In this project, I attempt to build a small language model, trained on all the articles of the Hungarian news portal telex.hu, using a character-based tokenizer.
The python environment is managed with pipenv. You can set up your environment with the following steps:
- Run
pipenv lock
to generate thePipfile.lock
which lists the version of your python packages. - Run
pipenv install --dev
to actually create a virtual environment and install the python packages. The flag--dev
allows to install the development packages (for linting, ...). - Run
pipenv shell
to activate the virtual environment
The ML pipeline is managed with DVC, here are a few tips on how to use it:
- Run the complete pipeline:
dvc repro
- Run a specific step of the pipeline with all its dependencies:
dvc repro <step_name>
DVC Sages:
- scrape : using the telex api downloads and saves all articles published since 2020 october
- prerpocess : removes html, tags, and collects all article contents in a single json
- train : Dataloader and LM model is initialized, training on characterwise in semi-supervised fashion
- evaluate : calculates corpus perplexity on a test set, generates random text from input context
. βββ Pipfile <- requirements for running the project βββ Pipfile.lock <- versions of the required packages βββ README.md βββ dvc.lock <- automatically records the states of the DVC pipeline βββ dvc.yaml <- lists the stages for the DVC pipeline βββ pyproject.toml <- contains the build system requirements of the projects βββ notebooks βββ params.py <- contains the parameters of the project βββ data β βββ preprocessed β βββ raw βββ telex <- source code of the project βββ models <- ml model definitions β βββ base_model.py β βββ bigram.py β βββ transformer.py βββ pipeline <- scripts for each stage in the DVC pipeline β βββ evaluate β βββ preprocess β βββ scrape <- scraping articles from telex β βββ train <- model training scripts βββ utils <- helper scripts βββ dataset.py <- defines pytorch Dataset object from raw articles βββ io.py <- input/output related functions