Skip to content

Generative Small Language Model learning on Hungarian news articles

Notifications You must be signed in to change notification settings

hbenedek/telex-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 Cannot retrieve latest commit at this time.

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

πŸ“œ telex-nlp

In this project, I attempt to build a small language model, trained on all the articles of the Hungarian news portal telex.hu, using a character-based tokenizer.

πŸ”§ Set up environment

The python environment is managed with pipenv. You can set up your environment with the following steps:

  • Run pipenv lockto generate the Pipfile.lock which lists the version of your python packages.
  • Run pipenv install --dev to actually create a virtual environment and install the python packages. The flag --dev allows to install the development packages (for linting, ...).
  • Run pipenv shell to activate the virtual environment

πŸš€ Run the DVC pipeline

The ML pipeline is managed with DVC, here are a few tips on how to use it:

  • Run the complete pipeline: dvc repro
  • Run a specific step of the pipeline with all its dependencies: dvc repro <step_name>

DVC Sages:

  • scrape : using the telex api downloads and saves all articles published since 2020 october
  • prerpocess : removes html, tags, and collects all article contents in a single json
  • train : Dataloader and LM model is initialized, training on characterwise in semi-supervised fashion
  • evaluate : calculates corpus perplexity on a test set, generates random text from input context

πŸ—οΈ Structure

.
β”œβ”€β”€ Pipfile                 <- requirements for running the project
β”œβ”€β”€ Pipfile.lock            <- versions of the required packages
β”œβ”€β”€ README.md
β”œβ”€β”€ dvc.lock                <- automatically records the states of the DVC pipeline
β”œβ”€β”€ dvc.yaml                <- lists the stages for the DVC pipeline
β”œβ”€β”€ pyproject.toml          <- contains the build system requirements of the projects
β”œβ”€β”€ notebooks
β”œβ”€β”€ params.py               <- contains the parameters of the project
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ preprocessed
β”‚   └── raw
└── telex                   <- source code of the project
    β”œβ”€β”€ models              <- ml model definitions
    β”‚   β”œβ”€β”€ base_model.py
    β”‚   β”œβ”€β”€ bigram.py
    β”‚   └── transformer.py
    β”œβ”€β”€ pipeline            <- scripts for each stage in the DVC pipeline
    β”‚   β”œβ”€β”€ evaluate
    β”‚   β”œβ”€β”€ preprocess
    β”‚   β”œβ”€β”€ scrape          <- scraping articles from telex
    β”‚   └── train           <- model training scripts
    └── utils               <- helper scripts
        β”œβ”€β”€ dataset.py      <- defines pytorch Dataset object from raw articles
        └── io.py           <- input/output related functions

About

Generative Small Language Model learning on Hungarian news articles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages