📜 telex-nlp

In this project, I attempt to build a small language model, trained on all the articles of the Hungarian news portal telex.hu, using a character-based tokenizer.

🔧 Set up environment

The python environment is managed with pipenv. You can set up your environment with the following steps:

Run pipenv lockto generate the Pipfile.lock which lists the version of your python packages.
Run pipenv install --dev to actually create a virtual environment and install the python packages. The flag --dev allows to install the development packages (for linting, ...).
Run pipenv shell to activate the virtual environment

🚀 Run the DVC pipeline

The ML pipeline is managed with DVC, here are a few tips on how to use it:

Run the complete pipeline: dvc repro
Run a specific step of the pipeline with all its dependencies: dvc repro <step_name>

DVC Sages:

scrape : using the telex api downloads and saves all articles published since 2020 october
prerpocess : removes html, tags, and collects all article contents in a single json
train : Dataloader and LM model is initialized, training on characterwise in semi-supervised fashion
evaluate : calculates corpus perplexity on a test set, generates random text from input context

🏗️ Structure

.
├── Pipfile                 <- requirements for running the project
├── Pipfile.lock            <- versions of the required packages
├── README.md
├── dvc.lock                <- automatically records the states of the DVC pipeline
├── dvc.yaml                <- lists the stages for the DVC pipeline
├── pyproject.toml          <- contains the build system requirements of the projects
├── notebooks
├── params.py               <- contains the parameters of the project
├── data
│   ├── preprocessed
│   └── raw
└── telex                   <- source code of the project
    ├── models              <- ml model definitions
    │   ├── base_model.py
    │   ├── bigram.py
    │   └── transformer.py
    ├── pipeline            <- scripts for each stage in the DVC pipeline
    │   ├── evaluate
    │   ├── preprocess
    │   ├── scrape          <- scraping articles from telex
    │   └── train           <- model training scripts
    └── utils               <- helper scripts
        ├── dataset.py      <- defines pytorch Dataset object from raw articles
        └── io.py           <- input/output related functions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📜 telex-nlp

🔧 Set up environment

🚀 Run the DVC pipeline

🏗️ Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit Cannot retrieve latest commit at this time. History 26 Commits
.dvc		.dvc
.github		.github
data		data
notebooks		notebooks
telex		telex
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.py		params.py
pyproject.toml		pyproject.toml

hbenedek/telex-nlp

Folders and files

Latest commit

History

Repository files navigation

📜 telex-nlp

🔧 Set up environment

🚀 Run the DVC pipeline

🏗️ Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages