ITMO-Advanced-NLP

Lab #1

Part I

The task is to clear the text data of the crawled web-pages from different sites.

It is necessary to ensure that the distribution of the 100 most frequent words includes only meaningful words in english language (not particles, conjunctions, prepositions, numbers, tags, symbols).

Determine the order of operations below and carry out the appropriate cleaning.

Remove non-english words
Remove html-tags (try to do it with regular expression, or play with beautifulsoap library)
Apply lemmatization / stemming
Remove stop-words
Additional processing - At your own initiative, if this helps to obtain a better distribution

Part II

Detect duplicated text (duplicates do not imply a complete word-to-word match, but texts that may contain a paraphrase, rearrangement of words, sentences)
Make a plot dependency of duplicates on shingle size (with fixed minhash length)
Make a plot dependency of duplicates on minhash length (with fixed shingle size)

Part III. Topic Modelling.

The provided data contain chunked stories by Edgar Allan Poe (EAP), Mary Shelley (MWS), and HP Lovecraft (HPL).

The dataset can be downloaded here: https://drive.google.com/file/d/14tAjAzHr6UmFVFV7ABTyNHBh-dWHAaLH/view?usp=sharing

Preprocess dataset with the functions from the Part 1
Implement the following three quality fuctions: coherence (or tf-idf coherence), normalized PMI, based on the distributed word representation(you can use pretrained w2v vectors or some other model). You are free to use any libraries (for instance gensim) and components.
Read and preprocess the dataset, divide it into train and test parts sklearn.model_selection.train_test_split. Test part will be used in classification part. For simplicity we do not perform cross-validation here, but you should remember about it.
Implement topic modeling with NMF (you can use sklearn.decomposition.NMF) and print out resulting topics. Try to change hyperparameters to better fit the dataset.
Implement topic modeling with LDA (you can use gensim implementation) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

Lab #2

For this task a well-known quora duplicate detection dataset will be used.

Do a little exploratory analysis. Find how many duplicates and non-duplicates are there in the train part and any other actions of your interest to better understand the data.
Build the siamese network
Measure the quality. To calculate loss use Triplet Loss.
Train the model

Lab #3

Implement Self-Attention mechanism
Analyse pre-trained BERT model for text processing

Lab #4

Understand the fine-tuning procedure and get acquainted with Huggingface Datasets library.

Load tokenizer and model
Look at the predictions of the model as-is before any fine-tuning
Define optimizer, sheduler (optional)
Fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score
Get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Lab1		Lab1
Lab2		Lab2
Lab3		Lab3
Lab4		Lab4
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ITMO-Advanced-NLP

Lab #1

Part I

Part II

Part III. Topic Modelling.

Lab #2

Lab #3

Lab #4

About

Releases

Packages

Languages

trixdade/ITMO-Advanced-NLP

Folders and files

Latest commit

History

Repository files navigation

ITMO-Advanced-NLP

Lab #1

Part I

Part II

Part III. Topic Modelling.

Lab #2

Lab #3

Lab #4

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages