- Description in English: https://medium.com/@phoenixilya/news-aggregator-in-2-weeks-5b38783b95e3
- Description in Russian: https://habr.com/ru/post/487324/
- Russian: https://ilyagusev.github.io/tgcontest/ru/main.html
- English: https://ilyagusev.github.io/tgcontest/en/main.html
Prerequisites: CMake, Boost
$ sudo apt-get install cmake libboost-all-dev build-essential libjsoncpp-dev uuid-dev protobuf-compiler libprotobuf-dev
For MacOS
$ brew install boost jsoncpp ossp-uuid protobuf
If you got zip archive, just go to building binary
To download code and models:
$ git clone https://github.com/IlyaGusev/tgcontest
$ cd tgcontest
$ git submodule update --init --recursive
$ bash download_models.sh
$ wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.5.0%2Bcpu.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.5.0+cpu.zip
For MacOS use https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.5.0.zip
To build binary (in "tgcontest" dir):
$ mkdir build && cd build && Torch_DIR="../libtorch" cmake -DCMAKE_BUILD_TYPE=Release .. && make -j4
To download datasets:
$ bash download_data.sh
Run on sample:
./build/tgnews top data --ndocs 10000
- Russian FastText vectors training: VectorsRu.ipynb
- Russian fasttext category classifier training: CatTrainRu.ipynb
- Russian sentence embedder training (v2): SimilarityRu.ipynb
- English FastText vectors training: VectorsEn.ipynb
- English fasttext category classifier training: CatTrainEn.ipynb
- English sentence embedder training: SimilarityEn.ipynb
- PageRank rating calculation: PageRankRating.ipynb
- Russian ELMo-based sentence embedder training (not used):
- Russian sentence embedder with triplet loss training (v3):
- XLM-RoBERTa pseudo-labeling for categorization:
- Language detection model (2 round): lang_detect_v10.ftz
- Russian FastText vectors (2 round): ru_vectors_v3.bin
- Russian categories detection model (2 round): ru_cat_v5.ftz
- English FastText vectors (2 round): en_vectors_v3.bin
- English categories detection model (2 round): en_cat_v5.ftz
- PageRank-based agency rating: pagerank_rating.txt
- Alexa agency rating: alexa_rating_4_fixed.txt
- XLM-RoBERTa for categorization (pytorch-lightning checkpoint): xlmr_en_ru_cat_v1.tar.gz
- Russian news from all archives, except 1117: ru_tg_1101_0510.jsonl.tar.gz
- Russian news from 1117 archive: ru_tg_0511_0517.jsonl.tar.gz
- English news from 1821, 2225, 29 and 09 archives: en_tg_test.tar.gz
- Data for training Russian vectors: ru_unsupervised_train.tar.gz
- Data for training English vectors: en_unsupervised_train.tar.gz
- Russian categories raw train markup: ru_cat_v4_train_raw_markup.tsv
- Russian categories aggregated train markup: ru_cat_v4_train_annot.json
- Russian categories aggregated train markup in fastText format: ft_ru_cat_v4_train.txt
- Russian categories manual train markup: ru_cat_v4_train_manual_annot.json
- Russian categoreis manual train markup in fastText format: ft_ru_cat_v4_train_manual.txt
- Russian categoreis raw test markup: ru_cat_v4_test_raw_markup.tsv
- Russian categories aggregated test markup: ru_cat_v4_test_annot.json
- Russian categories aggregated test markup in fastText format: ft_ru_cat_v4_test.txt
- English categories aggregated train markup: en_cat_v4_train_annot.json
- English categories aggregated train markup in fastText format: ft_en_cat_v4_train.txt
- English categories aggregated test markup: en_cat_v4_test_annot.json
- English categories aggregated test markup in fastText format: ft_en_cat_v4_test.txt
- Russian clustering pairs: ru_pairs_raw_markup.tsv
- English clustering pairs: en_pairs_raw_markup.tsv
- Round 2
- II place
- III place
- IV place
- Bossy Gnu: https://github.com/maxoodf/tgnews
- Other:
- Large Crab: https://github.com/ilya-ustinov/tgcontest
- Round 1
- III place
- Kooky Dragon: https://github.com/nick-baliesnyi/tgnews
- IV place
- Sharp Sloth: https://github.com/thehemen/telegram-data-clustering
- Other
- Desert Python: https://github.com/crazyleg/telegram_data_clustering_2019
- Funky Peacock: https://github.com/Stepka/telegram_clustering_contest
- Unknown animal: https://github.com/roman-rybalko/telegram-data-clustering-contest
- Unknown animal: https://github.com/MarcoBuster/data-clustering-contest
- Unknown animal: https://github.com/sudevschiz/tgnews
- Unknown animal: https://github.com/crazyleg/telegram_data_clustering_2019
- Unknown animal: https://github.com/77ph/tgnews
- Unknown animal: https://github.com/akash-joshi/telegram-cluster
- Unknown animal: https://github.com/dremovd/telegram-clustering
- III place
- Telegram: @YallenGusev