20news

This directory contains the 20 Newsgroups dataset, pre-converted into Annif vocabulary and document corpus format.

The script used for conversion is also available. It makes use of the scikit-learn fetch_20newsgroups function which is a convenient way of accessing the dataset.

This is the bydate flavor of the dataset, which has been split into train (n=11314) and test (n=7532) subsets by date. All header information as well as quote headers, which could provide non-topical hints about the newsgroup a message was posted in, have been stripped.

Name		Name	Last commit message	Last commit date
parent directory ..
20news-test.tsv		20news-test.tsv
20news-train.tsv		20news-train.tsv
20news-vocab.tsv		20news-vocab.tsv
README.md		README.md
fetch-20news.py		fetch-20news.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20news

20news

README.md

Files

20news

Directory actions

More options

Directory actions

More options

Latest commit

History

20news

Folders and files

parent directory

README.md