This directory contains the 20 Newsgroups dataset, pre-converted into Annif vocabulary and document corpus format.
The script used for conversion is also available. It makes use of the scikit-learn fetch_20newsgroups function which is a convenient way of accessing the dataset.
This is the bydate
flavor of the dataset, which has been split into
train
(n=11314) and test
(n=7532) subsets by date. All header
information as well as quote headers, which could provide non-topical hints about
the newsgroup a message was posted in, have been stripped.