GitHub - piki/wikipedia-film-database: Make a structured movie database from Wikipedia

Overview

This is a little Ruby script to generate a totally free movie database from Wikipedia.

The input is the bzipped XML file containing all English-language Wikipedia pages. The output is a log file containing one JSON blob per line. Each JSON blob is data from a single film, with keys for the film's title, cast, director(s), producer(s), production and distribution companies, and release year. Any lines in the output that are not JSON are debugging information (sometimes indicating a possible error or omission) and can be safely ignored.

The output can be found at https://oracleofbacon.org/data.txt.bz2.

Licensing

The data this script produces inherits its license from Wikipedia, namely the CC BY-SA 3.0 license.

Limitations

Wikipedia is written in prose form, for human readers. Although there are some formatting conventions and structured elements, they vary widely from page to page. This script parses the majority of film-related pages in Wikipedia successfully, but it will probably never parse them all. Typical mistakes include omitting part or all of a film's cast list, confusing multiple actors with the same name, and sometimes even treating prose or HTML tags as if they are part of actors' names. Pull requests for test cases and/or bug fixes are welcome.

Wikipedia is also a general-interest encyclopedia, not a comprehensive list of every actor in every film. In particular, it has about 5% as much data as the IMDb. Most popular films and prominent actors and actresses are covered, but the long tail of old, foreign, and obscure films is not.

This script makes no attempt to parse forms of video entertainment other than films. In particular, TV shows and video games are ignored.

Usage

Download a snapshot from https://dumps.wikimedia.org/enwiki/latest/. The file you are looking for will have a name like enwiki-20181220-pages-articles-multistream.xml.bz2. If you want the get-article script to work, too, download the index file from the same directory, named something like enwiki-20181220-pages-articles-multistream-index.txt.bz2.

Then decompress the file -- in a pipe, not to a file -- and pipe its output to ./find-movies. For example,

bzip2 -cd enwiki-20181220-pages-articles-multistream.xml.bz2 | ./find-movies | tee filmdb.txt

If you want to extract a single article from Wikipedia, use the ./get-article command. The command searches linearly through the index file, then uses that result to seek directly to the right block in the multistream file. Articles near the beginning of the index will be extracted much faster than those at the end. Articles not present in the index (not-found errors) will take a long time, so be sure you're extracting an article that actually exists.

./get-article "Army of Darkness" > "Army of Darkness.xml"

Development

To run the unit tests, do

make -C test

If you find a movie that doesn't get parsed properly, extract its article with ./get-article > test/fixtures/name-of-movie.xml, and add a test case for it to test/tc_fixtures.rb. Bonus points for fixing the bug, but even clear test cases are appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.bundle		.bundle
headshots		headshots
lib		lib
script		script
test		test
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
find-movies		find-movies
get-article		get-article
update		update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Licensing

Limitations

Usage

Development

About

Releases

Packages

Contributors 2

Languages

piki/wikipedia-film-database

Folders and files

Latest commit

History

Repository files navigation

Overview

Licensing

Limitations

Usage

Development

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages