Script and instructions to load CORD-19 citation data into Neo4j #176

mrkarezina · 2020-06-08T17:12:13Z

The script prepares the csv files that the cypher queries use to load the citation data into Neo4j.

Originally used py2neo to load the articles directly from the script but that was very slow.

Currently the only reliable citation data is titles. The script initially creates nodes for the articles in the dataset, and then merges cited articles (creates new nodes if the title is new). This leaves the possibility of there being duplicate titles that have the same node. We could try to use citation metadata to separate these but the additional citation metadata is unreliable.

A quick script shows there's a total of 59839 articles and 57099 after removing duplicate titles.

docs/working-with-cord19.md

scripts/cord19/neo4j_loader.py

mrkarezina added 3 commits June 6, 2020 23:43

CORD-19 Neo4j load script

b946de4

CSV based data import

1cd208f

Add Neo4j setup instructions

f2e0144

mrkarezina changed the title ~~Script and instructions to loading CORD-19 citation data into Neo4j~~ Script and instructions to load CORD-19 citation data into Neo4j Jun 8, 2020

lintool requested changes Jun 8, 2020

View reviewed changes

docs/working-with-cord19.md Outdated Show resolved Hide resolved

scripts/cord19/neo4j_loader.py Outdated Show resolved Hide resolved

scripts/cord19/neo4j_loader.py Outdated Show resolved Hide resolved

Setup argparse for script

33f7a24

lintool self-requested a review June 8, 2020 23:15

lintool approved these changes Jun 8, 2020

View reviewed changes

lintool merged commit 274d774 into castorini:master Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script and instructions to load CORD-19 citation data into Neo4j #176

Script and instructions to load CORD-19 citation data into Neo4j #176

mrkarezina commented Jun 8, 2020

Script and instructions to load CORD-19 citation data into Neo4j #176

Script and instructions to load CORD-19 citation data into Neo4j #176

Conversation

mrkarezina commented Jun 8, 2020