Skip to content

Anserini is a Lucene toolkit for reproducible information retrieval research

License

Notifications You must be signed in to change notification settings

kevinros/anserini

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 Cannot retrieve latest commit at this time.
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anserini

Build Status Maven Central LICENSE

Getting Started

Build using Maven:

mvn clean package appassembler:assemble

The eval/ directory contains evaluation tools and scripts, including trec_eval, gdeval.pl, ndeval. Before using trec_eval, unpack and compile it, as follows:

tar xvfz trec_eval.9.0.tar.gz && cd trec_eval.9.0 && make

Before using ndeval, compile it as follows:

cd ndeval && make

Running Standard IR Experiments

Anserini is designed to support experiments on various standard TREC collections out of the box:

Newswire

Web

Tweets

Tools

  • IndexUtils is a powerful utility to interact with an index using the command line, e.g. print index statistics. Refer to target/appassembler/bin/IndexUtils -h for more details.

  • Axiomatic Reranking

  • MapCollections is a generic mapper framework that processes each file segment in parallel. Developers can build their own mapper that extends to DocumentMapper. One example is our CountDocumentMapper which counts the number of documents in the whole collection:

    nohup target/appassembler/bin/MapCollections -collection TrecCollection -threads 16 -input /tuna1/collections/newswire/disk12/ -mapper CountDocumentMapper &> log.disk12.count &

Python Interface

Anserini was designed with Python integration in mind, for connecting with popular deep learning toolkits such as PyTorch. This is accomplished via pyjnius. The SimpleSearcher class provides a simple Python/Java bridge, shown below:

import jnius_config
jnius_config.set_classpath("target/anserini-0.1.1-SNAPSHOT-fatjar.jar")

from jnius import autoclass
JString = autoclass('java.lang.String')
JSearcher = autoclass('io.anserini.search.SimpleSearcher')

searcher = JSearcher(JString('lucene-index.robust04.pos+docvectors+rawdocs'))
hits = searcher.search(JString('hubble space telescope'))

# the docid of the 1st hit
hits[0].docid

# the internal Lucene docid of the 1st hit
hits[0].ldocid

# the score of the 1st hit
hits[0].score

# the full document of the 1st hit
hits[0].content

Release History

About

Anserini is a Lucene toolkit for reproducible information retrieval research

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 79.4%
  • Python 19.4%
  • Shell 1.1%
  • Other 0.1%