Skip to content

Commit

Permalink
Update installation instructions (#921)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Jan 7, 2022
1 parent 937ec63 commit ddef893
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 52 deletions.
30 changes: 3 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ Our toolkit is self-contained as a standard Python package and comes with querie
With Pyserini, it's easy to [reproduce](docs/pypi-reproduction.md) runs on a number of standard IR test collections!
A low-effort way to try things out is to look at our [online notebooks](https://github.com/castorini/anserini-notebooks), which will allow you to get started with just a few clicks.

## Package Installation
## Installation

Install via PyPI (requires Python 3.6+):
Install via PyPI (requires Python 3.8+):

```
pip install pyserini
Expand All @@ -34,33 +34,9 @@ We leave the installation of these packages to you.
The software ecosystem is rapidly evolving and a potential source of frustration is incompatibility among different versions of underlying dependencies.
We provide additional detailed installation instructions [here](./docs/installation.md).

## Development Installation

If you're planning on just _using_ Pyserini, then the `pip` instructions above are fine.
However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation.
For this, clone our repo with the `--recurse-submodules` option to make sure the `tools/` submodule also gets cloned.

The `tools/` directory, which contains evaluation tools and scripts, is actually [this repo](https://github.com/castorini/anserini-tools), integrated as a [Git submodule](https://git-scm.com/book/en/v2/Git-Tools-Submodules) (so that it can be shared across related projects).
Build as follows (you might get warnings, but okay to ignore):

```bash
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..
```

Next, you'll need to clone and build [Anserini](http://anserini.io/).
It makes sense to put both `pyserini/` and `anserini/` in a common folder.
After you've successfully built Anserini, copy the fatjar, which will be `target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar` into `pyserini/resources/jars/`.
As with the `pip` installation, a potential source of frustration is incompatibility among different versions of underlying dependencies.
For these and other issues, we provide additional detailed installation instructions [here](./docs/installation.md).

You can confirm everything is working by running the unit tests:

```bash
python -m unittest
```

Assuming all tests pass, you should be ready to go!
Instructions are provided [here](./docs/installation.md#development-installation).

## Quick Links

Expand Down
126 changes: 102 additions & 24 deletions docs/installation.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# Pyserini: Detailed Installation Guide

Pyserini has a number of important dependencies.
We recommend Python 3.8 for Pyserini.
At a high level, we try to keep our [`requirements.txt`](../requirements.txt) up to date.
Pyserini has a number of important dependencies:

For sparse retrieval, Pyserini depends on [Anserini](http://anserini.io/), which is built on Lucene.
[PyJNIus](https://github.com/kivy/pyjnius) is used to interact with the JVM.

For dense retrieval (since it involves neural networks), we need the [🤗 Transformers library](https://github.com/huggingface/transformers), [PyTorch](https://pytorch.org/), and [Faiss](https://github.com/facebookresearch/faiss) (specifically `faiss-cpu`).
A `pip` installation will automatically pull in the first to satisfy the package requirements, but since the other two may require platform-specific custom configuration, they are _not_ explicitly listed in the package requirements.
We leave the installation of these packages to you.
We leave the installation of these packages to you (but provide detailed instructions below).

In general, our development team tries to keep dependent packages at the same versions and upgrade in lockstep.
As of Pyserini v0.14.0, our "reference" configuration is a Linux machine running Ubuntu 18.04 with `faiss-cpu==1.7.0`, `transformers==4.6.0`, and `torch==1.8.1`.
Expand All @@ -17,72 +19,148 @@ With other versions of the dependent packages, as they say, your mileage may var

## Preliminaries

Below is a step-by-step Pyserini installation guide.
We assume you have [Anaconda](https://www.anaconda.com/) installed.
Below is a step-by-step Pyserini installation guide based on Python 3.8.
We recommend using [Anaconda](https://www.anaconda.com/) and assume you have already installed it.

Create new environment:

```bash
$ conda create -n pyserini python=3.6
$ conda create -n pyserini python=3.8
$ conda activate pyserini
```

Install JDK 11 via conda:
If you do not already have JDK 11 installed, install via `conda`:

```bash
$ conda install -c conda-forge openjdk=11
```

If your system already has JDK 11 installed, the above step can be skipped.
Use `java --version` to check one way or the other.

## Pip Installation

If you're just _using_ Pyserini, a `pip` installation with suffice; this contrasts with a _development_ installation (details below).

```bash
$ pip install pyserini
$ pip install transformers==4.6.0 # https://github.com/castorini/pyserini/issues/734
$ pip install onnxruntime
$ conda install -c conda-forge pyjnius
```

Install Pytorch based on environment (see [this guide](https://pytorch.org/get-started/locally/) for additional details):
As discussed above, installation of PyTorch can be a bit tricky, so we ask you to do it separately:

```bash
$ pip3 install torch==1.8.1 torchvision==0.9.1 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
$ pip install torch==1.8.1 torchvision==0.9.1 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
```

Install Faiss based on environment
And installing Faiss:

```bash
$ conda install faiss-cpu -c pytorch
```

## Development Installation
By this point, Pyserini should have been installed.
For the impatient, that's it!

However, it might be worthwhile to do a bit of sanity checking, per below.
Be warned, though, that these represent "real" retrieval experiments and may take some time to run.

To confirm that bag-of-words retrieval is working correctly, you can run the BM25 baseline on the MS MARCO passage ranking task:

First follow the steps [here](#development-installation) but run
```bash
$ pip install -e . # use this
$ python -m pyserini.search \
--topics msmarco-passage-dev-subset \
--index msmarco-passage \
--output run.msmarco-passage.txt \
--output-format msmarco \
--bm25

$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset run.msmarco-passage.txt
#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################
```
instead of

To confirm that dense retrieval is working correctly, you can run our TCT-ColBERT (v2) model on the MS MARCO passage ranking task:

```bash
$ pip install pyserini # do NOT use this
$ python -m pyserini.dsearch \
--topics msmarco-passage-dev-subset \
--index msmarco-passage-tct_colbert-v2-bf \
--encoded-queries tct_colbert-v2-msmarco-passage-dev-subset \
--batch-size 36 \
--threads 12 \
--output runs/run.msmarco-passage.tct_colbert-v2.bf.tsv \
--output-format msmarco

$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2.bf.tsv
#####################
MRR @10: 0.3440
QueriesRanked: 6980
#####################
```

If everything is working properly, you should be able to reproduce the results above.

Install Maven via conda:
## Development Installation

If you're planning on just _using_ Pyserini, then the `pip` instructions above are fine.
However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation.

Start with creating a new `conda` environment:

```
$ conda create -n pyserini-dev python=3.8
$ conda activate pyserini-dev
```

In addition to JDK 11, you'll also need Maven.
If Maven isn't already installed, you can install with `conda` as follows:

```bash
$ conda install -c conda-forge maven
```

Clone Anserini repo and build:
Clone the Pyserini repo with the `--recurse-submodules` option to make sure the `tools/` submodule also gets cloned:

```bash
$ git clone git@github.com:castorini/pyserini.git --recurse-submodules
```
The `tools/` directory, which contains evaluation tools and scripts, is actually [this repo](https://github.com/castorini/anserini-tools), integrated as a [Git submodule](https://git-scm.com/book/en/v2/Git-Tools-Submodules) (so that it can be shared across related projects).
Change into the `pyserini` subdirectory and build as follows (you might get warnings, but okay to ignore):

```bash
$ cd pyserini
$ cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
$ cd tools/eval/ndeval && make && cd ../../..
```

Use `pip` to "install" the checked out code in "editable" mode:

```bash
$ cd ..
$ git clone https://github.com/castorini/anserini.git
$ cd anserini
$ mvn clean package appassembler:assemble -Dmaven.test.skip=true
$ pip install -e .
```

Copy the fatjar to `pyserini/pyserini/resources/jars`.
You'll still need to install the other packages separately:

```bash
$ pip install torch==1.8.1 torchvision==0.9.1 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
$ conda install faiss-cpu -c pytorch
```

Next, you'll need to clone and build [Anserini](http://anserini.io/).
It makes sense to put both `pyserini/` and `anserini/` in a common folder.
After you've successfully built Anserini, copy the fatjar, which will be `target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar` into `pyserini/resources/jars/`.
As with the `pip` installation, a potential source of frustration is incompatibility among different versions of underlying dependencies.
For these and other issues, we provide additional detailed installation instructions [here](./docs/installation.md).

You can confirm everything is working by running the unit tests:

```bash
python -m unittest
```

Assuming all tests pass, you should be ready to go!

## Troubleshooting tips

Expand Down
2 changes: 1 addition & 1 deletion docs/pypi-reproduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The following results can be reproduced with v0.10.1.0 or anything later, includ

## Robust04

BM25 baseline from the [TREC 2004 Robust Track](https://github.com/castorini/anserini/blob/master/docs/regressions-robust04.md) on TREC Disks 4 & 5:
BM25 baseline from the [TREC 2004 Robust Track](https://github.com/castorini/anserini/blob/master/docs/regressions-disk45.md) on TREC Disks 4 & 5:

```bash
$ python -m pyserini.search --topics robust04 --index robust04 --output run.robust04.txt --bm25
Expand Down

0 comments on commit ddef893

Please sign in to comment.