Skip to content

Commit

Permalink
Fix HNSW stats (facebookresearch#3309)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: facebookresearch#3309

Make sure that the HNSW search stats work, remove stats for deprecated functionality.
Remove code of the link and code paper that is not supported anymore.

Reviewed By: kuarora, junjieqi

Differential Revision: D55247802

fbshipit-source-id: 03f176be092bff6b2db359cc956905d8646ea702
mdouze authored and facebook-github-bot committed Mar 22, 2024
1 parent b77061f commit fa1f39e
Showing 9 changed files with 27 additions and 947 deletions.
137 changes: 2 additions & 135 deletions benchs/link_and_code/README.md
Original file line number Diff line number Diff line change
@@ -21,138 +21,5 @@ graph to improve the reconstruction. It is described in

ArXiV [here](https://arxiv.org/abs/1804.09996)

Code structure
--------------

The test runs with 3 files:

- `bench_link_and_code.py`: driver script

- `datasets.py`: code to load the datasets. The example code runs on the
deep1b and bigann datasets. See the [toplevel README](../README.md)
on how to download them. They should be put in a directory, edit
datasets.py to set the path.

- `neighbor_codec.py`: this is where the representation is trained.

The code runs on top of Faiss. The HNSW index can be extended with a
`ReconstructFromNeighbors` C++ object that refines the distances. The
training is implemented in Python.

Update: 2023-12-28: the current Faiss dropped support for reconstruction with
this method.

Reproducing Table 2 in the paper
--------------------------------

The results of table 2 (accuracy on deep100M) in the paper can be
obtained with:

```bash
python bench_link_and_code.py \
--db deep100M \
--M0 6 \
--indexkey OPQ36_144,HNSW32_PQ36 \
--indexfile $bdir/deep100M_PQ36_L6.index \
--beta_nsq 4 \
--beta_centroids $bdir/deep100M_PQ36_L6_nsq4.npy \
--neigh_recons_codes $bdir/deep100M_PQ36_L6_nsq4_codes.npy \
--k_reorder 0,5 --efSearch 1,1024
```

Set `bdir` to a scratch directory.

Explanation of the flags:

- `--db deep1M`: dataset to process

- `--M0 6`: number of links on the base level (L6)

- `--indexkey OPQ36_144,HNSW32_PQ36`: Faiss index key to construct the
HNSW structure. It means that vectors are transformed by OPQ and
encoded with PQ 36x8 (with an intermediate size of 144D). The HNSW
level>0 nodes have 32 links (theses ones are "cheap" to store
because there are fewer nodes in the upper levels.

- `--indexfile $bdir/deep1M_PQ36_M6.index`: name of the index file
(without information for the L&C extension)

- `--beta_nsq 4`: number of bytes to allocate for the codes (M in the
paper)

- `--beta_centroids $bdir/deep1M_PQ36_M6_nsq4.npy`: filename to store
the trained beta centroids

- `--neigh_recons_codes $bdir/deep1M_PQ36_M6_nsq4_codes.npy`: filename
for the encoded weights (beta) of the combination

- `--k_reorder 0,5`: number of results to reorder. 0 = baseline
without reordering, 5 = value used throughout the paper

- `--efSearch 1,1024`: number of nodes to visit (T in the paper)

The script will proceed with the following steps:

0. load dataset (and possibly compute the ground-truth if the
ground-truth file is not provided)

1. train the OPQ encoder

2. build the index and store it

3. compute the residuals and train the beta vocabulary to do the reconstruction

4. encode the vertices

5. search and evaluate the search results.

With option `--exhaustive` the results of the exhaustive column can be
obtained.

The run above should output:
```bash
...
setting k_reorder=5
...
efSearch=1024 0.3132 ms per query, R@1: 0.4283 R@10: 0.6337 R@100: 0.6520 ndis 40941919 nreorder 50000

```
which matches the paper's table 2.

Note that in multi-threaded mode, the building of the HNSW structure
is not deterministic. Therefore, the results across runs may not be exactly the same.

Reproducing Figure 5 in the paper
---------------------------------

Figure 5 just evaluates the combination of HNSW and PQ. For example,
the operating point L6&OPQ40 can be obtained with

```bash
python bench_link_and_code.py \
--db deep1M \
--M0 6 \
--indexkey OPQ40_160,HNSW32_PQ40 \
--indexfile $bdir/deep1M_PQ40_M6.index \
--beta_nsq 1 --beta_k 1 \
--beta_centroids $bdir/deep1M_PQ40_M6_nsq0.npy \
--neigh_recons_codes $bdir/deep1M_PQ36_M6_nsq0_codes.npy \
--k_reorder 0 --efSearch 16,64,256,1024
```

The arguments are similar to the previous table. Note that nsq = 0 is
simulated by setting beta_nsq = 1 and beta_k = 1 (ie a code with a single
reproduction value).

The output should look like:

```bash
setting k_reorder=0
efSearch=16 0.0147 ms per query, R@1: 0.3409 R@10: 0.4388 R@100: 0.4394 ndis 2629735 nreorder 0
efSearch=64 0.0122 ms per query, R@1: 0.4836 R@10: 0.6490 R@100: 0.6509 ndis 4623221 nreorder 0
efSearch=256 0.0344 ms per query, R@1: 0.5730 R@10: 0.7915 R@100: 0.7951 ndis 11090176 nreorder 0
efSearch=1024 0.2656 ms per query, R@1: 0.6212 R@10: 0.8722 R@100: 0.8765 ndis 33501951 nreorder 0
```

The results with k_reorder=5 are not reported in the paper, they
represent the performance of a "free coding" version of the algorithm.
The necessary code for this paper was removed from Faiss in version 1.8.0.
For a functioning verinsion, use Faiss 1.7.4.
Loading
Oops, something went wrong.

0 comments on commit fa1f39e

Please sign in to comment.