Skip to content

Commit

Permalink
Add tutorials for datatools (asreview#20)
Browse files Browse the repository at this point in the history
  • Loading branch information
gimoAI authored Nov 16, 2022
1 parent b40f26a commit c505188
Show file tree
Hide file tree
Showing 2 changed files with 249 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,10 @@ In case any duplicate ambiguously labeled records exist, either within a dataset

If there are conflicting/contradictory labels, the user is warned, records with inconsistent labels are shown, and the script is aborted.

### Tutorials

Several [tutorials](Tutorials.md) are available that show how compose can be used in different scenarios.

## License

This extension is published under the [MIT license](/LICENSE).
Expand Down
245 changes: 245 additions & 0 deletions Tutorials.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
# Tutorials

---
Below are several tutorials to illustrate how to use `datatools`. Make
sure to have installed
[asreview-datatools](https://github.com/asreview/asreview-datatools) and
[ASReview LAB](https://asreview.nl/download/) v1.1 or higher.

Overview of the tutorials:
1. [Update systematic review](#1-update-systematic-review)
2. [Add prior knowledge](#2-add-prior-knowledge)
3. [Prepare a dataset for a simulation study in ASReview](#Prepare-a-dataset-for-a-simulation-study-in-ASReview)


Allowed data formats are described in the [ASReview
documentation](https://asreview.readthedocs.io/en/latest/data_format.html).

---

## 1. Update Systematic Review

Assume you are working on a systematic review and you want to update the
review with newly available records. The original data is stored in
`MY_LABELED_DATASET.csv` and the file contains a
[column](https://asreview.readthedocs.io/en/latest/data_labeled.html#label-format)
containing the labeling decissions. In order to update the systematic review,
you run the original search query again but with the new date. You save the
newly found records in `SEARCH_UPDATE.ris`.


In the command line interface (CLI), navigate to the directory where the dataset(s) are stored:
```bash
cd Parent_directory
```

## Preparing your data

The original data and the newly found records are in a different datafile
format. You can convert files to a different file format using the `convert`
script. For example, to convert SEARCH_UPDATE.ris to CSV format, open the
command line interface (CLI) and navigate to the directory where the
dataset(s) are stored and run

```bash
asreview data convert SEARCH_UPDATE.ris SEARCH_UPDATE.csv
```

Duplicate records can be removed with with `dedup` script. The algorithm
removes duplicates using the Digital Object Indentifier
([DOI](https://www.doi.org/)) and the title plus abstract.

```bash
asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv
```

### Describe input

If you want to see descriptive info on your input datasets, run these commands:

```bash
asreview data describe MY_LABELED_DATASET.csv -o MY_LABELED_DATASET_description.json
asreview data describe SEARCH_UPDATE_DEDUP.csv -o SEARCH_UPDATE_description.json
```
The results will be exported to `MY_LABELED_DATASET_description.json` and `SEARCH_UPDATE_description.json`.

### Compose datasets

Use the `compose` script to add `SEARCH_UPDATE_DEDUP.csv` to `MY_LABELED_DATASET.csv`:

```bash
asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPDATE_DEDUP.csv
```
The flag `-l` means the labels in `MY_LABELED_DATASET.csv` will be kept.

The flag `-u` means all records from `SEARCH_UPDATE_DEDUP.csv` will be
added as unlabeled to the composed dataset.

If a record exists in both datasets, it is assumed the record containing a
label is maintained, see the default [conflict resolving
strategy](https://github.com/asreview/asreview-datatools#resolving-conflicting-labels).
To keep both records (with and without label), use

```bash
asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPDATE_DEDUP.csv -c keep
```

The composed dataset will be exported to `COMPOSED_DATA.csv`.

### Describe output

To see descriptive info on the composed dataset:

```bash
asreview data describe COMPOSED_DATA.csv -o updated_search_description.json
```
The result will be exported to `updated_search_description.json`.

### Continue screening in ASReview lab

The [partly
labelled](https://asreview.readthedocs.io/en/latest/data_labeled.html#partially-labeled-data)
data, `COMPOSED_DATA.csv`, can be uploaded to [ASReview lab - Oracle
mode](https://asreview.readthedocs.io/en/latest/project_create.html). The
lables will be reckognized by ASReview and used to train the first iteration
of the model and you can continue screening all unlabeled records found in the
new search.

---
## 2. Add prior knowledge

Assume you have just executed a search query for a systematic review and you
want to use a pre-defined set of relevant and irrelevant records as training
data. The search results are stored in `SEARCH_RESULTS.ris`, and the records
you already know to be relevant/irrelevant are saved in
`PRIOR_RELEVANT.ris` and `PRIOR_IRRELEVANT.ris` respectively.


In the command line interface (CLI), naviate to the directory where the dataset(s) are stored:
```bash
cd Parent_directory
```
### Describe input
If you want to see descriptive info on your input datasets, run these commands:
```bash
asreview data describe SEARCH_RESULTS.ris -o SEARCH_RESULTS_description.json
asreview data describe PRIOR_RELEVANT.ris -o PRIOR_RELEVANT_description.json
asreview data describe PRIOR_IRRELEVANT.ris -o PRIOR_IRRELEVANT_description.json
```

The results will be exported to `SEARCH_RESULTS_description.json`,
`PRIOR_RELEVANT_description.json` and `PRIOR_IRRELEVANT_description.json`.


### Compose datasets
To create one dataset with labels only for the training data to be used in ASREview, run:

```bash
asreview data compose search_with_priors.ris -u SEARCH_RESULTS.ris -r PRIOR_RELEVANT.ris -i PRIOR_IRRELEVANT.ris
```

The flag `-r` means all records from `PRIOR_RELEVANT.ris` will be added as
relevant records to the composed dataset.

The flag `-i` means all records from `PRIOR_IRRELEVANT.ris` will be added
as irrelevant.

The flag `-u` means all other records from `SEARCH_RESULTS.ris` will be
added as unlabeled.

If any duplicate records exist across the datasets, by default the order of
keeping labels is:
1. relevant
2. irrelevant
3. unlabeled

You can configure the behavior in resolving conflicting labels, as explained
[here](README.md#Resolving-conflicting-labels).


The composed dataset will be exported to `search_with_priors.ris`.

### Describe output
To see descriptive info on the composed dataset:

```bash
asreview data describe search_with_priors.ris -o search_with_priors_description.json
```

The result will be exported to `search_with_priors_description.json` in the
output folder.


### Start screening in ASReview lab

The [partly
labelled](https://asreview.readthedocs.io/en/latest/data_labeled.html#partially-labeled-data)
data, `rch_with_priors.ris`, can be uploaded to [ASReview lab - Oracle
mode](https://asreview.readthedocs.io/en/latest/project_create.html). The
lables will be reckognized by ASReview and used to train the first iteration
of the model and you can continue screening all unlabeled records found in the
new search.

---
## Prepare a dataset for a simulation study in ASReview

Assume you want to use the [simulation
mode](https://asreview.readthedocs.io/en/latest/simulation_overview.html) of
ASReview but the data is not stored in one singe file containing the meta-data
and labelling decissions as required by ASReview.

Suppose the following files are available:

- `SCREENED.ris`: all records that were screened
- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.

You need to compose the files into a single file where all records from
`RELEVANT.csv` are relevant all other records are irrelevant.

In the command line interface (CLI), navigate to the directory where the dataset(s) are stored:
```bash
cd Parent_directory
```

### Describe input

If you want to see descriptive info on your input datasets, run these commands:

```bash
asreview data describe SCREENED.ris -o SCREENED_description.json
asreview data describe RELEVANT.ris -o RELEVANT_description.json
```
The results will be exported to `SCREENED_description.json` and `RELEVANT_description.json`.



### Compose datasets

Run `compose.py` to compose a new dataset from `SCREENED.ris` and `RELEVANT.ris`:

```bash
asreview data compose screened_with_labels.ris -i SCREENED.ris -r RELEVANT.ris
```

The flag `-r` means all records from `RELEVANT.ris` will be added as
relevant to the composed dataset.

The flag `-i` means all other records from `SCREENED.ris` will be added as
irrelevant.

The composed dataset will be exported to `screened_with_labels.ris`.

### Describe output

To see descriptive info on the composed dataset:

```bash
asreview data describe screened_with_labels.ris -o screened_with_labels_description.json
```
The result will be exported to `screened_with_labels_description.json`.

### Run simulation in ASReview lab

The resulting file `screened_with_labels.ris` can be uploaded to [ASReview lab Simulation mode](https://asreview.readthedocs.io/en/latest/simulation_webapp.html). This
allows you to simulate the screening procedure of the systematic review as if
it were carried out using ASReview lab.

0 comments on commit c505188

Please sign in to comment.