Skip to content

laurens88/asreview-datatools

Repository files navigation

ASReview Datatools

PyPI version Downloads PyPI - License Deploy and release Build status DOI

ASReview Datatools is an extension to ASReview LAB that can be used to:

  • Describe basic properties of a dataset (e.g., number of papers, number of inclusions, the amount of missing data and duplicates)
  • Convert file formats via the command line
  • Deduplicate data based on properties of the data
  • Stack multiple datasets on top of each other to create a single dataset
  • Compose a single (labeled, partly labeled, or unlabeled) dataset from multiple datasets.

ASReview datatools is available for ASReview LAB version 1 or later. If you are using ASReview LAB version 0.x, use ASReview-statistics instead of ASReview datatools.

Installation

ASReview Datatools requires Python 3.7+ and ASReview LAB version 1.1 or later.

The easiest way to install the extension is to install from PyPI:

pip install asreview-datatools

After installation of the datatools extension, asreview should automatically detect it. Test this with the following command:

asreview --help

The extension is successfully installed if it lists asreview data.

Getting started

ASReview Datatools is a command line tool that extends ASReview LAB. Each subsection below describes one of the tools. The structure is

asreview data NAME_OF_TOOL

where NAME_OF_TOOL is the name of one of the tools below (describe, convert, compose, or dedup) followed by positional arguments and optional arguments.

Each tool has its own help description which is available with

asreview data NAME_OF_TOOL -h

Tools

Data Describe

Describe the content of a dataset

asreview data describe MY_DATASET.csv

Export the results to a file (output.json)

asreview data describe MY_DATASET.csv -o output.json

Describe the van_de_schoot_2017 dataset from the benchmark platform.

asreview data describe benchmark:van_de_schoot_2017 -o output.json
{
  "asreviewVersion": "1.0",
  "apiVersion": "1.0",
  "data": {
    "items": [
      {
        "id": "n_records",
        "title": "Number of records",
        "description": "The number of records in the dataset.",
        "value": 6189
      },
      {
        "id": "n_relevant",
        "title": "Number of relevant records",
        "description": "The number of relevant records in the dataset.",
        "value": 43
      },
      {
        "id": "n_irrelevant",
        "title": "Number of irrelevant records",
        "description": "The number of irrelevant records in the dataset.",
        "value": 6146
      },
      {
        "id": "n_unlabeled",
        "title": "Number of unlabeled records",
        "description": "The number of unlabeled records in the dataset.",
        "value": 0
      },
      {
        "id": "n_missing_title",
        "title": "Number of records with missing title",
        "description": "The number of records in the dataset with missing title.",
        "value": 5
      },
      {
        "id": "n_missing_abstract",
        "title": "Number of records with missing abstract",
        "description": "The number of records in the dataset with missing abstract.",
        "value": 764
      },
      {
        "id": "n_duplicates",
        "title": "Number of duplicate records (basic algorithm)",
        "description": "The number of duplicate records in the dataset based on similar text.",
        "value": 104
      }
    ]
  }
}

Data Convert

Convert the format of a dataset. For example, convert a RIS dataset into a CSV, Excel, or TAB dataset.

asreview data convert MY_DATASET.ris MY_OUTPUT.csv

Data Dedup

Remove duplicate records with a simple and straightforward deduplication algorithm (see source code). The algorithm first removes all duplicates based on a persistent identifier (PID), doi by default. Then it concatenates the title and abstract, whereafter it removes all non-alphanumeric tokens. Then the duplicates are removed.

asreview data dedup MY_DATASET.ris

Export the deduplicated dataset to a file (output.csv)

asreview data dedup MY_DATASET.ris -o output.csv

By default, the PID is set to 'doi'. The dedup function offers the option to use a different PID. Consider a dataset with PubMed identifiers (PMID), the identifier can be used for deduplication.

asreview data dedup MY_DATASET.csv -o output.csv --pid PMID

Using the van_de_schoot_2017 dataset from the benchmark platform.

asreview data dedup benchmark:van_de_schoot_2017 -o van_de_schoot_2017_dedup.csv

Data Vstack (Experimental)

Vertical stacking: combine as many datasets as you want into a single dataset.

❗ Vstack is an experimental feature. We would love to hear your feedback. Please keep in mind that this feature can change in the future.

Your datasets should be in any ASReview-compatible data format. All input files should be in the same format, the output path should also be of the same file format.

Stack several datasets on top of each other:

asreview data vstack output.csv MY_DATASET_1.csv MY_DATASET_2.csv MY_DATASET_3.csv

Here, 3 datasets are exported into a single dataset output.csv. The output path can be followed by any number of datasets to be stacked.

Note

Vstack does not do any deduplication. For deduplication you might want to use the deduplication tool. If you wish to create a single (labeled, partly labeled, or unlabeled) dataset from multiple datasets containing labeling decisions while having control over duplicates and labels, use compose instead.

Data Compose (Experimental)

Compose is where datasets with different labels (or no labels) can be assembled into a single dataset.

❗ Compose is an experimental feature. We would love to hear your feedback. Please keep in mind that this feature can change in the future.

Data format

Your data files need to be in tabular or RIS file format. The output file and all input files should be in the same format.

  • Tabular file format: Supported tabular file formats are .csv, .tab, .tsv or .xlsx. Ensure the column names adhere to the predetermined set of accepted column names.

  • RIS file format: A RIS file has .ris or .txt as an extension. Read how to format your RIS files. ASReview converts the labeling decisions in RIS files to a binary variable: irrelevant as 0 and relevant as 1.

Records marked as unseen or with missing labeling decisions are converted to -1 by ASReview.

Run script

Assume you have records in MY_DATASET_1.ris from which you want to keep all existing labels and records in MY_DATASET_2.ris which you want to keep unlabeled. Both datasets can be composed into a single dataset using:

asreview data compose composed_output.ris -l MY_DATASET_1.ris -u MY_DATASET_2.ris

The resulting dataset is exported to composed_output.ris.

The output path (composed_output.ris in the example) should always be specified. Optional arguments are available for:

  • Input files
  • Persistent identifier (PID) used for deduplication
  • Resolving conflicting labels

Input files

Overview of possible input files and corresponding properties, use at least one of the following arguments:

Arguments Action
--relevant, -r Label all records from this dataset as relevant in the composed dataset.
--irrelevant, -i Label all records from this dataset as irrelevant in the composed dataset.
--labeled, -l Use existing labels from this dataset in the composed dataset.
--unlabeled, -u Remove all labels from this dataset in the composed dataset.

Persistent identifier

Duplicate checking is based on title/abstract and a persistent identifier (PID) like the digital object identifier (DOI). By default, doi is used as PID. It is possible to use the flag --pid to specify a persistent identifier other than doi.

Resolving conflicting labels

Each record is marked as relevant, irrelevant, or unlabeled. In case of a duplicate record, it may be labeled ambiguously (e.g., one record with two different labels). --hierarchy is used to specify a hierarchy of labels. Pass the letters r (relevant), i (irrelevant), and u (unlabeled) in any order to set label hierarchy. By default, the order is riu which means that:

  • Relevant labels are prioritized over irrelevant and unlabeled.
  • Irrelevant labels are prioritized over unlabeled ones.

If compose runs into conflicting labels, the user is warned, and the conflicting records are shown. To specify what happens in case of conflicts, use the --conflict_resolve/-c flag. This is set to keep_one by default, options are:

Resolve method Action in case of conflict
keep_one Keep one label, using --hierarchy to determine which label to keep
keep_all Keep conflicting records as duplicates in the composed dataset (ignoring --hierarchy)
abort Abort

Example

asreview data compose composed_output.ris -l MY_DATASET_1.ris -u MY_DATASET_2.ris -o uir -c abort

Above command will compose a dataset from MY_DATASET_1.ris and MY_DATASET_2.ris. The labels from MY_DATASET_1.ris are kept, and all records from MY_DATASET_2.ris are marked as unlabeled. In case any duplicate ambiguously labeled records exist, either within a dataset or across the datasets:

  • Unlabeled is prioritized over irrelevant and relevant labels.
  • Irrelevant labels are prioritized over relevant labels.

If there are conflicting/contradictory labels, the user is warned, records with inconsistent labels are shown, and the script is aborted.

Tutorials

Several tutorials are available that show how compose can be used in different scenarios.

License

This extension is published under the MIT license.

Contact

This extension is part of the ASReview project (asreview.ai). It is maintained by the maintainers of ASReview LAB. See ASReview LAB for contact information and more resources.

About

Tool to preprocess datasets for ASReview

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%