Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add snowballing functionality #37

Merged
merged 20 commits into from
Mar 28, 2024
Merged

Conversation

PeterLombaers
Copy link
Member

This pull request adds snowballing functionality to ASReview Datatools. Snowballing means finding incoming (forwards) and outgoing (backwards) citations of works in a dataset. This implementation works by taking an ASReview dataset as input, looking in OpenAlex for the references and then writing that away in a separate output file that could be read by ASReview again.

@J535D165 I'd be happy to hear what you think of this pull request! I added type hinting to my code, but I'm happy to remove it to align it with the rest of the code. There also is no Ruff settings file in this repository yet, so I used the settings I found in asreview main.

@PeterLombaers PeterLombaers requested a review from J535D165 March 14, 2024 13:44
@MvanSteenbergen
Copy link

snowballing as asreview argument is inconsistent with snowball asreview argument found in docs

@MvanSteenbergen
Copy link

Got the following errors for this datafile:

Backward:

asreview data snowballing asreview_result_spatial-and-temporal-patterning-of-emergency-reactive-police-demand.csv snowballed.csv --backward --all
Found OpenAlex identifiers for 1861 out of 3103 records. Performing snowballing for those records.
Starting backward snowballing
Found 0 records
Traceback (most recent call last):
  File "/home/mvansteenbergen/miniconda3/bin/asreview", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/__main__.py", line 43, in main
    entry.load()().execute(sys.argv[2:])
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/entrypoint.py", line 103, in execute
    snowball(**args_snowballing)
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/snowball.py", line 264, in snowball
    output_data = ASReviewData(output_data)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/data/base.py", line 112, in __init__
    self.max_idx = max(df.index.values) + 1
                   ^^^^^^^^^^^^^^^^^^^^
ValueError: max() arg is an empty sequence

and for forward snowballing:

(base) mvansteenbergen@ideapad:~/asreview_irr/data$ asreview data snowballing asreview_result_spatial-and-temporal-patterning-of-emergency-reactive-police-demand.csv snowballed.csv --forward
Found OpenAlex identifiers for 1576 out of 2729 records. Performing snowballing for those records.
Starting forward snowballing
Traceback (most recent call last):
  File "/home/mvansteenbergen/miniconda3/bin/asreview", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/__main__.py", line 43, in main
    entry.load()().execute(sys.argv[2:])
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/entrypoint.py", line 103, in execute
    snowball(**args_snowballing)
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/snowball.py", line 264, in snowball
    output_data = ASReviewData(output_data)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/data/base.py", line 112, in __init__
    self.max_idx = max(df.index.values) + 1
                   ^^^^^^^^^^^^^^^^^^^^
ValueError: max() arg is an empty sequence

@PeterLombaers
Copy link
Member Author

Thanks for testing! I fixed the first bug, and I'll have a look at what is going wrong for that dataset.

@MvanSteenbergen
Copy link

MvanSteenbergen commented Mar 21, 2024

Also happens on dataset_1.ris and dataset_2.ris from the example data in the asreview-data-tools for me. I'll try to see if I can find some other files.

(base) mvansteenbergen@ideapad:~/asreview_test_files$ asreview data snowballing dataset_1.ris snowballed.csv --forward
Found OpenAlex identifiers for 2 out of 3 records. Performing snowballing for those records.
Starting forward snowballing
Traceback (most recent call last):
  File "/home/mvansteenbergen/miniconda3/bin/asreview", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/__main__.py", line 43, in main
    entry.load()().execute(sys.argv[2:])
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/entrypoint.py", line 103, in execute
    snowball(**args_snowballing)
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/snowball.py", line 264, in snowball
    output_data = ASReviewData(output_data)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/data/base.py", line 112, in __init__
    self.max_idx = max(df.index.values) + 1
                   ^^^^^^^^^^^^^^^^^^^^
ValueError: max() arg is an empty sequence
(base) mvansteenbergen@ideapad:~/asreview_test_files$ asreview data snowballing dataset
_2.ris snowballed.csv --forward
Found OpenAlex identifiers for 6 out of 8 records. Performing snowballing for those records.
Starting forward snowballing
Traceback (most recent call last):
  File "/home/mvansteenbergen/miniconda3/bin/asreview", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/__main__.py", line 43, in main
    entry.load()().execute(sys.argv[2:])
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/entrypoint.py", line 103, in execute
    snowball(**args_snowballing)
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/snowball.py", line 264, in snowball
    output_data = ASReviewData(output_data)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/data/base.py", line 112, in __init__
    self.max_idx = max(df.index.values) + 1
                   ^^^^^^^^^^^^^^^^^^^^
ValueError: max() arg is an empty sequence

This is my version information:

(base) mvansteenbergen@ideapad:~/asreview_test_files$ asreview --help
usage: asreview [-h] [-V] [subcommand]

ASReview LAB - A tool for AI-assisted systematic reviews

positional arguments:
  subcommand     The subcommand to launch. Available commands:


                 [asreview 1.3.4] - ASReview LAB - A tool for AI-assisted systematic reviews
                        algorithms
                        auth-tool
                        lab
                        simulate
                        state-inspect

                 [asreview-datatools 0+untagged.83.g0c487f3] - Powerful command line tools for data handling in ASReview
                        data

@PeterLombaers
Copy link
Member Author

Thanks! I think I found the bug and made a fix, but I still need to test it a bit.

@MvanSteenbergen
Copy link

MvanSteenbergen commented Mar 21, 2024

Excellent! I just ran it and it works for me at least for the example datasets from ASReview itself.

For the dataset that were generated using an old version of ASReview, I'm getting this error. Seems to be because of column names:

(base) mvansteenbergen@ideapad:~/asreview_test_files$ asreview data snowball asreview_result_spatial-and-temporal-patterning-of-emergency-reactive-police-demand.csv snowballed.csv --forward
/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/datasets.py:749: UserWarning: The use of 'benchmark' datasets is deprecated, use SYNERGY dataset instead. For more information, see https://github.com/asreview/synergy-dataset.
  warnings.warn(
Traceback (most recent call last):
  File "/home/mvansteenbergen/miniconda3/bin/asreview", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/__main__.py", line 43, in main
    entry.load()().execute(sys.argv[2:])
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/entrypoint.py", line 103, in execute
    snowball(**args_snowballing)
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreviewcontrib/datatools/snowball.py", line 211, in snowball
    data = load_data(input_path)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mvansteenbergen/miniconda3/lib/python3.11/site-packages/asreview/data/base.py", line 63, in load_data
    raise FileNotFoundError(f"File, URL, or dataset does not exist: '{name}'")
FileNotFoundError: File, URL, or dataset does not exist: 'asreview_result_spatial-and-temporal-patterning-of-emergency-reactive-police-demand.csv'

@MvanSteenbergen
Copy link

MvanSteenbergen commented Mar 21, 2024

Another thing I noticed is that there's no warning when overwriting a file. Maybe good to implement.

Here's an example of what I mean:

(base) mvansteenbergen@ideapad:~/asreview_test_files$ asreview data snowball dataset_2.ris snowballed.csv --forward
Found OpenAlex identifiers for 6 out of 8 records. Performing snowballing for those records.
Starting forward snowballing
0. Getting works citing https://openalex.org/W2241411979
1. Getting works citing https://openalex.org/W2103172754
2. Getting works citing https://openalex.org/W3128349626
3. Getting works citing https://openalex.org/W3014512586
4. Getting works citing https://openalex.org/W2271587360
Saved dataset
(base) mvansteenbergen@ideapad:~/asreview_test_files$ asreview data snowball dataset_1.
ris snowballed.csv --forward
Found OpenAlex identifiers for 2 out of 3 records. Performing snowballing for those records.
Starting forward snowballing
0. Getting works citing https://openalex.org/W2241411979
1. Getting works citing https://openalex.org/W2103172754
Saved dataset

@laurens88
Copy link
Contributor

laurens88 commented Mar 25, 2024

I am getting the following error when I try to run forward snowballing with asreview 1.6:

Traceback (most recent call last):
  File "C:\Users\laure\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\laure\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\laure\Documents\Work related\Snowballing_PR\asreview-datatools\pr_testing\Scripts\asreview.exe\__main__.py", line 9, in <module>
    sys.exit(main())
  File "c:\users\laure\documents\work related\snowballing_pr\asreview-datatools\pr_testing\lib\site-packages\asreview\__main__.py", line 49, in main
    entry = base_entries[sys.argv[1]].load()
  File "c:\users\laure\documents\work related\snowballing_pr\asreview-datatools\pr_testing\lib\site-packages\importlib_metadata\__init__.py", line 184, in load
    module = import_module(match.group('module'))
  File "C:\Users\laure\AppData\Local\Programs\Python\Python38\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "c:\users\laure\documents\work related\snowballing_pr\asreview-datatools\asreviewcontrib\datatools\entrypoint.py", line 12, in <module>
    from asreviewcontrib.datatools.snowball import _parse_arguments_snowball
  File "c:\users\laure\documents\work related\snowballing_pr\asreview-datatools\asreviewcontrib\datatools\snowball.py", line 26, in <module>
    def forward_snowballing(identifiers: list[str]) -> dict[str, list[dict]]:
TypeError: 'type' object is not subscriptable

Not sure if this error is caused by me or the code?

@PeterLombaers
Copy link
Member Author

This has to do with the type annotations I added, in combination with older Python versions. Could you try it again with the fix I just made? If it still doesn't work, maybe try it with Python3.11?

@laurens88
Copy link
Contributor

With the fix I get the following error:

Traceback (most recent call last): File "C:\Users\laure\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\laure\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\laure\Documents\Work related\Snowballing_PR\asreview-datatools\pr_testing\Scripts\asreview.exe\__main__.py", line 9, in <module> sys.exit(main()) File "c:\users\laure\documents\work related\snowballing_pr\asreview-datatools\pr_testing\lib\site-packages\asreview\__main__.py", line 50, in main _execute_entry_point(entry, sys.argv[2:]) File "c:\users\laure\documents\work related\snowballing_pr\asreview-datatools\pr_testing\lib\site-packages\asreview\__main__.py", line 28, in _execute_entry_point entry().execute(args) File "c:\users\laure\documents\work related\snowballing_pr\asreview-datatools\asreviewcontrib\datatools\entrypoint.py", line 103, in execute snowball(**args_snowballing) File "c:\users\laure\documents\work related\snowballing_pr\asreview-datatools\asreviewcontrib\datatools\snowball.py", line 217, in snowball data = data.df.loc[data.included.astype(bool)] AttributeError: 'NoneType' object has no attribute 'astype'

Will try python 3.11 next

@PeterLombaers
Copy link
Member Author

Thanks for testing, this is useful! This bug is unrelated to the previous one, but happens because your test dataset doesn't have included information. I made a fix so that your test dataset should also work.

@J535D165
Copy link
Member

Can you rebase/merge master?

@PeterLombaers
Copy link
Member Author

Done!

@J535D165 J535D165 merged commit 3a01d45 into asreview:master Mar 28, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants