The 5Pils dataset contains the annotated meta-context of 1,676 fact-checked images. The dataset is split in 3 json files containing the train, val, and test sets. 5Pils is made available under a CC-BY-SA-4.0 license. Instructions to download the images are provided here.
$ conda create --name 5Pils python=3.9
$ conda activate 5Pils
$ pip install -r requirements.txt
$ python -m spacy download en_core_web_lg
$ python scripts/build_dataset_from_url.py
The dataset is also available on TUdatalib.
Each record contains the following items:
Main items
URL
: URL leading to the fact-checking articleimage URL
: URL leading to the fact-checked imageimage path
: local path to the imagepublication date
: publication date of the fact-checking articlewas the photo used before?
: answer to the Provenance pillarsource
: answer to the Source pillardate
: answer to the Date pillardate numeric label
: answer to the Date pillar in numeric formatlocation
: answer to the Location pillarmotivation
: answer to the Motivation pillar
Metadata
The following items are provided as metadata for analysis purpose
org
: fact-checking organization from which the data was collectedclaim
: claim verified in the fact-checking articletype of image
: the type of image claim being verified. True, Out-of-Context, Manipulated, Fake, or Trueverification strategy
: the list of strategies used by fact-checkers to verify the imageverification tool
: the list of tools used by fact-checkers to verify the imageclaimed location
: the location of the image, according to the claimclaimed date
: the date of the image, according to the claim
{
"URL": "https://factly.in/2013-evacuation-image-from-typhoon-hit-philippines-is-passed-off-as-iaf-airlifting-800-people-from-kabul",
"image path": "dataset/processed_img/2013-evacuation-image-from-typhoon-hit-philippines-is-passed-off-as-iaf-airlifting-800-people-from-kabul.png",
"org": "factly",
"publication date": "2021-08-17T12:00:55+00:00",
"claim": "The image shows 800 people airlifted by the Indian Air Force from Kabul.",
"was the photo used before?": "yes",
"source": "US Airforce's official website",
"date": "2013",
"date numeric label": "['2013-01-14T00:00:00+00:00']",
"location": "Tacloban city, Philippines",
"motivation": "To document the US Airforce\u2019s operation Damayan evacuation in 2013",
"type of image": "out-of-context",
"verification strategy": "['reverse image search']",
"verification tool": "['Other/Unspecified']",
"claimed location": "Kabul, Afghanistan",
"claimed date": "2021",
"image URL": "https://factly.in/wp-content/uploads//2021/08/IAF-Airlift-Afghanistan-FB-Post.jpg"
}
The following files can be found in the dataset folder:
- geonames_results.json: the mapping from the locations of 5Pils to GeoNames coordinates and hierarchies
- image_processing_instructions.txt: instructions to automate the cropping of the raw images collected from FC articles
- manipulation_detection_test.json: the classification results on the test set of the manipulation detection with a fine-tuned ViT transformer
- dataset/url/article_urls.txt: the URLs of the fact-checking articles from which the annotations were extracted
- dataset/retrieval_results/evidence_url.json: the URLs of the evidence retrieved with Google Reverse Image Search
We provide the code to collect FC articles and images from the 🏛️ Wayback Machine for a specific fact-checking organization.
$ python scripts/data_collection_waybackmachine.py --url_domain factly.in/ --org factly --file_path dataset/url/factly.txt --parser factly_parser --scrape_image 1 --process_image 0
The script can be extended to cover additional fact-checking organizations. For that, you will need to create a custom scraper function and place it in dataset_collection/scrape_utils.py. If you want to process the images, e.g. by cropping them to remove social media sidebars, you will need to provide your own processing instruction file or update dataset/image_processing_instructions.txt with instructions for the new images.
Parameters:
url_domain
: The domain to query on the Wayback Machine APIorg
: Organization from which articles are collectedfile_path
: Path to the file that stores the URLsparser
: A parser function. Each fact-checking organization has a dedicated one. They are found in dataset_collection/scrape_utils.py. By default, we provide parsers for Factly, Pesacheck, and 211check. If you collect data from another organization, you will need to create your own parser and add it to dataset_collection/scrape_utils.pyscrape_image
: If True, download the fact-checking images in addition to the article textprocess_image
: If True, process the images by cropping them to remove social media sidebars. This requires an instruction file that is, by default, only available for the 1,676 images of 5Pilsimage_processing_script
: Script with automated instructions to crop and clean the images. Needs to be a valid file with instructions if process_image is True. If you collect new data, you will need to update the instructions in the file.start_date
: Start date for the collection of URLsend_date
: End date for the collection of URLsmax_count
: Maximum number of URLs to collectchunk_size
: Size of each chunk to query the Wayback Machine API. It is not recommended to set it higher than 5000sleep
: Waiting time between two calls of the Wayback machine API. It is recommended to use a sufficiently high value. By default, 5
The next step is to automatically extract the annotations from the FC articles using a LLM. Our code uses GPT4, but can be easily modified to leverage other LLMs. Using GPT4 requires an Azure account with access to the Azure OpenAI service. You will need to provide your API key and API endpoint as parameters.
$ python scripts/data_labeling_gpt4.py
The final step is to apply preprocessing functions. The Locations are mapped to coordinates and hierarchies using 🌍 GeoNames. You will need to create a (free) account and provide your account name as input.
$ python scripts/preprocess_dataset.py