dataset

The 5Pils dataset 📸

The 5Pils dataset contains the annotated meta-context of 1,676 fact-checked images. The dataset is split in 3 json files containing the train, val, and test sets. 5Pils is made available under a CC-BY-SA-4.0 license. Instructions to download the images are provided here.

$ conda create --name 5Pils python=3.9
$ conda activate 5Pils
$ pip install -r requirements.txt
$ python -m spacy download en_core_web_lg
$ python scripts/build_dataset_from_url.py

The dataset is also available on TUdatalib.

Each record contains the following items:

Main items

URL: URL leading to the fact-checking article
image URL: URL leading to the fact-checked image
image path : local path to the image
publication date : publication date of the fact-checking article
was the photo used before? : answer to the Provenance pillar
source : answer to the Source pillar
date : answer to the Date pillar
date numeric label : answer to the Date pillar in numeric format
location : answer to the Location pillar
motivation : answer to the Motivation pillar

Metadata

The following items are provided as metadata for analysis purpose

org : fact-checking organization from which the data was collected
claim : claim verified in the fact-checking article
type of image : the type of image claim being verified. True, Out-of-Context, Manipulated, Fake, or True
verification strategy : the list of strategies used by fact-checkers to verify the image
verification tool : the list of tools used by fact-checkers to verify the image
claimed location : the location of the image, according to the claim
claimed date : the date of the image, according to the claim

Example

    {
        "URL": "https://factly.in/2013-evacuation-image-from-typhoon-hit-philippines-is-passed-off-as-iaf-airlifting-800-people-from-kabul",
        "image path": "dataset/processed_img/2013-evacuation-image-from-typhoon-hit-philippines-is-passed-off-as-iaf-airlifting-800-people-from-kabul.png",
        "org": "factly",
        "publication date": "2021-08-17T12:00:55+00:00",
        "claim": "The image shows 800 people airlifted by the Indian Air Force from Kabul.",
        "was the photo used before?": "yes",
        "source": "US Airforce's official website",
        "date": "2013",
        "date numeric label": "['2013-01-14T00:00:00+00:00']",
        "location": "Tacloban city, Philippines",
        "motivation": "To document the US Airforce\u2019s operation Damayan evacuation in 2013",
        "type of image": "out-of-context",
        "verification strategy": "['reverse image search']",
        "verification tool": "['Other/Unspecified']",
        "claimed location": "Kabul, Afghanistan",
        "claimed date": "2021",
        "image URL": "https://factly.in/wp-content/uploads//2021/08/IAF-Airlift-Afghanistan-FB-Post.jpg"
    }

Other files 🗃️

The following files can be found in the dataset folder:

geonames_results.json: the mapping from the locations of 5Pils to GeoNames coordinates and hierarchies
image_processing_instructions.txt: instructions to automate the cropping of the raw images collected from FC articles
manipulation_detection_test.json: the classification results on the test set of the manipulation detection with a fine-tuned ViT transformer
dataset/url/article_urls.txt: the URLs of the fact-checking articles from which the annotations were extracted
dataset/retrieval_results/evidence_url.json: the URLs of the evidence retrieved with Google Reverse Image Search

Collect your own data 🔎

Wayback machine scraping

We provide the code to collect FC articles and images from the 🏛️ Wayback Machine for a specific fact-checking organization.

$ python scripts/data_collection_waybackmachine.py --url_domain factly.in/ --org factly --file_path dataset/url/factly.txt --parser factly_parser --scrape_image 1 --process_image 0

The script can be extended to cover additional fact-checking organizations. For that, you will need to create a custom scraper function and place it in dataset_collection/scrape_utils.py. If you want to process the images, e.g. by cropping them to remove social media sidebars, you will need to provide your own processing instruction file or update dataset/image_processing_instructions.txt with instructions for the new images.

Parameters:

url_domain: The domain to query on the Wayback Machine API
org: Organization from which articles are collected
file_path: Path to the file that stores the URLs
parser: A parser function. Each fact-checking organization has a dedicated one. They are found in dataset_collection/scrape_utils.py. By default, we provide parsers for Factly, Pesacheck, and 211check. If you collect data from another organization, you will need to create your own parser and add it to dataset_collection/scrape_utils.py
scrape_image: If True, download the fact-checking images in addition to the article text
process_image: If True, process the images by cropping them to remove social media sidebars. This requires an instruction file that is, by default, only available for the 1,676 images of 5Pils
image_processing_script: Script with automated instructions to crop and clean the images. Needs to be a valid file with instructions if process_image is True. If you collect new data, you will need to update the instructions in the file.
start_date: Start date for the collection of URLs
end_date: End date for the collection of URLs
max_count: Maximum number of URLs to collect
chunk_size: Size of each chunk to query the Wayback Machine API. It is not recommended to set it higher than 5000
sleep: Waiting time between two calls of the Wayback machine API. It is recommended to use a sufficiently high value. By default, 5

Automated labeling

The next step is to automatically extract the annotations from the FC articles using a LLM. Our code uses GPT4, but can be easily modified to leverage other LLMs. Using GPT4 requires an Azure account with access to the Azure OpenAI service. You will need to provide your API key and API endpoint as parameters.

$ python scripts/data_labeling_gpt4.py

Dataset preprocessing

The final step is to apply preprocessing functions. The Locations are mapped to coordinates and hierarchies using 🌍 GeoNames. You will need to create a (free) account and provide your account name as input.

$ python scripts/preprocess_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

README.md

The 5Pils dataset 📸

Example

Other files 🗃️

Collect your own data 🔎

Wayback machine scraping

Automated labeling

Dataset preprocessing

Name		Name	Last commit message	Last commit date
parent directory ..
retrieval_results		retrieval_results
url		url
README.md		README.md
geonames_results.json		geonames_results.json
image_processing_instructions.txt		image_processing_instructions.txt
license		license
manipulation_detection_test.json		manipulation_detection_test.json
test.json		test.json
train.json		train.json
val.json		val.json

Files

dataset

Directory actions

More options

Directory actions

More options

Latest commit

History

dataset

Folders and files

parent directory

README.md

The 5Pils dataset 📸

Example

Other files 🗃️

Collect your own data 🔎

Wayback machine scraping

Automated labeling

Dataset preprocessing