ORKG Synthesis Dataset

This work is accepted for publication at JCDL-2024 conference.

What is the ORKG Synthesis Dataset?

We develop a methodology to collect and process scientific papers into a format ready for synthesis using the Open Research Knowledge Graph, a multidisciplinary platform that facilitates the comparison of scientific contributions. Where later, we introduce new synthesis types —- paper-wise, methodological, and thematic —- that focus on different aspects of the extracted insights. Utilizing Mistral-7B and GPT4 , we generate a large-scale dataset of these syntheses. The established nine quality criteria for evaluating these syntheses, assessed by both an automated LLM evaluator (GPT-4) and a human-crowdsourced survey.

Directories

corpus: Contains ORKG Synthesis dataset for bot GPT-4 and Mistral-7B for three synthesis objectives (paper-wise, methodological, and thematic). Also Prolific Human Survey Results.
gpt-4 synthesis-evaluator: Contains Evaluation System Prompt and evaluator script.
orkg-comparison-data-gen-scripts: Synthesis generation scripts.
synthesis-generation-prompts: Synthesis generation prompts for paper-wise, methodological, and thematic objectives.

Prolific Survey

The Prolific Survey Participant Demographics available at Table 1 in the corpus/prolific directory.

Also the average human and automatic (LLM) evaluation available at Table 2 in the corpus/prolific directory, representing average human and LLM evaluation scores by characteristic comparisons. For each domain/characteristic, the human scores are an average of 18 judgements (6 syntheses (2 samples x 3 synthesis types) x 3 participants) while the auto scores are an average of 6 judgements (6 syntheses (2 samples x 3 synthesis types) x 1 LLM evaluation).

LLMs4Synthesis

The LLMs4Synthesis framework on top of this dataset is available at https://github.com/HamedBabaei/LLMs4Synthesis.

Citation

Preprint:

@misc{giglou2024llms4synthesisleveraginglargelanguage,
      title={LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis},
      author={Hamed Babaei Giglou and Jennifer D'Souza and Sören Auer},
      year={2024},
      eprint={2409.18812},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18812},
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
corpus		corpus
gpt-4 synthesis-evaluator		gpt-4 synthesis-evaluator
images		images
orkg-comparison-data-gen-scripts		orkg-comparison-data-gen-scripts
synthesis-generation-prompts		synthesis-generation-prompts
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ORKG Synthesis Dataset

What is the ORKG Synthesis Dataset?

Directories

Prolific Survey

LLMs4Synthesis

Citation

About

Releases

Packages

Contributors 2

Languages

License

jd-coderepos/scisynthesis

Folders and files

Latest commit

History

Repository files navigation

ORKG Synthesis Dataset

What is the ORKG Synthesis Dataset?

Directories

Prolific Survey

LLMs4Synthesis

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages