A repository containing Python scripts for collating code content from the public repositories of huggingface
on GitHub.
Resultant dataset: https://huggingface.co/datasets/sayakpaul/hf-codegen-v2.
Update: Sourab and I published a blog post utilizing a part of this dataset to train a custom coding assistant. While I focused on data collection efforts, Sourab led the rest of the project, running numerous experiments. Check out our blog post here: Personal Copilot: Train Your Own Coding Assistant.
Make sure you have at least 50 GB of disk space.
- Clone the repo and change to the
data
directory. - Install the Python dependencies:
pip install -r requirements.
- Run
python parallel_clone_repos.py
to locally clone the public repositories situated under thehuggingface
GitHub org. You'd need to set upGH_ACCESS_TOKEN
as the env variable (it can be your GitHub personal access token). - Log in to your HF account:
huggingface-cli login
. - Prepare the dataset, serialize in feather files, and upload them to the Hugging Face Hub:
python prepare_dataset.py
. - To finally have the dataset compatible with 🤗 Datasets (helps with downstream training), run
python push_to_hub.py
.
💡 Note that Step 6 was run on a separate machine with lots of RAM (240 GB). Steps 5 - 6 could have been clubbed together had we used a more capable machine from the get-go.
The final dataset can be found here: sayakpaul/hf-codegen-v2.
Initially, we tried to also parallelize the reading and processing of code contents using multiprocessing
but couldn't succeed in doing so. The memory was getting overhauled.
So, we decided to process each repository file (in total we have 115 repositories) sequentially. The utility returns a dictionary which we were appending to a pandas
dataframe, which was initialized to be empty just containing columns. Our plan was to construct a final big pandas
dataframe and serialize that.
This was failing to execute in full capacity as the memory was overhauling in this case too.
So, we decided to serialize multiple dataframes in chunks to not overhaul the memory. Initially, we were serializing the dataframes in .csv
format and the resultant CSV files were in GBs in terms of size. So, we finally decided to use the Feather format for serialization which resulted in much lighter files (from a GB to 300 MB, for example).