methylprep
is a python package for processing Illumina methylation array data.
View on ReadTheDocs.
methylprep
is part of a methyl-suite of python packages that provide functions to process and analyze DNA methylation data from Illumina arrays (27, 450k, and EPIC/850k supported). The methylprep
package contains functions for processing raw data files from arrays, or downloading (and processing) public data sets from GEO (the NIH Gene Expression Omnibus is a database repository), or from ArrayExpress. It contains both a command line interface (CLI) for processing data from local files, and a set of functions for building a custom pipeline in a jupyter notebook or python scripting environment. The aim is to offer a standard process, with flexibility for those who want it.
You should install all three components, as they work together.
methylcheck
includes- quality control (QC) functions for filtering out unreliable probes, based on the published literature and outlier detection.
- sample outlier detection
- array level QC plots, based on Genome Studio functions
- data visualization functions based on seaborn and matplotlib graphic libraries.
- predict sex of human samples from probes
- interactive method for assigning samples to groups, based on array data, in a Jupyter notebook
methylize
provides analysis functions- differentially methylated probe statistics (between treatment and control samples)
- volcano plots (which probes are the most different)
- manhattan plot (where in genome are the differences)
methylprep maintains configuration files for your Python package manager of choice: pipenv or pip. Conda install is coming soon.
pip install methylprep
The most common use case is processing .idat
files on a computer within a command line interface. This can also be done in a Jupyter notebook, but large data sets take hours to run and Jupyter will take longer to run these than command line.
python -m methylprep -v process -d <filepath> --all
The --all
option applies the most common settings. Here are some specific options:
Argument | Type | Default | Description |
---|---|---|---|
data_dir |
str , Path |
REQUIRED | Base directory of the sample sheet and associated IDAT files |
array_type |
str |
None |
Code of the array type being processed. Possible values are custom , 27k , 450k , epic , and epic+ . If not provided, the pacakage will attempt to determine the array type based on the number of probes in the raw data. If the batch contains samples from different array types, this may not work. Our data download function attempts to split different arrays into separate batches for processing to accommodate this. |
manifest_filepath |
str , Path |
None |
File path for the array's manifest file. If not provided, this file will be downloaded from a Life Epigenetics archive. |
no_sample_sheet |
bool |
None |
pass in "--no_sample_sheet" from command line to trigger sample sheet auto-generation. Sample names will be based on idat filenames. Useful for public GEO data sets that lack sample sheets. |
sample_sheet_filepath |
str , Path |
None |
File path of the project's sample sheet. If not provided, the package will try to find one based on the supplied data directory path. |
sample_name |
str to list |
None |
List of sample names to process, in the CLI format of -n sample1 sample2 sample3 etc . If provided, only those samples specified will be processed. Otherwise all samples found in the sample sheet will be processed. |
export |
bool |
False |
Add flag to export the processed data to CSV. |
betas |
bool |
False |
Add flag to output a pickled dataframe of beta values of sample probe values. |
m_value |
bool |
False |
Add flag to output a pickled dataframe of m_values of samples probe values. |
batch_size |
int |
None |
Optional: splits the batch into smaller sized sets for processing. Useful when processing hundreds of samples that can't fit into memory. Produces multiple output files. This is also used by the package to process batches that come from different array types. |
data_dir
is the one required field. If you do not provide the file path for the project's sample_sheet, it will find one based on the supplied data directory path. It will also auto detect the array type and download the corresponding manifest file for you.
Run the complete methylation processing pipeline for the given project directory, optionally exporting the results to file.
Returns: A collection of DataContainer objects for each processed sample
from methylprep import run_pipeline
data_containers = run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None, sample_sheet_filepath=None, sample_names=None)
Note: All the same input parameters from command line apply to run_pipeline
, except --all
. Type dir(methylprep.run_pipeline)
in an interactive python session to see details.
Note: By default, if run_pipeline
is called as a function in a script, a list of SampleDataContainer objects is returned. However, if you specify betas=True
or m_value=True
, a dataframe of beta values or m-values is returned instead. All methylcheck
functions are designed to work on a dataframe or a folder to the processed data generated by run_pipeline
.
methylprep provides a command line interface (CLI) so the package can be used directly in bash/batchfile or windows/cmd scripts as part of building your custom processing pipeline.
All invocations of the methylprep CLI will provide contextual help, supplying the possible arguments and/or options available based on the invoked command. If you specify verbose logging the package will emit log output of DEBUG levels and above.
python -m methylprep
usage: methylprep [-h] [-v] {process,sample_sheet} ...
Utility to process methylation data from Illumina IDAT files
positional arguments:
{process,sample_sheet}
process process help
sample_sheet sample sheet help
optional arguments:
-h, --help show this help message and exit
-v, --verbose Enable verbose logging
The methylprep cli provides these top-level commands:
process
the main function: processing methylation data fromidat
files. Covered already.sample_sheet
to find/read/validate a sample sheet and output its contentsdownload
download and process public data sets in NIH GEO or ArrayExpress collections. Provide the public Accession ID and it will handle the rest.composite
download a bunch of datasets from a list of GEO ids, process them all, and combine into a large datasetalert
scan GEO database and construct a CSV / dataframe of sample meta data and phenotypes for all studies matching a keyword
There are thousands of publically accessible DNA methylation data sets available via the GEO (US NCBI NIH) https://www.ncbi.nlm.nih.gov/geo/ and ArrayExpress (UK) https://www.ebi.ac.uk/arrayexpress/ websites. This function makes it easy to import them and build a reference library of methylation data.
The CLI now includes a download
option. Supply the GEO ID or ArrayExpress ID and it will locate the files, download the idats, process them, and build a dataframe of the associated meta data. This dataframe format should be compatible with methylcheck and methylize.
Argument | Type | Default | Description |
---|---|---|---|
-h, --help | show this help message and exit | ||
-d , --data_dir | str |
[required path] | path to where the data series will be saved. Folder must exist already. |
-i ID, --id ID | str |
[required ID] | The dataset's reference ID (Starts with GSE for GEO or E-MTAB- for ArrayExpress) |
-l LIST, --list LIST | multiple strings |
optional | List of series IDs (can be either GEO or ArrayExpress), for partial downloading |
-o, --dict_only | True |
pass flag only | If passed, will only create dictionaries and not process any samples |
-b BATCH_SIZE, --batch_size BATCH_SIZE | int |
optional | Number of samples to process at a time, 100 by default. |
When processing large batches of raw .idat
files, specify --batch_size
to break the processing up into smaller batches so the computer's memory won't overload. This is off by default when using process
but is ON when using download
and set to batch_size of 100. Set to 0 to force processing everything as one batch. The output files will be split into multiple files afterwards, and you can recomine them using methylcheck.load
.
Find and parse the sample sheet in a given directory and emit the details of each sample. This is not required for actually processing data.
>>> python -m methylprep sample_sheet
usage: methylprep sample_sheet -d DATA_DIR
optional arguments:
Argument | Type | Description |
---|---|---|
-h, --help | show this help message and exit | |
-d, --data_dir | string | Base directory of the sample sheet and associated IDAT files |
-c, --create | bool | If specified, this creates a sample sheet from idats instead of parsing an existing sample sheet. The output file will be called "samplesheet.csv". |
-o OUTPUT_FILE, --output_file OUTPUT_FILE | string | If creating a sample sheet, you can provide an optional output filename (CSV). |
~/methylprep$ python -m methylprep -v sample_sheet -d ~/GSE133062/GSE133062 --create
INFO:methylprep.files.sample_sheets:[!] Created sample sheet: ~/GSE133062/GSE133062/samplesheet.csv with 70 GSM_IDs
INFO:methylprep.files.sample_sheets:Searching for sample_sheet in ~/GSE133062/GSE133062
INFO:methylprep.files.sample_sheets:Found sample sheet file: ~/GSE133062/GSE133062/samplesheet.csv
INFO:methylprep.files.sample_sheets:Parsing sample_sheet
200861170112_R01C01
200882160083_R03C01
200861170067_R02C01
200498360027_R04C01
200498360027_R08C01
200861170067_R01C01
200861170072_R05C01
200498360027_R06C01
200861170072_R01C01
200861170067_R03C01
200882160070_R02C01
...
A tool to build a data set from a list of public datasets. optional arguments:
Argument | Type | Description |
---|---|---|
-h, --help | show this help message and exit | |
-l LIST, --list LIST | filepath | A text file containg several GEO/ArrayExpress series ids. One ID per line in file. Note: The GEO Accession Viewer lets you export search results in this format. |
-d DATA_DIR, --data_dir DATA_DIR | filepath | Folder where to save data (and read the ID list file). |
-c, --control | bool | If flagged, this will only save samples that have the word "control" in their meta data. |
-k KEYWORD --keyword KEYWORD | string | Only retain samples that include this keyword (e.g. blood) somewhere in their meta data. |
-e, --export | bool | If passed, saves raw processing file data for each sample. (unlike meth-process, this is off by default) |
-b, --betas | bool | If passed, output returns a dataframe of beta values for samples x probes. Local file beta_values.npy is also created. |
-m, --m_value | bool | If passed, output returns a dataframe of M-values for samples x probes. Local file m_values.npy is also created. |
Function to check for new datasets on GEO and update a csv each time it is run. Usable as a weekly cron command line function. Saves data to a local csv to compare with old datasets in _meta.csv. Saves the dates of each dataset from GEO; calculates any new ones as new rows. updates csv.
Argument | Type | Description |
---|---|---|
keyword | string | Specify a word or phrase to narrow the search, such as "spleen blood". |