-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* added batch_size parameter to run_pipeline * added CLI functionality * batch_size python/CLI and tests * removed test; changed default behavior: won't raise error if file-to-be-downloaded already exists * Update setup.py * Update test_batch_size.py * Rename test_batch_size.py to test_pipeline_batch_size.py * dropped redundant tests and sped up one * Feature/public data (#21) * download command, as well as some batch_size adjustments * fixed string issue * renaming update and removed redundant tests * bs4 required for Array ingester * tests * workaround to return objects with batch size changes * workaround to return objects with batch size changes * bug * tests pass for batch_size * version 1.1 (#22) * download command, as well as some batch_size adjustments * fixed string issue * renaming update and removed redundant tests * bs4 required for Array ingester * tests * workaround to return objects with batch size changes * workaround to return objects with batch size changes * bug * tests pass for batch_size * progress bars * documenting `download` * Update cli.py * restore sample_name filter * added rawMetaDataset class and moved get_sample_sheet_s3 to more logical place here (#24) * updated docs for 1.1.1 * Update README.md * Update setup.py * exposed create_sample_sheet and download no_clean options * manifest file download in lambda * manifest file download in lambda * manifest file download in lambda * v1.1.3 bump bug fix * handles blank sample_name and ensures names are unique. * Update setup.py * geo downloader tweaks, fixed docs * minor tweaks to sample_sheet parser * v1.1.8: CLI retain --uncorrected mean prob values; sample_sheet sample_type sample_sub_type; sample_sheet accepts alt sentrix column headers * v1.1.8: CLI retain --uncorrected mean prob values; sample_sheet sample_type sample_sub_type; sample_sheet accepts alt sentrix column headers * v1.1.8 * v1.1.9 minor bug fix to alt filename * bug fix: sample QC control status * v1.1.11 generates meta_data pickle * Update config.yml * Update config.yml * Update config.yml * Update config.yml * coveralls * bug fix * v1.1.14 smarter meta_data cli option * Create faq.md * Update faq.md * Update faq.md * Update faq.md * reworking download and meta_data to be more robust * Update faq.md * Update faq.md * downloader warns if idats aren't there; smarter meta_data * minor
- Loading branch information
Marc Maxmeister
authored
Nov 13, 2019
1 parent
59d8691
commit 406dbb1
Showing
13 changed files
with
842 additions
and
115 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
# Tips and FAQs for the methylation suite | ||
|
||
## loading processed data | ||
|
||
If you have a small data set (under 200 samples), and did not use the `batch_size` option: | ||
|
||
```python | ||
import pandas as pd | ||
data = pd.read_pickle('{path_to_file}/beta_values.pkl') | ||
``` | ||
|
||
Otherwise, loading hundreds of samples can be slow. But there's a helper function that loads a bunch of smaller | ||
batches into one dataframe, if they were processed with methylprep's `batch_size` option: | ||
|
||
```python | ||
import methylize | ||
data = methylize.load(path_to_file) | ||
``` | ||
That will load 1600 samples in 8 seconds, compared to taking minutes if you `read_pickle` them and `concat` them using pandas. | ||
|
||
### Processing Notes | ||
|
||
1. DataFrames are the format that methylcheck/methylize functions expect. For saving and loading, the probe names belong in columns and the sample names belong in the index. | ||
But when processing data, some functions auto-transpose this to decrease processing time. | ||
2. By default, `methylprep process` also creates a file called `sample_sheet_meta_data.pkl` from various data sources | ||
- if there is a GEO series MINiML xml file, it reads this data preferentially | ||
- if there is a samplesheet, it will convert this to a meta DataFrame | ||
- if these sources are missing required data, such as Sentrix_Position and Sentrix_ID, it will look for idat files and | ||
read these from the filenames. | ||
|
||
## idat filenames | ||
- There are two acceptable formats: | ||
- `<GSM_ID>_<Sentrix_ID>_<Sentrix_Position>_<Red|Grn>.idat<.gz>` | ||
- `<Sentrix_ID>_<Sentrix_Position>_<Red|Grn>.idat<.gz>` | ||
- Methylprep will convert `.gz` files to `.idat` uncompressed files when processing. | ||
- methylprep does not recognize the older 27k array filename format: | ||
`<GSM_ID>_<Sentrix_ID>_<Sentrix_Position>_<SOME_LETTER>.idat` | ||
|
||
## Why didn't the `methylprep download` function work for GEO dataset GSEnnn? | ||
A significant number of GEO datasets do not store their data in a consistent format. Here are some reasons a GEO dataset fails to download: | ||
|
||
1. `idat` filename format is off (missing R00C00 position) | ||
2. no raw idats in zip, only processed data | ||
3. The meta data in MINiML format xml file is incomplete. (There are ways to make it work without meta data) | ||
4. the NIH GEO FTP server is down (yes, we've experienced this whilst testing too) | ||
5. `idat` files in dataset have varying number of probes. (If a dataset combines results from two array types (EPIC and 450k), it can sometimes split the data into two sample sheets and you can process each folder separately.) | ||
|
||
We've put a lot of effort into giving you clear error messages, detailing why each one failed. | ||
|
||
## What if `methylprep download` fails. How do I use the data anyway? | ||
1. Download the raw idat zipfile manually to a folder. | ||
2. Uncompress it. | ||
3. Confirm that there are `.idat` files present. | ||
4. If the files end in `.idat.gz`, gunzip them first. (In a mac/linux bash window you can navigate to the folder and type `gunzip *` to uncompress all those files. On a PC, use some software like `7zip` or `winzip`.) | ||
5. IF the filenames don't include Sentrix IDs and Sentrix array positions, like `<GSM ID>_<some long number>_<R01C01>_<Red or Grn>.idat`, you'll need to manually edit the samplesheet to match the files. | ||
6. Run `methylprep process` on the folder, possibly with the `--no_sample_sheet` option. It should work, but you won't have any of the sample meta data bundled for you for analysis with datasets in nonstandard formats. | ||
|
||
## How to process only part of a GEO dataset | ||
|
||
The `methylprep meta_data` command line interface (CLI) option allows you to specify which samples to process using keyword matching against the GEO meta data. You can use this before `methylprep download` and `methylprep process` to create smaller data sets, faster. | ||
|
||
### Examples | ||
|
||
#### (1) Grab a samplesheet for a GEO data set | ||
|
||
```python | ||
python -m methylprep meta_data -i GSE125105 | ||
``` | ||
Yields three files on local disk: | ||
|
||
- GSE125105_family.xml (the original GEO meta data file in MINiML format) | ||
- GSE125105_GPL13534_samplesheet.csv (used for processing) | ||
- GSE125105_GPL13534_meta_data.pkl (used in analysis to describe samples) | ||
|
||
| GSM_ID | Sample_Name | Sentrix_ID | Sentrix_Position | source | diagnosis | age | Sex | tissue | cellcount-cd8t | cellcount-cd4t | cellcount-nk | cellcount-bcell | cellcount-mono | cellcount-gran | description | | ||
|------------|----------------------------|------------|------------------|---------------------|-----------|-----|-----|-------------|----------------|----------------|--------------|-----------------|----------------|----------------|----------------------------| | ||
| GSM3562834 | genomic DNA from sample291 | 3999840035 | R01C01 | control_whole blood | control | 73 | F | whole blood | 0.07679 | 0.09099 | 0.06041 | 0.08542 | 0.09072 | 0.62266 | whole blood control sample | | ||
| GSM3562835 | genomic DNA from sample612 | 3999840035 | R01C02 | case_whole blood | case | 32 | M | whole blood | 0.05544 | 0.07946 | 0.0159 | 0.09557 | 0.05515 | 0.72663 | whole blood case sample | | ||
| GSM3562836 | genomic DNA from sample611 | 3999840035 | R02C01 | case_whole blood | case | 51 | F | whole blood | 0.08279 | 0.22216 | 0.03107 | 0.0769 | 0.07915 | 0.54165 | whole blood case sample | | ||
| GSM3562837 | genomic DNA from sample375 | 3999840035 | R02C02 | case_whole blood | case | 30 | M | whole blood | 0.03779 | 0.07368 | 0.00385 | 0.07548 | 0.0891 | 0.74809 | whole blood case sample | | ||
|
||
#### (2) Grab just the samplesheet and create a csv and dataframe pkl of it. | ||
|
||
```python | ||
python -m methylprep -v meta_data -i GSE84727 -d GSE84727 | ||
``` | ||
|
||
Verbose CLI output | ||
```python | ||
INFO:methylprep.download.miniml:Downloading GSE84727_family.xml | ||
INFO:methylprep.download.miniml:Downloaded GSE84727_family.xml | ||
INFO:methylprep.download.miniml:Unpacking GSE84727_family.xml | ||
INFO:methylprep.download.miniml:Downloaded and unpacked GSE84727 | ||
INFO:methylprep.download.miniml:MINiML file does not provide `methylprep_name` (sentrix_id_R00C00) for 847/847 samples. | ||
INFO:methylprep.download.miniml:dropped Sentrix_ID (empty) | ||
INFO:methylprep.download.miniml:dropped Sentrix_Position (empty) | ||
INFO:methylprep.download.miniml:source == description; dropping source | ||
INFO:methylprep.download.miniml:dropped `platform` ({'GPL13534'}) | ||
INFO:methylprep.download.miniml:title == Sample_Name; dropping title | ||
INFO:methylprep.download.miniml:Final samplesheet contains 847 rows and 7 columns | ||
INFO:methylprep.download.miniml:['GSM_ID', 'Sample_Name', 'sentrixids', 'Sex', 'age', 'disease_status', 'description'] | ||
``` | ||
Resulting samplesheet | ||
|
||
|GSM_ID |Sample_Name |sentrixids |Sex |age |disease_status |description| | ||
|-------|---------------|-------|--------|-------|-----------|--------------| | ||
|GSM2250273 |33262604 |3998567027_R01C01 |M |47.3 |1 |control blood| | ||
|GSM2250274 |33261623 |3998567027_R02C01 |M |60.4 |1 |control blood| | ||
|GSM2250275 |33262614 |3998567027_R04C01 |M |30.1 |1 |control blood| | ||
|
||
|
||
You'll notice that one column `sentrixids` is misnamed. It should be split into `Sentrix_Position` and `Sentrix_ID` columns for processing to work on this GEO series. | ||
You can edit the csv and fix that prior to running the pipeline with `methylprep process`. If you don't, you'll get an error saying, "methylprep could not find the samplesheet." This error is caused by the researcher putting an arbitrary name into the GSE84727_family.xml `MINiML` meta data file: | ||
|
||
``` | ||
<Characteristics tag="sentrixids"> | ||
3998567027_R02C02 | ||
</Characteristics> | ||
``` | ||
If you use the `methylprep download` option by itself, it can generally avoid this type of XML parsing error, but it will download everything. Doing analysis on just part of a dataset requires some debugging like this. | ||
|
||
|
||
#### (3) Samplesheet with only "normal" samples | ||
```python | ||
python -m methylprep -v meta_data -i GSE52270 -d GSE52270 -k normal | ||
|
||
``` | ||
|
||
`-k` is shorthand for `--keyword`. The resulting sample sheet only includes samples that include the keyword `normal` | ||
|
||
|GSM_ID | Sample_Name |source |disease state |description| | ||
|-------|---------------|-------|---------------|-----------| | ||
|GSM1278809 |Colon 61 |Large Intestine |normal |1011N| | ||
|GSM1278812 |Colon 64 |Large Intestine |normal |1082N| | ||
|GSM1278823 |Colon 75 |Large Intestine |normal |1184N| | ||
|GSM1278825 |White matter 77 |Central Nervous System |normal |12_03| | ||
|
||
|
||
|
||
#### (4) Generate filtered samplesheet with only control samples from blood | ||
|
||
``` | ||
python -m methylprep meta_data -i GSE125105 -d GSE125105 --control -k blood | ||
``` | ||
|
||
| GSM_ID | Sample_Name | Sentrix_ID | Sentrix_Position | source | diagnosis | age | Sex | tissue | cellcount-cd8t | cellcount-cd4t | cellcount-nk | cellcount-bcell | cellcount-mono | cellcount-gran | description | | ||
|------------|----------------------------|------------|------------------|---------------------|-----------|-----|-----|-------------|----------------|----------------|--------------|-----------------|----------------|----------------|----------------------------| | ||
| GSM3562834 | genomic DNA from sample291 | 3999840035 | R01C01 | control_whole blood | control | 73 | F | whole blood | 0.07679 | 0.09099 | 0.06041 | 0.08542 | 0.09072 | 0.62266 | whole blood control sample | | ||
| GSM3562839 | genomic DNA from sample176 | 3999840035 | R03C02 | control_whole blood | control | 43 | M | whole blood | 0.06946 | 0.12989 | 0.04703 | 0.09808 | 0.14105 | 0.54662 | whole blood control sample | | ||
| GSM3562842 | genomic DNA from sample161 | 3999840035 | R05C01 | control_whole blood | control | 44 | F | whole blood | 0.10986 | 0.13565 | 0.07657 | 0.09125 | 0.12521 | 0.49317 | whole blood control sample | | ||
| GSM3562846 | genomic DNA from sample270 | 3999840037 | R01C01 | control_whole blood | control | 64 | M | whole blood | 0.11508 | 0.14116 | 0.0679 | 0.09415 | 0.15311 | 0.47829 | whole blood control sample | | ||
| GSM3562855 | genomic DNA from sample162 | 3999840037 | R06C01 | control_whole blood | control | 65 | M | whole blood | 0.01668 | 0.14318 | 0.1096 | 0.05545 | 0.09695 | 0.60283 | whole blood control sample | | ||
|
||
This only retains 211 of the 699 samples in a samplesheet. Next, you download the `.idat` files with with `methylprep download` and then remove the `idat` files you won't need like this: | ||
|
||
```python | ||
python -m methylprep -v download -i GSE125105 -d GSE125105 | ||
python -m methylprep -v meta_data -i GSE125105 -d GSE125105 --sync_idats --control -k blood | ||
``` | ||
|
||
And then process the 6.1GB file using this samplesheet, like this: | ||
```python | ||
python -m methylprep -v process -d GSE125105 --betas --m_value -e | ||
``` | ||
Those options will create two big files. One is a dataframe of beta_values for each sample. The other, m_values for each sample (kind of the same thing, but sometimes you want m_values). the `-e` or `--no_export` option will suppress the function creating files of probe values for each sample, as these are not needed by most `methylize` and `methylcheck` functions. There is also a `--save_uncorrected` option that prevents any sort of background and NOOB signal enhancement during processing. Uncorrected files are needed for a few analysis functions, namely `p-value probe detection`. | ||
|
||
In general, partial-dataset processing fails because the meta data for a GEO dataset is incomplete. Either the array positions are missing, or misnamed. Careful checking can allow one to fix this and build a large data set from multiple GEO datasets. | ||
|
||
#### (4b) Another condensed example of downloading GEO data and only processing control samples | ||
|
||
```python | ||
python -m methylprep -v download -i GSE130030 -d GSE130030 | ||
# next, remove the treatment samples using `-c` and remove extra idats with `-s` | ||
python -m methylprep -v meta_data -i GSE130030 -d GSE130030 --control -s | ||
# finally, process it | ||
python -m methylprep -v process -d GSE130030 --betas --m_value --no_export | ||
``` | ||
|
||
This creates two files, `beta_values.pkl` and `GSE130030_GPL13534_meta_data.pkl`, that you can work with in `methylize` like this: | ||
|
||
Navigate to the `GSE130030` folder created by `methylrep`, and start a python interpreter: | ||
```python | ||
>>>import methylize | ||
>>>data,meta = methylize.load_both() | ||
INFO:methylize.helpers:Found several meta_data files; using: GSE130030_GPL13534_meta_data.pkl) | ||
INFO:methylize.helpers:loaded data (485512, 14) from 1 pickled files (0.159s) | ||
INFO:methylize.helpers:meta.Sample_IDs match data.index (OK) | ||
``` | ||
Or if you are running in a notebook, specify the full path: | ||
```python | ||
import methylize | ||
data,meta = methylize.load_both('<path_to...>/GSE105018') | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.