Skip to content

Commit

Permalink
Feature/geometa (#37)
Browse files Browse the repository at this point in the history
* added batch_size parameter to run_pipeline

* added CLI functionality

* batch_size python/CLI and tests

* removed test; changed default behavior: won't raise error if file-to-be-downloaded already exists

* Update setup.py

* Update test_batch_size.py

* Rename test_batch_size.py to test_pipeline_batch_size.py

* dropped redundant tests and sped up one

* Feature/public data (#21)

* download command, as well as some batch_size adjustments

* fixed string issue

* renaming update and removed redundant tests

* bs4 required for Array ingester

* tests

* workaround to return objects with batch size changes

* workaround to return objects with batch size changes

* bug

* tests pass for batch_size

* version 1.1 (#22)

* download command, as well as some batch_size adjustments

* fixed string issue

* renaming update and removed redundant tests

* bs4 required for Array ingester

* tests

* workaround to return objects with batch size changes

* workaround to return objects with batch size changes

* bug

* tests pass for batch_size

* progress bars

* documenting `download`

* Update cli.py

* restore sample_name filter

* added rawMetaDataset class and moved get_sample_sheet_s3 to more logical place here (#24)

* updated docs for 1.1.1

* Update README.md

* Update setup.py

* exposed create_sample_sheet and download no_clean options

* manifest file download in lambda

* manifest file download in lambda

* manifest file download in lambda

* v1.1.3 bump bug fix

* handles blank sample_name and ensures names are unique.

* Update setup.py

* geo downloader tweaks, fixed docs

* minor tweaks to sample_sheet parser

* v1.1.8: CLI retain --uncorrected mean prob values; sample_sheet sample_type sample_sub_type; sample_sheet accepts alt sentrix column headers

* v1.1.8: CLI retain --uncorrected mean prob values; sample_sheet sample_type sample_sub_type; sample_sheet accepts alt sentrix column headers

* v1.1.8

* v1.1.9 minor bug fix to alt filename

* bug fix: sample QC control status

* v1.1.11 generates meta_data pickle

* Update config.yml

* Update config.yml

* Update config.yml

* Update config.yml

* coveralls

* bug fix

* v1.1.14 smarter meta_data cli option

* Create faq.md

* Update faq.md

* Update faq.md

* Update faq.md

* reworking download and meta_data to be more robust

* Update faq.md

* Update faq.md

* downloader warns if idats aren't there; smarter meta_data

* minor
  • Loading branch information
Marc Maxmeister authored Nov 13, 2019
1 parent 59d8691 commit 406dbb1
Show file tree
Hide file tree
Showing 13 changed files with 842 additions and 115 deletions.
193 changes: 193 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Tips and FAQs for the methylation suite

## loading processed data

If you have a small data set (under 200 samples), and did not use the `batch_size` option:

```python
import pandas as pd
data = pd.read_pickle('{path_to_file}/beta_values.pkl')
```

Otherwise, loading hundreds of samples can be slow. But there's a helper function that loads a bunch of smaller
batches into one dataframe, if they were processed with methylprep's `batch_size` option:

```python
import methylize
data = methylize.load(path_to_file)
```
That will load 1600 samples in 8 seconds, compared to taking minutes if you `read_pickle` them and `concat` them using pandas.

### Processing Notes

1. DataFrames are the format that methylcheck/methylize functions expect. For saving and loading, the probe names belong in columns and the sample names belong in the index.
But when processing data, some functions auto-transpose this to decrease processing time.
2. By default, `methylprep process` also creates a file called `sample_sheet_meta_data.pkl` from various data sources
- if there is a GEO series MINiML xml file, it reads this data preferentially
- if there is a samplesheet, it will convert this to a meta DataFrame
- if these sources are missing required data, such as Sentrix_Position and Sentrix_ID, it will look for idat files and
read these from the filenames.

## idat filenames
- There are two acceptable formats:
- `<GSM_ID>_<Sentrix_ID>_<Sentrix_Position>_<Red|Grn>.idat<.gz>`
- `<Sentrix_ID>_<Sentrix_Position>_<Red|Grn>.idat<.gz>`
- Methylprep will convert `.gz` files to `.idat` uncompressed files when processing.
- methylprep does not recognize the older 27k array filename format:
`<GSM_ID>_<Sentrix_ID>_<Sentrix_Position>_<SOME_LETTER>.idat`

## Why didn't the `methylprep download` function work for GEO dataset GSEnnn?
A significant number of GEO datasets do not store their data in a consistent format. Here are some reasons a GEO dataset fails to download:

1. `idat` filename format is off (missing R00C00 position)
2. no raw idats in zip, only processed data
3. The meta data in MINiML format xml file is incomplete. (There are ways to make it work without meta data)
4. the NIH GEO FTP server is down (yes, we've experienced this whilst testing too)
5. `idat` files in dataset have varying number of probes. (If a dataset combines results from two array types (EPIC and 450k), it can sometimes split the data into two sample sheets and you can process each folder separately.)

We've put a lot of effort into giving you clear error messages, detailing why each one failed.

## What if `methylprep download` fails. How do I use the data anyway?
1. Download the raw idat zipfile manually to a folder.
2. Uncompress it.
3. Confirm that there are `.idat` files present.
4. If the files end in `.idat.gz`, gunzip them first. (In a mac/linux bash window you can navigate to the folder and type `gunzip *` to uncompress all those files. On a PC, use some software like `7zip` or `winzip`.)
5. IF the filenames don't include Sentrix IDs and Sentrix array positions, like `<GSM ID>_<some long number>_<R01C01>_<Red or Grn>.idat`, you'll need to manually edit the samplesheet to match the files.
6. Run `methylprep process` on the folder, possibly with the `--no_sample_sheet` option. It should work, but you won't have any of the sample meta data bundled for you for analysis with datasets in nonstandard formats.

## How to process only part of a GEO dataset

The `methylprep meta_data` command line interface (CLI) option allows you to specify which samples to process using keyword matching against the GEO meta data. You can use this before `methylprep download` and `methylprep process` to create smaller data sets, faster.

### Examples

#### (1) Grab a samplesheet for a GEO data set

```python
python -m methylprep meta_data -i GSE125105
```
Yields three files on local disk:

- GSE125105_family.xml (the original GEO meta data file in MINiML format)
- GSE125105_GPL13534_samplesheet.csv (used for processing)
- GSE125105_GPL13534_meta_data.pkl (used in analysis to describe samples)

| GSM_ID | Sample_Name | Sentrix_ID | Sentrix_Position | source | diagnosis | age | Sex | tissue | cellcount-cd8t | cellcount-cd4t | cellcount-nk | cellcount-bcell | cellcount-mono | cellcount-gran | description |
|------------|----------------------------|------------|------------------|---------------------|-----------|-----|-----|-------------|----------------|----------------|--------------|-----------------|----------------|----------------|----------------------------|
| GSM3562834 | genomic DNA from sample291 | 3999840035 | R01C01 | control_whole blood | control | 73 | F | whole blood | 0.07679 | 0.09099 | 0.06041 | 0.08542 | 0.09072 | 0.62266 | whole blood control sample |
| GSM3562835 | genomic DNA from sample612 | 3999840035 | R01C02 | case_whole blood | case | 32 | M | whole blood | 0.05544 | 0.07946 | 0.0159 | 0.09557 | 0.05515 | 0.72663 | whole blood case sample |
| GSM3562836 | genomic DNA from sample611 | 3999840035 | R02C01 | case_whole blood | case | 51 | F | whole blood | 0.08279 | 0.22216 | 0.03107 | 0.0769 | 0.07915 | 0.54165 | whole blood case sample |
| GSM3562837 | genomic DNA from sample375 | 3999840035 | R02C02 | case_whole blood | case | 30 | M | whole blood | 0.03779 | 0.07368 | 0.00385 | 0.07548 | 0.0891 | 0.74809 | whole blood case sample |

#### (2) Grab just the samplesheet and create a csv and dataframe pkl of it.

```python
python -m methylprep -v meta_data -i GSE84727 -d GSE84727
```

Verbose CLI output
```python
INFO:methylprep.download.miniml:Downloading GSE84727_family.xml
INFO:methylprep.download.miniml:Downloaded GSE84727_family.xml
INFO:methylprep.download.miniml:Unpacking GSE84727_family.xml
INFO:methylprep.download.miniml:Downloaded and unpacked GSE84727
INFO:methylprep.download.miniml:MINiML file does not provide `methylprep_name` (sentrix_id_R00C00) for 847/847 samples.
INFO:methylprep.download.miniml:dropped Sentrix_ID (empty)
INFO:methylprep.download.miniml:dropped Sentrix_Position (empty)
INFO:methylprep.download.miniml:source == description; dropping source
INFO:methylprep.download.miniml:dropped `platform` ({'GPL13534'})
INFO:methylprep.download.miniml:title == Sample_Name; dropping title
INFO:methylprep.download.miniml:Final samplesheet contains 847 rows and 7 columns
INFO:methylprep.download.miniml:['GSM_ID', 'Sample_Name', 'sentrixids', 'Sex', 'age', 'disease_status', 'description']
```
Resulting samplesheet

|GSM_ID |Sample_Name |sentrixids |Sex |age |disease_status |description|
|-------|---------------|-------|--------|-------|-----------|--------------|
|GSM2250273 |33262604 |3998567027_R01C01 |M |47.3 |1 |control blood|
|GSM2250274 |33261623 |3998567027_R02C01 |M |60.4 |1 |control blood|
|GSM2250275 |33262614 |3998567027_R04C01 |M |30.1 |1 |control blood|


You'll notice that one column `sentrixids` is misnamed. It should be split into `Sentrix_Position` and `Sentrix_ID` columns for processing to work on this GEO series.
You can edit the csv and fix that prior to running the pipeline with `methylprep process`. If you don't, you'll get an error saying, "methylprep could not find the samplesheet." This error is caused by the researcher putting an arbitrary name into the GSE84727_family.xml `MINiML` meta data file:

```
<Characteristics tag="sentrixids">
3998567027_R02C02
</Characteristics>
```
If you use the `methylprep download` option by itself, it can generally avoid this type of XML parsing error, but it will download everything. Doing analysis on just part of a dataset requires some debugging like this.


#### (3) Samplesheet with only "normal" samples
```python
python -m methylprep -v meta_data -i GSE52270 -d GSE52270 -k normal

```

`-k` is shorthand for `--keyword`. The resulting sample sheet only includes samples that include the keyword `normal`

|GSM_ID | Sample_Name |source |disease state |description|
|-------|---------------|-------|---------------|-----------|
|GSM1278809 |Colon 61 |Large Intestine |normal |1011N|
|GSM1278812 |Colon 64 |Large Intestine |normal |1082N|
|GSM1278823 |Colon 75 |Large Intestine |normal |1184N|
|GSM1278825 |White matter 77 |Central Nervous System |normal |12_03|



#### (4) Generate filtered samplesheet with only control samples from blood

```
python -m methylprep meta_data -i GSE125105 -d GSE125105 --control -k blood
```

| GSM_ID | Sample_Name | Sentrix_ID | Sentrix_Position | source | diagnosis | age | Sex | tissue | cellcount-cd8t | cellcount-cd4t | cellcount-nk | cellcount-bcell | cellcount-mono | cellcount-gran | description |
|------------|----------------------------|------------|------------------|---------------------|-----------|-----|-----|-------------|----------------|----------------|--------------|-----------------|----------------|----------------|----------------------------|
| GSM3562834 | genomic DNA from sample291 | 3999840035 | R01C01 | control_whole blood | control | 73 | F | whole blood | 0.07679 | 0.09099 | 0.06041 | 0.08542 | 0.09072 | 0.62266 | whole blood control sample |
| GSM3562839 | genomic DNA from sample176 | 3999840035 | R03C02 | control_whole blood | control | 43 | M | whole blood | 0.06946 | 0.12989 | 0.04703 | 0.09808 | 0.14105 | 0.54662 | whole blood control sample |
| GSM3562842 | genomic DNA from sample161 | 3999840035 | R05C01 | control_whole blood | control | 44 | F | whole blood | 0.10986 | 0.13565 | 0.07657 | 0.09125 | 0.12521 | 0.49317 | whole blood control sample |
| GSM3562846 | genomic DNA from sample270 | 3999840037 | R01C01 | control_whole blood | control | 64 | M | whole blood | 0.11508 | 0.14116 | 0.0679 | 0.09415 | 0.15311 | 0.47829 | whole blood control sample |
| GSM3562855 | genomic DNA from sample162 | 3999840037 | R06C01 | control_whole blood | control | 65 | M | whole blood | 0.01668 | 0.14318 | 0.1096 | 0.05545 | 0.09695 | 0.60283 | whole blood control sample |

This only retains 211 of the 699 samples in a samplesheet. Next, you download the `.idat` files with with `methylprep download` and then remove the `idat` files you won't need like this:

```python
python -m methylprep -v download -i GSE125105 -d GSE125105
python -m methylprep -v meta_data -i GSE125105 -d GSE125105 --sync_idats --control -k blood
```

And then process the 6.1GB file using this samplesheet, like this:
```python
python -m methylprep -v process -d GSE125105 --betas --m_value -e
```
Those options will create two big files. One is a dataframe of beta_values for each sample. The other, m_values for each sample (kind of the same thing, but sometimes you want m_values). the `-e` or `--no_export` option will suppress the function creating files of probe values for each sample, as these are not needed by most `methylize` and `methylcheck` functions. There is also a `--save_uncorrected` option that prevents any sort of background and NOOB signal enhancement during processing. Uncorrected files are needed for a few analysis functions, namely `p-value probe detection`.

In general, partial-dataset processing fails because the meta data for a GEO dataset is incomplete. Either the array positions are missing, or misnamed. Careful checking can allow one to fix this and build a large data set from multiple GEO datasets.

#### (4b) Another condensed example of downloading GEO data and only processing control samples

```python
python -m methylprep -v download -i GSE130030 -d GSE130030
# next, remove the treatment samples using `-c` and remove extra idats with `-s`
python -m methylprep -v meta_data -i GSE130030 -d GSE130030 --control -s
# finally, process it
python -m methylprep -v process -d GSE130030 --betas --m_value --no_export
```

This creates two files, `beta_values.pkl` and `GSE130030_GPL13534_meta_data.pkl`, that you can work with in `methylize` like this:

Navigate to the `GSE130030` folder created by `methylrep`, and start a python interpreter:
```python
>>>import methylize
>>>data,meta = methylize.load_both()
INFO:methylize.helpers:Found several meta_data files; using: GSE130030_GPL13534_meta_data.pkl)
INFO:methylize.helpers:loaded data (485512, 14) from 1 pickled files (0.159s)
INFO:methylize.helpers:meta.Sample_IDs match data.index (OK)
```
Or if you are running in a notebook, specify the full path:
```python
import methylize
data,meta = methylize.load_both('<path_to...>/GSE105018')
```
3 changes: 2 additions & 1 deletion methylprep/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# App
from .files import get_sample_sheet, get_sample_sheet_s3
from .processing import get_manifest, get_raw_datasets, run_pipeline, consolidate_values_for_sheet
from .download import run_series, run_series_list
from .download import run_series, run_series_list, convert_miniml


getLogger(__name__).addHandler(NullHandler())
Expand All @@ -17,4 +17,5 @@
'consolidate_values_for_sheet',
'run_series',
'run_series_list',
'convert_miniml'
]
60 changes: 59 additions & 1 deletion methylprep/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
from .processing import run_pipeline
from .download import (
run_series,
run_series_list
run_series_list,
convert_miniml
)


Expand Down Expand Up @@ -48,6 +49,9 @@ def build_parser():
download_parser = subparsers.add_parser('download', help='Downloads the specified series from GEO or ArrayExpress.')
download_parser.set_defaults(func=cli_download)

meta_parser = subparsers.add_parser('meta_data', help='Creates a meta_data dataframe from GEO MINiML XML file. Specify the GEO id.')
meta_parser.set_defaults(func=cli_meta_data)

parsed_args, func_args = parser.parse_known_args(sys.argv[1:])
if parsed_args.verbose:
logging.basicConfig(level=logging.DEBUG)
Expand Down Expand Up @@ -302,5 +306,59 @@ def cli_download(cmd_args):
else:
run_series_list(args.list, args.data_dir, dict_only=args.dict_only)


def cli_meta_data(cmd_args):
parser = DefaultParser(
prog='methylprep meta_data',
description="""A more feature-rich meta data parser for public MINiML GEO datasets.
Run this after downloading the dataset using `download` command.
This reads all the meta data from MINiML into a samplesheet.csv and meta data dataframe.
You can identify 'control' or samples containing a specific keyword (e.g. blood, tumor, etc) and remove
any samples from sheet that lack these criteria, and delete the associated idats that don't have these keywords.
After, run `process` on the rest, saving time. You can effectively ignore the parts of datasets that you don't need
based on the associated meta data."""
)
parser.add_argument(
'-i', '--id',
required=True,
help='Unique ID of the series (the GEO GSExxxx ID)',
)
parser.add_argument(
'-d', '--data_dir',
required=False,
type=Path,
default='.',
help='Directory to search for MINiML file.',
)
parser.add_argument(
'-c', '--control',
required=False,
action="store_true",
help='[experimental]: If flagged, this will look at the sample sheet and only save samples that appear to be "controls".',
)
parser.add_argument(
'-k', '--keyword',
required=False,
default=None,
type=str,
help='[experimental]: Retain samples that include this keyword (e.g. blood, case insensitive) somewhere in samplesheet values.',
)
parser.add_argument(
'-s', '--sync_idats',
required=False,
action="store_true",
help="[experimental]: If flagged, this will scan the `data_dir` and remove all idat files that are not in the filtered samplesheet, so they won't be processed.",
)
args = parser.parse_args(cmd_args)
if not args.id:
raise KeyError("You must supply a GEO id like `GSE123456`.")
convert_miniml(
args.id,
data_dir=args.data_dir,
extract_controls=args.control,
require_keyword=args.keyword,
sync_idats=args.sync_idats)


def cli_app():
build_parser()
4 changes: 3 additions & 1 deletion methylprep/download/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@
run_series,
run_series_list
)
from .miniml import convert_miniml

__all__ = [
'run_series',
'run_series_list',
]
'convert_miniml',
]
13 changes: 8 additions & 5 deletions methylprep/download/array_express.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ def ae_download(ae_id, series_path, ae_platforms, clean=True):
the list of supported ArrayExpress platforms
clean
whether or not to delete files once they are no longer need (True by default)"""
success = True
series_dir = Path(series_path)
sdrf_filename = f"{ae_id}.sdrf.txt"

Expand All @@ -36,9 +37,9 @@ def ae_download(ae_id, series_path, ae_platforms, clean=True):
if not os.path.exists(f"{series_path}/{platform}"):
os.mkdir(f"{series_path}/{platform}")

ftp = FTP('ftp.ebi.ac.uk')
ftp = FTP('ftp.ebi.ac.uk', timeout=59)
ftp.login()
ae_split = ae_id.split("-")[1]
ae_split = ae_id.split("-")[1] # MTAB, GEOD, etc
ftp.cwd(f"/pub/databases/arrayexpress/data/experiment/{ae_split}/{ae_id}")

#LOGGER.info(f"Downloading {ae_id}")
Expand All @@ -48,14 +49,15 @@ def ae_download(ae_id, series_path, ae_platforms, clean=True):
if not os.path.exists(f"{series_path}/{file}"):
#LOGGER.info(f"Downloading {file} from ArrayExpress")
raw_file = open(f"{series_path}/{file}", 'wb')
filesize = ftp.size(f"suppl/{raw_filename}")
filesize = ftp.size(file)
try:
with tqdm(unit = 'b', unit_scale = True, leave = False, miniters = 1, desc = geo_id, total = filesize) as tqdm_instance:
def tqdm_callback(data):
tqdm_instance.update(len(data))
raw_file.write(data)
ftp.retrbinary(f"RETR {file}", tqdm_callback)
except Exception as e:
print(e)
LOGGER.info('tqdm: Failed to create a progress bar, but it is downloading...')
ftp.retrbinary(f"RETR {file}", raw_file.write)
raw_file.close()
Expand All @@ -76,8 +78,8 @@ def tqdm_callback(data):
sdrf_file.close()

LOGGER.info(f"Downloaded and unpacked {ae_id}")

ftp.quit()
return success


def ae_metadata(ae_id, series_path, ae_platforms, path):
Expand All @@ -95,6 +97,7 @@ def ae_metadata(ae_id, series_path, ae_platforms, path):
Returns:
A list of platforms that the series contains samples of"""
pipeline_kwargs = {}
with open(f"{series_path}/{ae_id}.sdrf.txt") as fp:
reader = csv.reader(fp, delimiter="\t")
d = list(reader)
Expand Down Expand Up @@ -139,7 +142,7 @@ def ae_metadata(ae_id, series_path, ae_platforms, path):
if platform not in seen_platforms:
seen_platforms.append(platform)

return seen_platforms
return seen_platforms, pipeline_kwargs


def sample_sheet_from_sdrf(ae_id, series_path, platform, platform_samples_dict):
Expand Down
Loading

0 comments on commit 406dbb1

Please sign in to comment.