Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mouse #53

Merged
merged 13 commits into from
Apr 4, 2020
Prev Previous commit
Next Next commit
v1.2.4 tweaks
  • Loading branch information
marcmaxson committed Mar 10, 2020
commit ed823c67362ee4e0983d54804499ee6380b0f7a7
41 changes: 35 additions & 6 deletions docs/methylprep_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The 27K array measures more than 27,000 CpG positions, the 450K array measures m

### Some definitions

Two common measures are used to report the methylation levels: Beta values and M values [[3]](#du).
Two common measures of methylation are **Beta values** and **M values** [[3]](#du). M values range from 0 to any number (they are unbounded), so they are more sensitive for statistical analysis (ie, no ceiling/floor effects), but they're impossible to interpret. Normalized M-values can be used for statistical analyses (see https://www.nature.com/articles/s41598-020-60758-0). Beta values are easier to understand, practically speaking. They range from 0 to 1, and represent the proportion of methylated CpG sites in the sample.

**Beta value:**

Expand Down Expand Up @@ -78,7 +78,14 @@ $ python3 -m methylprep process -d "docs/example_data/GSE69852/"
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:36<00:00, 18.08s/it]
```

The `process` command by default outputs each sample into its own CSV file. Each row of the CSV files contains one of the sample's probes and the columns are the probe's Illumina probe ID, NOOB adjusted methyled value, NOOB adjusted unmethylated value, M value, and Beta value.
The `process` command by default outputs each sample into its own CSV file. Each row of the CSV files contains one of the sample's probes and the columns are the probe's Illumina probe ID, NOOB adjusted methylated value, NOOB adjusted unmethylated value, M value, and Beta value.

There are situations when researchers want additional measures from array data. If they want to run the quality control analysis in the `methylcheck` package, they will need data from Illumina's internal control probes and the uncorrected fluorescence intensity, representing the ratio of methylated to unmethylated sites. The easiest way to collect everything is to ask for everything with the `--all` flag:

```bash
$ python3 -m methylprep process -d "path/to/example/GSE123456/" --all
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 168/168 [51:36<00:00, 18.08s/it]
```

For the remaining **methylprep** examples we provide the `-v` option, which stands for verbose. This causes the program to output additional information about the processing of samples.

Expand Down Expand Up @@ -111,7 +118,7 @@ INFO:methylprep.processing.raw_dataset:Preprocessing Red foreground controls dat
INFO:methylprep.processing.pipeline:[!] Exported results (csv) to: {'docs/example_data/GSE69852/9247377093/9247377093_R02C01_processed.csv', 'docs/example_data/GSE69852/9247377085/9247377085_R04C02_processed.csv'}
```

To perform quality control using **methylprep** however, a `pandas` data frame where the rows contain different probes and the columns represent each of the samples is required; either beta or M values are stored for each probe/sample pair. To obtain this data frame, the user adds either a `--betas` or `--m_value` argument to `process`.
Most filtering and quality control steps in **methylcheck** require a `pandas` data frame, where the rows contain different probes and the columns represent each of the samples. **methylprep** generates these dataframes, if the user appends either a `--betas` or `--m_value` argument to the `process` command.

```bash
$ python3 -m methylprep -v process -d "docs/example_data/GSE69852/" --betas
Expand Down Expand Up @@ -171,11 +178,33 @@ INFO:methylprep.processing.raw_dataset:Preprocessing Red foreground controls dat
INFO:methylprep.processing.pipeline:saved m_values.pkl
```

The processed data is saved as a python pickle `pkl` file, which must be loaded for **methylcheck** to run. We use pickled dataframes instead of `csv`s or numpy arrays because the structure efficiently tracks meta data and allows functions to perform matrix operations on the whole array faster. [DataFrames](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html) are part of both python and `R`.
The `--uncorrected` and `--save_control` flags work the same way, generating `unmeth_values.pkl`, `meth_values.pkl` and `control_probes.pkl` pickled pandas dataframes in the working folder, respectively. Also, the processed `csv` files will contain these additional columns for each sample. The output appears in two formats and two places to ensure that it can be exported into any format (via the CSV) and be further processed and analyzed using the most efficient high-performance data format (in this case, a pickled dataframe of numpy arrays).

#### file outputs and loading

The processed data is saved as a python pickle `pkl` file, which can be loaded in **methylcheck**. We recommend using pickled dataframes instead of `csv`s or numpy arrays because the structure efficiently tracks meta data and allows functions to perform matrix operations on the whole array faster. [DataFrames](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html) are support both by python's `pandas`) and `R`'s' `tidyverse`.

Some large series of samples have caused **methylprep** to fail when available memory runs out. To avoid this, **methylprep** automatically turns on "batch processing" when there are more than 200 samples. If your computer has very little memory (4GB of RAM or less), you may need to run **methylprep process** with the `--batch_size` option set to something lower, like 25 samples. If you encounter any issues, please let us know by adding an issue to our [GitHub](https://github.com/LifeEGX/methylprep/issues).

The effect of running samples in small batches is that you get several dataframe pickles as output. To recombine them for analysis, use `methylprep.load` instead of the usual `pandas` function:

```python
# for a single file, use this:
import pandas as pd
df = pd.read_pickle(<filename>)

# for batches of samples, use this:
import methylprep
df = methylprep.load(<path_to_files>)

# for processed samples saved as `csv`s, use this:
import methylprep
df = methylprep.load(<path_to_csv_files>)
```

Some large series of samples have caused **methylprep** to fail when memory runs out. If you encounter any issues, please let us know by adding an issue to our [GitHub](https://github.com/LifeEGX/methylprep/issues). If your computer has very little memory (less than 4GB of RAM), you may not be able to run **methylprep**. However, if it crashes because of a memory error, use the --batch_size option with a small number, like 30, to force it to save to disk more often and use less memory.
#### advanced options

For additional arguments for `process`, pr more information on the structure of **methylprep**'s classes, and a guide to manually processing data using internal functions, see the **[Developers Notes section](https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/docs/methylprep_tutorial.html#developers-notes)]**.
For additional arguments for `process`, or more information on the structure of **methylprep**'s classes, and a guide to manually processing data using internal functions, see the **[Developers Notes section](https://life-epigenetics-methylprep.readthedocs-hosted.com/en/latest/docs/methylprep_tutorial.html#developers-notes)]**.


## `methylcheck` command line interface (CLI)
Expand Down
50 changes: 49 additions & 1 deletion methylprep/processing/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
from ..utils import ensure_directory_exists
from .postprocess import (
calculate_beta_value,
calculate_copy_number,
calculate_m_value,
calculate_copy_number,
consolidate_values_for_sheet,
consolidate_control_snp,
)
Expand Down Expand Up @@ -72,6 +72,12 @@ def run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None
if True, saves a pickle (beta_values.pkl) of beta values for all samples
m_value
if True, saves a pickle (m_values.pkl) of beta values for all samples
Note on meth/unmeth:
if either betas or m_value is True, this will also save two additional files:
'meth_values.pkl' and 'unmeth_values.pkl' with the same dataframe structure,
representing raw, uncorrected meth probe intensities for all samples. These are useful
in some methylcheck functions and load/produce results 100X faster than loading from
processed CSV output.
manifest_filepath [optional]
if you want to provide a custom manifest, provide the path. Otherwise, it will download
the appropriate one for you.
Expand Down Expand Up @@ -225,6 +231,48 @@ def run_pipeline(data_dir, array_type=None, export=False, manifest_filepath=None
df = df.transpose() # put probes as columns for faster loading.
pd.to_pickle(df, Path(data_dir,pkl_name))
LOGGER.info(f"saved {pkl_name}")
if betas or m_value:
df = consolidate_values_for_sheet(batch_data_containers, postprocess_func_colname='noob_meth', bit=bit)
if not batch_size:
pkl_name = 'noob_meth_values.pkl'
else:
pkl_name = f'noob_meth_values_{batch_num}.pkl'
if df.shape[1] > df.shape[0]:
df = df.transpose() # put probes as columns for faster loading.
pd.to_pickle(df, Path(data_dir,pkl_name))
LOGGER.info(f"saved {pkl_name}")
# TWO PARTS
df = consolidate_values_for_sheet(batch_data_containers, postprocess_func_colname='noob_unmeth', bit=bit)
if not batch_size:
pkl_name = 'noob_unmeth_values.pkl'
else:
pkl_name = f'noob_unmeth_values_{batch_num}.pkl'
if df.shape[1] > df.shape[0]:
df = df.transpose() # put probes as columns for faster loading.
pd.to_pickle(df, Path(data_dir,pkl_name))
LOGGER.info(f"saved {pkl_name}")

if (betas or m_value) and save_uncorrected:
df = consolidate_values_for_sheet(batch_data_containers, postprocess_func_colname='meth', bit=bit)
if not batch_size:
pkl_name = 'meth_values.pkl'
else:
pkl_name = f'meth_values_{batch_num}.pkl'
if df.shape[1] > df.shape[0]:
df = df.transpose() # put probes as columns for faster loading.
pd.to_pickle(df, Path(data_dir,pkl_name))
LOGGER.info(f"saved {pkl_name}")
# TWO PARTS
df = consolidate_values_for_sheet(batch_data_containers, postprocess_func_colname='unmeth', bit=bit)
if not batch_size:
pkl_name = 'unmeth_values.pkl'
else:
pkl_name = f'unmeth_values_{batch_num}.pkl'
if df.shape[1] > df.shape[0]:
df = df.transpose() # put probes as columns for faster loading.
pd.to_pickle(df, Path(data_dir,pkl_name))
LOGGER.info(f"saved {pkl_name}")

if export:
export_path_parents = list(set([str(Path(e).parent) for e in export_paths]))
LOGGER.info(f"[!] Exported results (csv) to: {export_path_parents}")
Expand Down