dmeth - documentation for reproducible DNA methylation analysis¶
User Guide¶
A fast, statistically rigorous Python framework providing a toolkit for DNA methylation analysis - from raw beta matrices to biomarkers and functional interpretation. dmeth implements the full modern differential methylation pipeline used in high-impact epigenome-wide association studies (EWAS), with performance and correctness on par with established R/bioconductor tools, all in pure Python.
Key Features¶
| Feature | Implementation | Performance |
|---|---|---|
| Empirical Bayes moderated t-tests | limma-style (Smyth 2004) with exact replication | Numba-accelerated (10–100× faster) |
| Memory-efficient chunked analysis | Automatic fallback for >1M probes | <4 GB RAM typical |
| Cell-type deconvolution | Reference-based NNLS (Houseman/Horvath-style) | Parallel joblib |
| DMR discovery | Sliding-window clustering + gap merging | Vectorized |
| Gene annotation & pathway enrichment | IntervalTree + Fisher’s exact (FDR) | Sub-second on 450k/EPIC |
| Coordinate liftover (hg19 ↔ hg38) | pyliftover integration | Per-region tracking |
| Biomarker panel discovery & validation | RF / Elastic Net + stratified CV | Built-in |
| Robust preprocessing & QC | Missingness, group representation, imputation | Production-safe |
Fully supports Illumina 450K, EPIC (850K), and any custom CpG × sample matrix.
Quick Start¶
pip install "dmeth[full]"
import pandas as pd
from dmeth.io.readers import load_methylation_data
from dmeth.core.analysis.preparation import filter_cpgs_by_missingness, impute_missing_values
from dmeth.core.analysis.validation import build_design, validate_contrast
from dmeth.core.analysis.core_analysis import fit_differential
from dmeth.core.downstream.annotation import find_dmrs_by_sliding_window
# 1. Load your data
# beta: CpG x samples matrix
# pheno: sample metadata with a 'group' column
beta = pd.read_csv("beta_matrix.csv", index_col=0)
pheno = pd.read_csv("phenotype.csv", index_col=0)
# 2. Preprocessing
# Drop CpGs with too much missingness
beta_clean, _, _ = filter_cpgs_by_missingness(beta, max_missing_rate=0.2)
# Impute remaining missing values (kNN)
beta_imp = impute_missing_values(beta_clean, method="knn", k=10)
# 3. Differential analysis (case vs control)
# Build design matrix from phenotype
design = build_design(pheno["group"], categorical=["group"])
contrast = [0] * (design.shape[1] - 1) + [1]
contrast = validate_contrast(design, contrast)
# Extract group labels
group_labels = pd.Series(
np.where(design["group"], "B", "A"),
index=beta_imp.columns,
name="group",
)
# Fit - use fit_differential_chunked() for larger datasets
res = fit_differential(
data=beta_imp,
design=design,
contrast=contrast,
group_labels=group_labels,
shrink="smyth",
robust=True,
)
# 4. Discover DMRs
ann = pd.read_csv("cpg_annotation.csv", index_col=0) # must include chr, pos columns
dmrs = find_dmrs_by_sliding_window(
dms=res[res["padj"] < 0.05],
annotation=ann,
chr_col="CHR",
pos_col="MAPINFO",
max_gap=500,
min_cpgs=3,
)
print(f"Found {len(dmrs)} DMRs")
print(dmrs.head())
Installation¶
# Minimal (no speed, annotation, and other extras)
pip install dmeth
# Recommended: full scientific environment
pip install "dmeth[full]"
# Development
pip install "dmeth[full,dev]"
Optional extras (dmeth[full]):
- speed: numba, combat (highly recommended)
- annotation: intervaltree, pyliftover
- parallel: joblib
- format: PyYAML, toml, h5py, xlrd
- plotting: plotly, umap-learn
- io: pyarrow, tables, openpyxl, xlsxwriter
Optional dev extras (dmeth[dev]):
pytest, pytest-cov, black, isort, flake8, flake8-pyproject, flake8-bugbear, bandit, mkdocs, mkdocs-material
Reference for all modules, classes, and functions¶
Study Planning¶
High-level planning API for DNA methylation studies.
This module provides a clean, user-friendly interface to the full study planning
capabilities of DMeth. It exposes all configuration queries, cost/timeline
estimators, and sample size calculators as simple top-level functions.
- All functions automatically use the global
PlannerConfigsingleton - (managed in
dmeth.config.config_manager), so configuration loaded once is immediately available everywhere.
Features¶
- Direct function access:
list_platforms(),calculate_sample_size(), etc. - Zero-boilerplate: no need to manually instantiate or pass config objects
- Real-time updates: any config change (e.g., regional pricing) instantly affects all calculations
add_custom_platform¶
add_custom_platform(platform_id: str, platform_info: Dict[str, Any]) -> None
Register a new custom platform after schema validation.
Parameters¶
- platform_id:
strUnique identifier for the new platform. - platform_info:
dictPlatform data conforming toPlatformSchema.
Raises¶
ValueError If the provided data fails validation.
calculate_sample_size¶
calculate_sample_size(*args, **kwargs)
Compute required sample size per group using power analysis.
- The calculation accounts for the chosen experimental design, platform
- CpG count (for Bonferroni correction), and any design-specific power adjustment factor.
Parameters¶
- design_id:
strIdentifier of the experimental design. - platform_id:
strIdentifier of the methylation platform. - effect_size:
floatExpected Cohen's d. - alpha:
float, default 0.05Nominal type-I error rate. - power:
float, default 0.8Desired statistical power. - mcp_method:
str, optionalMultiple-comparison correction method (bonferroni,fdr,none). - max_n:
int, default 500Upper bound on calculated sample size per group.
Returns¶
dict
Contains n_per_group, total_samples, effective alpha, etc.
estimate_study_timeline¶
estimate_study_timeline(*args, **kwargs)
Produce a detailed timeline estimate for the entire study.
The estimate respects phase-specific scaling, batch adjustments, parallelisability, and an optional global contingency buffer.
Parameters¶
- n_samples:
intTotal number of samples to be processed. - platform_id:
strPlatform identifier (affects array-processing duration). - contingency:
bool, default TrueApply the global contingency buffer percentage.
Returns¶
DataFrame
One row per phase plus a summary TOTAL row.
estimate_total_cost¶
estimate_total_cost(*args, **kwargs)
Calculate the complete study cost and provide a detailed breakdown.
Handles per-sample, per-CpG, and fixed cost components, with optional
inclusion of components marked as optional.
Parameters¶
- n_samples:
intNumber of samples. - platform_id:
strPlatform identifier. - include_optional:
bool, default TrueInclude optional cost components. - validate:
bool, default TruePerform unit validation (kept for backward compatibility).
Returns¶
dict
Contains a components DataFrame, total cost,
per_sample cost, and metadata.
get_config¶
get_config() -> dmeth.config.config_manager.PlannerConfig
Get the global PlannerConfig singleton instance.
Returns¶
PlannerConfig The global configuration instance.
get_cost_components¶
get_cost_components(platform: Optional[str] = None, include_optional: bool = True) -> Dict[str, Dict]
Return cost components, optionally filtered by platform or optionality.
Parameters¶
- platform:
str, optionalRestrict to components whoseapplies_tolist contains this platform. - include_optional:
bool, default TrueWhether to include components marked as optional.
Returns¶
dict Mapping from component identifier to component dictionary.
get_design¶
get_design(design_id: str) -> Dict[str, Any]
Retrieve the configuration dictionary for a specific experimental design.
Parameters¶
- design_id:
strIdentifier of the design.
Returns¶
dict Design configuration.
Raises¶
KeyError If the design identifier is unknown.
get_platform¶
get_platform(platform_id: str) -> Dict[str, Any]
Retrieve the configuration dictionary for a specific platform.
Parameters¶
- platform_id:
strIdentifier of the platform.
Returns¶
dict Platform configuration.
Raises¶
KeyError If the platform identifier is unknown.
get_platform_by_budget¶
get_platform_by_budget(max_cost_per_sample: float) -> DataFrame
Return platforms whose per-sample cost does not exceed a budget limit.
Parameters¶
- max_cost_per_sample:
floatMaximum acceptable cost per sample.
Returns¶
DataFrame Subset of platforms meeting the budget constraint.
list_designs¶
list_designs() -> DataFrame
Return a table of all configured experimental designs.
Returns¶
DataFrame One row per design.
list_platforms¶
list_platforms(recommended_only: bool = False) -> DataFrame
Return a DataFrame containing all configured platforms.
Parameters¶
- recommended_only:
bool, default FalseIf True, only platforms withrecommended=Trueare returned.
Returns¶
DataFrame One row per platform, indexed by platform identifier.
update_platform_cost¶
update_platform_cost(platform_id: str, new_cost: float) -> None
Change the per-sample cost of an existing platform.
Parameters¶
- platform_id:
strPlatform identifier. - new_cost:
floatNew cost per sample (must be non-negative).
Raises¶
KeyError If the platform does not exist.
update_regional_pricing¶
update_regional_pricing(multiplier: float, region: Optional[str] = None) -> None
Apply a uniform multiplicative adjustment to all platform costs.
Parameters¶
- multiplier:
floatFactor by which current costs are multiplied (must be > 0). - region:
str, optionalDescriptive label for logging purposes.
Raises¶
ValueError
If multiplier is not positive.
Input/Output¶
Data Utils¶
Core data structures and validation utilities for the DMeth pipeline.
This module defines the central ProcessedData container used throughout the package and provides helper functions for ensuring data integrity (index alignment and type safety).
Key Components¶
ProcessedData
-
The standard container holding aligned methylation matrix, sample metadata, and probe annotation. All analysis functions expect this object.
-
Validation helpers:
_ensure_index_alignment(M, pheno, ann)Checks that samples and probes are consistently indexed across components._ensure_index_strings(df)Safely converts DataFrame indices, but not columns, to strings (required for HDF5/h5py compatibility).
All ProcessedData objects are automatically validated and string-indexed upon creation.
ProcessedData¶
ProcessedData(M: DataFrame, pheno: DataFrame, ann: DataFrame, meta: Dict[str, Any] = <factory>) -> None
Central container for aligned methylation data used throughout the DMeth pipeline.
Guarantees that:
- All components (methylation matrix, sample metadata, probe annotation) share consistent string indices.
- Sample and probe alignment is validated on construction.
- A
metadictionary tracks processing history (normalisation, batch correction, QC metrics, etc.).
All downstream functions expect or return instances of this class.
Parameters¶
- M:
pd.DataFrameMethylation matrix with CpG sites as rows and samples as columns. Values are typically beta (0–1) or M-values (-inf to +inf). - pheno:
pd.DataFrameSample metadata table. Must be indexed by sample IDs (strings after construction). - ann:
pd.DataFrame or NoneOptional probe/CpG annotation table (e.g., Illumina manifest). Must be indexed by CpG IDs. IfNone, probe-level alignment checks are skipped. - meta:
dict, optionalFree-form dictionary storing pipeline provenance and parameters. Pre-populated with sensible defaults if not provided.
Attributes¶
- M:
pd.DataFrameMethylation matrix (string-indexed rows and columns). - pheno:
pd.DataFrameSample metadata (string-indexed). - ann:
pd.DataFrame or NoneProbe annotation (string-indexed if present). - meta:
dictProcessing metadata (e.g.,{"matrix_type": "beta", "normalized": True, ...}).
Notes¶
The __post_init__ method automatically:
- Converts all relevant indices to strings.
- Validates alignment between components.
- Raises informative
KeyErroron mismatch.
Readers¶
Input utilities for loading methylation datasets in the DMeth pipeline.
This module provides a Python API for loading
preprocessed or raw methylation data into the standard :class:ProcessedData container used throughout the package.
Features¶
- Robust multi-format reading (CSV, TSV, Excel, pickle, HDF5, Parquet, Feather)
- Automatic coercion of methylation matrix to numeric values
- Smart HDF5 key selection (prefers
/MorM) - Safe sample alignment between methylation matrix and phenotype data
- Two high-level entry points:
load_processed_data()– load a previously savedProcessedDataobjectload_methylation_data()– construct a newProcessedDatafrom raw files
ProcessedData¶
ProcessedData(M: DataFrame, pheno: DataFrame, ann: DataFrame, meta: Dict[str, Any] = <factory>) -> None
Central container for aligned methylation data used throughout the DMeth pipeline.
Guarantees that:
- All components (methylation matrix, sample metadata, probe annotation) share consistent string indices.
- Sample and probe alignment is validated on construction.
- A
metadictionary tracks processing history (normalisation, batch correction, QC metrics, etc.).
All downstream functions expect or return instances of this class.
Parameters¶
- M:
pd.DataFrameMethylation matrix with CpG sites as rows and samples as columns. Values are typically beta (0–1) or M-values (-inf to +inf). - pheno:
pd.DataFrameSample metadata table. Must be indexed by sample IDs (strings after construction). - ann:
pd.DataFrame or NoneOptional probe/CpG annotation table (e.g., Illumina manifest). Must be indexed by CpG IDs. IfNone, probe-level alignment checks are skipped. - meta:
dict, optionalFree-form dictionary storing pipeline provenance and parameters. Pre-populated with sensible defaults if not provided.
Attributes¶
- M:
pd.DataFrameMethylation matrix (string-indexed rows and columns). - pheno:
pd.DataFrameSample metadata (string-indexed). - ann:
pd.DataFrame or NoneProbe annotation (string-indexed if present). - meta:
dictProcessing metadata (e.g.,{"matrix_type": "beta", "normalized": True, ...}).
Notes¶
The __post_init__ method automatically:
- Converts all relevant indices to strings.
- Validates alignment between components.
- Raises informative
KeyErroron mismatch.
load_methylation_data¶
load_methylation_data(methylation_input: Union[str, pathlib.Path, DataFrame], pheno_input: Union[str, pathlib.Path, DataFrame, NoneType] = None, ann_input: Union[str, pathlib.Path, DataFrame, NoneType] = None, index_col_probe: Optional[int] = 0, index_col_sample: Optional[int] = 0) -> ProcessedData
High-level constructor that builds a ProcessedData object from raw matrix/metadata files.
- Accepts file paths, already-loaded DataFrames, or a mix of both.
- Performs automatic type coercion, sample alignment, and validation.
Parameters¶
- methylation_input:
str | Path | pd.DataFrameCpG × sample methylation matrix (beta or M-values). - pheno_input:
str | Path | pd.DataFrame, optionalSample phenotype/metadata table. IfNone, an empty placeholder indexed by the matrix columns is created. - ann_input:
str | Path | pd.DataFrame, optionalCpG probe annotation (e.g., Illumina manifest). IfNone, annotation is omitted. - index_col_probe:
int, default 0Column containing CpG identifiers when reading the methylation matrix. - index_col_sample:
int, default 0Column containing sample identifiers when reading the phenotype table.
Returns¶
ProcessedData Fully validated container ready for downstream analysis.
Notes¶
- Non-numeric entries in the methylation matrix are forcibly coerced to NaN (with a warning indicating how many values were affected).
- The phenotype table is reindexed to exactly match the sample order in
M. - All indices are converted to strings and alignment is checked by
ProcessedData.__post_init__.
load_processed_data¶
load_processed_data(path: Union[str, pathlib.Path], trusted: bool = False) -> ProcessedData
Load a previously saved ProcessedData object from disk.
Supports the two canonical DMeth output formats:
- .pkl / .pickle → direct pandas pickle of the ProcessedData instance
- .h5 / .hdf5 → HDF5 store containing the keys M, pheno, and optionally ann
Parameters¶
- path:
str or PathPath to the saved ProcessedData file. - trusted:
bool, default FalseOnly allow.pkl/.picklefiles from trusted sources
Returns¶
ProcessedData Fully reconstructed and validated ProcessedData container.
Raises¶
ValueError If pickle file is from an untrusted source ValueError If the file extension is not one of the supported formats.
Writers¶
Output utilities for saving processed methylation data and analysis results.
This module provides a unified interface (Python API + Typer CLI) for exporting:
- ProcessedData objects in multiple formats
- Differential methylation results (filtered and clean)
- IDAT-style HDF5 archives (compatible with downstream tools)
- Self-contained HTML analysis reports with embedded plots
All export functions include overwrite protection, automatic directory creation, and informative logging.
Features¶
- Smart format detection and suffix handling
- Optional compression (blosc:zstd or gzip)
- Safe export of differential results (core columns + group means)
- IDAT-compatible HDF5 layout via h5py (fallback to pandas HDFStore)
- One-call HTML report generation from summary dict + matplotlib figures
ProcessedData¶
ProcessedData(M: DataFrame, pheno: DataFrame, ann: DataFrame, meta: Dict[str, Any] = <factory>) -> None
Central container for aligned methylation data used throughout the DMeth pipeline.
Guarantees that:
- All components (methylation matrix, sample metadata, probe annotation) share consistent string indices.
- Sample and probe alignment is validated on construction.
- A
metadictionary tracks processing history (normalisation, batch correction, QC metrics, etc.).
All downstream functions expect or return instances of this class.
Parameters¶
- M:
pd.DataFrameMethylation matrix with CpG sites as rows and samples as columns. Values are typically beta (0–1) or M-values (-inf to +inf). - pheno:
pd.DataFrameSample metadata table. Must be indexed by sample IDs (strings after construction). - ann:
pd.DataFrame or NoneOptional probe/CpG annotation table (e.g., Illumina manifest). Must be indexed by CpG IDs. IfNone, probe-level alignment checks are skipped. - meta:
dict, optionalFree-form dictionary storing pipeline provenance and parameters. Pre-populated with sensible defaults if not provided.
Attributes¶
- M:
pd.DataFrameMethylation matrix (string-indexed rows and columns). - pheno:
pd.DataFrameSample metadata (string-indexed). - ann:
pd.DataFrame or NoneProbe annotation (string-indexed if present). - meta:
dictProcessing metadata (e.g.,{"matrix_type": "beta", "normalized": True, ...}).
Notes¶
The __post_init__ method automatically:
- Converts all relevant indices to strings.
- Validates alignment between components.
- Raises informative
KeyErroron mismatch.
export_idat_hdf5¶
export_idat_hdf5(beta: DataFrame, mvals: Optional[DataFrame], sample_sheet: Optional[DataFrame], filepath: Union[str, pathlib.Path], compress: bool = True) -> pathlib.Path
Write methylation data to an IDAT-compatible HDF5 archive using h5py (preferred) or pandas fallback.
Layout mirrors common downstream tools: groups /beta, /mvals (optional),
and /samples containing one JSON dataset per sample.
Parameters¶
- beta:
pd.DataFrameBeta-value matrix (CpGs × samples). - mvals:
pd.DataFrame or None, optionalM-value matrix (same dimensions). - sample_sheet:
pd.DataFrame or None, optionalSample metadata (index = sample IDs). - filepath:
str or PathDestination HDF5 file. - compress:
bool, default TrueApply gzip compression to large datasets.
Returns¶
Path Path to the created HDF5 file.
Raises¶
RuntimeError If neither h5py nor pandas.HDFStore is available.
export_results¶
export_results(res: DataFrame, output_path: str, format: str = 'csv', include_all: bool = False, verbose: bool = True, overwrite: bool = True) -> None
Export differential methylation results to CSV, TSV, or Excel with optional column filtering.
- By default only core statistical columns and per-group mean columns are exported.
- Overwrite protection is applied if
overwrite=False.
Parameters¶
- res:
pd.DataFrameDifferential results table (as returned by the analysis pipeline). - output_path:
strDestination file path (extension determines format ifformatis omitted). - format:
{"csv", "tsv", "excel"}, default "csv"Output file format. - include_all:
bool, default FalseIf True, export every column; if False, only core + mean columns are kept. - verbose:
bool, default TrueLog a confirmation message on success. - overwrite:
bool, default TrueRaise FileExistsError if the file exists and this is False.
Raises¶
FileExistsError
When overwrite=False and the target file already exists.
ValueError
For unsupported format.
RuntimeError
If writing fails (e.g., missing openpyxl for Excel).
save_processed_data¶
save_processed_data(data: 'ProcessedData', path: Union[str, pathlib.Path], format: str = 'hdf5', compression: str | None = 'blosc:zstd', complib_level: int = 9, overwrite: bool = True) -> None
Persist a ProcessedData object to disk in one of several convenient formats.
Supported formats¶
csv / tsv → three separate files (M, pheno, ann) xlsx → single workbook with sheets "M", "pheno", "ann" pickle / pkl → direct pickle of the ProcessedData instance hdf5 / h5 → pandas HDFStore with compressed tables
Parameters¶
- data:
ProcessedDataThe container to save. - path:
str or PathOutput path; suffix is automatically corrected to matchformat. - format:
{"csv", "tsv", "xlsx", "pickle", "pkl", "hdf5", "h5"}, default "hdf5"Desired export format. - compression:
str or None, default "blosc:zstd"Compression library for HDF5 (only used when format is hdf5). - complib_level:
int, default 9Compression level (1–9) for HDF5. - overwrite:
bool, default TrueAllow overwriting existing files.
Raises¶
ValueError
If an unsupported format is requested.
FileExistsError
When overwrite=False and the target exists.
Preprocessing Utilities¶
Core preprocessing utilities for DNA methylation array data.
- This module implements quality control, normalization, batch correction, and filtering operations for methylation matrices (beta or M-values).
- It provides both standard and high-performance implementations with memory-efficient processing for large datasets.
Features¶
- Sample-level QC: Flag/remove samples with excessive missing data
- CpG-level QC: Remove probes with high missingness or on sex chromosomes
- Normalization: Beta quantile normalization with optional M-value conversion
- Batch correction: Regression-based or ComBat-style correction
- Filtering: Remove low-variance probes
ProcessedData¶
ProcessedData(M: DataFrame, pheno: DataFrame, ann: DataFrame, meta: Dict[str, Any] = <factory>) -> None
Central container for aligned methylation data used throughout the DMeth pipeline.
Guarantees that:
- All components (methylation matrix, sample metadata, probe annotation) share consistent string indices.
- Sample and probe alignment is validated on construction.
- A
metadictionary tracks processing history (normalisation, batch correction, QC metrics, etc.).
All downstream functions expect or return instances of this class.
Parameters¶
- M:
pd.DataFrameMethylation matrix with CpG sites as rows and samples as columns. Values are typically beta (0–1) or M-values (-inf to +inf). - pheno:
pd.DataFrameSample metadata table. Must be indexed by sample IDs (strings after construction). - ann:
pd.DataFrame or NoneOptional probe/CpG annotation table (e.g., Illumina manifest). Must be indexed by CpG IDs. IfNone, probe-level alignment checks are skipped. - meta:
dict, optionalFree-form dictionary storing pipeline provenance and parameters. Pre-populated with sensible defaults if not provided.
Attributes¶
- M:
pd.DataFrameMethylation matrix (string-indexed rows and columns). - pheno:
pd.DataFrameSample metadata (string-indexed). - ann:
pd.DataFrame or NoneProbe annotation (string-indexed if present). - meta:
dictProcessing metadata (e.g.,{"matrix_type": "beta", "normalized": True, ...}).
Notes¶
The __post_init__ method automatically:
- Converts all relevant indices to strings.
- Validates alignment between components.
- Raises informative
KeyErroron mismatch.
batch_correction¶
batch_correction(data: ProcessedData, batch_col: str, covariates: Optional[List[str]] = None, method: str = 'qr', weights: Optional[ndarray] = None, block_size: int = 50000, robust: bool = False, return_diagnostics: bool = False) -> Union[ProcessedData, Tuple[ProcessedData, Dict[str, Any]]]
Regression-based batch correction using full linear model (intercept + covariates + batch).
Numerically stable QR/pinv solver with optional per-probe weights and block-wise processing.
Parameters¶
- batch_col:
strColumn inphenoidentifying batch. - covariates:
list[str] or NoneAdditional technical or biological covariates to preserve. - method:
{"qr", "pinv"}, default "qr"Solver backend. - weights:
ndarray or NonePer-probe inverse-variance weights. - robust:
bool, default FalseUse Huber robust regression (much slower). - return_diagnostics:
bool, default FalseReturn detailed effect sizes, residuals, and variance explained.
Returns¶
ProcessedData or (ProcessedData, dict) Corrected data and optional diagnostics.
batch_correction_combat¶
batch_correction_combat(data: ProcessedData, batch_col: str, covariates: Optional[List[str]] = None, parametric: bool = True, return_diagnostics: bool = False) -> Union[ProcessedData, Tuple[ProcessedData, Dict[str, Any]]]
Wrapper for ComBat (or safe fallback) batch correction.
- Uses
pycombatif installed (parametric or nonparametric). - Falls back to conservative mean-centering + variance rescaling with clear warning.
- Automatically handles beta ↔ M-value conversion.
Parameters¶
- parametric:
bool, default TrueUse parametric ComBat (faster, assumes normality). - return_diagnostics:
bool, default FalseReturn metadata about which method was actually used.
Returns¶
ProcessedData or (ProcessedData, dict) Batch-corrected data and diagnostics.
filter_low_variance_cpgs¶
filter_low_variance_cpgs(data: ProcessedData, min_percentile: float = 10.0, inplace: bool = True) -> ProcessedData
Remove CpGs below a variance percentile threshold (e.g., bottom 10%).
Parameters¶
- data : ProcessedData
- min_percentile : float, default 10.0 Keep only probes with variance ≥ this percentile.
- inplace : bool, default True Modify input object or return a copy.
Returns¶
- ProcessedData Dataset with low-variance probes removed.
Notes¶
- Number of removed probes recorded in
data.meta["qc"]["low_variance_removed"].
normalize_methylation¶
normalize_methylation(data: ProcessedData, method: str = 'beta_quantile', convert_to: Optional[str] = None, copy: bool = False, q_chunk_threshold: int = 100000000.0, q_block_probes: int = 50000) -> ProcessedData
Perform quantile normalization across samples with optional beta ↔ M-value conversion.
- Implements memory-efficient beta quantile normalization using column-wise sorting and block-wise rank mapping.
- Automatically falls back to disk-backed memmap for very large datasets.
Parameters¶
- data:
ProcessedDataInput methylation data. - method:
{"beta_quantile", "none"}, default "beta_quantile"Normalization method. - convert_to:
{"beta", "m"} or None, optionalConvert matrix type after normalization (e.g., "m" for M-values). - copy:
bool, default FalseWork on a deep copy of the input. - q_chunk_threshold:
int, default 1e8Element count threshold (n_probes × n_samples) above which memmap is used. - q_block_probes:
int, default 50_000Number of probes processed per block during rank-to-target assignment.
Returns¶
ProcessedData
Normalized dataset with updated meta["normalized"] and detailed provenance.
Notes¶
Preserves original NaNs. Records full normalization metadata including memory strategy.
normalize_methylation_highperf¶
normalize_methylation_highperf(data: ProcessedData, method: str = 'beta_quantile', convert_to: Optional[str] = None, copy: bool = False, memmap_threshold: int = 200000000.0, memmap_dir: Optional[str] = None, n_workers: Optional[int] = None, sample_block: int = 16, random_state: Optional[int] = None) -> ProcessedData
High-performance quantile normalization with multiprocessing and automatic memmap handling.
Designed for >850k × 500+ datasets. Uses disk-backed arrays when needed and parallelizes the rank-mapping stage.
Parameters¶
- memmap_threshold:
int, default 2e8Element count above which temporary memmap files are created. - memmap_dir:
str or NoneDirectory for memmap files (defaults to system temp). - n_workers:
int or NoneNumber of processes for parallel assignment (defaults to CPU count – 1). - sample_block:
int, default 16Number of samples per parallel job. - random_state:
int or NoneCurrently unused (kept for API consistency).
Returns¶
ProcessedData Quantile-normalized data with comprehensive metadata.
Notes¶
Significantly faster and lower peak RAM than normalize_methylation on large cohorts.
qc_cpg_level¶
qc_cpg_level(data: ProcessedData, max_missing_fraction: float = 0.1, drop_sex_chr: bool = True, chr_col: str = 'chromosome') -> ProcessedData
Remove CpGs with high missingness across samples and optionally drop sex-chromosome probes.
Parameters¶
- data:
ProcessedDataInput dataset. - max_missing_fraction:
float, default 0.10Maximum fraction of samples allowed to be missing for a CpG (10% default). - drop_sex_chr:
bool, default TrueIf True and annotation is available, remove all probes on chromosomes X and Y. - chr_col:
str, default "chromosome"Column name indata.anncontaining chromosome information.
Returns¶
ProcessedData Updated container with low-quality and/or sex-chromosome CpGs removed.
Notes¶
Summary statistics are recorded in data.meta["qc"]["cpg_missing"].
qc_sample_level¶
qc_sample_level(data: ProcessedData, max_missing_fraction: float = 0.05, min_nonmissing_probes: Optional[int] = None, remove_samples: bool = True) -> ProcessedData
Identify and optionally remove samples with excessive missing methylation data.
Parameters¶
- data:
ProcessedDataInput methylation dataset. - max_missing_fraction:
float, default 0.05Maximum allowed fraction of missing CpGs per sample (5% by default). - min_nonmissing_probes:
int or None, optionalAlternative absolute threshold: minimum number of detected (non-missing) probes required. - remove_samples:
bool, default TrueIf True, flagged samples are removed from bothMandpheno; if False, only metadata is recorded.
Returns¶
ProcessedData Updated container (samples removed if requested).
Notes¶
Flagged sample list and thresholds are stored in data.meta["qc"] ["sample_missing"].
Core Differential Analysis¶
Core Analysis¶
Statistical engines for differential methylation analysis.
This module implements empirical Bayes moderated statistics following Smyth (2004) limma methodology, with extensions for paired designs, robust estimation, and Numba-accelerated computation. It provides both full-matrix and chunked processing for memory-constrained environments.
Features¶
- Empirical Bayes variance shrinkage (Smyth method)
- Robust variance estimation via winsorization
- Numba JIT compilation for 10-100 x speedup
- Memory-efficient chunked processing
- Paired and multi-group designs
- F-tests for multi-coefficient contrasts
- Automatic handling of missing data
- Group mean computation for interpretability
SmythPrior¶
SmythPrior(df_prior: 'float', var_prior: 'float') -> None
SmythPrior(df_prior: 'float', var_prior: 'float')
check_analysis_memory¶
check_analysis_memory(M: 'pd.DataFrame', warn_threshold_gb: 'float' = 8.0)
Conservatively estimate RAM requirements for a full-matrix differential analysis.
- Calculates current data footprint, projects peak usage (approximately 4 times input size), and compares against available system memory.
- Raises MemoryError early if danger is high, forcing use of chunked mode.
Parameters¶
- M:
pd.DataFrameMethylation matrix (CpGs × samples). - warn_threshold_gb:
float, default 8.0Issue a warning if estimated peak exceeds this value.
Returns¶
dict
Keys: data_gb, peak_gb, available_gb.
Raises¶
MemoryError
If projected peak consumption exceeds ~80% of available RAM. In such cases fit_differential_chunked should be used instead.
fit_differential¶
fit_differential(M: 'pd.DataFrame', design: 'pd.DataFrame', contrast: 'Optional[np.ndarray]' = None, contrast_matrix: 'Optional[np.ndarray]' = None, shrink: 'Union[str, float]' = 'auto', robust: 'bool' = True, eps: 'float' = 1e-08, return_residuals: 'bool' = False, min_count: 'int' = 3, max_d0: 'float' = 50.0, winsor_lower: 'float' = 0.05, winsor_upper: 'float' = 0.95, group_labels: 'Optional[pd.Series]' = None, use_numba: 'bool' = True) -> 'Union[pd.DataFrame, Tuple[pd.DataFrame, pd.DataFrame]]'
Core limma-style differential methylation analysis with empirical Bayes moderation.
- Features:
- Numba-accelerated feature-wise fitting (10–100× speedup when no missing data) - Automatic handling of missing values
- Robust winsorization of residuals
- Multiple shrinkage strategies (Smyth, median, fixed d₀, none)
- Single-contrast t-tests or multi-coefficient F-tests
- Optional group mean beta/M-value columns for interpretability
- Optional residuals return for diagnostics
Parameters¶
- M:
pd.DataFrameCpG × sample matrix of M-values (or beta values if already on log-odds scale). - design:
pd.DataFrameSample × covariate design matrix (output frompatsyor manual construction). contrast / contrast_matrix : np.ndarray or None Contrast vector (t-test) or matrix (F-test). - shrink:
{"auto", "smyth", "median", "none"} or floatVariance shrinkage method or fixed prior df. - robust:
bool, default TrueWinsorize residuals per feature. - group_labels:
pd.Series, optionalExplicit sample grouping for mean columns; otherwise heuristically detected. - use_numba:
bool, default TrueEnable JIT-compiled loop when possible.
Returns¶
pd.DataFrame
Results table with columns including logFC, t, pval, padj (or F, df1, df2), moderated variances (s2_post), prior df (d0), and group means.
(pd.DataFrame, pd.DataFrame), optional
Second element contains per-feature residuals if return_residuals=True.
fit_differential_chunked¶
fit_differential_chunked(M: 'pd.DataFrame', design: 'pd.DataFrame', chunk_size: 'int' = 10000, verbose: 'bool' = True, group_labels: 'Optional[pd.Series]' = None, **kwargs) -> 'pd.DataFrame'
Memory-efficient version of fit_differential that processes features in chunks.
Useful for >1M probe arrays on limited RAM. P-values are combined globally and FDR-corrected once at the end.
Parameters¶
- chunk_size:
int, default 10000Number of CpGs per processing chunk. **kwargs All additional arguments forwarded tofit_differential.
Returns¶
pd.DataFrame Complete results table with globally adjusted p-values and group means.
Postprocessing¶
Postprocessing and interpretation utilities for differential methylation analysis results.
This module provides robust, publication-ready tools for summarizing, filtering, and extracting biologically meaningful differentially methylated CpGs from statistical output. It supports flexible significance criteria, effect-size thresholding, directional filtering, and delta-beta constraints, with comprehensive summary statistics for reporting.
Features¶
- Comprehensive summary statistics including significance counts, directionality, effect sizes, and empirical Bayes shrinkage metrics
- Flexible extraction of significant CpGs with adjustable logFC, adjusted p-value, and delta-beta thresholds
- Directional filtering for hyper- and hypomethylated sites
- Automatic detection or explicit specification of group mean beta-value columns for delta-beta filtering
- Graceful handling of missing columns and empty result sets with informative warnings
- Optional detailed summary dictionaries for downstream reporting or visualization
get_significant_cpgs¶
get_significant_cpgs(res: DataFrame, lfc_col: str = 'logFC', pval_col: str = 'padj', lfc_thresh: float = 0.0, pval_thresh: float = 0.05, delta_beta_thresh: Optional[float] = None, direction: Optional[str] = None, delta_beta_cols: Optional[List[str]] = None, return_summary: bool = False, verbose: bool = True) -> Union[List[str], Dict]
Extract biologically meaningful significant CpGs using flexible, multi-layer filtering.
- Combines adjusted p-value, log fold-change, directionality, and optional absolute delta-beta criteria.
- Ideal for generating final candidate lists for downstream validation or pathway analysis.
Parameters¶
- res:
pd.DataFrameFull differential methylation results table. - lfc_col:
str, default "logFC"Column containing log₂ fold change. - pval_col:
str, default "padj"Column containing adjusted p-values. - lfc_thresh:
float, default 0.0Minimum absolute |logFC| required (0 means no LFC filtering). - pval_thresh:
float, default 0.05Maximum adjusted p-value for significance. - delta_beta_thresh:
float or None, optionalMinimum absolute difference in mean beta values between groups. - direction:
{"hyper", "hypo", None}, optionalRestrict to hypermethylated (logFC > 0), hypomethylated (logFC < 0), or both. - delta_beta_cols:
list[str] or None, optionalExactly two column names containing group mean beta values (e.g.["meanB_case", "meanB_control"]). IfNoneand delta-beta filtering is requested, automatically detects columns starting withmeanB_. - return_summary:
bool, default FalseIf True, return a detailed dictionary instead of just the CpG list. - verbose:
bool, default TrueWarn when no significant CpGs are found.
Returns¶
list[str] or dict List of significant CpG IDs (default) Or a summary dictionary containing counts, directional breakdown, mean |logFC|, and the CpG list
Raises¶
KeyError
If required columns (lfc_col or pval_col) are missing.
ValueError
If delta_beta_thresh is used but exactly two mean-beta columns cannot be identified.
summarize_differential_results¶
summarize_differential_results(res: DataFrame, pval_thresh: float = 0.05, lfc_thresh: float = 0.0, verbose: bool = True) -> Dict[str, Union[int, float]]
Produce a comprehensive, publication-ready summary of differential methylation results.
Handles missing columns gracefully and returns key metrics for reporting: significance counts, directionality, effect-size statistics, and variance-shrinkage diagnostics.
Parameters¶
- res:
pd.DataFrameDifferential methylation results containing at minimumlogFCandpadj. May also includepval,s2,s2_post, andd0(prior degrees of freedom). - pval_thresh:
float, default 0.05Adjusted p-value threshold defining statistical significance. - lfc_thresh:
float, default 0.0Minimum absolute |logFC| required (0 means no LFC filtering). - verbose:
bool, default TrueEmit warnings for empty input or missing optional columns.
Returns¶
dict
Summary dictionary with the following keys:
total_tested: total CpGs tested
significant: number of significant CpGs (padj < threshold)
pct_significant: percentage of significant CpGs
hypermethylated / hypomethylated: directional counts
mean_abs_logFC_sig / median_abs_logFC_sig: effect size summaries among significant sites
max_abs_logFC: largest absolute log fold change
min_pval: smallest raw p-value
shrinkage_factor: median ratio of moderated to original variance
d0: median prior degrees of freedom (empirical Bayes)
Notes¶
Missing optional columns are safely ignored with fallback values. Empty input returns a zero-filled summary.
Preparation¶
Data preparation and preprocessing utilities for differential methylation analysis.
This module provides robust, production-ready tools for cleaning and imputing CpG-level methylation matrices prior to statistical modeling. It implements stringent quality-control filters based on missingness patterns and group representation, alongside flexible imputation strategies optimized for high-dimensional epigenomic data.
Features¶
- Global and per-group missingness filtering with configurable thresholds
- Minimum representation enforcement across experimental groups
- Multiple imputation methods: row-wise mean/median and K-nearest neighbors
- Sample-wise KNN option for dramatic speed gains when samples ≪ CpGs
- Comprehensive input validation and clear diagnostic reporting
- Memory-efficient operations using NumPy-based vectorization
filter_cpgs_by_missingness¶
filter_cpgs_by_missingness(M: 'pd.DataFrame', max_missing_rate: 'float' = 0.2, min_samples_per_group: 'Optional[int]' = None, groups: 'Optional[pd.Series]' = None) -> 'Tuple[pd.DataFrame, int, int]'
Remove CpGs exceeding a global missingness threshold and/or failing per-group representation.
Parameters¶
- M:
pd.DataFrameMethylation matrix (CpGs × samples) containing possible NaN values. - max_missing_rate:
float, default 0.2Maximum allowable fraction of missing values across all samples (0–1). - min_samples_per_group:
int or None, optionalMinimum number of non-missing observations required in every experimental group. Ignored ifNone. - groups:
pd.Series or None, optionalSample group labels indexed byM.columns. Required whenmin_samples_per_groupis set.
Returns¶
- filtered:
pd.DataFrameSubset of CpGs passing both filters. - n_removed:
intNumber of CpGs removed. - n_kept:
intNumber of CpGs retained.
Raises¶
ValueError
If min_samples_per_group is supplied without groups or if group labels are misaligned.
filter_min_per_group¶
filter_min_per_group(M: 'pd.DataFrame', groups: 'pd.Series', min_per_group: 'int' = 5, verbose: 'bool' = True) -> 'pd.DataFrame'
Retain only CpGs with at least min_per_group non-missing values in one experimental group.
Useful for ensuring sufficient representation before differential analysis.
Parameters¶
- M:
pd.DataFrameMethylation matrix (CpGs × samples) possibly containing NaN. - groups:
pd.SeriesGroup membership for each sample (index must matchM.columns). - min_per_group:
int, default 5Minimum number of observed (non-missing) values required per group. - verbose:
bool, default TrueLog a concise summary of filtering results.
Returns¶
pd.DataFrame Filtered matrix containing only qualifying CpGs.
Raises¶
ValueError
If group labels do not cover all samples in M.
impute_missing_values¶
impute_missing_values(M: 'pd.DataFrame', method: 'str' = 'mean', k: 'int' = 5, use_sample_knn: 'bool' = False) -> 'pd.DataFrame'
Impute missing methylation values using row-wise statistics or K-nearest neighbours.
Parameters¶
- M:
pd.DataFrameCpG × sample matrix with possible NaN entries. - method:
{"mean", "median", "knn"}, default "mean"Imputation strategy. - k:
int, default 5Number of neighbours used for KNN imputation. - use_sample_knn:
bool, default FalseIf True, KNN is performed across samples (samples ≪ CpGs) rather than across CpGs. Dramatically faster for typical array datasets.
Returns¶
pd.DataFrame Imputed matrix with the same index/columns as input.
Notes¶
meanandmedianare applied row-wise (per CpG).knnuses scikit-learn'sKNNImputerwith distance weighting.- No imputation is performed if the matrix contains no missing values.
Validation¶
Input validation and integrity checks for differential methylation analysis.
- This module implements rigorous pre-analysis validation of data matrices, experimental designs, contrasts, sample alignment, and system resources. - It ensures statistical estimability, correct pairing structure, and sufficient memory before launching computationally intensive analyses, preventing silent failures and ambiguous results.
Features¶
- Accurate memory footprint estimation with conservative peak usage prediction and automatic MemoryError on insufficient RAM
- Strict validation of two-group design vectors with automatic construction of intercept + indicator design matrix
- Comprehensive contrast validation including shape, zero-vector detection, and QR-based estimability checking
- Flexible string contrast syntax ("treatment-control") for simple two-column designs
- Thorough alignment verification between methylation data columns, design matrix rows, and optional sample identifiers
- Robust paired-sample validation ensuring exactly two observations per subject and balanced group representation within pairs
- Clear, actionable error messages and logging for rapid debugging in production pipelines
build_design¶
build_design(data: 'pd.DataFrame', categorical: 'list[str]' = None, add_intercept: 'bool' = True, drop_first: 'bool' = True) -> 'pd.DataFrame'
Build a design matrix for linear modeling.
Parameters¶
- data:
pd.DataFrameColumns are variables used in the design (condition, patient, batch, age, sex, ...) - categorical:
list of str, optionalWhich columns should be treated as categorical. If None, infer automatically. - add_intercept:
bool, default=TrueWhether to add an intercept column of 1s. - drop_first:
bool, default=TrueDrop the first dummy level to avoid collinearity.
Returns¶
pd.DataFrame Numeric design matrix.
check_analysis_memory¶
check_analysis_memory(M: 'pd.DataFrame', warn_threshold_gb: 'float' = 8.0)
Conservatively estimate RAM requirements for a full-matrix differential analysis.
- Calculates current data footprint, projects peak usage (approximately 4 times input size), and compares against available system memory.
- Raises MemoryError early if danger is high, forcing use of chunked mode.
Parameters¶
- M:
pd.DataFrameMethylation matrix (CpGs × samples). - warn_threshold_gb:
float, default 8.0Issue a warning if estimated peak exceeds this value.
Returns¶
dict
Keys: data_gb, peak_gb, available_gb.
Raises¶
MemoryError
If projected peak consumption exceeds ~80% of available RAM. In such cases fit_differential_chunked should be used instead.
validate_alignment¶
validate_alignment(data: 'pd.DataFrame', design_matrix: 'Union[np.ndarray, pd.DataFrame]', sample_names: 'Optional[Sequence]' = None, paired_ids: 'Optional[Sequence]' = None, group_col_idx: 'int' = 1) -> 'Tuple[pd.Index, Union[pd.DataFrame, np.ndarray], Optional[np.ndarray]]'
Ensure perfect alignment between methylation data, design matrix, and optional pairing information.
Verifies sample counts, column ordering, duplicate names, and (if supplied) paired-subject structure.
Parameters¶
- data:
pd.DataFrameMethylation matrix with samples in columns. - design_matrix:
np.ndarray or pd.DataFrameCorresponding design matrix. - sample_names:
Sequence or NoneExplicit ordered list of sample identifiers (required only if column names differ). - paired_ids:
Sequence or NoneSubject/block identifiers for paired designs; each ID must appear exactly twice. - group_col_idx:
int, default 1Column in design_matrix encoding group membership (used to verify balanced pairing).
Returns¶
tuple (ordered_data_columns, validated_design_matrix, validated_paired_array_or_None)
Raises¶
ValueError On any misalignment, duplicate column names, or invalid pairing structure (e.g., a subject with both samples in the same group).
validate_contrast¶
validate_contrast(design_matrix: 'Union[np.ndarray, pd.DataFrame]', contrast: 'Union[np.ndarray, Sequence[float]]') -> 'np.ndarray'
Validate and normalize a numeric contrast vector against a design matrix.
Parameters¶
- design_matrix:
np.ndarray or pd.DataFrameFull design matrix (n_samples × n_coefficients). - contrast:
np.ndarray or Sequence[float]Contrast vector (length must equal n_coefficients).
Returns¶
np.ndarray Validated contrast vector of shape (n_coefficients,) and dtype float64.
Raises¶
TypeError If inputs are not numeric or not the correct type. ValueError If contrast is zero, wrong length, or non-estimable.
Downstream Analysis¶
Annotation¶
Annotation and functional interpretation utilities for DNA methylation analysis.
This module delivers a complete suite of tools for translating CpG- and region-level differential methylation results into biological context. It enables rapid gene annotation, pathway- and gene-set enrichment analysis, computation of pathway-level methylation activity, correlation with gene expression, and reliable coordinate liftover across genome assemblies — all implemented with performance and robustness suitable for genome-wide studies.
Features¶
- Ultra-fast nearest-gene annotation using IntervalTree with graceful fallback to distance-based mapping
- Fisher’s exact test gene-set enrichment with automatic background correction, size filtering, and FDR adjustment
- Flexible pathway-level methylation scoring (mean/median/sum) from user-provided gene-to-pathway mappings
- Sample-aware Pearson/Spearman correlation between CpG methylation and gene expression with robust overlap handling
- High-accuracy genomic liftover via pyliftover (hg19 ↔ hg38 and other builds) with per-region success tracking
- Comprehensive handling of missing dependencies, empty inputs, and chromosome naming conventions
- Preservation of original indices and seamless integration with DMeth result tables
adjust_pvalues¶
adjust_pvalues(pvals: Union[pandas.core.series.Series, ndarray, List[float]], method: str = 'fdr_bh') -> pandas.core.series.Series
Apply multiple-testing correction to raw p-values using statsmodels.
Handles NaN values robustly (treated as non-significant) and preserves the original index when input is a pandas Series.
Parameters¶
- pvals:
array-likeRaw p-values (0 ≤ p ≤ 1). - method:
str, default "fdr_bh"Correction method passed tostatsmodels.stats.multitest.multipletests. Supported:"bonferroni","holm","fdr_bh","fdr_by","sidak","none".
Returns¶
pd.Series Adjusted p-values with identical index/order as input.
correlate_methylation_expression¶
correlate_methylation_expression(beta: 'pd.DataFrame', expression: 'pd.DataFrame', gene_map: 'Optional[Dict[str, str]]' = None, method: 'str' = 'pearson', override_index_alignment: 'bool' = False) -> 'pd.DataFrame'
Compute sample-wise correlation between CpG methylation and gene expression.
Supports both one-to-one (same index) and many-to-one (custom CpG→gene mapping) scenarios.
Parameters¶
beta, expression : pd.DataFrame
Methylation (CpGs × samples) and expression (genes × samples) matrices.
- gene_map: dict or None
Explicit mapping from CpG ID → gene symbol (required for cis-analysis).
- method: {"pearson", "spearman"}, default "pearson"
Correlation coefficient to compute.
- override_index_alignment: bool, default False
Proceed even with <2 overlapping samples (useful for exploratory checks).
Returns¶
pd.DataFrame
Correlation results with columns r and pval (indexed by CpG and optionally gene).
gene_set_enrichment¶
gene_set_enrichment(gene_list: 'List[str]', background: 'Optional[List[str]]' = None, gene_sets: 'Optional[Dict[str, List[str]]]' = None, method: 'str' = 'fisher', pval_cutoff: 'float' = 0.05, min_set_size: 'int' = 5, max_set_size: 'int' = 500) -> 'pd.DataFrame'
Perform over-representation analysis (Fisher’s exact test) on a list of genes against predefined gene sets (e.g., GO, KEGG, Reactome).
Automatically filters gene sets by size and applies FDR correction.
Parameters¶
- gene_list:
list[str]Genes of interest (e.g., nearest genes of significant DMS/DMRs). - background:
list[str] or NoneBackground gene universe. Defaults to union of all genes ingene_sets. - gene_sets:
dict[str, list[str]]Mapping from pathway/term name to member genes. min_set_size, max_set_size : int Exclude overly small or large gene sets (default 5–500). - pval_cutoff:
float, default 0.05Return only terms with adjusted p-value ≤ this threshold.
Returns¶
pd.DataFrame
Enriched terms with columns: term, pvalue, padj, oddsratio, overlap, set_size.
liftover_coordinates¶
liftover_coordinates(regions: 'pd.DataFrame', from_build: 'str' = 'hg19', to_build: 'str' = 'hg38', chr_col: 'str' = 'chr', start_col: 'str' = 'start', end_col: 'str' = 'end') -> 'pd.DataFrame'
Convert genomic regions between genome assemblies using pyliftover (e.g., hg19 ↔ hg38).
Handles both single-base and interval coordinates with per-row success reporting.
Parameters¶
- regions:
pd.DataFrameTable containing chromosome, start, and end columns. from_build, to_build : str Source and target genome builds (default: hg19 → hg38). chr_col, start_col, end_col : str Column names for genomic coordinates.
Returns¶
pd.DataFrame
Original table augmented with:
lifted_chr, lifted_start, lifted_end
lifted (boolean indicating successful conversion)
Raises¶
RuntimeError
If pyliftover is not installed.
map_dms_to_genes¶
map_dms_to_genes(dms: 'pd.DataFrame', genes: 'pd.DataFrame', cpg_chr_col: 'str' = 'chr', cpg_pos_col: 'str' = 'pos', gene_chr_col: 'str' = 'chr', gene_start_col: 'str' = 'start', gene_end_col: 'str' = 'end', gene_name_col: 'str' = 'gene_symbol', promoter_upstream: 'int' = 1500, proximal_cutoff: 'int' = 200, promoter_downstream: 'int' = 500, distal_promoter_max: 'int' = 5000, max_distance: 'int' = 1000000) -> 'pd.DataFrame'
Annotate CpGs with nearest gene and regulatory context.
Parameters¶
- dms:
pd.DataFrameCpG positions with chromosome and position columns. - genes:
pd.DataFrameGene annotations with chromosome, start, end, gene_symbol, and optional strand. cpg_chr_col, cpg_pos_col : str Column names for CpG chromosome and position. gene_chr_col, gene_start_col, gene_end_col, gene_name_col : str Column names for gene annotation. - promoter_upstream:
intUpstream region from TSS considered promoter. - proximal_cutoff:
intDistance near TSS classified as proximal promoter. - promoter_downstream:
intDownstream region from TSS considered promoter. - distal_promoter_max:
intMaximum distance upstream of TSS classified as distal promoter. - max_distance:
intMaximum distance to assign nearest gene if no overlap.
Returns¶
pd.DataFrame
Original dms with added columns:
- nearest_gene : str or NaN
- relation : {"proximal_promoter", "distal_promoter", "promoter_downstream",
"distal_upstream", "gene_body", "intergenic"}
- distance_bp : int or NaN
pathway_methylation_scores¶
pathway_methylation_scores(beta: 'pd.DataFrame', annotation: 'pd.DataFrame', pathway_db: 'Dict[str, List[str]]', method: 'str' = 'mean') -> 'pd.DataFrame'
Collapse CpG-level beta values into pathway-level methylation scores per sample.
Useful for downstream pathway-activity modeling or visualization.
Parameters¶
- beta:
pd.DataFrameBeta-value matrix (CpGs × samples). - annotation:
pd.DataFrameMapping from CpG identifiers to gene symbols (index → gene). - pathway_db:
dict[str, list[str]]Dictionary of pathways → list of associated genes. - method:
{"mean", "median", "sum"}, default "mean"Aggregation function applied across CpGs belonging to each pathway.
Returns¶
pd.DataFrame Pathways × samples matrix of aggregated methylation scores.
Deconvolution¶
Cell-type deconvolution utilities for bulk DNA methylation data.
This module provides fast, reference-based estimation of cell-type proportions from epigenome-wide methylation profiles using constrained non-negative least squares (NNLS). It supports integration of published reference matrices (e.g., blood, brain, tumor microenvironment) and enables accurate correction for cellular heterogeneity in differential methylation studies.
Features¶
- High-performance NNLS deconvolution with automatic feature alignment between bulk and reference
- Optional parallel processing via joblib for large cohorts (hundreds to thousands of samples)
- Robust normalization of estimated proportions to sum-to-1 per sample
- Graceful handling of missing joblib dependency with automatic fallback to serial execution
- Clear error reporting when no overlapping CpGs exist between dataset and reference
- Direct compatibility with standard beta-value matrices (CpGs × samples) used throughout DMeth)
- Extensible design for future reference-free (e.g., EpiDISH HEpiDISH, MeDeCom) or projective methods
estimate_cell_composition¶
estimate_cell_composition(beta: DataFrame, ref_profiles: DataFrame, method: str = 'nnls', n_jobs: int = 1) -> DataFrame
Perform reference-based cell-type deconvolution of bulk DNA methylation data using constrained non-negative least squares (NNLS).
Estimates the relative proportions of predefined cell types in each sample by solving the linear mixture model: β_bulk ≈ R · w
where R contains cell-type-specific methylation reference profiles and w ≥ 0 are the unknown proportions.
Parameters¶
- beta:
pd.DataFrameBulk beta-value matrix with CpGs as rows (index) and samples as columns. Values must be in [0, 1]; missing values are not supported. - ref_profiles:
pd.DataFrameReference matrix of pure cell-type methylation profiles. Must have the same CpG index asbeta(or a subset thereof) and cell types as columns. - method:
str, default "nnls"Deconvolution algorithm. Currently only"nnls"(scipy.optimize.nnls) is implemented. - n_jobs:
int, default 1Number of parallel processes for sample-wise deconvolution. Requiresjoblib; automatically falls back to serial execution if unavailable or set to 1.
Returns¶
pd.DataFrame
Estimated cell-type proportions with:
rows = samples (same order and names as beta.columns)
columns = cell types (same as ref_profiles.columns)
values constrained to ≥0 and normalized to sum to 1 per sample
Raises¶
ValueError
If no overlapping CpG features exist between beta and ref_profiles.
Notes¶
- Automatic feature alignment: only CpGs present in both matrices are used.
- Proportions are forcibly normalized to sum to 1 (with protection against division by zero).
- Highly efficient for typical blood, brain, or tumor microenvironment references (e.g., FlowSorted.Blood.EPIC, Houseman, etc.).
- Designed for seamless integration with DMeth preprocessing pipelines (beta-value input expected).
Downstream Stats¶
Core statistical utilities for DNA methylation analysis downstream of differential testing.
- This module provides a comprehensive suite of high-performance, publication-grade functions for p-value adjustment, effect-size computation, delta-beta calculation, reproducible DMR calling via sliding-window clustering, regional summarization, and cross-dataset reproducibility assessment.
- Designed for seamless integration into DMeth pipelines, all functions operate efficiently on large epigenome-wide datasets while preserving genomic coordinates and CpG identifiers.
Features¶
- Multiple p-value correction methods via statsmodels with robust NaN handling
- Stouffer’s Z-score method for meta-analysis across studies or batches
- Cohen’s d and Hedges’ g effect size estimation with proper small-sample correction
- Flexible delta-beta computation with optional absolute values and index alignment
- Threshold-based filtering of DMS results supporting logFC, delta-beta, and directional constraints
- Fast, vectorized sliding-window DMR discovery with configurable gap merging and minimum CpG requirements
- Comprehensive DMR summarization and cross-dataset reproducibility metrics (Jaccard, concordance, Spearman correlation)
- Full preservation of CpG identifiers and genomic coordinates throughout all operations
adjust_pvalues¶
adjust_pvalues(pvals: Union[pandas.core.series.Series, ndarray, List[float]], method: str = 'fdr_bh') -> pandas.core.series.Series
Apply multiple-testing correction to raw p-values using statsmodels.
Handles NaN values robustly (treated as non-significant) and preserves the original index when input is a pandas Series.
Parameters¶
- pvals:
array-likeRaw p-values (0 ≤ p ≤ 1). - method:
str, default "fdr_bh"Correction method passed tostatsmodels.stats.multitest.multipletests. Supported:"bonferroni","holm","fdr_bh","fdr_by","sidak","none".
Returns¶
pd.Series Adjusted p-values with identical index/order as input.
compute_delta_beta¶
compute_delta_beta(mean_beta_group1: Union[pandas.core.series.Series, ndarray], mean_beta_group2: Union[pandas.core.series.Series, ndarray], as_abs: bool = False) -> pandas.core.series.Series
Calculate per-CpG difference in mean beta values between two groups.
Automatically aligns inputs by index; supports absolute differences.
Parameters¶
mean_beta_group1, mean_beta_group2 : Series-like
Group-wise mean beta values.
- as_abs: bool, default False
Return absolute delta-beta if True.
Returns¶
pd.Series Delta-beta values (positive = group1 > group2).
compute_dms_reproducibility¶
compute_dms_reproducibility(res1: DataFrame, res2: DataFrame, id_col: Optional[str] = None, effect_col: str = 'logFC', pval_col: str = 'padj', pval_thresh: float = 0.05) -> Dict[str, Any]
Quantify reproducibility of differential methylation signals across two independent analyses or cohorts.
Metrics include:
- Total/overlap feature counts
- Jaccard index
- Number of overlapping significant CpGs
- Directional concordance
- Spearman correlation of effect sizes
Parameters¶
res1, res2 : pd.DataFrame
Two differential result tables (same CpG index preferred).
effect_col, pval_col : str
Columns for effect size and adjusted p-value.
- pval_thresh: float, default 0.05
Returns¶
dict Comprehensive reproducibility statistics.
compute_effect_size¶
compute_effect_size(beta_group1: DataFrame, beta_group2: DataFrame, method: str = 'cohens_d') -> pandas.core.series.Series
Compute standardized effect size (Cohen’s d or Hedges’ g) for each CpG between two groups.
Uses pooled standard deviation with proper small-sample correction for Hedges’ g.
Parameters¶
beta_group1, beta_group2 : pd.DataFrame
Beta matrices (CpGs × samples) for each group.
- method: {"cohens_d", "hedges_g"}, default "cohens_d"
Returns¶
pd.Series Effect size per CpG (positive = higher in group1).
filter_dms¶
filter_dms(res: DataFrame, lfc_col: str = 'logFC', pval_col: str = 'padj', delta_beta_col: Optional[str] = None, pval_thresh: float = 0.05, lfc_thresh: float = 0.0, delta_beta_thresh: Optional[float] = None, direction: Optional[str] = None) -> DataFrame
Apply multi-criterion filtering to differential methylation results.
- Supports:
- Adjusted p-value threshold - Minimum |logFC|
- Minimum |Δβ|
- Directional filtering (“hyper” or “hypo”)
Parameters¶
- res:
pd.DataFrameDifferential results table. lfc_col, pval_col, delta_beta_col : str Column names (defaults: "logFC", "padj", optional). pval_thresh, lfc_thresh, delta_beta_thresh Numeric thresholds. - direction:
{"hyper", "hypo", None}
Returns¶
pd.DataFrame Subset of rows passing all specified criteria.
find_dmrs_by_sliding_window¶
find_dmrs_by_sliding_window(dms: DataFrame, annotation: DataFrame, chr_col: str = 'chr', pos_col: str = 'pos', pval_col: str = 'padj', pval_thresh: float = 0.05, delta_beta_col: Optional[str] = 'delta_beta', max_gap: int = 500, min_cpgs: int = 3, merge_distance: Optional[int] = None, use_intervaltree: bool = True) -> DataFrame
Identify differentially methylated regions (DMRs) by clustering spatially proximate significant CpGs.
Uses a fast sliding-window/gap-merging approach per chromosome.
Parameters¶
- dms:
pd.DataFrameSignificant DMS results (after filtering). - annotation:
pd.DataFrameCpG annotation withchrandposcolumns. - max_gap:
int, default 500Maximum distance (bp) to bridge adjacent significant CpGs. - min_cpgs:
int, default 3Minimum number of significant CpGs required to call a DMR. - merge_distance:
int or NoneIf set, merge DMRs closer than this distance.
Returns¶
pd.DataFrame
One row per DMR with columns:
chr, start, end, n_cpgs, mean_delta_beta, mean_logFC, min_padj, cpgs (list).
stouffer_combined_pvalue¶
stouffer_combined_pvalue(pvals)
Combine independent p-values across studies or replicates using Stouffer’s Z-score method (equal weighting).
Parameters¶
- pvals:
array-likeList or array of p-values to combine.
Returns¶
float Single combined p-value.
Raises¶
ValueError If input is empty or contains values outside [0, 1].
summarize_regions¶
summarize_regions(dmrs: DataFrame, summary_cols: Optional[List[str]] = None) -> DataFrame
Generate a concise summary table of discovered DMRs.
Reports total count, median length, average CpG density, mean delta-beta, and strongest significance.
Parameters¶
- dmrs:
pd.DataFrameOutput fromfind_dmrs_by_sliding_window. - summary_cols:
list[str] or NoneColumns to summarize (defaults to key metrics).
Returns¶
pd.DataFrame Single-row summary suitable for manuscript tables.
Helpers¶
General-purpose utilities for DNA methylation analysis workflows.
This lightweight module contains essential helper functions used across the DMeth package for consistent data handling and summarization. It provides robust chromosome string normalization and flexible group-wise aggregation of beta-value matrices, ensuring compatibility and reproducibility in downstream statistical and annotation steps.
Features¶
- Robust chromosome identifier normalization with automatic 'chr' prefix addition and preservation of X/Y conventions
- Safe handling of mixed chromosome formats (e.g., '1', 'chr1', 'CHR1' → 'chr1', 'chrX')
- Efficient group-wise summarization of per-sample beta values with support for arbitrary aggregation functions
- Automatic computation of both mean and variance per group for interpretability and modeling
- Graceful handling of missing groups, empty inputs, and non-overlapping sample sets
- Index-preserving operations fully compatible with CpG × sample matrix conventions
summarize_groups¶
summarize_groups(beta: DataFrame, groups: pandas.core.series.Series, summary_func: Callable = <function mean at 0x7c994bb96170>) -> DataFrame
Compute group-wise summary statistics (mean and variance) across samples for a beta-value matrix.
Designed for rapid interpretation of methylation levels across experimental conditions or cell types.
Parameters¶
- beta:
pd.DataFrameBeta-value matrix with CpGs as rows (index) and samples as columns. - groups:
pd.SeriesSample-to-group mapping. Index must align withbeta.columns. - summary_func: `callable, default
np.mean``` Aggregation function applied per group (e.g.,np.mean,np.median``).
Returns¶
pd.DataFrame
New DataFrame with columns:
mean_{group} – group-wise summary using summary_func
var_{group} – sample variance within each group (ddof=1)
Missing groups or empty intersections yield ``NaN`` columns.
Notes¶
- Fully preserves CpG index.
- Handles partial or missing group overlap gracefully.
- Ideal for generating input tables for delta-beta calculation, visualization, or reporting.
Signature¶
Biomarker signature discovery and predictive modeling for DNA methylation data.
This module provides a streamlined, production-ready framework for translating differential methylation results into high-performance diagnostic, prognostic, or predictive biomarker panels. It supports feature selection from statistical outputs, cross-validated model training with state-of-the-art algorithms, and rigorous independent validation of classification or regression performance.
Features¶
- Simple yet effective signature selection via top-ranked CpGs by p-value or moderated t-statistic
- Flexible predictive modeling with Random Forest and Elastic Net (with built-in hyperparameter tuning via CV)
- Automatic task detection and appropriate stratification (classification vs regression)
- Comprehensive cross-validation and held-out test set evaluation with standard metrics (AUC, accuracy, R², RMSE)
- Full integration with DMeth result tables and sample-level beta matrices (features × samples orientation)
- Reproducible training through fixed random seeds and stratified splitting
- Extensible architecture for future stability selection, recursive feature elimination, or multi-omics integration
model_dms_for_prediction¶
model_dms_for_prediction(beta: 'pd.DataFrame', labels: 'pd.Series', method: 'str' = 'random_forest', n_splits: 'int' = 5, random_state: 'int' = 42, task: 'Optional[str]' = None) -> 'Dict[str, Any]'
Train and evaluate a predictive model using DNA methylation signatures.
Automatically detects classification vs regression tasks and applies appropriate modeling and evaluation strategies.
Parameters¶
- beta:
pd.DataFrameBeta-value matrix with CpGs as rows and samples as columns (features × samples orientation). - labels:
pd.SeriesTarget variable aligned withbeta.columns. - method:
{"random_forest", "elasticnet"}, default "random_forest"Predictive algorithm:random_forest: 500 trees with parallel trainingelasticnet: LogisticRegressionCV or ElasticNetCV with built-in cross-validated regularization - n_splits:
int, default 5Number of cross-validation folds for hyperparameter tuning and performance estimation. - random_state:
int, default 42Seed for reproducible splitting and model initialization. - task:
{"classification", "regression"} or None, optionalForce task type. IfNone, inferred from label distribution.
Returns¶
dict
Contains:
estimator: final fitted model (trained on full data)
cv_results: detailed cross-validation scores
test_auc / test_accuracy (classification) or test_rmse / test_r2 (regression) on a stratified 20% held-out set
select_signature_panel¶
select_signature_panel(res: 'pd.DataFrame', method: 'str' = 'top', top_n: 'int' = 100, importance_col: 'str' = 't') -> 'List[str]'
Extract a candidate biomarker panel from differential methylation results.
Parameters¶
- res:
pd.DataFrameDifferential analysis results table (e.g., output fromfit_differential). Must contain at least one ofpvalor the column specified inimportance_col. - method:
str, default "top"Feature selection strategy. Currently only"top"is implemented. - top_n:
int, default 100Number of top-ranked CpGs to retain. - importance_col:
str, default "t"Column used for ranking whenpvalis unavailable (e.g., moderated t-statistic).
Returns¶
List[str]
Ordered list of selected CpG identifiers (row names from res).
Notes¶
- Prioritizes
pvalfor ranking if present. - Falls back to descending order of
importance_colotherwise. - Future extensions will include stability selection and recursive feature elimination.
validate_signature¶
validate_signature(X_train: 'pd.DataFrame', y_train: 'Union[pd.Series, np.ndarray]', X_test: 'pd.DataFrame', y_test: 'Union[pd.Series, np.ndarray]', features: 'Sequence[str]', method: 'str' = 'elasticnet') -> 'Dict[str, Any]'
Independent validation of a pre-selected methylation signature on held-out data.
Re-trains the specified model on training data using only the provided features and reports performance on a separate test set.
Parameters¶
X_train, X_test : pd.DataFrame
Training and test methylation matrices (samples × features).
y_train, y_test : array-like
Corresponding labels.
- features: Sequence[str]
Subset of CpGs constituting the signature.
- method: str, default "elasticnet"
Model used for validation (same options as model_dms_for_prediction).
Returns¶
dict
Performance metrics:
Binary classification → auc and accuracy
Multi-class/regression → accuracy or rmse + r2
Notes¶
Provides an unbiased estimate of clinical/translational performance when the signature was derived on a separate discovery cohort.
Utilities¶
Logger¶
Centralised logging utilities for the DMeth DNA-methylation analysis suite.
This module configures a unified logger with the following features:
Features¶
- Timestamped log files automatically saved to
<output_dir>/log/ - Simultaneous console (stdout) and file output
- Consistent formatting across all package modules
- A custom :class:
ProgressAwareLoggerthat seamlessly integratestqdmprogress bars:logger.progress("Processing samples", total=n)starts a progress barlogger.progress_update(k)advances it- Any regular log call (info/warning/error/etc.) automatically closes the active bar so that log messages are never corrupted by overlapping tqdm output
All other dmeth modules import the logger via get_logger().
ProgressAwareLogger¶
ProgressAwareLogger(name) -> 'None'
Custom logger class that supports a temporary progress bar. The progress bar stays active until the next normal log call.
get_logger¶
get_logger(name: 'str' = 'dmeth') -> 'logging.Logger'
Return the central DMeth logger instance.
Parameters¶
- name:
str, default "dmeth"Logger name. Typically left as the default.
Returns¶
logging.Logger The configured ProgressAwareLogger instance.
Plotting¶
Plotting utilities for downstream DNA methylation analysis.
This module provides a comprehensive set of visualisation functions for quality control, differential methylation analysis, dimensionality reduction, variance shrinkage diagnostics, and publication-ready summary figures. Functions operate on pandas DataFrames, ProcessedData objects, or differential methylation results and return standardised matplotlib (or optionally Plotly) figures.
Features¶
- Standardised figure creation
- Enhanced volcano plots with automatic significance colouring and annotation
- P-value QQ plots and histograms
- Sample-level PCA, t-SNE, and UMAP embeddings (static or interactive)
- Variance shrinkage diagnostics (limma-style)
- Mean-difference bar plots for top loci
- Multi-panel QC reports via plot_stage()
- Combined DMS/DMR summary visualisation (volcano + manhattan + heatmap)
ProcessedData¶
ProcessedData(M: DataFrame, pheno: DataFrame, ann: DataFrame, meta: Dict[str, Any] = <factory>) -> None
Central container for aligned methylation data used throughout the DMeth pipeline.
Guarantees that:
- All components (methylation matrix, sample metadata, probe annotation) share consistent string indices.
- Sample and probe alignment is validated on construction.
- A
metadictionary tracks processing history (normalisation, batch correction, QC metrics, etc.).
All downstream functions expect or return instances of this class.
Parameters¶
- M:
pd.DataFrameMethylation matrix with CpG sites as rows and samples as columns. Values are typically beta (0–1) or M-values (-inf to +inf). - pheno:
pd.DataFrameSample metadata table. Must be indexed by sample IDs (strings after construction). - ann:
pd.DataFrame or NoneOptional probe/CpG annotation table (e.g., Illumina manifest). Must be indexed by CpG IDs. IfNone, probe-level alignment checks are skipped. - meta:
dict, optionalFree-form dictionary storing pipeline provenance and parameters. Pre-populated with sensible defaults if not provided.
Attributes¶
- M:
pd.DataFrameMethylation matrix (string-indexed rows and columns). - pheno:
pd.DataFrameSample metadata (string-indexed). - ann:
pd.DataFrame or NoneProbe annotation (string-indexed if present). - meta:
dictProcessing metadata (e.g.,{"matrix_type": "beta", "normalized": True, ...}).
Notes¶
The __post_init__ method automatically:
- Converts all relevant indices to strings.
- Validates alignment between components.
- Raises informative
KeyErroron mismatch.
correlate_methylation_expression¶
correlate_methylation_expression(beta: 'pd.DataFrame', expression: 'pd.DataFrame', gene_map: 'Optional[Dict[str, str]]' = None, method: 'str' = 'pearson', override_index_alignment: 'bool' = False) -> 'pd.DataFrame'
Compute sample-wise correlation between CpG methylation and gene expression.
Supports both one-to-one (same index) and many-to-one (custom CpG→gene mapping) scenarios.
Parameters¶
beta, expression : pd.DataFrame
Methylation (CpGs × samples) and expression (genes × samples) matrices.
- gene_map: dict or None
Explicit mapping from CpG ID → gene symbol (required for cis-analysis).
- method: {"pearson", "spearman"}, default "pearson"
Correlation coefficient to compute.
- override_index_alignment: bool, default False
Proceed even with <2 overlapping samples (useful for exploratory checks).
Returns¶
pd.DataFrame
Correlation results with columns r and pval (indexed by CpG and optionally gene).
methylation_expression_heatmap¶
methylation_expression_heatmap(beta: DataFrame, expression: DataFrame, genes: Optional[List[str]] = None, top_n: int = 50, sample_metadata: Optional[DataFrame] = None, save_path: Union[str, pathlib.Path, NoneType] = None, method: str = 'pearson') -> matplotlib.figure.Figure
Heatmap of methylation-expression correlation coefficients for selected genes/CpGs.
Parameters¶
- beta:
pd.DataFrameMethylation beta matrix (features × samples). - expression:
pd.DataFrameGene expression matrix (genes × samples). - genes:
list[str], optionalSpecific features to display; if None the top_n most correlated are used. - top_n:
int, default 50Number of top correlations shown whengenesis None. - sample_metadata:
pd.DataFrame, optionalCurrently unused (reserved for future sample-side annotation). - save_path:
str or Path, optionalDestination for saving the figure. - method:
{"pearson", "spearman"}, default "pearson"Correlation method.
Returns¶
plt.Figure Horizontal heatmap of correlation coefficients.
pca_plot¶
pca_plot(data: ProcessedData, color_col: str, title: str = 'PCA of Methylation Data', figsize: tuple[int, int] = (7, 6)) -> None
Display a simple PCA projection of samples coloured by a phenotype column.
Parameters¶
- data:
ProcessedDataContainer holding the methylation matrix and phenotype table. - color_col:
strColumn indata.phenoused for colouring points. - title:
str, default "PCA of Methylation Data"Title shown on the plot. - figsize:
tuple[int, int], default (7, 6)Size of the figure in inches.
Notes¶
The plot is shown immediately with plt.show() and is not returned.
plot_mean_difference¶
plot_mean_difference(beta_group1: DataFrame, beta_group2: DataFrame, top_n: int = 50, save_path: Union[str, pathlib.Path, NoneType] = None) -> matplotlib.figure.Figure
Bar plot of the largest absolute mean beta differences between two groups.
Parameters¶
beta_group1, beta_group2 : pd.DataFrame
Beta-value matrices (features × samples) for each group.
- top_n: int, default 50
Number of top differentially methylated loci to display.
- save_path: str or Path, optional
Destination for saving the figure.
Returns¶
plt.Figure Bar chart of absolute mean differences.
plot_pvalue_qq¶
plot_pvalue_qq(res: DataFrame, pval_col: str = 'pval', dpi: int = 300, save_path: Union[str, pathlib.Path, NoneType] = None) -> matplotlib.figure.Figure
Produce a quantile-quantile plot comparing observed versus expected -log10(p-values).
Parameters¶
- res:
pd.DataFrameDataFrame containing the p-value column. - pval_col:
str, default "pval"Column name of the p-values. - dpi:
int, default 300Figure resolution. - save_path:
str or Path, optionalDestination path for saving the figure.
Returns¶
plt.Figure Q-Q plot figure.
plot_shrinkage_diagnostics¶
plot_shrinkage_diagnostics(s2: pandas.core.series.Series, s2_post: pandas.core.series.Series, d0: Optional[float] = None, save_path: Union[str, pathlib.Path, NoneType] = None) -> matplotlib.figure.Figure
Visualise the effect of empirical Bayes variance shrinkage.
Parameters¶
- s2:
pd.SeriesOriginal (unmoderated) variance estimates per feature. - s2_post:
pd.SeriesModerated (shrunk) variance estimates. - d0:
float, optionalPrior degrees of freedom from the shrinkage procedure (displayed in title if given). - save_path:
str or Path, optionalIf provided, the figure is saved to this location.
Returns¶
plt.Figure Scatter plot comparing log10(original) vs log10(shrunk) variances.
plot_stage¶
plot_stage(stage: str, M: DataFrame, res: Optional[DataFrame] = None, metadata: Optional[DataFrame] = None, groups_col: str = 'Type', top_n: int = 10, embedding: str = 'pca', interactive: bool = False, save_path: Union[str, pathlib.Path, Mapping[str, str], NoneType] = None, dpi: int = 300, **kwargs: Any) -> Dict[str, Union[matplotlib.figure.Figure, ForwardRef('PlotlyFigure')]]
Orchestrate a complete multi-panel QC or analysis visualisation for a given processing stage.
Supported stages: "qc", "variance", "differential", "top_hits", "correlation".
Parameters¶
- stage:
strOne of the recognised analysis stages. - M:
pd.DataFrameMethylation matrix (features × samples), typically M-values or beta-values. - res:
pd.DataFrame, optionalDifferential methylation results (required for variance, differential, top_hits stages). - metadata:
pd.DataFrame, optionalSample annotation table; must contain the grouping column. - groups_col:
str, default "Type"Column in metadata used for colour-coding samples. - top_n:
int, default 10Number of top features to display in relevant panels. - embedding:
{"pca", "tsne", "umap"}, default "pca"Dimensionality-reduction method for sample embedding. - interactive:
bool, default FalseReturn Plotly interactive figures where possible. - save_path:
str, Path or dict, optionalPath or mapping of figure names to paths for automatic saving. - dpi:
int, default 300Resolution for static figures. **kwargs Additional keyword arguments passed to the underlying reducer.
Returns¶
dict[str, plt.Figure | plotly.graph_objects.Figure] Mapping from panel name to the generated figure(s).
plot_volcano¶
plot_volcano(res: DataFrame, lfc_col: str = 'logFC', pval_col: str = 'pval', alpha: float = 0.7, lfc_thresh: float = 1.0, pval_thresh: float = 0.05, top_n: int = 10, dpi: int = 300, save_path: Union[str, pathlib.Path, NoneType] = None) -> matplotlib.figure.Figure
Create an enhanced volcano plot with top hits annotated.
Parameters¶
- res:
pd.DataFrameDifferential results with logFC and p-values - lfc_col:
strColumn name for log fold change - pval_col:
strColumn name for p-value - alpha:
floatPoint transparency - lfc_thresh:
floatThreshold for logFC significance - pval_thresh:
floatThreshold for p-value significance - top_n:
intNumber of top hits to annotate - dpi:
intFigure resolution - save_path:
str or Path, optionalPath to save figure
Returns¶
plt.Figure Volcano plot figure
pvalue_histogram¶
pvalue_histogram(pvals: Union[pandas.core.series.Series, ndarray[Any, numpy.dtype[numpy.float64]], List[float]], bins: int = 50, save_path: Union[str, pathlib.Path, NoneType] = None) -> matplotlib.figure.Figure
Histogram of p-values with a flat null-expectation line for QC assessment.
Parameters¶
- pvals:
array-likeCollection of p-values. - bins:
int, default 50Number of histogram bins. - save_path:
str or Path, optionalPath to save the figure.
Returns¶
plt.Figure Histogram figure.
visualize_dms¶
visualize_dms(res: DataFrame, beta: Optional[DataFrame] = None, top_n: int = 50, volcano: bool = True, manhattan: bool = True, heatmap: bool = True, sample_metadata: Optional[DataFrame] = None, save_dir: Union[str, pathlib.Path, NoneType] = None) -> Dict[str, Optional[matplotlib.figure.Figure]]
Produce a standard set of summary plots for differentially methylated sites/regions.
Optionally creates volcano, Manhattan, and heatmap visualisations.
Parameters¶
- res:
pd.DataFrameDifferential methylation results. - beta:
pd.DataFrame, optionalBeta matrix required for the heatmap panel. - top_n:
int, default 50Number of top CpGs shown in the heatmap. volcano / manhattan / heatmap : bool, default True Toggle creation of each panel. - sample_metadata:
pd.DataFrame, optionalSample annotation (currently unused but reserved for future clustering). - save_dir:
str or Path, optionalDirectory where individual PNG files are written.
Returns¶
dict[str, plt.Figure | None] Figures for "volcano", "manhattan", and "heatmap" (None if not generated).
Configuration¶
Config Manager¶
Planner configuration manager for DNA methylation studies.
Provides a thread-safe singleton that centralizes platforms, experimental designs, cost components, timeline phases, and global settings with full validation and multiple format support.
CostComponentSchema¶
CostComponentSchema(*, cost: Annotated[float, Ge(ge=0.0)], unit: Annotated[str, _PydanticGeneralMetadata(pattern='^(per_sample|per_cpg|fixed)$')], description: Optional[str] = None, optional: bool = False, applies_to: Optional[List[str]] = None) -> None
Configuration schema for cost components.
DesignSchema¶
DesignSchema(*, name: str, description: Optional[str] = None, n_groups: Annotated[int, Ge(ge=1)], paired: bool = False, complexity: Optional[str] = None, min_n_recommended: Annotated[int, Ge(ge=1)], power_adjustment: Annotated[float, Gt(gt=0.0), Le(le=2.0)], analysis_method: Optional[str] = None, example_uses: Optional[List[str]] = None) -> None
Configuration schema for experimental design.
GlobalSettingsSchema¶
GlobalSettingsSchema(*, contingency_buffer_percent: Annotated[float, Ge(ge=0.0), Le(le=100.0)] = 10.0, default_mcp_method: Annotated[str, _PydanticGeneralMetadata(pattern='^(bonferroni|fdr|none)$')] = 'fdr') -> None
Global configuration settings.
PlannerConfig¶
PlannerConfig(config_file: Union[str, pathlib.Path, NoneType] = None)
Thread-safe singleton for planner configuration management.
This class manages all configuration for DNA methylation study planning, including platforms, experimental designs, costs, and timelines.
Parameters¶
- config_file:
str or Path, optionalConfiguration file to load on initialization.
PlannerConfigModel¶
PlannerConfigModel(*, platforms: Dict[str, dmeth.config.config_manager.PlatformSchema], designs: Dict[str, dmeth.config.config_manager.DesignSchema], cost_components: Dict[str, dmeth.config.config_manager.CostComponentSchema], timeline_phases: Dict[str, dmeth.config.config_manager.TimelinePhaseSchema], global_settings: dmeth.config.config_manager.GlobalSettingsSchema = <factory>) -> None
Complete planner configuration model.
PlatformSchema¶
PlatformSchema(*, name: str, manufacturer: Optional[str] = None, n_cpgs: Annotated[int, Ge(ge=1)], cost_per_sample: Annotated[float, Ge(ge=0.0)], processing_days: Annotated[int, Ge(ge=0)], dna_required_ng: Annotated[Optional[float], Ge(ge=0)] = None, coverage: Optional[str] = None, release_year: Annotated[Optional[int], Ge(ge=1990), Le(le=2030)] = None, status: Optional[str] = None, recommended: bool = False, notes: Optional[str] = None) -> None
Configuration schema for a methylation array platform.
TimelinePhaseSchema¶
TimelinePhaseSchema(*, name: Optional[str] = None, base_duration_days: Annotated[float, Gt(gt=0)], scaling_factor: Annotated[float, Ge(ge=0.0)] = 0.0, batch_adjustment: Annotated[float, Ge(ge=0.0)] = 0.0, description: Optional[str] = None, critical: bool = False, optional: bool = False, parallelizable: bool = False) -> None
Configuration schema for timeline phases.
get_config¶
get_config() -> dmeth.config.config_manager.PlannerConfig
Get the global PlannerConfig singleton instance.
Returns¶
PlannerConfig The global configuration instance.
load_file¶
load_file(path: Union[str, pathlib.Path], allow_excel: bool = True) -> None
Load configuration from a file. See PlannerConfig.load_file().
migrate¶
migrate(migrator: Callable[[Dict], Dict]) -> None
Apply a migration function. See PlannerConfig.migrate().
reload¶
reload() -> None
Reload configuration. See PlannerConfig.reload().
reset_config¶
reset_config() -> None
Reset the global configuration instance (primarily for testing).
save_file¶
save_file(path: Union[str, pathlib.Path], fmt: Optional[str] = None) -> None
Save configuration to a file. See PlannerConfig.save_file().
Citation¶
If you use dmeth in your research, please cite:
@software{dmeth2025,
author = {Afolabi, Dare},
title = {dmeth: A comprehensive Python toolkit for differential DNA methylation analysis with empirical Bayes moderation and biomarker discovery},
version = {0.2.0},
year = {2025},
publisher = {GitHub},
doi = {10.5281/zenodo.17777501},
url = {https://doi.org/10.5281/zenodo.17777501},
}
References¶
- Smyth, G. K. (2004). Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1).
- Liu, P., & Hwang, J.T.G. (2007). Quick calculation for sample size while controlling false discovery rate with application to microarray analysis. Bioinformatics, 23(6), 739–746.
- Du, P., Zhang, X., Huang, C.-C., Jafari, N., Kibbe, W.A., Hou, L., & Lin, S. (2010). Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics, 11:587.
- Jung, S.H., Young, S.S. (2012). Power and sample size calculation for microarray studies. Journal of Biopharmaceutical Statistics, 22(1):30-42.
- Phipson, B. et al. (2016). missMethyl: an R package for analyzing data from Illumina’s HumanMethylation450 platform. Bioinformatics, 32(2), 286-288.
Support¶
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: dare.afolabi@outlook.com ook.com)
Auto-generated on 2025-12-13 02:56:16