API Reference¶
consenrich.core¶
The core module implements the main features of Consenrich and defines key parameter classes for a consistent namespace.
Note
Many parameters are documented here for completeness and do not need to be tuned in common use-cases. See Notation for some definitions of key variables and parameters used throughout documentation.
- class consenrich.core.processParams(deltaF, minQ, maxQ, offDiagQ)[source]¶
Parameters related to the process model of Consenrich.
The consensus epigenomic signal is modeled explicitly with a simple ‘level + slope’ process.
- Parameters:
deltaF (float) – Controls the integration step size between the signal ‘slope’ \(\dot{x}_{[i]}\) and the signal ‘level’ \(x_{[i]}\).
minQ (float) – Minimum process noise scale (diagonal in \(\mathbf{Q}_{[i]}\)) on the primary state variable (signal level). If
minQ < 0(default), a small value scales the minimum observation noise level (observationParams.minR) and is used for numerical stability.maxQ (float) – Maximum process noise scale. If
maxQ < 0(default), no effective upper bound is enforced.offDiagQ (float) – Off-diagonal value in the process noise covariance \(\mathbf{Q}_{[i,01]}\)
- Seealso:
- class consenrich.core.plotParams(plotPrefix=None, plotStateEstimatesHistogram=False, plotMWSRHistogram=False, plotHeightInches=6.0, plotWidthInches=8.0, plotDPI=300, plotDirectory=None)[source]¶
(Experimental) Parameters related to plotting filter results and diagnostics.
- Parameters:
plotPrefix (str or None) – Prefix for output plot filenames.
plotStateEstimatesHistogram (bool) – If True, plot a histogram of post-fit primary state estimates
plotMWSRHistogram (bool) – If True, plot a histogram of post-fit weighted squared residuals (MWSR).
plotHeightInches (float) – Height of output plots in inches.
plotWidthInches (float) – Width of output plots in inches.
plotDPI (int) – DPI of output plots (png)
plotDirectory (str or None) – Directory where plots will be written.
- Seealso:
plotStateEstimatesHistogram(),plotMWSRHistogram(),outputParams
- class consenrich.core.observationParams(minR, maxR, samplingIters, samplingBlockSizeBP, binQuantileCutoff, EB_minLin, EB_use, EB_setNu0, EB_setNuL, pad)[source]¶
Parameters related to the observation model of Consenrich.
The observation model is used to integrate sequence alignment data from the multiple replicates while accounting for both region- and replicate-specific noise.
- Parameters:
minR (float | None) – Genome-wide lower bound for replicate-specific observation noise scale.
maxR (float | None) – Genome-wide upper bound for the replicate-specific observation noise scale.
samplingIters (int | None) – Number of blocks (within-contig) to sample while building the empirical absMean-variance trend in
consenrich.core.fitVarianceFunction().samplingBlockSizeBP (int | None) – Expected size (in bp) of contiguous blocks that are sampled when fitting AR1 parameters to estimate \((\lvert \mu_b \rvert, \sigma^2_b)\) pairs. Note, during sampling, each block’s size (unit: genomic intervals) is drawn from truncated \(\textsf{Geometric}(p=1/\textsf{samplingBlockSize})\) to reduce artifacts from fixed-size blocks. If None or ` < 1`, then this value is inferred using
consenrich.core.getContextSize().binQuantileCutoff (float | None) – When fitting the variance function, pairs \((\lvert \mu_b \rvert, \sigma^2_b)\) are binned by their (absolute) means. This parameter sets the quantile of variances within each bin to use when fitting the global mean-variance trend. Increasing this value toward 1.0 can raise the prior trend for measurement uncertainty and yield stiffer signal estimates overall.
EB_minLin (float | None) – Require that the fitted trend in
consenrich.core.getMuncTrack()satisfy: \(\textsf{variance} \geq \textsf{minLin} \cdot |\textsf{mean}|\). SeefitVarianceFunction().EB_use (bool | None) – If True, shrink ‘local’ noise estimates to a prior trend dependent on amplitude. See
consenrich.core.getMuncTrack().EB_setNu0 (int | None) – If provided, manually set \(\nu_0\) to this value (rather than computing via
consenrich.core.EB_computePriorStrength()).EB_setNuL (int | None) – If provided, manually set local model df, \(\nu_L\), to this value.
pad (float | None) – A small constant added to the observation noise variance estimates for conditioning
- Seealso:
consenrich.core.getMuncTrack(),consenrich.core.fitVarianceFunction(),consenrich.core.EB_computePriorStrength(),consenrich.cconsenrich.cblockScaleEM()
- class consenrich.core.stateParams(stateInit, stateCovarInit, boundState, stateLowerBound, stateUpperBound, conformalRescale, conformalAlpha, conformal_numIters, conformalFinalRefit)[source]¶
Parameters related to state variables and covariances.
- Parameters:
stateInit (float) – Initial value of the ‘primary’ state/signal at the first genomic interval: \(x_{[1]}\)
stateCovarInit (float) – Initial state covariance (covariance) scale. Note, the initial state uncertainty \(\mathbf{P}_{[1]}\) is a multiple of the identity matrix \(\mathbf{I}\). Final results are typically insensitive to this parameter, since the filter effectively ‘forgets’ its initialization after processing a moderate number of intervals and backward smoothing.
boundState (bool) – If True, the primary state estimate for \(x_{[i]}\) is reported within stateLowerBound and stateUpperBound. Note that the internal filtering is unaffected.
stateLowerBound (float) – Lower bound for the state estimate.
stateUpperBound (float) – Upper bound for the state estimate.
conformalRescale (bool | None) – If True, perform replicate-heldout split-conformal calibration of uncertainty for a hypothetical new replicate at each genomic interval i. This is useful if the usual posterior uncertainty estimate is unsatisfying due to dependence on the assumed model. “calibration scores” are formed by holding out one set of replicates (calibration replicates), fitting the model on the remaining replicates (proper-training replicates), and comparing held-out observations to the fitted predictor. Under replicate exchangeability, this routine can yield a finite-sample marginal coverage guarantee for the corresponding prediction intervals at the requested miscoverage level.
conformalAlpha (float | None) – Target split-conformal miscoverage level \(\alpha\) in (0, 1)`. When conformalRescale is True, the inflation factor is chosen from the \((1-\alpha)\) empirical quantile of calibration scores so that the induced prediction intervals for a hypothetical new replicate have marginal coverage at least \(1-\alpha\).
conformal_numIters (int | None) – Number of replicate-heldout calibration rounds used to estimate the inflation factor. If set
> 1, the routine repeats the replicate split multiple times and aggregates calibration scores across rounds to reduce Monte Carlo variability in the estimated quantile. Note, however, this aggregation across violates the exact single-split reasoning for finite-sample marginal coverage.conformalFinalRefit (bool | None) – If True, refit the final model after selecting the conformal inflation factor–Note that this breaks the standard split-conformal protocol that requires the predictor used at test time is the same “frozen predictor” used to generate calibration scores. If strict guarantees are important and you have enough replicates to allocate a meaningful calibration set, we suggest setting
conformalFinalRefit=False. In all cases, the Consenrich implementation can only inflate the original posterior uncertainty, avoiding induced anti-conservative reporting.
- class consenrich.core.inputParams(bamFiles, bamFilesControl, pairedEnd)[source]¶
Parameters related to the input data for Consenrich.
- Parameters:
bamFiles (List[str]) – A list of paths to distinct coordinate-sorted and indexed BAM files.
bamFilesControl (List[str], optional) – A list of paths to distinct coordinate-sorted and indexed control BAM files. e.g., IgG control inputs for ChIP-seq.
pairedEnd (Optional[bool]) – Deprecated: Paired-end/Single-end is inferred automatically from the alignment flags in input BAM files.
- class consenrich.core.outputParams(convertToBigWig, roundDigits, writeUncertainty, writeMWSR, writeJackknifeSE, applyJackknife)[source]¶
Parameters related to output files.
- Parameters:
convertToBigWig (bool) – If True, output bedGraph files are converted to bigWig format.
roundDigits (int) – Number of decimal places to round output values (bedGraph)
writeUncertainty (bool) – If True, write the posterior state uncertainty \(\sqrt{\widetilde{P}_{i,(11)}}\) to bedGraph.
writeMWSR (bool) –
If True, write a per-interval mean weighted+squared+*studentized* post-fit residual (
MWSR), computed using the smoothed state and its posterior variance from the final filter/smoother pass.Let \(r_{[j,i]} = \texttt{matrixData}_{[j,i]} - b_j - \widetilde{x}_{[i]}\) denote the post-fit residual for replicate \(j\) at interval \(i\), where \(\widetilde{x}_{[i]}\) is the consensus epigenomic signal level estimate. Let \(\widetilde{P}_{[00,i]}\) be the posterior variance of the first state variable (signal level) and let \(R_{[j,i]}\) be the observation noise variance for replicate \(j\) at interval \(i\). Then, the studentized squared residuals \(u^2_{[j,i]}\) and the mean weighted squared residuals \(\textsf{MWSR}[i]\) are recorded as:
\[u^2_{[j,i]} = \frac{r_{[j,i]}^2 + \widetilde{P}_{[00,i]}}{R_{[j,i]}}, \qquad \textsf{MWSR}[i] = \frac{1}{m}\sum_{j=1}^{m} u^2_{[j,i]}.\]which is consistent with the EM routine in
consenrich.cconsenrich.cblockScaleEM()writeJackknifeSE (bool) – If True, write the standard error of the signal level estimates across jackknife replicates to bedGraph. This is only relevant if applyJackknife is True.
applyJackknife (bool) – If True, estimate replicate-level sampling variability in the signal level estimates with the jackknife
- class consenrich.core.genomeParams(genomeName, chromSizesFile, blacklistFile, sparseBedFile, chromosomes, excludeChroms, excludeForNorm)[source]¶
Specify assembly-specific resources, parameters.
- Parameters:
genomeName (str) – If supplied, default resources for the assembly (sizes file, blacklist, and ‘sparse’ regions) in src/consenrich/data are used.
ce10, ce11, dm6, hg19, hg38, mm10, mm39have default resources available.chromSizesFile (str) – A two-column TSV-like file with chromosome names and sizes (in base pairs).
blacklistFile (str, optional) – A BED file with regions to exclude.
sparseBedFile (str, optional) – A BED file with ‘sparse regions’ that are mutually exclusive with or devoid of the targeted signal. Used to estimate noise levels. See
getMuncTrack().chromosomes (List[str]) – A list of chromosome names to analyze. If None, all chromosomes in chromSizesFile are used.
excludeChroms (List[str]) – A list of chromosome names to exclude from analysis.
excludeForNorm (List[str]) – A list of chromosome names to exclude when summing up the ‘effective genome size’ during normalization. This can be useful to avoid bias from poorly assembled, highly repetitive, and/or sex-specific chromosomes (e.g., chrM, chrUn, etc.). For reference, see effective genome size.
- class consenrich.core.countingParams(intervalSizeBP, backgroundBlockSizeBP, smoothSpanBP, scaleFactors, scaleFactorsControl, normMethod, fragmentLengths, fragmentLengthsControl, useTreatmentFragmentLengths, fixControl, globalWeight, asymPos, logOffset, logMult)[source]¶
Parameters related to counting aligned reads
- Parameters:
intervalSizeBP (int) – Length (bp) of each genomic interval \(i=1\ldots n\) that comprise the larger genomic region (contig, chromosome, etc.) The default value is generally robust, but users may consider increasing this value when expected feature size is large and/or sequencing depth is low (less than \(\approx 5 \textsf{million}\), depending on assay).
backgroundBlockSizeBP (int) – Length (bp) of blocks used to estimate local statistics (background, noise, etc.). If a negative value is provided (default), this value is inferred from the data using
consenrich.core.getContextSize().normMethod (str) –
Method used to normalize read counts for sequencing depth / library size.
EGS: Effective Genome Size normalization (seeconsenrich.detrorm.getScaleFactor1x())SF: Median of ratios style scale factors (seeconsenrich.cconsenrich.cSF()). Restricted to analyses with>= 3samples (no input control).RPKM: Scale factors based on Reads Per Kilobase per Million mapped reads (seeconsenrich.detrorm.getScaleFactorPerMillion())
fragmentLengths (List[int], optional) – List of fragment lengths (bp) to use for extending reads from 5’ ends when counting single-end data.
fragmentLengthsControl (List[int], optional) – List of fragment lengths (bp) to use for extending reads from 5’ ends when counting single-end with control data.
useTreatmentFragmentLengths (bool, optional) – If True, use fragment lengths estimated from treatment BAM files for control BAM files, too.
fixControl (bool, optional) – If True, treatment samples are not upscaled, and control samples are not downscaled.
globalWeight (float, optional) – Relative weight assigned to the global ‘dense’ baseline when combining with local baseline estimates. Higher values increase the influence of the global baseline. For instance,
globalWeight = 2results in a weighted average where the global baseline contributes 2/3 of the final baseline estimate; whereasglobalWeight = 1results in equal weighting between global and local baselines. Users with input control samples may consider increasing this value to avoid redundancy (artificial local trends have presumably been accounted for in the control, leaving less signal to be modeled locally).asymPos (float, optional) – Relative weight assigned to positive residuals to induce asymmetry in reweighting. Used during IRLS for the local baseline computation. Using smaller values near 0.0 will downweight peaks more and reduce the risk of removing true signal. Typical range is
(0, 0.75].logOffset (float, optional) – A small constant added to read normalized counts before log-transforming (pseudocount). For example, \(\log(x + 1)\) for
logOffset = 1. Default is1.0.logMult (float, optional) – Multiplicative factor applied to log-scaled and normalized counts. For example, setting
logMult = 1 / \log(2)will yield log2-scaled counts after transformation, and settinglogMult = 1.0yields natural log-scaled counts.smoothSpanBP (int | None)
scaleFactors (List[float] | None)
scaleFactorsControl (List[float] | None)
- Seealso:
Treatment vs. Control Fragment Lengths in Single-End Data
For single-end data, cross-correlation-based estimates for fragment length in control inputs can be biased due to a comparative lack of structure in strand-specific coverage tracks.
This can create artifacts during counting, so it is common to use the estimated treatment fragment length for both treatment and control samples. The argument
observationParams.useTreatmentFragmentLengthsenables this behavior.
- class consenrich.core.samParams(samThreads, samFlagExclude, oneReadPerBin, chunkSize, offsetStr='0,0', maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0, countEndsOnly=False, minMappingQuality=0, minTemplateLength=-1, fragmentLengths=None)[source]¶
Parameters related to reading BAM files
- Parameters:
samThreads (int) – The number of threads to use for reading BAM files.
samFlagExclude (int) – The SAM flag to exclude certain reads.
oneReadPerBin (int) – If 1, only the interval with the greatest read overlap is incremented.
chunkSize (int) – maximum number of intervals’ data to hold in memory before flushing to disk.
offsetStr (str) – A string of two comma-separated integers – first for the 5’ shift on forward strand, second for the 5’ shift on reverse strand.
maxInsertSize (int) – Maximum frag length/insert to consider when estimating fragment length.
inferFragmentLength (int) – Intended for single-end data: if > 0, the maximum correlation lag (avg.) between strand-specific read tracks is taken as the fragment length estimate and used to extend reads from 5’. Ignored if data is paired-end, countEndsOnly, or fragmentLengths is provided. important when targeting broader marks (e.g., ChIP-seq H3K27me3).
countEndsOnly (Optional[bool]) – If True, only the 5’ read lengths contribute to counting.
minMappingQuality (Optional[int]) – Minimum mapping quality (MAPQ) for reads to be counted.
fragmentLengths (List[int] | None) – If supplied, a list of estimated fragment lengths for each BAM file. These are values are used to extend reads. Note, these values will override TLEN attributes in paired-end data
pairedEndMode (int | None)
minTemplateLength (int | None)
Tip
For an overview of SAM flags, see https://broadinstitute.github.io/picard/explain-flags.html
- class consenrich.core.matchingParams(templateNames, cascadeLevels, iters, alpha, useScalingFunction, minMatchLengthBP, maxNumMatches, minSignalAtMaxima, merge, mergeGapBP, excludeRegionsBedFile, penalizeBy, randSeed, eps, massQuantileCutoff, methodFDR)[source]¶
Parameters related to the matching algorithm.
See `matching` for an overview of the approach.
- Parameters:
templateNames (List[str]) – A list of str values – each entry references a mother wavelet (or its corresponding scaling function). e.g., [haar, db2]
cascadeLevels (List[int]) – Number of cascade iterations, or ‘levels’, used to define wavelet-based templates Must have the same length as templateNames, with each entry aligned to the corresponding template. e.g., given templateNames [haar, db2], then [2,2] would use 2 cascade levels for both templates.
iters (int) – Number of randomly drawn contiguous blocks used to build an empirical null for significance evaluation. See
cconsenrich.csampleBlockStats().alpha (float) – Primary significance threshold on detected matches. Specifically, the minimum corrected empirical p-value approximated from randomly sampled blocks in the response sequence.
minMatchLengthBP (Optional[int]) – Within a window of minMatchLengthBP length (bp), relative maxima in the signal-template convolution \(\mathcal{R}_{[\ast]}\) must be greater in value than others to qualify as matches. If set to a value less than 1, the minimum length is determined via
consenrich.matching.autoMinLengthIntervals()(default behavior).minSignalAtMaxima (Optional[str | float]) – Secondary/optional threshold coupled with
alpha. Requires the signal value, \(\widetilde{x}_{[i^*]}\), at relative maxima in the response sequence, \(\mathcal{R}_{[i^*]}\), to be greater than this threshold. If astrvalue is provided, looks for ‘q:quantileValue’, e.g., ‘q:0.90’. The threshold is then set to the corresponding quantile of the non-zero signal estimates in the distribution of transformed values.useScalingFunction (bool) – If True, use (only) the scaling function to build the matching template. If False, use (only) the wavelet function.
eps (float) – Tolerance parameter for relative maxima detection in the response sequence. Set to zero to enforce strict inequalities when identifying discrete relative maxima.
massQuantileCutoff (float) – Quantile cutoff for filtering initial (unmerged) matches based on their ‘mass’
((avgSignal*length)/intervalLength). To diable, set< 0.maxNumMatches (int | None)
merge (bool | None)
mergeGapBP (int | None)
excludeRegionsBedFile (str | None)
penalizeBy (str | None)
randSeed (int | None)
methodFDR (str | None)
- Seealso:
cconsenrich.csampleBlockStats(), `matching`,outputParams.
- consenrich.core.readBamSegments(bamFiles, chromosome, start, end, intervalSizeBP, readLengths, scaleFactors, oneReadPerBin, samThreads, samFlagExclude, offsetStr='0,0', maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0, countEndsOnly=False, minMappingQuality=0, minTemplateLength=-1, fragmentLengths=None)[source]¶
Calculate coverage tracks for each BAM file.
- Parameters:
bamFiles (List[str]) – See
inputParams.chromosome (str) – Chromosome to read.
start (int) – Start position of the genomic segment.
end (int) – End position of the genomic segment.
readLengths (List[int]) – List of read lengths for each BAM file.
scaleFactors (List[float]) – List of scale factors for each BAM file.
intervalSizeBP (int) – See
countingParams.oneReadPerBin (int) – See
samParams.samThreads (int) – See
samParams.samFlagExclude (int) – See
samParams.offsetStr (str) – See
samParams.maxInsertSize (int) – See
samParams.pairedEndMode (int) – See
samParams.inferFragmentLength (int) – See
samParams.minMappingQuality (int) – See
samParams.minTemplateLength (Optional[int]) – See
samParams.fragmentLengths (Optional[List[int]]) – If supplied, a list of estimated fragment lengths for each BAM file. These are values are used to extend reads. Note, these values will override TLEN attributes in paired-end data
countEndsOnly (bool | None)
- Return type:
ndarray[tuple[Any, …], dtype[float32]]
- consenrich.core.getMuncTrack(chromosome, intervals, values, intervalSizeBP, samplingBlockSizeBP=None, samplingIters=25000, randomSeed=42, excludeMask=None, useEMA=True, excludeFitCoefs=None, binQuantileCutoff=0.5, EB_minLin=1.0, EB_use=True, EB_setNu0=None, EB_setNuL=None, EB_localQuantile=0.0, verbose=False, eps=0.01)[source]¶
Approximate initial sample-specific (M)easurement (unc)ertainty tracks
For an individual experimental sample (replicate), quantify positional data uncertainty over genomic intervals \(i=1,2,\ldots n\) spanning
chromosome. These tracks (per-sample) comprise thematrixMuncinput torunConsenrich(), \(\mathbf{R}[:,:] \in \mathbb{R}^{m \times n}\).Variance is modeled as a function of the absolute mean signal level. For
EB_use=True, local variance estimates are shrunk toward a signal level dependent global variance fit.- Parameters:
chromosome (str) – chromosome/contig name
values (np.ndarray) – normalized/transformed signal measurements over genomic intervals (e.g.,
consenrich.cconsenrich.cTransform()output)intervals (np.ndarray) – genomic intervals positions (start positions)
intervalSizeBP (int)
samplingBlockSizeBP (int | None)
samplingIters (int)
randomSeed (int)
excludeMask (ndarray | None)
useEMA (bool | None)
excludeFitCoefs (Tuple[int, ...] | None)
binQuantileCutoff (float)
EB_minLin (float)
EB_use (bool)
EB_setNu0 (int | None)
EB_setNuL (int | None)
EB_localQuantile (float)
verbose (bool)
eps (float)
- Return type:
tuple[ndarray[tuple[Any, …], dtype[float32]], float]
See
consenrich.core.observationParamsfor other parameters.
- consenrich.core.fitVarianceFunction(jointlySortedMeans, jointlySortedVariances, eps=0.01, binQuantileCutoff=0.5, EB_minLin=1.0)[source]¶
- Parameters:
jointlySortedMeans (ndarray)
jointlySortedVariances (ndarray)
eps (float)
binQuantileCutoff (float)
EB_minLin (float)
- Return type:
ndarray
- consenrich.core.EB_computePriorStrength(localModelVariances, globalModelVariances, Nu_local)[source]¶
Compute \(\nu_0\) to determine ‘prior strength’
The prior model strength is determined by ‘excess’ dispersion beyond sampling noise at the local level.
- Parameters:
localModelVariances (np.ndarray) – Local model variance estimates (e.g., rolling AR(1) innovation variances
consenrich.cconsenrich.crolling_AR1_IVar()).globalModelVariances (np.ndarray) – Global model variance estimates from the absMean-variance trend fit (
consenrich.core.fitVarianceFunction()).Nu_local (float) – Effective sample size/degrees of freedom for the local model.
- Returns:
Estimated prior strength \(\nu_{0}\).
- Return type:
float
- Seealso:
consenrich.core.getMuncTrack(),consenrich.core.fitVarianceFunction()
- consenrich.core.runConsenrich(matrixData, matrixMunc, deltaF, minQ, maxQ, offDiagQ, stateInit, stateCovarInit, boundState, stateLowerBound, stateUpperBound, blockLenIntervals, projectStateDuringFiltering=False, pad=0.0001, disableCalibration=False, EM_maxIters=50, EM_rtol=0.0001, EM_scaleToMedian=False, EM_tNu=10.0, EM_alphaEMA=0.1, EM_scaleLOW=0.1, EM_scaleHIGH=10.0, EM_useObsBlockScale=True, EM_useProcBlockScale=True, EM_useObsPrecReweight=True, EM_useProcPrecReweight=True, EM_useReplicateBias=True, EM_useReplicateScale=True, EM_repBiasShrink=0.0, EM_repScaleLOW=0.25, EM_repScaleHIGH=4.0, returnScales=True, returnReplicateOffsets=False, applyJackknife=False, jackknifeEM_maxIters=5, jackknifeEM_rtol=0.01, useWhiteAccel=False, useDiscreteConstAccel=False, autoDeltaF=True, autoDeltaF_low=0.0001, autoDeltaF_high=2.0, autoDeltaF_init=0.01, autoDeltaF_maxEvals=25, conformalRescale=False, conformalAlpha=0.05, conformal_numIters=3, conformalFinalRefit=True)[source]¶
Run Consenrich over over a contiguous genomic region (chromosome)
- We estimate a position-dependent consensus epigenomic signal \(\widetilde{x}_{[i,0]}\) from multiple replicate
tracks’ HTS data, and provide positional uncertainty quantification by propagating variance in a linear filter-smoother setup.
This wrapper ties together several fundamental routines written in Cython:
consenrich.cconsenrich.cforwardPass(): Forward filter (predict, update)consenrich.cconsenrich.cbackwardPass(): Backward fixed-interval smootherconsenrich.cconsenrich.cblockScaleEM(): Joint optimization of process and observation model noise scales with robust studentized reweighting
- Seealso:
consenrich.core.getMuncTrack(),consenrich.cconsenrich.cTransform(),consenrich.cconsenrich.cforwardPass(),consenrich.cconsenrich.cbackwardPass(),consenrich.cconsenrich.cblockScaleEM()- Parameters:
matrixData (ndarray)
matrixMunc (ndarray)
deltaF (float)
minQ (float)
maxQ (float)
offDiagQ (float)
stateInit (float)
stateCovarInit (float)
boundState (bool)
stateLowerBound (float)
stateUpperBound (float)
blockLenIntervals (int)
projectStateDuringFiltering (bool)
pad (float)
disableCalibration (bool)
EM_maxIters (int)
EM_rtol (float)
EM_scaleToMedian (bool)
EM_tNu (float)
EM_alphaEMA (float)
EM_scaleLOW (float)
EM_scaleHIGH (float)
EM_useObsBlockScale (bool)
EM_useProcBlockScale (bool)
EM_useObsPrecReweight (bool)
EM_useProcPrecReweight (bool)
EM_useReplicateBias (bool)
EM_useReplicateScale (bool)
EM_repBiasShrink (float)
EM_repScaleLOW (float)
EM_repScaleHIGH (float)
returnScales (bool)
returnReplicateOffsets (bool)
applyJackknife (bool)
jackknifeEM_maxIters (int)
jackknifeEM_rtol (float)
useWhiteAccel (bool)
useDiscreteConstAccel (bool)
autoDeltaF (bool)
autoDeltaF_low (float)
autoDeltaF_high (float)
autoDeltaF_init (float)
autoDeltaF_maxEvals (int)
conformalRescale (bool)
conformalAlpha (float)
conformal_numIters (int)
conformalFinalRefit (bool)
- consenrich.core.constructMatrixF(deltaF)[source]¶
Build the state transition matrix for the process model
- Parameters:
deltaF (float) – See
processParams.- Returns:
The state transition matrix \(\mathbf{F}\)
- Return type:
npt.NDArray[np.float32]
- Seealso:
- consenrich.core.constructMatrixQ(minDiagQ, offDiagQ=0.0, Q00=None, Q01=None, Q10=None, Q11=None, useIdentity=-1.0, tol=1e-08, useWhiteAccel=False, useDiscreteConstAccel=False, deltaF=None)[source]¶
Build the (base) process noise covariance matrix \(\mathbf{Q}\).
- Parameters:
minDiagQ (float) – Minimum value for diagonal entries of \(\mathbf{Q}\).
offDiagQ (float) – Value for off-diagonal entries of \(\mathbf{Q}\).
Q00 (Optional[float]) – Optional value for entry (0,0) of \(\mathbf{Q}\).
Q01 (Optional[float]) – Optional value for entry (0,1) of \(\mathbf{Q}\).
Q10 (Optional[float]) – Optional value for entry (1,0) of \(\mathbf{Q}\).
Q11 (Optional[float]) – Optional value for entry (1,1) of \(\mathbf{Q}\).
useIdentity (float) – If > 0.0, use a scaled identity matrix for \(\mathbf{Q}\). Overrides other parameters.
tol (float) – Tolerance for positive definiteness check.
useWhiteAccel (bool)
useDiscreteConstAccel (bool)
deltaF (float | None)
- Returns:
The process noise covariance matrix \(\mathbf{Q}\).
- Return type:
npt.NDArray[np.float32]
- Seealso:
- consenrich.core.constructMatrixH(m, coefficients=None)[source]¶
Build the observation model matrix \(\mathbf{H}\).
- Parameters:
m (int) – Number of observations.
coefficients (Optional[np.ndarray]) – Optional coefficients for the observation model, which can be used to weight the observations manually.
- Returns:
The observation model matrix \(\mathbf{H}\).
- Return type:
npt.NDArray[np.float32]
- Seealso:
observationParams, class:inputParams
consenrich.detrorm¶
- consenrich.detrorm.getScaleFactor1x(bamFile, effectiveGenomeSize, readLength, excludeChroms, chromSizesFile, samThreads)[source]¶
Generic normalization factor based on effective genome size and number of mapped reads in non-excluded chromosomes.
- Parameters:
bamFile (str) – See
consenrich.core.inputParams.effectiveGenomeSize (int) – Effective genome size in base pairs. See
consenrich.constants.getEffectiveGenomeSize().readLength (int) – read length or fragment length
excludeChroms (List[str]) – List of chromosomes to exclude from the analysis.
chromSizesFile (str) – Path to the chromosome sizes file.
samThreads (int) – See
consenrich.core.samParams.
- Returns:
Scale factor for 1x normalization.
- Return type:
float
- consenrich.detrorm.getScaleFactorPerMillion(bamFile, excludeChroms, intervalSizeBP)[source]¶
Generic normalization factor based on number of mapped reads in non-excluded chromosomes.
- Parameters:
bamFile (str) – See
consenrich.core.inputParams.excludeChroms (List[str]) – List of chromosomes to exclude when counting mapped reads.
intervalSizeBP (int)
- Returns:
Scale factor accounting for number of mapped reads (only).
- Return type:
float
- consenrich.detrorm.getPairScaleFactors(bamFileA, bamFileB, effectiveGenomeSizeA, effectiveGenomeSizeB, readLengthA, readLengthB, excludeChroms, chromSizesFile, samThreads, intervalSizeBP, normMethod='EGS', fixControl=True)[source]¶
Scale treatment:control data based on effective genome size or reads per million.
- Parameters:
bamFileA (str) – Alignment file for the ‘treatment’ sample.
bamFileB (str) – Alignment file for the ‘control’ sample (e.g., input).
effectiveGenomeSizeA (int) – Effective genome size for the treatment sample.
effectiveGenomeSizeB (int) – Effective genome size for the control sample.
readLengthA (int) – Read or fragment length for the treatment sample.
readLengthB (int) – Read or fragment length for the control sample.
excludeChroms (List[str]) – List of chromosomes to exclude from the analysis.
chromSizesFile (str) – Path to the chromosome sizes file.
samThreads (int) – See
consenrich.core.samParams.intervalSizeBP (int) – Step size for coverage calculation.
normMethod (str)
fixControl (bool)
- Param:
normMethod: Normalization method to use (“EGS” or “RPKM”).
- Returns:
Tuple of scale factors for treatment and control samples.
- Return type:
Tuple[float, float]
consenrich.constants¶
- consenrich.constants.getEffectiveGenomeSize(genome, readLength)[source]¶
Get the effective genome size for a given genome and read length.
- Parameters:
genome (str) – Name of the genome. See
consenrich.constants.resolveGenomeName()andconsenrich.core.genomeParams.readLength (int) – Length of the reads. See
consenrich.core.getReadLength().
- Raises:
ValueError – If the genome is not recognized or if the read length is not available for the genome.
- Returns:
Effective genome size in base pairs.
- Return type:
int
- consenrich.constants.getGenomeResourceFile(genome, fileType, dir_='data')[source]¶
Get the path to a genome resource file.
- Parameters:
genome (str) – the genome assembly. See
consenrich.constants.resolveGenomeName()andconsenrich.core.genomeParams.fileType (str) – One of ‘sizes’, ‘blacklist’, ‘sparse’.
dir_ (str)
- Returns:
Path to the resource file.
- Return type:
str
- Raises:
ValueError – If not a sizes, blacklist, or sparse file.
FileNotFoundError – If the resource file does not exist.
Cython functions: consenrich.cconsenrich¶
Several computationally burdensome tasks are implemented in cython:
- consenrich.cconsenrich.creadBamSegment(bamFile, chromosome, start, end, intervalSizeBP, readLength, oneReadPerBin, samThreads, samFlagExclude, shiftForwardStrand53=0, shiftReverseStrand53=0, extendBP=0, maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0, minMappingQuality=0, minTemplateLength=-1, weightByOverlap=1, ignoreTLEN=1)¶
Count reads in a BAM file for a given chromosome.
- consenrich.cconsenrich.csampleBlockStats(intervals, values, expectedBlockSize, iters, randSeed, excludeIdxMask, eps=0.0)¶
Sample contiguous blocks in the response sequence (xCorr), record maxima, and repeat.
Used to build an empirical null distribution and determine significance of response outputs. The size of blocks is drawn from a truncated geometric distribution, preserving rough equality in expectation but allowing for variability to account for the sampling across different phases in the response sequence.
- Parameters:
values (cnp.ndarray[cnp.float64_t, ndim=1]) – The response sequence to sample from.
expectedBlockSize (int) – The expected size (geometric) of the blocks to sample.
iters (int) – The number of blocks to sample.
randSeed (int) – Random seed for reproducibility.
- Returns:
An array of sampled block maxima.
- Return type:
cnp.ndarray[cnp.float64_t, ndim=1]
- Seealso:
- consenrich.cconsenrich.cgetFragmentLength(bamFile, samThreads=0, samFlagExclude=3844, maxInsertSize=1000, iters=1000, blockSize=5000, fallBack=147, rollingChunkSize=250, lagStep=5, earlyExit=250, randSeed=42)¶
- consenrich.cconsenrich.cTransform(x, blockLength, disableLocalBackground=False, w_local=1.0, w_global=2.0, verbose=False, rseed=0, useIRLS=True, asymPos=0.4, logOffset=1.0, logMult=1.0)¶
- consenrich.cconsenrich.cDenseMean(x, blockLenTarget=250, itersEM=1000, seed=0, verbose=False)¶
Use a Gaussian mixture to estimate a ‘dense’ baseline to subtract from each replicate’s transformed count track
Note, this step is potentially redundant if using replicate-level bias terms in the observation model.
- consenrich.cconsenrich.clocalBaseline(x, blockSize=101, useIRLS=True, asymPos=0.4, noiseMult=1.0, cauchyScale=3.0, maxIters=25, minWeight=1e-06, tol=0.001)¶
Estimate a local baseline on x with a lower/smooth envelope via IRLS
Compute a locally smooth baseline \(\hat{b}\) for an input signal \(y\), using a second-order penalized smoother (Whittaker) with asymmetric iteratively reweighted least squares (IRLS) to reduce influence from peaks.
- Parameters:
x (np.ndarray) – Signal measurements over fixed-length genomic intervals
asymPos (float) – Relative weight assigned to positive residuals to induce asymmetry in reweighting. Used during IRLS for the local baseline computation. Smaller values will downweight peaks more and pose less risk of removing true signal. Typical range is
(0, 0.75].cauchyScale (float) – Controls how quickly weights decay with normalized residual magnitude. Smaller values downweight outliers strongly.
- Returns:
Baseline estimate as a float32 NumPy array of shape
(n,).- Return type:
numpy.ndarray
- consenrich.cconsenrich.cPAVA(x, postWeight)¶
PAVA for isotonic regression
This code aims for the notation and algorithm of Busing 2022 (JSS,
DOI: 10.18637/jss.v102.c01).From Busing:
> Observe that the violation 8 = x3 > x4 = 2 is solved by combining two values, 8 and 2, resulting > in a (new) block value of 5, i.e., (8 + 2)/2 = 5. Instead of immediately turning > around and start solving down block violation, we may first look ahead for the next > value in the sequence, k-up, for if this element is smaller than or equal to 5, the > next value can immediately be pooled into the current block, i.e., (8 + 2 + 2)/3 = 4. > Looking ahead can be continued until the next element is larger than the current block > value or if we reach the end of the sequence.
- Parameters:
x (cnp.ndarray, (either f32 or f64)) – 1D array to be fitted as nondecreasing
postWeight (cnp.ndarray) – 1D array of weights corresponding to each observed value. These are the number of ‘observations’ associated to each ‘unique’ value in x. Intuition: more weight to values with more observations.
- Returns:
PAVA-fitted values
- Return type:
cnp.ndarray, (either f32 or f64)
- consenrich.cconsenrich.cforwardPass(matrixData, matrixPluginMuncInit, matrixF, matrixQ0, intervalToBlockMap, rScale, qScale, blockCount, stateInit, stateCovarInit, pad=0.0001, projectStateDuringFiltering=False, stateLowerBound=0.0, stateUpperBound=0.0, chunkSize=1000000, stateForward=None, stateCovarForward=None, pNoiseForward=None, vectorD=None, progressBar=None, progressIter=25000, returnNLL=False, storeNLLInD=False, lambdaExp=None, processPrecExp=None, replicateBias=None, replicateScale=None, EM_useObsBlockScale=True, EM_useProcBlockScale=True, EM_useObsPrecReweight=True, EM_useProcPrecReweight=True, EM_useReplicateBias=False, EM_useReplicateScale=False)¶
Run the forward pass (filter) for state estimation
See
consenrich.cconsenrich.cblockScaleEM(), where this routine is applied within the filter, smooth, update loop.
- consenrich.cconsenrich.cbackwardPass(matrixData, matrixF, stateForward, stateCovarForward, pNoiseForward, chunkSize=1000000, stateSmoothed=None, stateCovarSmoothed=None, lagCovSmoothed=None, postFitResiduals=None, replicateBias=None, progressBar=None, progressIter=10000, EM_useReplicateBias=False)¶
Run the backward pass (smoother)
This function executes the smoothing phase of Consenrich’s forward–backward state estimation. It operates on outputs from the forward-filtered outputs (those returned by
consenrich.cconsenrich.cforwardPass()).That is, given the forward-pass, filtered estimates over genomic intervals \(i = 1, \dots, n\),
\[\mathbf{x}_{[i|i]}, \qquad \mathbf{P}_{[i|i]}, \qquad \mathbf{Q}_{[i]},\]this routine computes the backward-smoothed state estimates \(\widetilde{\mathbf{x}}_{[i]}\) and the backward-smoothed covariances \(\widetilde{\mathbf{P}}_{[i]}\).
- consenrich.cconsenrich.cblockScaleEM(matrixData, matrixPluginMuncInit, matrixF, matrixQ0, intervalToBlockMap, blockCount, stateInit, stateCovarInit, EM_maxIters=50, EM_rtol=0.0001, pad=0.0001, EM_scaleLOW=0.1, EM_scaleHIGH=10.0, EM_alphaEMA=0.1, EM_alphaEMA_Q=1.0, EM_scaleToMedian=False, EM_tNu=10.0, EM_useObsBlockScale=True, EM_useProcBlockScale=True, EM_useObsPrecReweight=True, EM_useProcPrecReweight=True, EM_useReplicateBias=True, EM_useReplicateScale=True, EM_repBiasShrink=0.0, EM_repScaleLOW=0.25, EM_repScaleHIGH=4.0, t_innerIters=3, returnIntermediates=False)¶
Run the Consenrich filter-smoother estimation loop with iteratively updated observation and process noise [co]variances.
Take observation and process noise [co]variances:
\[\widetilde{R}_{[j,i]}=\frac{a_j\,r_{b(i)}\,v_{[j,i]}}{\lambda_{[j,i]}}, \qquad \widetilde{\mathbf{Q}}_{[i]}=\frac{q_{b(i)}\,\mathbf{Q}_0}{\kappa_{[i]}}.\]We also include replicate-level additive offsets in the observation mean:
\[z_{[j,i]} = x_{[i,0]} + b_j + \epsilon_{[j,i]}.\]Here \(r_b>0\) and \(q_b>0\) are block-level scale multipliers, \(a_j>0\) and \(b_j\) are replicate-level multipliers/offsets, and \(\lambda_{[j,i]}\) and \(\kappa_{[i]}\) are Student-t precision multipliers.
Estimation loop¶
Repeat until convergence:
Filter-Smoother estimation
Run the forward filter and backward smoother under the current (given) effective noises \(\widetilde{R}\) and \(\widetilde{\mathbf{Q}}\). This yields smoothed moments \(\widetilde{\mathbf{x}}_{[i]}\), \(\widetilde{\mathbf{P}}_{[i]}\), and lag-one covariances \(\widetilde{\mathbf{C}}_{[i,i+1]}\).
Studentized precision reweighting:
Observation weights \(\lambda_{[j,i]}\) (
EM_useObsPrecReweight):\[u^2_{[j,i]}=\frac{(z_{[j,i]}-b_j-\widetilde{x}_{[i,0]})^2+\widetilde{P}_{[i,0,0]}} {\widetilde{R}_{[j,i]}} \quad\Rightarrow\quad \lambda_{[j,i]} \leftarrow \frac{\nu_R+1}{\nu_R+u^2_{[j,i]}}.\]In code,
EM_tNucorresponds to \(\nu_R\).Process weights \(\kappa_{[i]}\):
Let \(\mathbf{w}_{[i]}=\mathbf{x}_{[i]}-\mathbf{F}\mathbf{x}_{[i-1]}\) and define
\[\Delta_{[i]}=\textsf{Trace}\!\left(\mathbf{Q}_0^{-1}\,\mathbb{E}\left[\mathbf{w}_{[i]}\mathbf{w}_{[i]}^\top\right]\right).\]Then
\[\kappa_{[i]} \leftarrow \frac{\nu_Q+d}{\nu_Q+\Delta_{[i]}/q_{b(i)}},\]where \(d=2\).
3. Block-level scaling: update \(r_b\) and/or \(q_b\) using blockwise closed-form averages of sufficient statistics from the E-step.
Observation block scale update:
\[r_b \leftarrow \frac{1}{N_b}\sum_{(j,i):\,b(i)=b} \lambda_{[j,i]}\, \frac{(z_{[j,i]}-b_j-\widetilde{x}_{[i,0]})^2+\widetilde{P}_{[i,0,0]}}{a_j\,v_{[j,i]}}.\]Process block scale update:
\[q_b \leftarrow \frac{1}{d\,M_b}\sum_{i:\,b(i)=b}\kappa_{[i+1]}\,\Delta_{[i]}.\]4. Replicate-level updates: update \(b_j\) and \(a_j\) from weighted residual moments at replicate resolution while holding block and interval-level terms fixed.
If
EM_alphaEMAis supplied, we smooth updates with an exponential moving average (EMA) in the log domain:\[\log r_b \leftarrow (1-\alpha)\log r_b^{(\mathrm{sm})} + \alpha \log r_b, \qquad \log q_b \leftarrow (1-\alpha_Q)\log q_b^{(\mathrm{sm})} + \alpha_Q \log q_b.\]Objective Function¶
Let \(x_{1:n}=\{\mathbf{x}_{[i]}\}_{i=1}^n\), \(\Lambda=\{\lambda_{[j,i]}\}\), and \(\kappa=\{\kappa_{[i]}\}\). Collecting process and observation terms and mixing penalties yields:
\begin{align} \mathcal{J}(x,\Lambda,\kappa,r,q,a,b) &= \frac12\sum_{i=2}^{n} \left[ \log\left|\frac{q_{b(i)}}{\kappa_{[i]}}\mathbf{Q}_0\right| + (\mathbf{x}_{[i]}-\mathbf{F}\mathbf{x}_{[i-1]})^\top \left(\frac{\kappa_{[i]}}{q_{b(i)}}\mathbf{Q}_0^{-1}\right) (\mathbf{x}_{[i]}-\mathbf{F}\mathbf{x}_{[i-1]}) \right] \\ &\quad+ \frac12\sum_{i=1}^{n}\sum_{j=1}^m \left[ \log\!\left(\frac{a_j r_{b(i)}v_{[j,i]}}{\lambda_{[j,i]}}\right) + (z_{[j,i]}-b_j-x_{[i,0]})^2\,\frac{\lambda_{[j,i]}}{a_j r_{b(i)}v_{[j,i]}} \right] \\ &\quad+ \sum_{i=1}^{n}\sum_{j=1}^m \left[ -\left(\frac{\nu_R+1}{2}-1\right)\log\lambda_{[j,i]} +\frac{\nu_R+1}{2}\lambda_{[j,i]} \right] \\ &\quad+ \sum_{i=2}^{n} \left[ -\left(\frac{\nu_Q+d}{2}-1\right)\log\kappa_{[i]} +\frac{\nu_Q+d}{2}\kappa_{[i]} \right]. \end{align}So the estimation loop maximizing our objective function may be viewed as a coordinate ascent where the filter-smoother solves the quadratic subproblem conditional on the current estimates of \(\Lambda\), \(\kappa\), \(r\), \(q\), \(a\), and \(b\), and reweighting plus scale/offset updates optimize over \(\Lambda\), \(\kappa\), \(r\), \(q\), \(a\), and \(b\).
- param matrixData:
Replicate track data (rows: replicates, columns: genomic intervals).
- type matrixData:
numpy.ndarray[numpy.float32]
- param matrixPluginMuncInit:
Data-derived observation noise variances \(v_{[j,i]}\). Same per-replicate/per-interval shape as
matrixData.- type matrixPluginMuncInit:
numpy.ndarray[numpy.float32]
- param matrixF:
Transition matrix \(\mathbf{F}\), shape
(2, 2).- type matrixF:
numpy.ndarray[numpy.float32]
- param matrixQ0:
Base process noise covariance: \(\mathbf{Q}_0 \in \mathbb{R}^{2 \times 2}\)
- type matrixQ0:
numpy.ndarray[numpy.float32]
- param intervalToBlockMap:
Mapping from interval index \(i\) to block index \(b(i)\)
- type intervalToBlockMap:
numpy.ndarray[numpy.int32]
- param blockCount:
Number of blocks that are rescaled during optimization. Larger values optimize at a finer resolution, but this introduces risk of overfitting, increased runtime, and potential ambiguity with the precision reweighting procedure.
- type blockCount:
int
- param stateInit:
Initial state value for the signal-level (first component) of the state vector \(\mathbf{x}_{[0]}\)
- type stateInit:
float
- param stateCovarInit:
Initial state covariance scale
- type stateCovarInit:
float
- param EM_maxIters:
Maximum outer EM iterations.
- type EM_maxIters:
int
- param EM_rtol:
Relative improvement tolerance on NLL used in convergence
- type EM_rtol:
float
- param EM_scaleLOW:
Lower bound for block scales \(r_b\) and \(q_b\).
- type EM_scaleLOW:
float
- param EM_scaleHIGH:
Upper bound for block scales \(r_b\) and \(q_b\)
- type EM_scaleHIGH:
float
- param EM_alphaEMA:
If in
(0, 1], enables log-domain EMA smoothing for block scale updates with smoothing factor \(\alpha\)- type EM_alphaEMA:
float
- param EM_scaleToMedian:
If True, rescales \(r_b\) and \(q_b\) by their medians after each M-step. This can help preserve between-block scale differences for interpretability, but may interfere with convergence of the EM algorithm since it changes the effective objective.
- type EM_scaleToMedian:
bool
- param EM_tNu:
Student-t df for reweighting strengths (smaller = stronger reweighting)
- type EM_tNu:
float
- param EM_useObsBlockScale:
If True, estimate block observation scales \(r_b\); otherwise fix \(r_b\equiv 1\).
- type EM_useObsBlockScale:
bool
- param EM_useProcBlockScale:
If True, estimate block process scales \(q_b\); otherwise fix \(q_b\equiv 1\).
- type EM_useProcBlockScale:
bool
- param EM_useObsPrecReweight:
If True, update observation precision multipliers \(\lambda_{[j,i]}\) (Student-t reweighting); otherwise \(\lambda\equiv 1\).
- type EM_useObsPrecReweight:
bool
- param EM_useProcPrecReweight:
If True, update process precision multipliers \(\kappa_{[i]}\) (Student-t reweighting); otherwise \(\kappa\equiv 1\).
- type EM_useProcPrecReweight:
bool
- param EM_useReplicateBias:
If True, update replicate-level additive offsets \(b_j\).
- type EM_useReplicateBias:
bool
- param EM_useReplicateScale:
If True, update replicate-level observation variance scales \(a_j\).
- type EM_useReplicateScale:
bool
- param t_innerIters:
Number of inner filter/smoother + reweighting updates per EM outer iteration.
- type t_innerIters:
int
- param returnIntermediates:
If True, also return smoothed states/covariances, residuals, and (if enabled) precision multipliers.
- type returnIntermediates:
bool
- returns:
A tuple
(rScale, qScale, itersDone, finalNLL). IfreturnIntermediates=True, additionally returns(stateSmoothed, stateCovarSmoothed, lagCovSmoothed, postFitResiduals, lambdaExp, processPrecExp, replicateBias, replicateScale).- rtype:
tuple
References
Shumway, R. H. & Stoffer, D. S. (1982): An approach to time series smoothing and forecasting using the EM algorithm. DOI:
10.1111/j.1467-9892.1982.tb00349.xWest, M. (1987): On scale mixtures of normal distributions. DOI:
10.1093/biomet/74.3.646
- consenrich.cconsenrich.cEMA(x, alpha)¶
- consenrich.cconsenrich.cmeanVarPairs(intervals, values, blockSize, iters, randSeed, excludeIdxMask, zeroPenalty=0.0, zeroThresh=0.0, useInnovationVar=True, useSampleVar=False)¶
- consenrich.cconsenrich.projectToBox(vectorX, matrixP, stateLowerBound, stateUpperBound, eps)¶
consenrich.matching¶
(Experimental) Detect genomic regions showing both enrichment and non-random structure
Denote a noisy signal over fixed-length genomic intervals,
Aim: Determine a set of structured peaks, i.e., genomic regions where \(x_{[a:b]}\) exhibits both:
Enrichment (large relative amplitude)
Non-random structure defined by a robust template (polynomial, oscillatory, etc.)
Prioritizing genomic regions that are both enriched and agree with a prescribed structure may be appealing for several reasons. Namely,
Targeted detection of biologically relevant signal patterns in a given assay (e.g., see related works analyzing peak-shape Cremona et al., 2015, Parodi et al., 2017)
Improved confidence that the identified genomic regions are not due to stochastic noise, which is characteristically unstructured.
Speed: Runs in seconds/minutes using efficient numerical methods to compute large chromosome-scale convolutions (fast fourier transform (FFT)-based, overlap-add (OA), etc.)
Algorithm Overview¶
To detect structured peaks, we run an approach akin to matched filtering, with templates derived from approximated discrete wavelets (or their corresponding scaling functions).
Note, other bases for defining templates may be considered in the future, e.g., splines, polynomials, etc. Wavelet-based templates are convenient and possess several desirable properties for detecting structured peaks: multi-scale representation, compact support, orthogonality, etc. For relevant background, Fournier et al ‘95, and/or Mallat and Hwan ‘92 may be useful reading.
Denote the cross-correlation between the signal track and a matching template \(\boldsymbol{\xi}\) as:
We refer to \(\mathcal{R}\) over \(i=1 \ldots n\) as the response sequence. The response will be greatest in genomic regions with a high read density and structural similarity to the template \(\boldsymbol{\xi}\).
To detect ‘significant’ hits,
We first construct an observed empirical distribution from randomly-sampled genomic blocks. Specifically, we sample \(B\) blocks and record each \(\max(\mathcal{R}_{[b_1]}, \ldots, \mathcal{R}_{[b_K]})\). To reduce leakage and selection bias, the chromosome is split into block-level null/test subsets, with guard zones near split boundaries.
Relative maxima in the response sequence, i.e., \(i^*\) such that \(\mathcal{R}_{[i^* - 1 \,:\, i^* - T/2]}\, \leq \, \mathcal{R}_{[i^*]} \, \geq \, \mathcal{R}_{[i^* + 1 \,:\, i^* + T/2]}\) are retained as candidate matches
Each candidate is assigned an empirical \(p\)-value based on its upper-tail ECDF under the null subset maxima (with lower bound \(1/(B+1)\)). The split is repeated several times, and per-candidate split-level \(p\)-values are aggregated with a Cauchy combination rule. Those satisfying \(p < \alpha\) are deemed ‘significant’. Note that \(p\)-values are with respect to chromosome-specific empirical distributions.
Additional criteria for matching: require the signal values at candidate peaks/matches, \({x}_{[i^*]}\), to exceed a cutoff (matchingParams.minSignalAtMaxima), and/or require the length of the matched feature to exceed a minimum size (matchingParams.minMatchLengthBP).
Overlapping/adjacent matches can be merged.
Note
Blocked Split + Repeated Combination
The response sequence is partitioned into coarse genomic blocks, stratified by robust block summary statistics, then assigned approximately 50/50 into null/test subsets within each stratum.
For each split, null maxima are sampled only from null blocks and candidates are evaluated only on test blocks.
Split-level candidate \(p\)-values are then combined with a Cauchy combination statistic.
Generic Defaults
matchingParams.templateNames: [haar, haar, db2, db2]
matchingParams.cascadeLevels: [1,2,1,2]
matchingParams.minMatchLengthBP: -1 # select via `consenrich.core.getContextSize`
matchingParams.mergeGapBP: -1
matchingParams.alpha: 0.05
—
- consenrich.matching.matchWavelet(chromosome, intervals, values, templateNames, cascadeLevels, iters, alpha=0.05, minMatchLengthBP=-1, maxNumMatches=100000, minSignalAtMaxima=0.1, randSeed=42, recenterAtPointSource=True, useScalingFunction=True, excludeRegionsBedFile=None, weights=None, eps=0.001, methodFDR=None)[source]¶
Detect structured peaks in Consenrich tracks by matching wavelet- or scaling-function–based templates.
- Parameters:
chromosome (str) – Chromosome name for the input intervals and values.
values (npt.NDArray[np.float64]) – A 1D array of signal-like values. In this documentation, we refer to values derived from Consenrich, but other continuous-valued tracks at evenly spaced genomic intervals may be suitable, too.
templateNames (List[str]) – A list of str values – each entry references a mother wavelet (or its corresponding scaling function). e.g., [haar, db2]
cascadeLevels (List[int]) – Number of cascade iterations used to approximate each template (wavelet or scaling function). Must have the same length as templateNames, with each entry aligned to the corresponding template. e.g., given templateNames [haar, db2], then [2,2] would use 2 cascade levels for both templates.
iters (int) – Number of random blocks to sample in the response sequence while building an empirical null to test significance within chromosomes. See
cconsenrich.csampleBlockStats().alpha (float) – Primary significance threshold on detected matches. Specifically, the minimum corrected empirical p-value approximated from randomly sampled blocks in the response sequence.
minMatchLengthBP (Optional[int]) – Within a window of minMatchLengthBP length (bp), relative maxima in the signal-template convolution must be greater in value than others to qualify as matches. If set to a value less than 1, the minimum length is determined via
consenrich.matching.autoMinLengthIntervals(), a simple wrapper aroundconsenrich.core.getContextSize(). If set to None, defaults to 250 bp.minSignalAtMaxima (Optional[str | float]) – Secondary significance threshold coupled with \(\alpha\). Requires the signal value at relative maxima in the response sequence to be greater than a threshold \(\pm \epsilon\).
useScalingFunction (bool) – If True, use (only) the scaling function to build the matching template. If False, use (only) the wavelet function.
excludeRegionsBedFile (Optional[str]) – A BED file with regions to exclude from matching
recenterAtPointSource (bool) – If True, recenter detected matches at the point source (max value)
weights (Optional[npt.NDArray[np.float64]]) – Optional weights to apply to values prior to matching. Must have the same length as values.
eps (float) – Tolerance parameter for relative maxima detection in the response sequence. Set to zero to enforce strict inequalities when identifying discrete relative maxima.
intervals (ndarray[tuple[Any, ...], dtype[int]])
maxNumMatches (int | None)
randSeed (int)
methodFDR (str | None)
- Seealso:
consenrich.core.matchingParams,cconsenrich.csampleBlockStats(), `matching`- Returns:
A pandas DataFrame with detected matches
- Return type:
pd.DataFrame
Notation¶
\(m\) = number of replicate tracks (rows)
\(n\) = number of genomic intervals (columns)
\(B\) = number of genomic blocks partitioning the main contig (
blockCount)\(b(i)\in\{0,\dots,B-1\}\) maps interval \(i\) to block index (
intervalToBlockMap)\(z_{[j,i]}\) is the observation for replicate \(j\) at genomic interval \(i\)
\(v_{[j,i]}\) is the observation variance for replicate \(j\) at genomic interval \(i\) (
consenrich.core.getMuncTrack())\(\mathbf{x}_{[i]}\in\mathbb{R}^2\) is the state vector at interval \(i\), where
\(x_{[i,0]}\) denotes the latent signal level at genomic interval \(i\)
\(x_{[i,1]}\) denotes the latent signal slope at genomic interval \(i\)
\(\mathbf{P}_{[i]}\in\mathbb{R}^{2\times 2}\) is the state covariance matrix at interval \(i\), where
\(P_{[i,0,0]}\) is the variance of the signal level at interval \(i\)
\(P_{[i,1,1]}\) is the variance of the signal slope at interval \(i\)
\(P_{[i,0,1]}=P_{[i,1,0]}\) is the covariance between signal level and slope at interval \(i\)
\(\mathbf{F}\) is the process model matrix (
matrixF), where\(F_{00}=1\) and \(F_{01}=\Delta_F\)
\(F_{10}=0\) and \(F_{11}=1\)
\(\mathbf{Q}_0\) is the initial process model noise covariance matrix (
matrixQ0)\(\mathbf{H}\) is the observation model matrix (\(\mathbf{H}=[1, 0]\))
\(\mathbf{R}_{[i]}=\textsf{diag}\{v_{[:,i]}\}\) is the base observation-noise covariance at interval \(i\)
\(\lambda_{[j,i]}\) is the latent precision multiplier for observation \(z_{[j,i]}\) in a Student-t Gaussian scale-mixture model
\(\kappa_{[i]}\) is the latent precision multiplier for the process transition at interval \(i\) in a Student-t Gaussian scale-mixture model
\(\nu_R\) is the degrees-of-freedom parameter controlling the heaviness of Student-t tails for observation noise
\(\nu_Q\) is the degrees-of-freedom parameter controlling the heaviness of Student-t tails for process noise
\(r_b\) is the block-level observation noise scale multiplier for block \(b\)
\(q_b\) is the block-level process noise scale multiplier for block \(b\)
\(a_j\) is the replicate-level observation variance scale multiplier for replicate \(j\)
\(b_j\) is the replicate-level additive observation offset for replicate \(j\)
The effective observation model used in EM is:
Resolution of scaling terms:
\(v_{[j,i]}\) and \(\lambda_{[j,i]}\) operate at replicate-interval resolution
\(r_b\) and \(q_b\) operate at block resolution via \(b(i)\)
\(a_j\) and \(b_j\) operate at replicate resolution