API Reference

consenrich.core

The core module implements the main features of Consenrich and defines key parameter classes for a consistent namespace.

Note

Many parameters are documented here for completeness and do not need to be tuned in common use-cases. See Notation for some definitions of key variables and parameters used throughout documentation.

class consenrich.core.processParams(deltaF, minQ, maxQ, offDiagQ)[source]

Parameters related to the process model of Consenrich.

The consensus epigenomic signal is modeled explicitly with a simple ‘level + slope’ process.

Parameters:
  • deltaF (float) – Controls the integration step size between the signal ‘slope’ \(\dot{x}_{[i]}\) and the signal ‘level’ \(x_{[i]}\).

  • minQ (float) – Minimum process noise scale (diagonal in \(\mathbf{Q}_{[i]}\)) on the primary state variable (signal level). If minQ < 0 (default), a small value scales the minimum observation noise level (observationParams.minR) and is used for numerical stability.

  • maxQ (float) – Maximum process noise scale. If maxQ < 0 (default), no effective upper bound is enforced.

  • offDiagQ (float) – Off-diagonal value in the process noise covariance \(\mathbf{Q}_{[i,01]}\)

Seealso:

consenrich.core.runConsenrich()

class consenrich.core.plotParams(plotPrefix=None, plotStateEstimatesHistogram=False, plotMWSRHistogram=False, plotHeightInches=6.0, plotWidthInches=8.0, plotDPI=300, plotDirectory=None)[source]

(Experimental) Parameters related to plotting filter results and diagnostics.

Parameters:
  • plotPrefix (str or None) – Prefix for output plot filenames.

  • plotStateEstimatesHistogram (bool) – If True, plot a histogram of post-fit primary state estimates

  • plotMWSRHistogram (bool) – If True, plot a histogram of post-fit weighted squared residuals (MWSR).

  • plotHeightInches (float) – Height of output plots in inches.

  • plotWidthInches (float) – Width of output plots in inches.

  • plotDPI (int) – DPI of output plots (png)

  • plotDirectory (str or None) – Directory where plots will be written.

Seealso:

plotStateEstimatesHistogram(), plotMWSRHistogram(), outputParams

class consenrich.core.observationParams(minR, maxR, samplingIters, samplingBlockSizeBP, binQuantileCutoff, EB_minLin, EB_use, EB_setNu0, EB_setNuL, pad)[source]

Parameters related to the observation model of Consenrich.

The observation model is used to integrate sequence alignment data from the multiple replicates while accounting for both region- and replicate-specific noise.

Parameters:
  • minR (float | None) – Genome-wide lower bound for replicate-specific observation noise scale.

  • maxR (float | None) – Genome-wide upper bound for the replicate-specific observation noise scale.

  • samplingIters (int | None) – Number of blocks (within-contig) to sample while building the empirical absMean-variance trend in consenrich.core.fitVarianceFunction().

  • samplingBlockSizeBP (int | None) – Expected size (in bp) of contiguous blocks that are sampled when fitting AR1 parameters to estimate \((\lvert \mu_b \rvert, \sigma^2_b)\) pairs. Note, during sampling, each block’s size (unit: genomic intervals) is drawn from truncated \(\textsf{Geometric}(p=1/\textsf{samplingBlockSize})\) to reduce artifacts from fixed-size blocks. If None or ` < 1`, then this value is inferred using consenrich.core.getContextSize().

  • binQuantileCutoff (float | None) – When fitting the variance function, pairs \((\lvert \mu_b \rvert, \sigma^2_b)\) are binned by their (absolute) means. This parameter sets the quantile of variances within each bin to use when fitting the global mean-variance trend. Increasing this value toward 1.0 can raise the prior trend for measurement uncertainty and yield stiffer signal estimates overall.

  • EB_minLin (float | None) – Require that the fitted trend in consenrich.core.getMuncTrack() satisfy: \(\textsf{variance} \geq \textsf{minLin} \cdot |\textsf{mean}|\). See fitVarianceFunction().

  • EB_use (bool | None) – If True, shrink ‘local’ noise estimates to a prior trend dependent on amplitude. See consenrich.core.getMuncTrack().

  • EB_setNu0 (int | None) – If provided, manually set \(\nu_0\) to this value (rather than computing via consenrich.core.EB_computePriorStrength()).

  • EB_setNuL (int | None) – If provided, manually set local model df, \(\nu_L\), to this value.

  • pad (float | None) – A small constant added to the observation noise variance estimates for conditioning

Seealso:

consenrich.core.getMuncTrack(), consenrich.core.fitVarianceFunction(), consenrich.core.EB_computePriorStrength(), consenrich.cconsenrich.cblockScaleEM()

class consenrich.core.stateParams(stateInit, stateCovarInit, boundState, stateLowerBound, stateUpperBound, conformalRescale, conformalAlpha, conformal_numIters, conformalFinalRefit)[source]

Parameters related to state variables and covariances.

Parameters:
  • stateInit (float) – Initial value of the ‘primary’ state/signal at the first genomic interval: \(x_{[1]}\)

  • stateCovarInit (float) – Initial state covariance (covariance) scale. Note, the initial state uncertainty \(\mathbf{P}_{[1]}\) is a multiple of the identity matrix \(\mathbf{I}\). Final results are typically insensitive to this parameter, since the filter effectively ‘forgets’ its initialization after processing a moderate number of intervals and backward smoothing.

  • boundState (bool) – If True, the primary state estimate for \(x_{[i]}\) is reported within stateLowerBound and stateUpperBound. Note that the internal filtering is unaffected.

  • stateLowerBound (float) – Lower bound for the state estimate.

  • stateUpperBound (float) – Upper bound for the state estimate.

  • conformalRescale (bool | None) – If True, perform replicate-heldout split-conformal calibration of uncertainty for a hypothetical new replicate at each genomic interval i. This is useful if the usual posterior uncertainty estimate is unsatisfying due to dependence on the assumed model. “calibration scores” are formed by holding out one set of replicates (calibration replicates), fitting the model on the remaining replicates (proper-training replicates), and comparing held-out observations to the fitted predictor. Under replicate exchangeability, this routine can yield a finite-sample marginal coverage guarantee for the corresponding prediction intervals at the requested miscoverage level.

  • conformalAlpha (float | None) – Target split-conformal miscoverage level \(\alpha\) in (0, 1)`. When conformalRescale is True, the inflation factor is chosen from the \((1-\alpha)\) empirical quantile of calibration scores so that the induced prediction intervals for a hypothetical new replicate have marginal coverage at least \(1-\alpha\).

  • conformal_numIters (int | None) – Number of replicate-heldout calibration rounds used to estimate the inflation factor. If set > 1, the routine repeats the replicate split multiple times and aggregates calibration scores across rounds to reduce Monte Carlo variability in the estimated quantile. Note, however, this aggregation across violates the exact single-split reasoning for finite-sample marginal coverage.

  • conformalFinalRefit (bool | None) – If True, refit the final model after selecting the conformal inflation factor–Note that this breaks the standard split-conformal protocol that requires the predictor used at test time is the same “frozen predictor” used to generate calibration scores. If strict guarantees are important and you have enough replicates to allocate a meaningful calibration set, we suggest setting conformalFinalRefit=False. In all cases, the Consenrich implementation can only inflate the original posterior uncertainty, avoiding induced anti-conservative reporting.

class consenrich.core.inputParams(bamFiles, bamFilesControl, pairedEnd)[source]

Parameters related to the input data for Consenrich.

Parameters:
  • bamFiles (List[str]) – A list of paths to distinct coordinate-sorted and indexed BAM files.

  • bamFilesControl (List[str], optional) – A list of paths to distinct coordinate-sorted and indexed control BAM files. e.g., IgG control inputs for ChIP-seq.

  • pairedEnd (Optional[bool]) – Deprecated: Paired-end/Single-end is inferred automatically from the alignment flags in input BAM files.

class consenrich.core.outputParams(convertToBigWig, roundDigits, writeUncertainty, writeMWSR, writeJackknifeSE, applyJackknife)[source]

Parameters related to output files.

Parameters:
  • convertToBigWig (bool) – If True, output bedGraph files are converted to bigWig format.

  • roundDigits (int) – Number of decimal places to round output values (bedGraph)

  • writeUncertainty (bool) – If True, write the posterior state uncertainty \(\sqrt{\widetilde{P}_{i,(11)}}\) to bedGraph.

  • writeMWSR (bool) –

    If True, write a per-interval mean weighted+squared+*studentized* post-fit residual (MWSR), computed using the smoothed state and its posterior variance from the final filter/smoother pass.

    Let \(r_{[j,i]} = \texttt{matrixData}_{[j,i]} - b_j - \widetilde{x}_{[i]}\) denote the post-fit residual for replicate \(j\) at interval \(i\), where \(\widetilde{x}_{[i]}\) is the consensus epigenomic signal level estimate. Let \(\widetilde{P}_{[00,i]}\) be the posterior variance of the first state variable (signal level) and let \(R_{[j,i]}\) be the observation noise variance for replicate \(j\) at interval \(i\). Then, the studentized squared residuals \(u^2_{[j,i]}\) and the mean weighted squared residuals \(\textsf{MWSR}[i]\) are recorded as:

    \[u^2_{[j,i]} = \frac{r_{[j,i]}^2 + \widetilde{P}_{[00,i]}}{R_{[j,i]}}, \qquad \textsf{MWSR}[i] = \frac{1}{m}\sum_{j=1}^{m} u^2_{[j,i]}.\]

    which is consistent with the EM routine in consenrich.cconsenrich.cblockScaleEM()

  • writeJackknifeSE (bool) – If True, write the standard error of the signal level estimates across jackknife replicates to bedGraph. This is only relevant if applyJackknife is True.

  • applyJackknife (bool) – If True, estimate replicate-level sampling variability in the signal level estimates with the jackknife

class consenrich.core.genomeParams(genomeName, chromSizesFile, blacklistFile, sparseBedFile, chromosomes, excludeChroms, excludeForNorm)[source]

Specify assembly-specific resources, parameters.

Parameters:
  • genomeName (str) – If supplied, default resources for the assembly (sizes file, blacklist, and ‘sparse’ regions) in src/consenrich/data are used. ce10, ce11, dm6, hg19, hg38, mm10, mm39 have default resources available.

  • chromSizesFile (str) – A two-column TSV-like file with chromosome names and sizes (in base pairs).

  • blacklistFile (str, optional) – A BED file with regions to exclude.

  • sparseBedFile (str, optional) – A BED file with ‘sparse regions’ that are mutually exclusive with or devoid of the targeted signal. Used to estimate noise levels. See getMuncTrack().

  • chromosomes (List[str]) – A list of chromosome names to analyze. If None, all chromosomes in chromSizesFile are used.

  • excludeChroms (List[str]) – A list of chromosome names to exclude from analysis.

  • excludeForNorm (List[str]) – A list of chromosome names to exclude when summing up the ‘effective genome size’ during normalization. This can be useful to avoid bias from poorly assembled, highly repetitive, and/or sex-specific chromosomes (e.g., chrM, chrUn, etc.). For reference, see effective genome size.

class consenrich.core.countingParams(intervalSizeBP, backgroundBlockSizeBP, smoothSpanBP, scaleFactors, scaleFactorsControl, normMethod, fragmentLengths, fragmentLengthsControl, useTreatmentFragmentLengths, fixControl, globalWeight, asymPos, logOffset, logMult)[source]

Parameters related to counting aligned reads

Parameters:
  • intervalSizeBP (int) – Length (bp) of each genomic interval \(i=1\ldots n\) that comprise the larger genomic region (contig, chromosome, etc.) The default value is generally robust, but users may consider increasing this value when expected feature size is large and/or sequencing depth is low (less than \(\approx 5 \textsf{million}\), depending on assay).

  • backgroundBlockSizeBP (int) – Length (bp) of blocks used to estimate local statistics (background, noise, etc.). If a negative value is provided (default), this value is inferred from the data using consenrich.core.getContextSize().

  • normMethod (str) –

    Method used to normalize read counts for sequencing depth / library size.

  • fragmentLengths (List[int], optional) – List of fragment lengths (bp) to use for extending reads from 5’ ends when counting single-end data.

  • fragmentLengthsControl (List[int], optional) – List of fragment lengths (bp) to use for extending reads from 5’ ends when counting single-end with control data.

  • useTreatmentFragmentLengths (bool, optional) – If True, use fragment lengths estimated from treatment BAM files for control BAM files, too.

  • fixControl (bool, optional) – If True, treatment samples are not upscaled, and control samples are not downscaled.

  • globalWeight (float, optional) – Relative weight assigned to the global ‘dense’ baseline when combining with local baseline estimates. Higher values increase the influence of the global baseline. For instance, globalWeight = 2 results in a weighted average where the global baseline contributes 2/3 of the final baseline estimate; whereas globalWeight = 1 results in equal weighting between global and local baselines. Users with input control samples may consider increasing this value to avoid redundancy (artificial local trends have presumably been accounted for in the control, leaving less signal to be modeled locally).

  • asymPos (float, optional) – Relative weight assigned to positive residuals to induce asymmetry in reweighting. Used during IRLS for the local baseline computation. Using smaller values near 0.0 will downweight peaks more and reduce the risk of removing true signal. Typical range is (0, 0.75].

  • logOffset (float, optional) – A small constant added to read normalized counts before log-transforming (pseudocount). For example, \(\log(x + 1)\) for logOffset = 1. Default is 1.0.

  • logMult (float, optional) – Multiplicative factor applied to log-scaled and normalized counts. For example, setting logMult = 1 / \log(2) will yield log2-scaled counts after transformation, and setting logMult = 1.0 yields natural log-scaled counts.

  • smoothSpanBP (int | None)

  • scaleFactors (List[float] | None)

  • scaleFactorsControl (List[float] | None)

Seealso:

consenrich.cconsenrich.cTransform()

Treatment vs. Control Fragment Lengths in Single-End Data

For single-end data, cross-correlation-based estimates for fragment length in control inputs can be biased due to a comparative lack of structure in strand-specific coverage tracks.

This can create artifacts during counting, so it is common to use the estimated treatment fragment length for both treatment and control samples. The argument observationParams.useTreatmentFragmentLengths enables this behavior.

class consenrich.core.samParams(samThreads, samFlagExclude, oneReadPerBin, chunkSize, offsetStr='0,0', maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0, countEndsOnly=False, minMappingQuality=0, minTemplateLength=-1, fragmentLengths=None)[source]

Parameters related to reading BAM files

Parameters:
  • samThreads (int) – The number of threads to use for reading BAM files.

  • samFlagExclude (int) – The SAM flag to exclude certain reads.

  • oneReadPerBin (int) – If 1, only the interval with the greatest read overlap is incremented.

  • chunkSize (int) – maximum number of intervals’ data to hold in memory before flushing to disk.

  • offsetStr (str) – A string of two comma-separated integers – first for the 5’ shift on forward strand, second for the 5’ shift on reverse strand.

  • maxInsertSize (int) – Maximum frag length/insert to consider when estimating fragment length.

  • inferFragmentLength (int) – Intended for single-end data: if > 0, the maximum correlation lag (avg.) between strand-specific read tracks is taken as the fragment length estimate and used to extend reads from 5’. Ignored if data is paired-end, countEndsOnly, or fragmentLengths is provided. important when targeting broader marks (e.g., ChIP-seq H3K27me3).

  • countEndsOnly (Optional[bool]) – If True, only the 5’ read lengths contribute to counting.

  • minMappingQuality (Optional[int]) – Minimum mapping quality (MAPQ) for reads to be counted.

  • fragmentLengths (List[int] | None) – If supplied, a list of estimated fragment lengths for each BAM file. These are values are used to extend reads. Note, these values will override TLEN attributes in paired-end data

  • pairedEndMode (int | None)

  • minTemplateLength (int | None)

Tip

For an overview of SAM flags, see https://broadinstitute.github.io/picard/explain-flags.html

class consenrich.core.matchingParams(templateNames, cascadeLevels, iters, alpha, useScalingFunction, minMatchLengthBP, maxNumMatches, minSignalAtMaxima, merge, mergeGapBP, excludeRegionsBedFile, penalizeBy, randSeed, eps, massQuantileCutoff, methodFDR)[source]

Parameters related to the matching algorithm.

See `matching` for an overview of the approach.

Parameters:
  • templateNames (List[str]) – A list of str values – each entry references a mother wavelet (or its corresponding scaling function). e.g., [haar, db2]

  • cascadeLevels (List[int]) – Number of cascade iterations, or ‘levels’, used to define wavelet-based templates Must have the same length as templateNames, with each entry aligned to the corresponding template. e.g., given templateNames [haar, db2], then [2,2] would use 2 cascade levels for both templates.

  • iters (int) – Number of randomly drawn contiguous blocks used to build an empirical null for significance evaluation. See cconsenrich.csampleBlockStats().

  • alpha (float) – Primary significance threshold on detected matches. Specifically, the minimum corrected empirical p-value approximated from randomly sampled blocks in the response sequence.

  • minMatchLengthBP (Optional[int]) – Within a window of minMatchLengthBP length (bp), relative maxima in the signal-template convolution \(\mathcal{R}_{[\ast]}\) must be greater in value than others to qualify as matches. If set to a value less than 1, the minimum length is determined via consenrich.matching.autoMinLengthIntervals() (default behavior).

  • minSignalAtMaxima (Optional[str | float]) – Secondary/optional threshold coupled with alpha. Requires the signal value, \(\widetilde{x}_{[i^*]}\), at relative maxima in the response sequence, \(\mathcal{R}_{[i^*]}\), to be greater than this threshold. If a str value is provided, looks for ‘q:quantileValue’, e.g., ‘q:0.90’. The threshold is then set to the corresponding quantile of the non-zero signal estimates in the distribution of transformed values.

  • useScalingFunction (bool) – If True, use (only) the scaling function to build the matching template. If False, use (only) the wavelet function.

  • eps (float) – Tolerance parameter for relative maxima detection in the response sequence. Set to zero to enforce strict inequalities when identifying discrete relative maxima.

  • massQuantileCutoff (float) – Quantile cutoff for filtering initial (unmerged) matches based on their ‘mass’ ((avgSignal*length)/intervalLength). To diable, set < 0.

  • maxNumMatches (int | None)

  • merge (bool | None)

  • mergeGapBP (int | None)

  • excludeRegionsBedFile (str | None)

  • penalizeBy (str | None)

  • randSeed (int | None)

  • methodFDR (str | None)

Seealso:

cconsenrich.csampleBlockStats(), `matching`, outputParams.

consenrich.core.readBamSegments(bamFiles, chromosome, start, end, intervalSizeBP, readLengths, scaleFactors, oneReadPerBin, samThreads, samFlagExclude, offsetStr='0,0', maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0, countEndsOnly=False, minMappingQuality=0, minTemplateLength=-1, fragmentLengths=None)[source]

Calculate coverage tracks for each BAM file.

Parameters:
  • bamFiles (List[str]) – See inputParams.

  • chromosome (str) – Chromosome to read.

  • start (int) – Start position of the genomic segment.

  • end (int) – End position of the genomic segment.

  • readLengths (List[int]) – List of read lengths for each BAM file.

  • scaleFactors (List[float]) – List of scale factors for each BAM file.

  • intervalSizeBP (int) – See countingParams.

  • oneReadPerBin (int) – See samParams.

  • samThreads (int) – See samParams.

  • samFlagExclude (int) – See samParams.

  • offsetStr (str) – See samParams.

  • maxInsertSize (int) – See samParams.

  • pairedEndMode (int) – See samParams.

  • inferFragmentLength (int) – See samParams.

  • minMappingQuality (int) – See samParams.

  • minTemplateLength (Optional[int]) – See samParams.

  • fragmentLengths (Optional[List[int]]) – If supplied, a list of estimated fragment lengths for each BAM file. These are values are used to extend reads. Note, these values will override TLEN attributes in paired-end data

  • countEndsOnly (bool | None)

Return type:

ndarray[tuple[Any, …], dtype[float32]]

consenrich.core.getMuncTrack(chromosome, intervals, values, intervalSizeBP, samplingBlockSizeBP=None, samplingIters=25000, randomSeed=42, excludeMask=None, useEMA=True, excludeFitCoefs=None, binQuantileCutoff=0.5, EB_minLin=1.0, EB_use=True, EB_setNu0=None, EB_setNuL=None, EB_localQuantile=0.0, verbose=False, eps=0.01)[source]

Approximate initial sample-specific (M)easurement (unc)ertainty tracks

For an individual experimental sample (replicate), quantify positional data uncertainty over genomic intervals \(i=1,2,\ldots n\) spanning chromosome. These tracks (per-sample) comprise the matrixMunc input to runConsenrich(), \(\mathbf{R}[:,:] \in \mathbb{R}^{m \times n}\).

Variance is modeled as a function of the absolute mean signal level. For EB_use=True, local variance estimates are shrunk toward a signal level dependent global variance fit.

Parameters:
  • chromosome (str) – chromosome/contig name

  • values (np.ndarray) – normalized/transformed signal measurements over genomic intervals (e.g., consenrich.cconsenrich.cTransform() output)

  • intervals (np.ndarray) – genomic intervals positions (start positions)

  • intervalSizeBP (int)

  • samplingBlockSizeBP (int | None)

  • samplingIters (int)

  • randomSeed (int)

  • excludeMask (ndarray | None)

  • useEMA (bool | None)

  • excludeFitCoefs (Tuple[int, ...] | None)

  • binQuantileCutoff (float)

  • EB_minLin (float)

  • EB_use (bool)

  • EB_setNu0 (int | None)

  • EB_setNuL (int | None)

  • EB_localQuantile (float)

  • verbose (bool)

  • eps (float)

Return type:

tuple[ndarray[tuple[Any, …], dtype[float32]], float]

See consenrich.core.observationParams for other parameters.

consenrich.core.fitVarianceFunction(jointlySortedMeans, jointlySortedVariances, eps=0.01, binQuantileCutoff=0.5, EB_minLin=1.0)[source]
Parameters:
  • jointlySortedMeans (ndarray)

  • jointlySortedVariances (ndarray)

  • eps (float)

  • binQuantileCutoff (float)

  • EB_minLin (float)

Return type:

ndarray

consenrich.core.EB_computePriorStrength(localModelVariances, globalModelVariances, Nu_local)[source]

Compute \(\nu_0\) to determine ‘prior strength’

The prior model strength is determined by ‘excess’ dispersion beyond sampling noise at the local level.

Parameters:
  • localModelVariances (np.ndarray) – Local model variance estimates (e.g., rolling AR(1) innovation variances consenrich.cconsenrich.crolling_AR1_IVar()).

  • globalModelVariances (np.ndarray) – Global model variance estimates from the absMean-variance trend fit (consenrich.core.fitVarianceFunction()).

  • Nu_local (float) – Effective sample size/degrees of freedom for the local model.

Returns:

Estimated prior strength \(\nu_{0}\).

Return type:

float

Seealso:

consenrich.core.getMuncTrack(), consenrich.core.fitVarianceFunction()

consenrich.core.runConsenrich(matrixData, matrixMunc, deltaF, minQ, maxQ, offDiagQ, stateInit, stateCovarInit, boundState, stateLowerBound, stateUpperBound, blockLenIntervals, projectStateDuringFiltering=False, pad=0.0001, disableCalibration=False, EM_maxIters=50, EM_rtol=0.0001, EM_scaleToMedian=False, EM_tNu=10.0, EM_alphaEMA=0.1, EM_scaleLOW=0.1, EM_scaleHIGH=10.0, EM_useObsBlockScale=True, EM_useProcBlockScale=True, EM_useObsPrecReweight=True, EM_useProcPrecReweight=True, EM_useReplicateBias=True, EM_useReplicateScale=True, EM_repBiasShrink=0.0, EM_repScaleLOW=0.25, EM_repScaleHIGH=4.0, returnScales=True, returnReplicateOffsets=False, applyJackknife=False, jackknifeEM_maxIters=5, jackknifeEM_rtol=0.01, useWhiteAccel=False, useDiscreteConstAccel=False, autoDeltaF=True, autoDeltaF_low=0.0001, autoDeltaF_high=2.0, autoDeltaF_init=0.01, autoDeltaF_maxEvals=25, conformalRescale=False, conformalAlpha=0.05, conformal_numIters=3, conformalFinalRefit=True)[source]

Run Consenrich over over a contiguous genomic region (chromosome)

We estimate a position-dependent consensus epigenomic signal \(\widetilde{x}_{[i,0]}\) from multiple replicate

tracks’ HTS data, and provide positional uncertainty quantification by propagating variance in a linear filter-smoother setup.

This wrapper ties together several fundamental routines written in Cython:

  1. consenrich.cconsenrich.cforwardPass(): Forward filter (predict, update)

  2. consenrich.cconsenrich.cbackwardPass(): Backward fixed-interval smoother

  3. consenrich.cconsenrich.cblockScaleEM(): Joint optimization of process and observation model noise scales with robust studentized reweighting

Seealso:

consenrich.core.getMuncTrack(), consenrich.cconsenrich.cTransform(), consenrich.cconsenrich.cforwardPass(), consenrich.cconsenrich.cbackwardPass(), consenrich.cconsenrich.cblockScaleEM()

Parameters:
  • matrixData (ndarray)

  • matrixMunc (ndarray)

  • deltaF (float)

  • minQ (float)

  • maxQ (float)

  • offDiagQ (float)

  • stateInit (float)

  • stateCovarInit (float)

  • boundState (bool)

  • stateLowerBound (float)

  • stateUpperBound (float)

  • blockLenIntervals (int)

  • projectStateDuringFiltering (bool)

  • pad (float)

  • disableCalibration (bool)

  • EM_maxIters (int)

  • EM_rtol (float)

  • EM_scaleToMedian (bool)

  • EM_tNu (float)

  • EM_alphaEMA (float)

  • EM_scaleLOW (float)

  • EM_scaleHIGH (float)

  • EM_useObsBlockScale (bool)

  • EM_useProcBlockScale (bool)

  • EM_useObsPrecReweight (bool)

  • EM_useProcPrecReweight (bool)

  • EM_useReplicateBias (bool)

  • EM_useReplicateScale (bool)

  • EM_repBiasShrink (float)

  • EM_repScaleLOW (float)

  • EM_repScaleHIGH (float)

  • returnScales (bool)

  • returnReplicateOffsets (bool)

  • applyJackknife (bool)

  • jackknifeEM_maxIters (int)

  • jackknifeEM_rtol (float)

  • useWhiteAccel (bool)

  • useDiscreteConstAccel (bool)

  • autoDeltaF (bool)

  • autoDeltaF_low (float)

  • autoDeltaF_high (float)

  • autoDeltaF_init (float)

  • autoDeltaF_maxEvals (int)

  • conformalRescale (bool)

  • conformalAlpha (float)

  • conformal_numIters (int)

  • conformalFinalRefit (bool)

consenrich.core.constructMatrixF(deltaF)[source]

Build the state transition matrix for the process model

Parameters:

deltaF (float) – See processParams.

Returns:

The state transition matrix \(\mathbf{F}\)

Return type:

npt.NDArray[np.float32]

Seealso:

processParams

consenrich.core.constructMatrixQ(minDiagQ, offDiagQ=0.0, Q00=None, Q01=None, Q10=None, Q11=None, useIdentity=-1.0, tol=1e-08, useWhiteAccel=False, useDiscreteConstAccel=False, deltaF=None)[source]

Build the (base) process noise covariance matrix \(\mathbf{Q}\).

Parameters:
  • minDiagQ (float) – Minimum value for diagonal entries of \(\mathbf{Q}\).

  • offDiagQ (float) – Value for off-diagonal entries of \(\mathbf{Q}\).

  • Q00 (Optional[float]) – Optional value for entry (0,0) of \(\mathbf{Q}\).

  • Q01 (Optional[float]) – Optional value for entry (0,1) of \(\mathbf{Q}\).

  • Q10 (Optional[float]) – Optional value for entry (1,0) of \(\mathbf{Q}\).

  • Q11 (Optional[float]) – Optional value for entry (1,1) of \(\mathbf{Q}\).

  • useIdentity (float) – If > 0.0, use a scaled identity matrix for \(\mathbf{Q}\). Overrides other parameters.

  • tol (float) – Tolerance for positive definiteness check.

  • useWhiteAccel (bool)

  • useDiscreteConstAccel (bool)

  • deltaF (float | None)

Returns:

The process noise covariance matrix \(\mathbf{Q}\).

Return type:

npt.NDArray[np.float32]

Seealso:

processParams

consenrich.core.constructMatrixH(m, coefficients=None)[source]

Build the observation model matrix \(\mathbf{H}\).

Parameters:
  • m (int) – Number of observations.

  • coefficients (Optional[np.ndarray]) – Optional coefficients for the observation model, which can be used to weight the observations manually.

Returns:

The observation model matrix \(\mathbf{H}\).

Return type:

npt.NDArray[np.float32]

Seealso:

observationParams, class:inputParams

consenrich.detrorm

consenrich.detrorm.getScaleFactor1x(bamFile, effectiveGenomeSize, readLength, excludeChroms, chromSizesFile, samThreads)[source]

Generic normalization factor based on effective genome size and number of mapped reads in non-excluded chromosomes.

Parameters:
Returns:

Scale factor for 1x normalization.

Return type:

float

consenrich.detrorm.getScaleFactorPerMillion(bamFile, excludeChroms, intervalSizeBP)[source]

Generic normalization factor based on number of mapped reads in non-excluded chromosomes.

Parameters:
  • bamFile (str) – See consenrich.core.inputParams.

  • excludeChroms (List[str]) – List of chromosomes to exclude when counting mapped reads.

  • intervalSizeBP (int)

Returns:

Scale factor accounting for number of mapped reads (only).

Return type:

float

consenrich.detrorm.getPairScaleFactors(bamFileA, bamFileB, effectiveGenomeSizeA, effectiveGenomeSizeB, readLengthA, readLengthB, excludeChroms, chromSizesFile, samThreads, intervalSizeBP, normMethod='EGS', fixControl=True)[source]

Scale treatment:control data based on effective genome size or reads per million.

Parameters:
  • bamFileA (str) – Alignment file for the ‘treatment’ sample.

  • bamFileB (str) – Alignment file for the ‘control’ sample (e.g., input).

  • effectiveGenomeSizeA (int) – Effective genome size for the treatment sample.

  • effectiveGenomeSizeB (int) – Effective genome size for the control sample.

  • readLengthA (int) – Read or fragment length for the treatment sample.

  • readLengthB (int) – Read or fragment length for the control sample.

  • excludeChroms (List[str]) – List of chromosomes to exclude from the analysis.

  • chromSizesFile (str) – Path to the chromosome sizes file.

  • samThreads (int) – See consenrich.core.samParams.

  • intervalSizeBP (int) – Step size for coverage calculation.

  • normMethod (str)

  • fixControl (bool)

Param:

normMethod: Normalization method to use (“EGS” or “RPKM”).

Returns:

Tuple of scale factors for treatment and control samples.

Return type:

Tuple[float, float]

consenrich.constants

consenrich.constants.getEffectiveGenomeSize(genome, readLength)[source]

Get the effective genome size for a given genome and read length.

Parameters:
  • genome (str) – Name of the genome. See consenrich.constants.resolveGenomeName() and consenrich.core.genomeParams.

  • readLength (int) – Length of the reads. See consenrich.core.getReadLength().

Raises:

ValueError – If the genome is not recognized or if the read length is not available for the genome.

Returns:

Effective genome size in base pairs.

Return type:

int

consenrich.constants.getGenomeResourceFile(genome, fileType, dir_='data')[source]

Get the path to a genome resource file.

Parameters:
  • genome (str) – the genome assembly. See consenrich.constants.resolveGenomeName() and consenrich.core.genomeParams.

  • fileType (str) – One of ‘sizes’, ‘blacklist’, ‘sparse’.

  • dir_ (str)

Returns:

Path to the resource file.

Return type:

str

Raises:
  • ValueError – If not a sizes, blacklist, or sparse file.

  • FileNotFoundError – If the resource file does not exist.

Cython functions: consenrich.cconsenrich

Several computationally burdensome tasks are implemented in cython:

consenrich.cconsenrich.creadBamSegment(bamFile, chromosome, start, end, intervalSizeBP, readLength, oneReadPerBin, samThreads, samFlagExclude, shiftForwardStrand53=0, shiftReverseStrand53=0, extendBP=0, maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0, minMappingQuality=0, minTemplateLength=-1, weightByOverlap=1, ignoreTLEN=1)

Count reads in a BAM file for a given chromosome.

consenrich.cconsenrich.csampleBlockStats(intervals, values, expectedBlockSize, iters, randSeed, excludeIdxMask, eps=0.0)

Sample contiguous blocks in the response sequence (xCorr), record maxima, and repeat.

Used to build an empirical null distribution and determine significance of response outputs. The size of blocks is drawn from a truncated geometric distribution, preserving rough equality in expectation but allowing for variability to account for the sampling across different phases in the response sequence.

Parameters:
  • values (cnp.ndarray[cnp.float64_t, ndim=1]) – The response sequence to sample from.

  • expectedBlockSize (int) – The expected size (geometric) of the blocks to sample.

  • iters (int) – The number of blocks to sample.

  • randSeed (int) – Random seed for reproducibility.

Returns:

An array of sampled block maxima.

Return type:

cnp.ndarray[cnp.float64_t, ndim=1]

Seealso:

consenrich.matching.matchWavelet()

consenrich.cconsenrich.cgetFragmentLength(bamFile, samThreads=0, samFlagExclude=3844, maxInsertSize=1000, iters=1000, blockSize=5000, fallBack=147, rollingChunkSize=250, lagStep=5, earlyExit=250, randSeed=42)
consenrich.cconsenrich.cTransform(x, blockLength, disableLocalBackground=False, w_local=1.0, w_global=2.0, verbose=False, rseed=0, useIRLS=True, asymPos=0.4, logOffset=1.0, logMult=1.0)
consenrich.cconsenrich.cDenseMean(x, blockLenTarget=250, itersEM=1000, seed=0, verbose=False)

Use a Gaussian mixture to estimate a ‘dense’ baseline to subtract from each replicate’s transformed count track

Note, this step is potentially redundant if using replicate-level bias terms in the observation model.

consenrich.cconsenrich.clocalBaseline(x, blockSize=101, useIRLS=True, asymPos=0.4, noiseMult=1.0, cauchyScale=3.0, maxIters=25, minWeight=1e-06, tol=0.001)

Estimate a local baseline on x with a lower/smooth envelope via IRLS

Compute a locally smooth baseline \(\hat{b}\) for an input signal \(y\), using a second-order penalized smoother (Whittaker) with asymmetric iteratively reweighted least squares (IRLS) to reduce influence from peaks.

Parameters:
  • x (np.ndarray) – Signal measurements over fixed-length genomic intervals

  • asymPos (float) – Relative weight assigned to positive residuals to induce asymmetry in reweighting. Used during IRLS for the local baseline computation. Smaller values will downweight peaks more and pose less risk of removing true signal. Typical range is (0, 0.75].

  • cauchyScale (float) – Controls how quickly weights decay with normalized residual magnitude. Smaller values downweight outliers strongly.

Returns:

Baseline estimate as a float32 NumPy array of shape (n,).

Return type:

numpy.ndarray

consenrich.cconsenrich.cPAVA(x, postWeight)

PAVA for isotonic regression

This code aims for the notation and algorithm of Busing 2022 (JSS, DOI: 10.18637/jss.v102.c01).

From Busing:

> Observe that the violation 8 = x3 > x4 = 2 is solved by combining two values, 8 and 2, resulting > in a (new) block value of 5, i.e., (8 + 2)/2 = 5. Instead of immediately turning > around and start solving down block violation, we may first look ahead for the next > value in the sequence, k-up, for if this element is smaller than or equal to 5, the > next value can immediately be pooled into the current block, i.e., (8 + 2 + 2)/3 = 4. > Looking ahead can be continued until the next element is larger than the current block > value or if we reach the end of the sequence.

Parameters:
  • x (cnp.ndarray, (either f32 or f64)) – 1D array to be fitted as nondecreasing

  • postWeight (cnp.ndarray) – 1D array of weights corresponding to each observed value. These are the number of ‘observations’ associated to each ‘unique’ value in x. Intuition: more weight to values with more observations.

Returns:

PAVA-fitted values

Return type:

cnp.ndarray, (either f32 or f64)

consenrich.cconsenrich.cforwardPass(matrixData, matrixPluginMuncInit, matrixF, matrixQ0, intervalToBlockMap, rScale, qScale, blockCount, stateInit, stateCovarInit, pad=0.0001, projectStateDuringFiltering=False, stateLowerBound=0.0, stateUpperBound=0.0, chunkSize=1000000, stateForward=None, stateCovarForward=None, pNoiseForward=None, vectorD=None, progressBar=None, progressIter=25000, returnNLL=False, storeNLLInD=False, lambdaExp=None, processPrecExp=None, replicateBias=None, replicateScale=None, EM_useObsBlockScale=True, EM_useProcBlockScale=True, EM_useObsPrecReweight=True, EM_useProcPrecReweight=True, EM_useReplicateBias=False, EM_useReplicateScale=False)

Run the forward pass (filter) for state estimation

See consenrich.cconsenrich.cblockScaleEM(), where this routine is applied within the filter, smooth, update loop.

Seealso:

consenrich.cconsenrich.cbackwardPass(), consenrich.cconsenrich.cblockScaleEM(), consenrich.core.runConsenrich()

consenrich.cconsenrich.cbackwardPass(matrixData, matrixF, stateForward, stateCovarForward, pNoiseForward, chunkSize=1000000, stateSmoothed=None, stateCovarSmoothed=None, lagCovSmoothed=None, postFitResiduals=None, replicateBias=None, progressBar=None, progressIter=10000, EM_useReplicateBias=False)

Run the backward pass (smoother)

This function executes the smoothing phase of Consenrich’s forward–backward state estimation. It operates on outputs from the forward-filtered outputs (those returned by consenrich.cconsenrich.cforwardPass()).

That is, given the forward-pass, filtered estimates over genomic intervals \(i = 1, \dots, n\),

\[\mathbf{x}_{[i|i]}, \qquad \mathbf{P}_{[i|i]}, \qquad \mathbf{Q}_{[i]},\]

this routine computes the backward-smoothed state estimates \(\widetilde{\mathbf{x}}_{[i]}\) and the backward-smoothed covariances \(\widetilde{\mathbf{P}}_{[i]}\).

Seealso:

consenrich.cconsenrich.cforwardPass(), consenrich.cconsenrich.cblockScaleEM(), consenrich.core.runConsenrich()

consenrich.cconsenrich.cblockScaleEM(matrixData, matrixPluginMuncInit, matrixF, matrixQ0, intervalToBlockMap, blockCount, stateInit, stateCovarInit, EM_maxIters=50, EM_rtol=0.0001, pad=0.0001, EM_scaleLOW=0.1, EM_scaleHIGH=10.0, EM_alphaEMA=0.1, EM_alphaEMA_Q=1.0, EM_scaleToMedian=False, EM_tNu=10.0, EM_useObsBlockScale=True, EM_useProcBlockScale=True, EM_useObsPrecReweight=True, EM_useProcPrecReweight=True, EM_useReplicateBias=True, EM_useReplicateScale=True, EM_repBiasShrink=0.0, EM_repScaleLOW=0.25, EM_repScaleHIGH=4.0, t_innerIters=3, returnIntermediates=False)

Run the Consenrich filter-smoother estimation loop with iteratively updated observation and process noise [co]variances.

Take observation and process noise [co]variances:

\[\widetilde{R}_{[j,i]}=\frac{a_j\,r_{b(i)}\,v_{[j,i]}}{\lambda_{[j,i]}}, \qquad \widetilde{\mathbf{Q}}_{[i]}=\frac{q_{b(i)}\,\mathbf{Q}_0}{\kappa_{[i]}}.\]

We also include replicate-level additive offsets in the observation mean:

\[z_{[j,i]} = x_{[i,0]} + b_j + \epsilon_{[j,i]}.\]

Here \(r_b>0\) and \(q_b>0\) are block-level scale multipliers, \(a_j>0\) and \(b_j\) are replicate-level multipliers/offsets, and \(\lambda_{[j,i]}\) and \(\kappa_{[i]}\) are Student-t precision multipliers.

Estimation loop

Repeat until convergence:

  1. Filter-Smoother estimation

Run the forward filter and backward smoother under the current (given) effective noises \(\widetilde{R}\) and \(\widetilde{\mathbf{Q}}\). This yields smoothed moments \(\widetilde{\mathbf{x}}_{[i]}\), \(\widetilde{\mathbf{P}}_{[i]}\), and lag-one covariances \(\widetilde{\mathbf{C}}_{[i,i+1]}\).

  1. Studentized precision reweighting:

Observation weights \(\lambda_{[j,i]}\) (EM_useObsPrecReweight):

\[u^2_{[j,i]}=\frac{(z_{[j,i]}-b_j-\widetilde{x}_{[i,0]})^2+\widetilde{P}_{[i,0,0]}} {\widetilde{R}_{[j,i]}} \quad\Rightarrow\quad \lambda_{[j,i]} \leftarrow \frac{\nu_R+1}{\nu_R+u^2_{[j,i]}}.\]

In code, EM_tNu corresponds to \(\nu_R\).

Process weights \(\kappa_{[i]}\):

Let \(\mathbf{w}_{[i]}=\mathbf{x}_{[i]}-\mathbf{F}\mathbf{x}_{[i-1]}\) and define

\[\Delta_{[i]}=\textsf{Trace}\!\left(\mathbf{Q}_0^{-1}\,\mathbb{E}\left[\mathbf{w}_{[i]}\mathbf{w}_{[i]}^\top\right]\right).\]

Then

\[\kappa_{[i]} \leftarrow \frac{\nu_Q+d}{\nu_Q+\Delta_{[i]}/q_{b(i)}},\]

where \(d=2\).

3. Block-level scaling: update \(r_b\) and/or \(q_b\) using blockwise closed-form averages of sufficient statistics from the E-step.

Observation block scale update:

\[r_b \leftarrow \frac{1}{N_b}\sum_{(j,i):\,b(i)=b} \lambda_{[j,i]}\, \frac{(z_{[j,i]}-b_j-\widetilde{x}_{[i,0]})^2+\widetilde{P}_{[i,0,0]}}{a_j\,v_{[j,i]}}.\]

Process block scale update:

\[q_b \leftarrow \frac{1}{d\,M_b}\sum_{i:\,b(i)=b}\kappa_{[i+1]}\,\Delta_{[i]}.\]

4. Replicate-level updates: update \(b_j\) and \(a_j\) from weighted residual moments at replicate resolution while holding block and interval-level terms fixed.

If EM_alphaEMA is supplied, we smooth updates with an exponential moving average (EMA) in the log domain:

\[\log r_b \leftarrow (1-\alpha)\log r_b^{(\mathrm{sm})} + \alpha \log r_b, \qquad \log q_b \leftarrow (1-\alpha_Q)\log q_b^{(\mathrm{sm})} + \alpha_Q \log q_b.\]

Objective Function

Let \(x_{1:n}=\{\mathbf{x}_{[i]}\}_{i=1}^n\), \(\Lambda=\{\lambda_{[j,i]}\}\), and \(\kappa=\{\kappa_{[i]}\}\). Collecting process and observation terms and mixing penalties yields:

\begin{align} \mathcal{J}(x,\Lambda,\kappa,r,q,a,b) &= \frac12\sum_{i=2}^{n} \left[ \log\left|\frac{q_{b(i)}}{\kappa_{[i]}}\mathbf{Q}_0\right| + (\mathbf{x}_{[i]}-\mathbf{F}\mathbf{x}_{[i-1]})^\top \left(\frac{\kappa_{[i]}}{q_{b(i)}}\mathbf{Q}_0^{-1}\right) (\mathbf{x}_{[i]}-\mathbf{F}\mathbf{x}_{[i-1]}) \right] \\ &\quad+ \frac12\sum_{i=1}^{n}\sum_{j=1}^m \left[ \log\!\left(\frac{a_j r_{b(i)}v_{[j,i]}}{\lambda_{[j,i]}}\right) + (z_{[j,i]}-b_j-x_{[i,0]})^2\,\frac{\lambda_{[j,i]}}{a_j r_{b(i)}v_{[j,i]}} \right] \\ &\quad+ \sum_{i=1}^{n}\sum_{j=1}^m \left[ -\left(\frac{\nu_R+1}{2}-1\right)\log\lambda_{[j,i]} +\frac{\nu_R+1}{2}\lambda_{[j,i]} \right] \\ &\quad+ \sum_{i=2}^{n} \left[ -\left(\frac{\nu_Q+d}{2}-1\right)\log\kappa_{[i]} +\frac{\nu_Q+d}{2}\kappa_{[i]} \right]. \end{align}

So the estimation loop maximizing our objective function may be viewed as a coordinate ascent where the filter-smoother solves the quadratic subproblem conditional on the current estimates of \(\Lambda\), \(\kappa\), \(r\), \(q\), \(a\), and \(b\), and reweighting plus scale/offset updates optimize over \(\Lambda\), \(\kappa\), \(r\), \(q\), \(a\), and \(b\).

param matrixData:

Replicate track data (rows: replicates, columns: genomic intervals).

type matrixData:

numpy.ndarray[numpy.float32]

param matrixPluginMuncInit:

Data-derived observation noise variances \(v_{[j,i]}\). Same per-replicate/per-interval shape as matrixData.

type matrixPluginMuncInit:

numpy.ndarray[numpy.float32]

param matrixF:

Transition matrix \(\mathbf{F}\), shape (2, 2).

type matrixF:

numpy.ndarray[numpy.float32]

param matrixQ0:

Base process noise covariance: \(\mathbf{Q}_0 \in \mathbb{R}^{2 \times 2}\)

type matrixQ0:

numpy.ndarray[numpy.float32]

param intervalToBlockMap:

Mapping from interval index \(i\) to block index \(b(i)\)

type intervalToBlockMap:

numpy.ndarray[numpy.int32]

param blockCount:

Number of blocks that are rescaled during optimization. Larger values optimize at a finer resolution, but this introduces risk of overfitting, increased runtime, and potential ambiguity with the precision reweighting procedure.

type blockCount:

int

param stateInit:

Initial state value for the signal-level (first component) of the state vector \(\mathbf{x}_{[0]}\)

type stateInit:

float

param stateCovarInit:

Initial state covariance scale

type stateCovarInit:

float

param EM_maxIters:

Maximum outer EM iterations.

type EM_maxIters:

int

param EM_rtol:

Relative improvement tolerance on NLL used in convergence

type EM_rtol:

float

param EM_scaleLOW:

Lower bound for block scales \(r_b\) and \(q_b\).

type EM_scaleLOW:

float

param EM_scaleHIGH:

Upper bound for block scales \(r_b\) and \(q_b\)

type EM_scaleHIGH:

float

param EM_alphaEMA:

If in (0, 1], enables log-domain EMA smoothing for block scale updates with smoothing factor \(\alpha\)

type EM_alphaEMA:

float

param EM_scaleToMedian:

If True, rescales \(r_b\) and \(q_b\) by their medians after each M-step. This can help preserve between-block scale differences for interpretability, but may interfere with convergence of the EM algorithm since it changes the effective objective.

type EM_scaleToMedian:

bool

param EM_tNu:

Student-t df for reweighting strengths (smaller = stronger reweighting)

type EM_tNu:

float

param EM_useObsBlockScale:

If True, estimate block observation scales \(r_b\); otherwise fix \(r_b\equiv 1\).

type EM_useObsBlockScale:

bool

param EM_useProcBlockScale:

If True, estimate block process scales \(q_b\); otherwise fix \(q_b\equiv 1\).

type EM_useProcBlockScale:

bool

param EM_useObsPrecReweight:

If True, update observation precision multipliers \(\lambda_{[j,i]}\) (Student-t reweighting); otherwise \(\lambda\equiv 1\).

type EM_useObsPrecReweight:

bool

param EM_useProcPrecReweight:

If True, update process precision multipliers \(\kappa_{[i]}\) (Student-t reweighting); otherwise \(\kappa\equiv 1\).

type EM_useProcPrecReweight:

bool

param EM_useReplicateBias:

If True, update replicate-level additive offsets \(b_j\).

type EM_useReplicateBias:

bool

param EM_useReplicateScale:

If True, update replicate-level observation variance scales \(a_j\).

type EM_useReplicateScale:

bool

param t_innerIters:

Number of inner filter/smoother + reweighting updates per EM outer iteration.

type t_innerIters:

int

param returnIntermediates:

If True, also return smoothed states/covariances, residuals, and (if enabled) precision multipliers.

type returnIntermediates:

bool

returns:

A tuple (rScale, qScale, itersDone, finalNLL). If returnIntermediates=True, additionally returns (stateSmoothed, stateCovarSmoothed, lagCovSmoothed, postFitResiduals, lambdaExp, processPrecExp, replicateBias, replicateScale).

rtype:

tuple

References

  • Shumway, R. H. & Stoffer, D. S. (1982): An approach to time series smoothing and forecasting using the EM algorithm. DOI: 10.1111/j.1467-9892.1982.tb00349.x

  • West, M. (1987): On scale mixtures of normal distributions. DOI: 10.1093/biomet/74.3.646

consenrich.cconsenrich.cEMA(x, alpha)
consenrich.cconsenrich.cmeanVarPairs(intervals, values, blockSize, iters, randSeed, excludeIdxMask, zeroPenalty=0.0, zeroThresh=0.0, useInnovationVar=True, useSampleVar=False)
consenrich.cconsenrich.projectToBox(vectorX, matrixP, stateLowerBound, stateUpperBound, eps)

consenrich.matching

(Experimental) Detect genomic regions showing both enrichment and non-random structure

Denote a noisy signal over fixed-length genomic intervals,

\[{\mathbf{x}} = \{{x}_{[i]}\}_{i=1}^{i=n}.\]

Aim: Determine a set of structured peaks, i.e., genomic regions where \(x_{[a:b]}\) exhibits both:

  1. Enrichment (large relative amplitude)

  2. Non-random structure defined by a robust template (polynomial, oscillatory, etc.)

Prioritizing genomic regions that are both enriched and agree with a prescribed structure may be appealing for several reasons. Namely,

  • Targeted detection of biologically relevant signal patterns in a given assay (e.g., see related works analyzing peak-shape Cremona et al., 2015, Parodi et al., 2017)

  • Improved confidence that the identified genomic regions are not due to stochastic noise, which is characteristically unstructured.

  • Speed: Runs in seconds/minutes using efficient numerical methods to compute large chromosome-scale convolutions (fast fourier transform (FFT)-based, overlap-add (OA), etc.)

Algorithm Overview

To detect structured peaks, we run an approach akin to matched filtering, with templates derived from approximated discrete wavelets (or their corresponding scaling functions).

Note, other bases for defining templates may be considered in the future, e.g., splines, polynomials, etc. Wavelet-based templates are convenient and possess several desirable properties for detecting structured peaks: multi-scale representation, compact support, orthogonality, etc. For relevant background, Fournier et al ‘95, and/or Mallat and Hwan ‘92 may be useful reading.

Denote the cross-correlation between the signal track and a matching template \(\boldsymbol{\xi}\) as:

\[\{\mathcal{R}_{[i]}\}_{i=1}^{i=n} = {\mathbf{x}} \star \boldsymbol{\xi},\]
\[\mathcal{R}_{[i]} = \sum_{t=1}^{t=T} {x}_{[i+t-1]} \cdot \xi_{[t]}.\]

We refer to \(\mathcal{R}\) over \(i=1 \ldots n\) as the response sequence. The response will be greatest in genomic regions with a high read density and structural similarity to the template \(\boldsymbol{\xi}\).

To detect ‘significant’ hits,

  • We first construct an observed empirical distribution from randomly-sampled genomic blocks. Specifically, we sample \(B\) blocks and record each \(\max(\mathcal{R}_{[b_1]}, \ldots, \mathcal{R}_{[b_K]})\). To reduce leakage and selection bias, the chromosome is split into block-level null/test subsets, with guard zones near split boundaries.

  • Relative maxima in the response sequence, i.e., \(i^*\) such that \(\mathcal{R}_{[i^* - 1 \,:\, i^* - T/2]}\, \leq \, \mathcal{R}_{[i^*]} \, \geq \, \mathcal{R}_{[i^* + 1 \,:\, i^* + T/2]}\) are retained as candidate matches

  • Each candidate is assigned an empirical \(p\)-value based on its upper-tail ECDF under the null subset maxima (with lower bound \(1/(B+1)\)). The split is repeated several times, and per-candidate split-level \(p\)-values are aggregated with a Cauchy combination rule. Those satisfying \(p < \alpha\) are deemed ‘significant’. Note that \(p\)-values are with respect to chromosome-specific empirical distributions.

    • Additional criteria for matching: require the signal values at candidate peaks/matches, \({x}_{[i^*]}\), to exceed a cutoff (matchingParams.minSignalAtMaxima), and/or require the length of the matched feature to exceed a minimum size (matchingParams.minMatchLengthBP).

    • Overlapping/adjacent matches can be merged.

Note

Blocked Split + Repeated Combination

  • The response sequence is partitioned into coarse genomic blocks, stratified by robust block summary statistics, then assigned approximately 50/50 into null/test subsets within each stratum.

  • For each split, null maxima are sampled only from null blocks and candidates are evaluated only on test blocks.

  • Split-level candidate \(p\)-values are then combined with a Cauchy combination statistic.

Generic Defaults

matchingParams.templateNames: [haar, haar, db2, db2]
matchingParams.cascadeLevels: [1,2,1,2]
matchingParams.minMatchLengthBP: -1 # select via `consenrich.core.getContextSize`
matchingParams.mergeGapBP: -1
matchingParams.alpha: 0.05

consenrich.matching.matchWavelet(chromosome, intervals, values, templateNames, cascadeLevels, iters, alpha=0.05, minMatchLengthBP=-1, maxNumMatches=100000, minSignalAtMaxima=0.1, randSeed=42, recenterAtPointSource=True, useScalingFunction=True, excludeRegionsBedFile=None, weights=None, eps=0.001, methodFDR=None)[source]

Detect structured peaks in Consenrich tracks by matching wavelet- or scaling-function–based templates.

Parameters:
  • chromosome (str) – Chromosome name for the input intervals and values.

  • values (npt.NDArray[np.float64]) – A 1D array of signal-like values. In this documentation, we refer to values derived from Consenrich, but other continuous-valued tracks at evenly spaced genomic intervals may be suitable, too.

  • templateNames (List[str]) – A list of str values – each entry references a mother wavelet (or its corresponding scaling function). e.g., [haar, db2]

  • cascadeLevels (List[int]) – Number of cascade iterations used to approximate each template (wavelet or scaling function). Must have the same length as templateNames, with each entry aligned to the corresponding template. e.g., given templateNames [haar, db2], then [2,2] would use 2 cascade levels for both templates.

  • iters (int) – Number of random blocks to sample in the response sequence while building an empirical null to test significance within chromosomes. See cconsenrich.csampleBlockStats().

  • alpha (float) – Primary significance threshold on detected matches. Specifically, the minimum corrected empirical p-value approximated from randomly sampled blocks in the response sequence.

  • minMatchLengthBP (Optional[int]) – Within a window of minMatchLengthBP length (bp), relative maxima in the signal-template convolution must be greater in value than others to qualify as matches. If set to a value less than 1, the minimum length is determined via consenrich.matching.autoMinLengthIntervals(), a simple wrapper around consenrich.core.getContextSize(). If set to None, defaults to 250 bp.

  • minSignalAtMaxima (Optional[str | float]) – Secondary significance threshold coupled with \(\alpha\). Requires the signal value at relative maxima in the response sequence to be greater than a threshold \(\pm \epsilon\).

  • useScalingFunction (bool) – If True, use (only) the scaling function to build the matching template. If False, use (only) the wavelet function.

  • excludeRegionsBedFile (Optional[str]) – A BED file with regions to exclude from matching

  • recenterAtPointSource (bool) – If True, recenter detected matches at the point source (max value)

  • weights (Optional[npt.NDArray[np.float64]]) – Optional weights to apply to values prior to matching. Must have the same length as values.

  • eps (float) – Tolerance parameter for relative maxima detection in the response sequence. Set to zero to enforce strict inequalities when identifying discrete relative maxima.

  • intervals (ndarray[tuple[Any, ...], dtype[int]])

  • maxNumMatches (int | None)

  • randSeed (int)

  • methodFDR (str | None)

Seealso:

consenrich.core.matchingParams, cconsenrich.csampleBlockStats(), `matching`

Returns:

A pandas DataFrame with detected matches

Return type:

pd.DataFrame

Notation

  • \(m\) = number of replicate tracks (rows)

  • \(n\) = number of genomic intervals (columns)

  • \(B\) = number of genomic blocks partitioning the main contig (blockCount)

  • \(b(i)\in\{0,\dots,B-1\}\) maps interval \(i\) to block index (intervalToBlockMap)

  • \(z_{[j,i]}\) is the observation for replicate \(j\) at genomic interval \(i\)

  • \(v_{[j,i]}\) is the observation variance for replicate \(j\) at genomic interval \(i\) (consenrich.core.getMuncTrack())

  • \(\mathbf{x}_{[i]}\in\mathbb{R}^2\) is the state vector at interval \(i\), where

    • \(x_{[i,0]}\) denotes the latent signal level at genomic interval \(i\)

    • \(x_{[i,1]}\) denotes the latent signal slope at genomic interval \(i\)

  • \(\mathbf{P}_{[i]}\in\mathbb{R}^{2\times 2}\) is the state covariance matrix at interval \(i\), where

    • \(P_{[i,0,0]}\) is the variance of the signal level at interval \(i\)

    • \(P_{[i,1,1]}\) is the variance of the signal slope at interval \(i\)

    • \(P_{[i,0,1]}=P_{[i,1,0]}\) is the covariance between signal level and slope at interval \(i\)

  • \(\mathbf{F}\) is the process model matrix (matrixF), where

    • \(F_{00}=1\) and \(F_{01}=\Delta_F\)

    • \(F_{10}=0\) and \(F_{11}=1\)

  • \(\mathbf{Q}_0\) is the initial process model noise covariance matrix (matrixQ0)

  • \(\mathbf{H}\) is the observation model matrix (\(\mathbf{H}=[1, 0]\))

  • \(\mathbf{R}_{[i]}=\textsf{diag}\{v_{[:,i]}\}\) is the base observation-noise covariance at interval \(i\)

  • \(\lambda_{[j,i]}\) is the latent precision multiplier for observation \(z_{[j,i]}\) in a Student-t Gaussian scale-mixture model

  • \(\kappa_{[i]}\) is the latent precision multiplier for the process transition at interval \(i\) in a Student-t Gaussian scale-mixture model

  • \(\nu_R\) is the degrees-of-freedom parameter controlling the heaviness of Student-t tails for observation noise

  • \(\nu_Q\) is the degrees-of-freedom parameter controlling the heaviness of Student-t tails for process noise

  • \(r_b\) is the block-level observation noise scale multiplier for block \(b\)

  • \(q_b\) is the block-level process noise scale multiplier for block \(b\)

  • \(a_j\) is the replicate-level observation variance scale multiplier for replicate \(j\)

  • \(b_j\) is the replicate-level additive observation offset for replicate \(j\)

The effective observation model used in EM is:

\[z_{[j,i]} = x_{[i,0]} + b_j + \varepsilon_{[j,i]}, \qquad \mathrm{Var}(\varepsilon_{[j,i]}) = \frac{a_j\,r_{b(i)}\,v_{[j,i]}}{\lambda_{[j,i]}}.\]

Resolution of scaling terms:

  • \(v_{[j,i]}\) and \(\lambda_{[j,i]}\) operate at replicate-interval resolution

  • \(r_b\) and \(q_b\) operate at block resolution via \(b(i)\)

  • \(a_j\) and \(b_j\) operate at replicate resolution