API Reference¶
consenrich.core
¶
The core module implements the main aspects of Consenrich and defines key parameter classes.
- class consenrich.core.processParams(deltaF, minQ, maxQ, offDiagQ, dStatAlpha, dStatd, dStatPC)[source]¶
Parameters related to the process model of Consenrich.
The process model governs the signal and variance propagation through the state transition \(\mathbf{F} \in \mathbb{R}^{2 \times 2}\) and process noise covariance \(\mathbf{Q}_{[i]} \in \mathbb{R}^{2 \times 2}\) matrices.
- Parameters:
deltaF (float) – Scales the signal and variance propagation between adjacent genomic intervals.
minQ (float) – Minimum process noise level (diagonal in \(\mathbf{Q}_{[i]}\)) for each state variable. Adjust relative to data quality (more reliable data –> lower minQ).
dStatAlpha (float) – Threshold on the deviation between the data and estimated signal – used to determine whether the process noise is scaled up.
dStatd (float) – Constant \(d\) in the scaling expression \(\sqrt{d|D_{[i]} - \alpha_D| + c}\) that is used to up/down-scale the process noise covariance in the event of a model mismatch.
dStatPC (float) – Constant \(c\) in the scaling expression \(\sqrt{d|D_{[i]} - \alpha_D| + c}\) that is used to up/down-scale the process noise covariance in the event of a model mismatch.
maxQ (float)
offDiagQ (float)
- class consenrich.core.observationParams(minR, maxR, useALV, useConstantNoiseLevel, noGlobal, numNearest, localWeight, globalWeight, approximationWindowLengthBP, lowPassWindowLengthBP, lowPassFilterType, returnCenter)[source]¶
Parameters related to the observation model of Consenrich. The observation model is used to integrate sequence alignment count data from the multiple input samples and account for region-and-sample-specific noise processes corrupting data. The observation model matrix \(\mathbf{H} \in \mathbb{R}^{m \times 2}\) maps from the state dimension (2) to the dimension of measurements/data (\(m\)).
- Parameters:
minR (float) – The minimum observation noise level for each sample \(j=1\ldots m\) in the observation noise covariance matrix \(\mathbf{R}_{[i, (11:mm)]}\).
numNearest (int) – The number of nearest nearby sparse features to use for local variance calculation. Ignored if useALV is True.
localWeight (float) – The coefficient for the local noise level (based on the local surrounding window / numNearest features) used in the weighted sum measuring sample-specific noise level at the current interval.
globalWeight (float) – The coefficient for the global noise level (based on all genomic intervals \(i=1\ldots n\)) used in the weighted sum measuring sample-specific noise level at the current interval.
approximationWindowLengthBP (int) – The length of the local approximation window in base pairs (BP) for the local variance calculation.
sparseBedFile (str, optional) – The path to a BED file of ‘sparse’ regions for the local variance calculation.
noGlobal (bool) – If True, only the ‘local’ variances are used to approximate observation noise covariance \(\mathbf{R}_{[:, (11:mm)]}\).
useALV (bool) – Whether to use average local variance (ALV) to approximate observation noise covariances per-sample, per-interval. Recommended for estimating signals associated with repressive/heterochromatic elements.
useConstantNoiseLevel (bool) – Whether to use a constant noise level in the observation model.
lowPassFilterType (Optional[str]) – The type of low-pass filter to use (e.g., ‘median’, ‘mean’).
maxR (float)
lowPassWindowLengthBP (int)
returnCenter (bool)
- class consenrich.core.stateParams(stateInit, stateCovarInit, boundState, stateLowerBound, stateUpperBound)[source]¶
Parameters related to state and uncertainty bounds and initialization.
- Parameters:
stateInit (float) – Initial value of the ‘primary’ state/signal at the first genomic interval: \(x_{[1]}\)
stateCovarInit (float) – Initial state covariance (covariance) scale. Note, the initial state uncertainty \(\mathbf{P}_{[1]}\) is a multiple of the identity matrix \(\mathbf{I}\)
boundState (bool) – If True, the primary state estimate for \(x_{[i]}\) is constrained within stateLowerBound and stateUpperBound.
stateLowerBound (float) – Lower bound for the state estimate.
stateUpperBound (float) – Upper bound for the state estimate.
- class consenrich.core.detrendParams(useOrderStatFilter, usePolyFilter, detrendTrackPercentile, detrendSavitzkyGolayDegree, detrendWindowLengthBP)[source]¶
Parameters related detrending and background-removal
- Parameters:
useOrderStatFilter (bool) – Whether to use a local/moving order statistic (percentile filter) to model and remove trends in the read density data.
usePolyFilter (bool) – Whether to use a low-degree polynomial fit to model and remove trends in the read density data.
detrendSavitzkyGolayDegree (int) – The polynomial degree of the Savitzky-Golay filter to use for detrending
detrendTrackPercentile (float) – The percentile to use for the local/moving order-statistic filter. Decrease for broad marks + sparse data if useOrderStatFilter is True.
detrendWindowLengthBP (int) – The length of the window in base pairs for detrending. Increase for broader marks + sparse data.
- class consenrich.core.inputParams(bamFiles, bamFilesControl)[source]¶
Parameters related to the input data for Consenrich.
- Parameters:
bamFiles (List[str]) – A list of paths to distinct coordinate-sorted and indexed BAM files.
bamFilesControl (List[str], optional) – A list of paths to distinct coordinate-sorted and indexed control BAM files. e.g., IgG control inputs for ChIP-seq.
- class consenrich.core.genomeParams(genomeName, chromSizesFile, blacklistFile, sparseBedFile, chromosomes, excludeChroms, excludeForNorm)[source]¶
Specify assembly-specific resources, parameters.
- Parameters:
genomeName (str) – If supplied, default resources for the assembly (sizes file, blacklist, and ‘sparse’ regions) in src/consenrich/data are used.
ce10, ce11, dm6, hg19, hg38, mm10, mm39
have default resources available.chromSizesFile (str) – A two-column TSV-like file with chromosome names and sizes (in base pairs).
blacklistFile (str, optional) – A BED file with regions to exclude.
sparseBedFile (str, optional) – A BED file with sparse regions used to estimate noise levels – ignored if observationParams.useALV is True.
chromosomes (List[str]) – A list of chromosome names to analyze. If None, all chromosomes in chromSizesFile are used.
excludeChroms (List[str])
excludeForNorm (List[str])
- class consenrich.core.countingParams(stepSize, scaleDown, scaleFactors, scaleFactorsControl, numReads, applyAsinh, applyLog, rescaleToTreatmentCoverage=False)[source]¶
Parameters related to counting reads in genomic intervals.
- Parameters:
stepSize (int) – Step size (bp) for the genomic intervals (AKA bin size, interval length, width, etc.)
scaleDown (bool, optional) – If using paired treatment and control BAM files, whether to scale down the larger of the two before computing the difference/ratio
scaleFactors (List[float], optional) – Scale factors for the read counts.
scaleFactorsControl (List[float], optional) – Scale factors for the control read counts.
numReads (int) – Number of reads to sample.
applyAsinh (bool, optional) – If true, \(\textsf{arsinh}(x)\) applied to counts \(x\) (log-like for large values and linear near the origin)
applyLog (bool, optional) – If true, \(\textsf{log}(x + 1)\) applied to counts \(x\)
rescaleToTreatmentCoverage (bool | None)
- class consenrich.core.samParams(samThreads, samFlagExclude, oneReadPerBin, chunkSize, offsetStr='0,0', extendBP=[], maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0)[source]¶
Parameters related to reading BAM files
- Parameters:
samThreads (int) – The number of threads to use for reading BAM files.
samFlagExclude (int) – The SAM flag to exclude certain reads.
oneReadPerBin (int) – If 1, only the interval with the greatest read overlap is incremented.
chunkSize (int) – maximum number of intervals’ data to hold in memory before flushing to disk.
offsetStr (str) – A string of two comma-separated integers – first for the 5’ shift on forward strand, second for the 5’ shift on reverse strand.
extendBP (List[int]) – A list of integers specifying the number of base pairs to extend reads for each BAM file after shifting per offsetStr. If all BAM files share the same expected frag. length, can supply a single numeric value to be broadcasted. Ignored for PE reads.
maxInsertSize (int) – Maximum frag length/insert for paired-end reads.
pairedEndMode (int) – If > 0, only proper pairs are counted subject to maxInsertSize.
inferFragmentLength (int) – Intended for single-end data: if > 0, the maximum correlation lag (avg.) between strand-specific read tracks is taken as the fragment length estimate and used to extend reads from 5’. Ignored if pairedEndMode > 0 or extendBP set. This parameter is particularly important when targeting broader marks (e.g., ChIP-seq H3K27me3).
Tip
For an overview of SAM flags, see https://broadinstitute.github.io/picard/explain-flags.html
- class consenrich.core.matchingParams(templateNames, cascadeLevels, iters, alpha, minMatchLengthBP, maxNumMatches, minSignalAtMaxima='q:0.75', merge=False, mergeGapBP=25, useScalingFunction=True, excludeRegionsBedFile=None)[source]¶
Parameters related to the matching algorithm packaged with this software.
See `matching` for details.
- Parameters:
templateNames (List[str]) – A list of str values – wavelet bases used for matching, e.g., [haar, db2, sym4]
cascadeLevels (List[int]) – A list of int values – the number of cascade iterations used for approximating the scaling/wavelet functions.
iters (int) – Number of random blocks to sample in the response sequence while building an empirical null to test significance. See
cconsenrich.csampleBlockStats()
.alpha (float) – Primary significance threshold on detected matches. Specifically, the \(1 - \alpha\) quantile of an empirical null distribution. The empirical null distribution is built from cross-correlation values over randomly sampled blocks.
minMatchLengthBP (int) – Within a window of minMatchLengthBP length (bp), relative maxima in the signal-template convolution must be greater in value than others to qualify as matches. Set to a negative value to disable this filter.
minSignalAtMaxima (Optional[str | float]) – Secondary significance threshold coupled with alpha. Require the signal value at relative maxima in the response sequence to be greater than this threshold. Comparisons are made in log-scale. If a float value is provided, the minimum signal value must be greater than this (absolute) value. Set to a negative value to disable the threshold. If a str value is provided, looks for ‘q:quantileValue’, e.g., ‘q:0.75’. The threshold is then set to the corresponding quantile of the non-zero signal estimates. Defaults to str value ‘q:0.75’ — the 75th percentile of signal values.
useScalingFunction (bool) – If True, use (only) the scaling function to build the matching template. If False, use (only) the wavelet function.
excludeRegionsBedFile (Optional[str]) – A BED file with regions to exclude from matching
maxNumMatches (int | None)
merge (bool)
mergeGapBP (int)
- Seealso:
consenrich.core.matchingParams
,cconsenrich.csampleBlockStats()
, `matching`
- consenrich.core.getChromRanges(bamFile, chromosome, chromLength, samThreads, samFlagExclude)[source]¶
Get the start and end positions of reads in a chromosome from a BAM file.
- Parameters:
bamFile (str) – See
inputParams
.chromosome (str) – the chromosome to read in bamFile.
chromLength (int) – Base pair length of the chromosome.
samThreads (int) – See
samParams
.samFlagExclude (int) – See
samParams
.
- Returns:
Tuple of start and end positions (nucleotide coordinates) in the chromosome.
- Return type:
Tuple[int, int]
- Seealso:
getChromRangesJoint()
,cconsenrich.cgetFirstChromRead()
,cconsenrich.cgetLastChromRead()
- consenrich.core.getChromRangesJoint(bamFiles, chromosome, chromSize, samThreads, samFlagExclude)[source]¶
For multiple BAM files, reconcile a single start and end position over which to count reads, where the start and end positions are defined by the first and last reads across all BAM files.
- Parameters:
bamFiles (List[str]) – List of BAM files to read.
chromosome (str) – Chromosome to read.
chromSize (int) – Size of the chromosome.
samThreads (int) – Number of threads to use for reading the BAM files.
samFlagExclude (int) – SAM flag to exclude certain reads.
- Returns:
Tuple of start and end positions.
- Return type:
Tuple[int, int]
- Seealso:
getChromRanges()
,cconsenrich.cgetFirstChromRead()
,cconsenrich.cgetLastChromRead()
- consenrich.core.getReadLength(bamFile, numReads, maxIterations, samThreads, samFlagExclude)[source]¶
Infer read length from mapped reads in a BAM file.
Samples at least numReads reads passing criteria given by samFlagExclude and returns the median read length.
- Parameters:
bamFile (str) – See
inputParams
.numReads (int) – Number of reads to sample.
maxIterations (int) – Maximum number of iterations to perform.
samThreads (int) – See
samParams
.samFlagExclude (int) – See
samParams
.
- Returns:
The median read length.
- Return type:
int
- Raises:
ValueError – If the read length cannot be determined after scanning maxIterations reads.
- Seealso:
cconsenrich.cgetReadLength()
- consenrich.core.readBamSegments(bamFiles, chromosome, start, end, stepSize, readLengths, scaleFactors, oneReadPerBin, samThreads, samFlagExclude, offsetStr='0,0', applyAsinh=False, applyLog=False, extendBP=[], maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0)[source]¶
Calculate tracks of read counts (or a function thereof) for each BAM file.
See
cconsenrich.creadBamSegment()
for the underlying implementation in Cython.- Parameters:
bamFiles (List[str]) – See
inputParams
.chromosome (str) – Chromosome to read.
start (int) – Start position of the genomic segment.
end (int) – End position of the genomic segment.
readLengths (List[int]) – List of read lengths for each BAM file.
scaleFactors (List[float]) – List of scale factors for each BAM file.
stepSize (int) – See
countingParams
.oneReadPerBin (int) – See
samParams
.samThreads (int) – See
samParams
.samFlagExclude (int) – See
samParams
.offsetStr (str) – See
samParams
.extendBP (int) – See
samParams
.maxInsertSize (int) – See
samParams
.pairedEndMode (int) – See
samParams
.inferFragmentLength (int) – See
samParams
.applyAsinh (bool | None)
applyLog (bool | None)
- Return type:
ndarray[tuple[Any, …], dtype[float32]]
- consenrich.core.getAverageLocalVarianceTrack(values, stepSize, approximationWindowLengthBP, lowPassWindowLengthBP, minR, maxR, lowPassFilterType='median')[source]¶
Approximate a positional/local noise level track for a single sample’s read-density-based values.
First computes a moving average of
values
using a bp-length windowapproximationWindowLengthBP
and a moving average ofvalues**2
over the same window. Their difference is used to approximate the local variance. A low-pass filter (median or mean) with windowlowPassWindowLengthBP
then smooths the variance track. Finally, the track is clipped to[minR, maxR]
to yield the local noise level track.- Parameters:
values (np.ndarray) – 1D array of read-density-based values for a single sample.
stepSize (int) – Bin size (bp).
approximationWindowLengthBP (int) – Window (bp) for local mean and second-moment. See
observationParams
.lowPassWindowLengthBP (int) – Window (bp) for the low-pass filter on the variance track. See
observationParams
.minR (float) – Lower clip for the returned noise level. See
observationParams
.maxR (float) – Upper clip for the returned noise level. See
observationParams
.lowPassFilterType (Optional[str]) –
"median"
(default) or"mean"
. Type of low-pass filter to use for smoothing the local variance track. SeeobservationParams
.
- Returns:
Local noise level per interval.
- Return type:
npt.NDArray[np.float32]
- Seealso:
- consenrich.core.constructMatrixF(deltaF)[source]¶
Build the state transition matrix for the process model
- Parameters:
deltaF (float) – See
processParams
.- Returns:
The state transition matrix \(\mathbf{F}\)
- Return type:
npt.NDArray[np.float32]
- Seealso:
- consenrich.core.constructMatrixQ(minDiagQ, offDiagQ=0.0)[source]¶
Build the initial process noise covariance matrix \(\mathbf{Q}_{[1]}\).
- Parameters:
minDiagQ (float) – See
processParams
.offDiagQ (float) – See
processParams
.
- Returns:
The initial process noise covariance matrix \(\mathbf{Q}_{[1]}\).
- Return type:
npt.NDArray[np.float32]
- Seealso:
- consenrich.core.constructMatrixH(m, coefficients=None)[source]¶
Build the observation model matrix \(\mathbf{H}\).
- Parameters:
m (int) – Number of observations.
coefficients (Optional[np.ndarray]) – Optional coefficients for the observation model, which can be used to weight the observations manually.
- Returns:
The observation model matrix \(\mathbf{H}\).
- Return type:
npt.NDArray[np.float32]
- Seealso:
observationParams
, class:inputParams
- consenrich.core.runConsenrich(matrixData, matrixMunc, deltaF, minQ, maxQ, offDiagQ, dStatAlpha, dStatd, dStatPC, stateInit, stateCovarInit, boundState, stateLowerBound, stateUpperBound, chunkSize, progressIter, coefficientsH=None, residualCovarInversionFunc=None, adjustProcessNoiseFunc=None)[source]¶
Run consenrich on a contiguous segment (e.g. a chromosome) of read-density-based data. Completes the forward and backward passes given data and approximated observation noise covariance matrices \(\mathbf{R}_{[1:n, (11:mm)]}\).
- Parameters:
matrixData (np.ndarray) – Read density data for a single chromosome or general contiguous segment, possibly preprocessed. Two-dimensional array of shape \(m \times n\) where \(m\) is the number of samples/tracks and \(n\) the number of genomic intervals.
matrixMunc (np.ndarray) – Uncertainty estimates for the read coverage data. Two-dimensional array of shape \(m \times n\) where \(m\) is the number of samples/tracks and \(n\) the number of genomic intervals. See
getMuncTrack()
.deltaF (float) – See
processParams
.minQ (float) – See
processParams
.maxQ (float) – See
processParams
.offDiagQ (float) – See
processParams
.dStatAlpha (float) – See
processParams
.dStatd (float) – See
processParams
.dStatPC (float) – See
processParams
.stateInit (float) – See
stateParams
.stateCovarInit (float) – See
stateParams
.chunkSize (int) – Number of genomic intervals’ data to keep in memory before flushing to disk.
progressIter (int) – The number of iterations after which to log progress.
coefficientsH (Optional[np.ndarray]) – Optional coefficients for the observation model matrix \(\mathbf{H}\). If None, the coefficients are set to 1.0 for all samples.
residualCovarInversionFunc (Optional[Callable]) – Callable function to invert the observation covariance matrix \(\mathbf{E}_{[i]}\). If None, defaults to
cconsenrich.cinvertMatrixE()
.adjustProcessNoiseFunc (Optional[Callable]) – Function to adjust the process noise covariance matrix \(\mathbf{Q}_{[i]}\). If None, defaults to
cconsenrich.updateProcessNoiseCovariance()
.boundState (bool)
stateLowerBound (float)
stateUpperBound (float)
- Returns:
Tuple of three numpy arrays: - state estimates \(\widetilde{\mathbf{x}}_{[i]}\) of shape \(n \times 2\) - state covariance estimates \(\widetilde{\mathbf{P}}_{[i]}\) of shape \(n \times 2 \times 2\) - post-fit residuals \(\widetilde{\mathbf{y}}_{[i]}\) of shape \(n \times m\)
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
- Raises:
ValueError – If the number of samples in matrixData is not equal to the number of samples in matrixMunc.
- Seealso:
- consenrich.core.getPrimaryState(stateVectors, roundPrecision=3)[source]¶
Get the primary state estimate from each vector after running Consenrich.
- Parameters:
stateVectors (npt.NDArray[np.float32]) – State vectors from
runConsenrich()
.roundPrecision (int)
- Returns:
A one-dimensional numpy array of the primary state estimates.
- Return type:
npt.NDArray[np.float32]
- consenrich.core.getStateCovarTrace(stateCovarMatrices, roundPrecision=3)[source]¶
Get a one-dimensional array of state covariance traces after running Consenrich
- Parameters:
stateCovarMatrices (np.ndarray) – Estimated state covariance matrices \(\widetilde{\mathbf{P}}_{[i]}\)
roundPrecision (int)
- Returns:
A one-dimensional numpy array of the traces of the state covariance matrices.
- Return type:
npt.NDArray[np.float32]
- consenrich.core.getPrecisionWeightedResidual(postFitResiduals, matrixMunc, roundPrecision=3, stateCovarSmoothed=None)[source]¶
Get a one-dimensional precision-weighted array residuals after running Consenrich.
Applies an inverse-variance weighting of the post-fit residuals \(\widetilde{\mathbf{y}}_{[i]}\) and returns a one-dimensional array of “precision-weighted residuals”. The state-level uncertainty can also be incorporated given stateCovarSmoothed.
- Parameters:
postFitResiduals (np.ndarray) – Post-fit residuals \(\widetilde{\mathbf{y}}_{[i]}\) from
runConsenrich()
.matrixMunc (np.ndarray) – An \(m \times n\) sample-by-interval matrix – At genomic intervals \(i = 1,2,\ldots,n\), the respective length-\(m\) column is \(\mathbf{R}_{[i,11:mm]}\). That is, the observation noise levels for each sample \(j=1,2,\ldots,m\) at interval \(i\). To keep memory usage minimal matrixMunc is not returned in full or computed in in
runConsenrich()
. If using Consenrich programmatically, runconsenrich.core.getMuncTrack()
for each sample’s count data (rows in the matrix output ofreadBamSegments()
).stateCovarSmoothed (Optional[np.ndarray]) – Smoothed state covariance matrices \(\widetilde{\mathbf{P}}_{[i]}\) from
runConsenrich()
.roundPrecision (int)
- Returns:
A one-dimensional array of “precision-weighted residuals”
- Return type:
npt.NDArray[np.float32]
- consenrich.core.getMuncTrack(chromosome, intervals, stepSize, rowValues, minR, maxR, useALV, useConstantNoiseLevel, noGlobal, localWeight, globalWeight, approximationWindowLengthBP, lowPassWindowLengthBP, returnCenter, sparseMap=None, lowPassFilterType='median')[source]¶
Get observation noise variance \(R_{[:,jj]}\) for the sample \(j\).
Combines a local ALV estimate (see
getAverageLocalVarianceTrack()
) with an optional global component. IfuseALV
is True, only the ALV is used. IfuseConstantNoiseLevel
is True, a constant track set to the global mean is used. When asparseMap
is provided, local values are aggregated over nearby ‘sparse’ regions before mixing with the global component.For heterochromatic or repressive marks (H3K9me3, H3K27me3, MNase-seq, etc.), consider setting useALV=True to prevent inflated sample-level noise estimates.
- Parameters:
chromosome (str) – Tracks are approximated for this chromosome.
intervals (ndarray) – Genomic intervals for which to compute the noise track.
stepSize (int) – See
countingParams
.rowValues (np.ndarray) – Read-density-based values for the sample \(j\) at the genomic intervals \(i=1,2,\ldots,n\).
minR (float) – See
observationParams
.maxR (float) – See
observationParams
.useALV (bool) – See
observationParams
.useConstantNoiseLevel (bool) – See
observationParams
.noGlobal (bool) – See
observationParams
.localWeight (float) – See
observationParams
.globalWeight (float) – See
observationParams
.approximationWindowLengthBP (int) – See
observationParams
and/orgetAverageLocalVarianceTrack()
.lowPassWindowLengthBP (int) – See
observationParams
and/orgetAverageLocalVarianceTrack()
.sparseMap (Optional[dict[int, int]]) – Optional mapping (dictionary) of interval indices to the nearest sparse regions. See
getSparseMap()
.lowPassFilterType (Optional[str]) – The type of low-pass filter to use in average local variance track (e.g., ‘median’, ‘mean’).
returnCenter (bool)
- Returns:
A one-dimensional numpy array of the observation noise track for the sample \(j\).
- Return type:
npt.NDArray[np.float32]
- consenrich.core.sparseIntersection(chromosome, intervals, sparseBedFile)[source]¶
Returns intervals in the chromosome that overlap with the sparse features.
Not relevant if observationParams.useALV is True.
- Parameters:
chromosome (str) – The chromosome name.
intervals (np.ndarray) – The genomic intervals to consider.
sparseBedFile (str) – Path to the sparse BED file.
- Returns:
A numpy array of start positions of the sparse features that overlap with the intervals
- Return type:
np.ndarray[Tuple[Any], np.dtype[Any]]
- consenrich.core.adjustFeatureBounds(feature, stepSize)[source]¶
Adjust the start and end positions of a BED feature to be centered around a step.
- Parameters:
feature (Interval)
stepSize (int)
- Return type:
Interval
- consenrich.core.getSparseMap(chromosome, intervals, numNearest, sparseBedFile)[source]¶
Build a map between each genomic interval and numNearest sparse features
- Parameters:
chromosome (str) – The chromosome name. Note, this function only needs to be run once per chromosome.
intervals (np.ndarray) – The genomic intervals to map.
numNearest (int) – The number of nearest sparse features to consider
sparseBedFile (str) – path to the sparse BED file.
- Returns:
A dictionary mapping each interval index to the indices of the nearest sparse regions.
- Return type:
dict[int, np.ndarray]
- consenrich.core.getBedMask(chromosome, bedFile, intervals)[source]¶
Return a 1/0 mask for intervals overlapping a sorted and merged BED file.
This function is a wrapper for
cconsenrich.cbedMask()
.- Parameters:
chromosome (str) – The chromosome name.
intervals (np.ndarray) – chromosome-specific, sorted, non-overlapping start positions of genomic intervals. Each interval is assumed stepSize.
bedFile (str) – Path to a sorted and merged BED file
- Returns:
An intervals-length mask s.t. True indicates the interval overlaps a feature in the BED file.
- Return type:
np.ndarray
consenrich.detrorm
¶
- consenrich.detrorm.getScaleFactor1x(bamFile, effectiveGenomeSize, readLength, excludeChroms, chromSizesFile, samThreads)[source]¶
Generic normalization factor based on effective genome size and number of mapped reads in non-excluded chromosomes.
- Parameters:
bamFile (str) – See
consenrich.core.inputParams
.effectiveGenomeSize (int) – Effective genome size in base pairs. See
consenrich.constants.getEffectiveGenomeSize()
.readLength (int) – Read length (base pairs). See
consenrich.core.getReadLength()
.excludeChroms (List[str]) – List of chromosomes to exclude from the analysis.
chromSizesFile (str) – Path to the chromosome sizes file.
samThreads (int) – See
consenrich.core.samParams
.
- Returns:
Scale factor for 1x normalization.
- Return type:
float
- consenrich.detrorm.getScaleFactorPerMillion(bamFile, excludeChroms)[source]¶
Generic normalization factor based on number of mapped reads in non-excluded chromosomes.
- Parameters:
bamFile (str) – See
consenrich.core.inputParams
.excludeChroms (List[str]) – List of chromosomes to exclude when counting mapped reads.
- Returns:
Scale factor accounting for number of mapped reads (only).
- Return type:
float
- consenrich.detrorm.getPairScaleFactors(bamFileA, bamFileB, effectiveGenomeSizeA, effectiveGenomeSizeB, readLengthA, readLengthB, excludeChroms, chromSizesFile, samThreads, scaleDown=True)[source]¶
Get scaling constants that normalize two alignment files to each other (e.g. ChIP-seq treatment and control) with respect to sequence coverage.
- Parameters:
bamFileA (str) – Path to the first BAM file.
bamFileB (str) – Path to the second BAM file.
effectiveGenomeSizeA (int) – Effective genome size for the first BAM file.
effectiveGenomeSizeB (int) – Effective genome size for the second BAM file.
readLengthA (int) – Read length for the first BAM file.
readLengthB (int) – Read length for the second BAM file.
excludeChroms (List[str]) – List of chromosomes to exclude from the analysis.
chromSizesFile (str) – Path to the chromosome sizes file.
samThreads (int) – Number of threads to use for reading BAM files.
scaleDown (bool)
- Returns:
A tuple containing the scale factors for the first and second BAM files.
- Return type:
Tuple[float, float]
- consenrich.detrorm.detrendTrack(values, stepSize, detrendWindowLengthBP, useOrderStatFilter, usePolyFilter, detrendTrackPercentile, detrendSavitzkyGolayDegree)[source]¶
Detrend tracks using either an order statistic filter or a polynomial filter.
- Parameters:
values (np.ndarray) – Values to detrend.
stepSize (int) – see
consenrich.core.countingParams
.detrendWindowLengthBP (int) – See
consenrich.core.detrendParams
.useOrderStatFilter (bool) – Whether to use a sliding order statistic filter.
usePolyFilter (bool) – Whether to use a sliding polynomial/least squares filter.
detrendTrackPercentile (float) – Percentile to use for the order statistic filter.
detrendSavitzkyGolayDegree (int) – Degree of the polynomial for the Savitzky-Golay/Polynomial filter.
- Returns:
Detrended values.
- Return type:
np.ndarray
- Raises:
ValueError – If the detrend window length is not greater than 3 times the step size or if the values length is less than the detrend window length.
consenrich.constants
¶
Important
This module is provided for convenience. If a genome is not listed here, users can still specify resources (e.g., sizes file, blacklist) manually.
- consenrich.constants.getEffectiveGenomeSize(genome, readLength)[source]¶
Get the effective genome size for a given genome and read length.
- Parameters:
genome (str) – Name of the genome. See
consenrich.constants.resolveGenomeName()
andconsenrich.core.genomeParams
.readLength (int) – Length of the reads. See
consenrich.core.getReadLength()
.
- Raises:
ValueError – If the genome is not recognized or if the read length is not available for the genome.
- Returns:
Effective genome size in base pairs.
- Return type:
int
- consenrich.constants.getGenomeResourceFile(genome, fileType, dir_='data')[source]¶
Get the path to a genome resource file.
- Parameters:
genome (str) – the genome assembly. See
consenrich.constants.resolveGenomeName()
andconsenrich.core.genomeParams
.fileType (str) – One of ‘sizes’, ‘blacklist’, ‘sparse’.
dir_ (str)
- Returns:
Path to the resource file.
- Return type:
str
- Raises:
ValueError – If not a sizes, blacklist, or sparse file.
FileNotFoundError – If the resource file does not exist.
- consenrich.constants.resolveGenomeName(genome)[source]¶
Standardize the genome name for consistency :param genome: Name of the genome. See
consenrich.core.genomeParams
. :type genome: str :return: Standardized genome name. :rtype: str :raises ValueError: If the genome is not recognized.- Parameters:
genome (str)
- Return type:
str
consenrich.matching
¶
(Experimental) Detect genomic regions showing both enrichment and non-random structure
Take a set of genomic intervals \(i=1,2,\ldots,n\), each spanning \(L\) base pairs,
and a ‘consensus’ signal track defined over the genomic intervals, estimated from multiple independent samples’ high-throughput functional genomics sequencing data:
In this documentation, we assume \(\widetilde{\mathbf{x}}\) is the Consenrich ‘primary state estimate’.
Aim: Determine a set of ‘structured’ peak-like genomic regions where the consensus signal track \(\widetilde{\mathbf{x}}\) exhibits both:
Enrichment (large relative amplitude)
Non-random structure (polynomial, oscillatory, etc.)
Why: Prioritizing genomic regions that are both enriched and show a prescribed level of structure is appealing for several reasons. Namely,
Improved confidence that the identified genomic regions are not due to stochastic noise, which is characteristically unstructured.
Targeted detection of biologically relevant signal patterns in a given assay (e.g., see related works analyzing peak-shape Cremona et al., 2015, Parodi et al., 2017)
Speed: Runs genome-wide in seconds/minutes using an efficient Fast Fourier Transform (FFT)-based implementation.
In the case of Consenrich, that the primary signal estimates in \(\{\widetilde{x}_{[i]}\}^{n}\) are reinforced by multiple samples and account for relevant sources of uncertainty is particularly advantageous–it provides a more reliable basis for evaluating legitimate structure and identifying high-resolution features.
Algorithm Overview¶
To detect structured peaks, we run an approach akin to matched filtering, with templates derived from discrete wavelet or scaling functions.
Define the response sequence as the cross-correlation between the consensus signal \(\widetilde{\mathbf{x}}\) and the template \(\boldsymbol{\xi}\):
(Equivalently, the convolution of \(\widetilde{\mathbf{x}}\) with the time-reversed template \(\boldsymbol{\xi}'\), where \(\xi'_{[t]} = \xi_{[T-t+1]}\).)
Where \(\mathcal{R}_{[i]}\) is large, there is greater evidence that the signal \(\widetilde{\mathbf{x}}\) is enriched and exhibits structure that is agreeable to the template \(\boldsymbol{\xi}\).
Detection Thresholds
matchingParams.alpha
defines the (\(1 - \alpha\))-quantile threshold of the block-maxima null on the response sequence, i.e., the cross-correlation between the Consenrich track and template (default
0.05
).This null is constructed by sampling blocks in the response sequence and recording the maximum value within each block.
The size of each sampled block is drawn from a (truncated) geometric distribution with an expected value equal to the template length/minimum feature size
matchingParams.minSignalAtMaxima
(Optional)Can be an absolute value (float) or string
"q:<quantileValue>"
to require the value at response-maxima to exceed the given quantile of non-zero values (defaultq:0.75
).This threshold is applied after tempering the dynamic range with an arsinh transform (i.e., \(\sinh^{-1}(x)\)).
By default, the upper quartile of non-zero values is used: q:0.75
To disable: set to a negative numeric value.
matchingParams.minMatchLengthBP
: (Optional)Minimum feature length in bp to qualify as a match (default
250
).To disable: set to a negative numeric value.
See also
Sections Getting Started and/or Additional Examples and Benchmarking which include browser shots demonstrating qualitative behavior of this feature.
In the following browser snapshot, we sweep several key matching parameters.
As opposed to the configs in Additional Examples and Benchmarking, here, we set matchingParams.merge: false
to clearly illustrate contrasting results.

—
Generic Defaults
matchingParams.templateNames: [haar, db2]
matchingParams.cascadeLevels: [2]
matchingParams.minMatchLengthBP: 250
matchingParams.mergeGapBP: 50
matchingParams.alpha: 0.05
The matching config above is not encompassing but should provide a strong starting point for standard use cases.
Some basic guidelines:
For broad marks, consider larger more-symmetric templates and/or cascadeLevels` (e.g., sym4) and a larger minimum feature length (e.g.,
500
)For sharp marks, consider smaller templates and/or lower cascadeLevels (e.g., haar, db2) and a smaller minimum feature length (e.g.,
150
)
See Additional Examples and Benchmarking for more specific guidance.
—
Command-Line Usage
The matching algorithm can be run directly at the command-line on existing Consenrich-generated bedGraph files. For instance, in the ChIP-seq experiment from Getting Started: Minimal Example,
% consenrich \
--match-bedGraph consenrichOutput_demoHistoneChIPSeq_state.bedGraph \
--match-template db3 \
--match-level 2 \
--match-alpha 0.01
This avoids a full run of Consenrich and requires minimal runtime: See consenrich -h for more details.
- consenrich.matching.matchWavelet(chromosome, intervals, values, templateNames, cascadeLevels, iters, alpha=0.05, minMatchLengthBP=250, maxNumMatches=100000, minSignalAtMaxima='q:0.75', randSeed=42, recenterAtPointSource=True, useScalingFunction=True, excludeRegionsBedFile=None, weights=None)[source]¶
Detect structured peaks by cross-correlating Consenrich tracks with wavelet- or scaling-function templates.
See `matching` for an overview of the approach.
- Parameters:
chromosome (str) – Chromosome name for the input intervals and values.
values (npt.NDArray[np.float64]) – ‘Consensus’ signal estimates derived from multiple samples, e.g., from Consenrich.
templateNames (List[str]) – A list of str values – wavelet bases used for matching, e.g., [haar, db2, sym4]
cascadeLevels (List[int]) – A list of int values – the number of cascade iterations used for approximating the scaling/wavelet functions.
iters (int) – Number of random blocks to sample in the response sequence while building an empirical null to test significance. See
cconsenrich.csampleBlockStats()
.alpha (float) – Primary significance threshold on detected matches. Specifically, the \(1 - \alpha\) quantile of an empirical null distribution. The empirical null distribution is built from cross-correlation values over randomly sampled blocks.
minMatchLengthBP (int) – Within a window of minMatchLengthBP length (bp), relative maxima in the signal-template convolution must be greater in value than others to qualify as matches. Set to a negative value to disable this filter.
minSignalAtMaxima (Optional[str | float]) – Secondary significance threshold coupled with alpha. Require the signal value at relative maxima in the response sequence to be greater than this threshold. Comparisons are made in log-scale. If a float value is provided, the minimum signal value must be greater than this (absolute) value. Set to a negative value to disable the threshold. If a str value is provided, looks for ‘q:quantileValue’, e.g., ‘q:0.75’. The threshold is then set to the corresponding quantile of the non-zero signal estimates. Defaults to str value ‘q:0.75’ — the 75th percentile of signal values.
useScalingFunction (bool) – If True, use (only) the scaling function to build the matching template. If False, use (only) the wavelet function.
excludeRegionsBedFile (Optional[str]) – A BED file with regions to exclude from matching
intervals (ndarray[tuple[Any, ...], dtype[int]])
maxNumMatches (int | None)
randSeed (int)
recenterAtPointSource (bool)
weights (ndarray[tuple[Any, ...], dtype[float64]] | None)
- Seealso:
consenrich.core.matchingParams
,cconsenrich.csampleBlockStats()
, `matching`- Return type:
DataFrame
- consenrich.matching.mergeMatches(filePath, mergeGapBP=50)[source]¶
Merge overlapping or nearby structured peaks (matches) in a narrowPeak file.
Where an overlap occurs within mergeGapBP base pairs, the feature with the greatest signal defines the new summit/pointSource
- Parameters:
filePath (str) – narrowPeak file containing matches detected with
consenrich.matching.matchWavelet()
mergeGapBP (int) – Maximum gap size (in base pairs) to consider for merging
- Seealso:
Cython functions: consenrich.cconsenrich
¶
Several computationally burdensome tasks are written in Cython for improved efficiency.
- consenrich.cconsenrich.creadBamSegment(bamFile, chromosome, start, end, stepSize, readLength, oneReadPerBin, samThreads, samFlagExclude, shiftForwardStrand53=0, shiftReverseStrand53=0, extendBP=0, maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0)¶
Count reads in a BAM file for a given chromosome
- consenrich.cconsenrich.cinvertMatrixE(muncMatrixIter, priorCovarianceOO)¶
Invert the residual covariance matrix during the forward pass.
- Parameters:
muncMatrixIter (cnp.ndarray[cnp.float32_t, ndim=1]) – The diagonal elements of the covariance matrix at a given genomic interval.
priorCovarianceOO (cnp.float32_t) – The a priori ‘primary’ state variance \(P_{[i|i-1,11]}\).
- Returns:
The inverted covariance matrix.
- Return type:
cnp.ndarray[cnp.float32_t, ndim=2]
- consenrich.cconsenrich.updateProcessNoiseCovariance(matrixQ, matrixQCopy, dStat, dStatAlpha, dStatd, dStatPC, inflatedQ, maxQ, minQ)¶
Adjust process noise covariance matrix \(\mathbf{Q}_{[i]}\)
- Parameters:
matrixQ – Current process noise covariance
matrixQCopy – A copy of the initial original covariance matrix \(\mathbf{Q}_{[.]}\)
inflatedQ – Flag indicating if the process noise covariance is inflated
- Returns:
Updated process noise covariance matrix and inflated flag
- Return type:
tuple
- consenrich.cconsenrich.csampleBlockStats(intervals, values, expectedBlockSize, iters, randSeed, excludeIdxMask)¶
Sample contiguous blocks in the response sequence (xCorr), record maxima, and repeat.
Used to build an empirical null distribution and determine significance of response outputs. The size of blocks is drawn from a truncated geometric distribution, preserving rough equality in expectation but allowing for variability to account for the sampling across different phases in the response sequence.
- Parameters:
values (cnp.ndarray[cnp.float64_t, ndim=1]) – The response sequence to sample from.
expectedBlockSize (int) – The expected size (geometric) of the blocks to sample.
iters (int) – The number of blocks to sample.
randSeed (int) – Random seed for reproducibility.
- Returns:
An array of sampled block maxima.
- Return type:
cnp.ndarray[cnp.float64_t, ndim=1]
- Seealso:
- consenrich.cconsenrich.cSparseAvg(trackALV, sparseMap)¶
Fast access and average of numNearest sparse elements.
See
consenrich.core.getMuncTrack()
- Parameters:
trackALV (float[::1]) – See
consenrich.core.getAverageLocalVarianceTrack()
sparseMap (dict[int, np.ndarray]) – See
consenrich.core.getSparseMap()
- Returns:
array of mena(‘nearest local variances’) same length as trackALV
- Return type:
cnp.ndarray[cnp.float32_t, ndim=1]
- consenrich.cconsenrich.cgetFragmentLength(bamFile, chromosome, start, end, samThreads=0, samFlagExclude=3844, maxInsertSize=1000, minInsertSize=50, iters=1000, blockSize=5000, fallBack=147, rollingChunkSize=250, lagStep=5, earlyExit=100, randSeed=42)¶
- consenrich.cconsenrich.cbedMask(chromosome, bedFile, intervals, stepSize)¶
Return a 1/0 mask for intervals overlapping a sorted and merged BED file.
- Parameters:
chromosome (str) – Chromosome name.
bedFile (str) – Path to a sorted and merged BED file.
intervals (cnp.ndarray[cnp.uint32_t, ndim=1]) – Array of sorted, non-overlapping start positions of genomic intervals. Each interval is assumed stepSize.
stepSize (int32_t) – Step size between genomic positions in intervals.
- Returns:
A mask s.t. 1 indicates the corresponding interval overlaps a BED region.
- Return type:
cnp.ndarray[cnp.uint8_t, ndim=1]