API Reference¶

`consenrich.core`¶

The core module implements the main aspects of Consenrich and defines key parameter classes.

class consenrich.core.processParams(deltaF, minQ, maxQ, offDiagQ, dStatAlpha, dStatd, dStatPC)[source]¶

Parameters related to the process model of Consenrich.

The process model governs the signal and variance propagation through the state transition \(\mathbf{F} \in \mathbb{R}^{2 \times 2}\) and process noise covariance \(\mathbf{Q}_{[i]} \in \mathbb{R}^{2 \times 2}\) matrices.

Parameters:

deltaF (float) – Scales the signal and variance propagation between adjacent genomic intervals.
minQ (float) – Minimum process noise level (diagonal in \(\mathbf{Q}_{[i]}\)) for each state variable. Adjust relative to data quality (more reliable data –> lower minQ).
dStatAlpha (float) – Threshold on the deviation between the data and estimated signal – used to determine whether the process noise is scaled up.
dStatd (float) – Constant \(d\) in the scaling expression \(\sqrt{d|D_{[i]} - \alpha_D| + c}\) that is used to up/down-scale the process noise covariance in the event of a model mismatch.
dStatPC (float) – Constant \(c\) in the scaling expression \(\sqrt{d|D_{[i]} - \alpha_D| + c}\) that is used to up/down-scale the process noise covariance in the event of a model mismatch.
maxQ (float)
offDiagQ (float)

class consenrich.core.observationParams(minR, maxR, useALV, useConstantNoiseLevel, noGlobal, numNearest, localWeight, globalWeight, approximationWindowLengthBP, lowPassWindowLengthBP, lowPassFilterType, returnCenter)[source]¶

Parameters related to the observation model of Consenrich. The observation model is used to integrate sequence alignment count data from the multiple input samples and account for region-and-sample-specific noise processes corrupting data. The observation model matrix \(\mathbf{H} \in \mathbb{R}^{m \times 2}\) maps from the state dimension (2) to the dimension of measurements/data (\(m\)).

Parameters:

minR (float) – The minimum observation noise level for each sample \(j=1\ldots m\) in the observation noise covariance matrix \(\mathbf{R}_{[i, (11:mm)]}\).
numNearest (int) – The number of nearest nearby sparse features to use for local variance calculation. Ignored if useALV is True.
localWeight (float) – The coefficient for the local noise level (based on the local surrounding window / numNearest features) used in the weighted sum measuring sample-specific noise level at the current interval.
globalWeight (float) – The coefficient for the global noise level (based on all genomic intervals \(i=1\ldots n\)) used in the weighted sum measuring sample-specific noise level at the current interval.
approximationWindowLengthBP (int) – The length of the local approximation window in base pairs (BP) for the local variance calculation.
sparseBedFile (str, optional) – The path to a BED file of ‘sparse’ regions for the local variance calculation.
noGlobal (bool) – If True, only the ‘local’ variances are used to approximate observation noise covariance \(\mathbf{R}_{[:, (11:mm)]}\).
useALV (bool) – Whether to use average local variance (ALV) to approximate observation noise covariances per-sample, per-interval. Recommended for estimating signals associated with repressive/heterochromatic elements.
useConstantNoiseLevel (bool) – Whether to use a constant noise level in the observation model.
lowPassFilterType (Optional[str]) – The type of low-pass filter to use (e.g., ‘median’, ‘mean’).
maxR (float)
lowPassWindowLengthBP (int)
returnCenter (bool)

class consenrich.core.stateParams(stateInit, stateCovarInit, boundState, stateLowerBound, stateUpperBound)[source]¶

Parameters related to state and uncertainty bounds and initialization.

Parameters:

stateInit (float) – Initial value of the ‘primary’ state/signal at the first genomic interval: \(x_{[1]}\)
stateCovarInit (float) – Initial state covariance (covariance) scale. Note, the initial state uncertainty \(\mathbf{P}_{[1]}\) is a multiple of the identity matrix \(\mathbf{I}\)
boundState (bool) – If True, the primary state estimate for \(x_{[i]}\) is constrained within stateLowerBound and stateUpperBound.
stateLowerBound (float) – Lower bound for the state estimate.
stateUpperBound (float) – Upper bound for the state estimate.

class consenrich.core.detrendParams(useOrderStatFilter, usePolyFilter, detrendTrackPercentile, detrendSavitzkyGolayDegree, detrendWindowLengthBP)[source]¶

Parameters related detrending and background-removal

Parameters:

useOrderStatFilter (bool) – Whether to use a local/moving order statistic (percentile filter) to model and remove trends in the read density data.
usePolyFilter (bool) – Whether to use a low-degree polynomial fit to model and remove trends in the read density data.
detrendSavitzkyGolayDegree (int) – The polynomial degree of the Savitzky-Golay filter to use for detrending
detrendTrackPercentile (float) – The percentile to use for the local/moving order-statistic filter. Decrease for broad marks + sparse data if useOrderStatFilter is True.
detrendWindowLengthBP (int) – The length of the window in base pairs for detrending. Increase for broader marks + sparse data.

class consenrich.core.inputParams(bamFiles, bamFilesControl)[source]¶

Parameters related to the input data for Consenrich.

Parameters:

bamFiles (List[str]) – A list of paths to distinct coordinate-sorted and indexed BAM files.
bamFilesControl (List[str], optional) – A list of paths to distinct coordinate-sorted and indexed control BAM files. e.g., IgG control inputs for ChIP-seq.

class consenrich.core.genomeParams(genomeName, chromSizesFile, blacklistFile, sparseBedFile, chromosomes, excludeChroms, excludeForNorm)[source]¶

Specify assembly-specific resources, parameters.

Parameters:

genomeName (str) – If supplied, default resources for the assembly (sizes file, blacklist, and ‘sparse’ regions) in src/consenrich/data are used. ce10, ce11, dm6, hg19, hg38, mm10, mm39 have default resources available.
chromSizesFile (str) – A two-column TSV-like file with chromosome names and sizes (in base pairs).
blacklistFile (str, optional) – A BED file with regions to exclude.
sparseBedFile (str, optional) – A BED file with sparse regions used to estimate noise levels – ignored if observationParams.useALV is True.
chromosomes (List[str]) – A list of chromosome names to analyze. If None, all chromosomes in chromSizesFile are used.
excludeChroms (List[str])
excludeForNorm (List[str])

class consenrich.core.countingParams(stepSize, scaleDown, scaleFactors, scaleFactorsControl, numReads, applyAsinh, applyLog, rescaleToTreatmentCoverage=False)[source]¶

Parameters related to counting reads in genomic intervals.

Parameters:

stepSize (int) – Step size (bp) for the genomic intervals (AKA bin size, interval length, width, etc.)
scaleDown (bool, optional) – If using paired treatment and control BAM files, whether to scale down the larger of the two before computing the difference/ratio
scaleFactors (List[float], optional) – Scale factors for the read counts.
scaleFactorsControl (List[float], optional) – Scale factors for the control read counts.
numReads (int) – Number of reads to sample.
applyAsinh (bool, optional) – If true, \(\textsf{arsinh}(x)\) applied to counts \(x\) (log-like for large values and linear near the origin)
applyLog (bool, optional) – If true, \(\textsf{log}(x + 1)\) applied to counts \(x\)
rescaleToTreatmentCoverage (bool | None)

class consenrich.core.samParams(samThreads, samFlagExclude, oneReadPerBin, chunkSize, offsetStr='0,0', extendBP=[], maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0)[source]¶

Parameters related to reading BAM files

Parameters:

samThreads (int) – The number of threads to use for reading BAM files.
samFlagExclude (int) – The SAM flag to exclude certain reads.
oneReadPerBin (int) – If 1, only the interval with the greatest read overlap is incremented.
chunkSize (int) – maximum number of intervals’ data to hold in memory before flushing to disk.
offsetStr (str) – A string of two comma-separated integers – first for the 5’ shift on forward strand, second for the 5’ shift on reverse strand.
extendBP (List[int]) – A list of integers specifying the number of base pairs to extend reads for each BAM file after shifting per offsetStr. If all BAM files share the same expected frag. length, can supply a single numeric value to be broadcasted. Ignored for PE reads.
maxInsertSize (int) – Maximum frag length/insert for paired-end reads.
pairedEndMode (int) – If > 0, only proper pairs are counted subject to maxInsertSize.
inferFragmentLength (int) – Intended for single-end data: if > 0, the maximum correlation lag (avg.) between strand-specific read tracks is taken as the fragment length estimate and used to extend reads from 5’. Ignored if pairedEndMode > 0 or extendBP set. This parameter is particularly important when targeting broader marks (e.g., ChIP-seq H3K27me3).

Tip

For an overview of SAM flags, see https://broadinstitute.github.io/picard/explain-flags.html

class consenrich.core.matchingParams(templateNames, cascadeLevels, iters, alpha, minMatchLengthBP, maxNumMatches, minSignalAtMaxima='q:0.75', merge=False, mergeGapBP=25, useScalingFunction=True, excludeRegionsBedFile=None)[source]¶

Parameters related to the matching algorithm packaged with this software.

See `matching` for details.

Parameters:

templateNames (List[str]) – A list of str values – wavelet bases used for matching, e.g., [haar, db2, sym4]
cascadeLevels (List[int]) – A list of int values – the number of cascade iterations used for approximating the scaling/wavelet functions.
iters (int) – Number of random blocks to sample in the response sequence while building an empirical null to test significance. See cconsenrich.csampleBlockStats().
alpha (float) – Primary significance threshold on detected matches. Specifically, the \(1 - \alpha\) quantile of an empirical null distribution. The empirical null distribution is built from cross-correlation values over randomly sampled blocks.
minMatchLengthBP (int) – Within a window of minMatchLengthBP length (bp), relative maxima in the signal-template convolution must be greater in value than others to qualify as matches. Set to a negative value to disable this filter.
minSignalAtMaxima (Optional[str | float]) – Secondary significance threshold coupled with alpha. Require the signal value at relative maxima in the response sequence to be greater than this threshold. Comparisons are made in log-scale. If a float value is provided, the minimum signal value must be greater than this (absolute) value. Set to a negative value to disable the threshold. If a str value is provided, looks for ‘q:quantileValue’, e.g., ‘q:0.75’. The threshold is then set to the corresponding quantile of the non-zero signal estimates. Defaults to str value ‘q:0.75’ — the 75th percentile of signal values.
useScalingFunction (bool) – If True, use (only) the scaling function to build the matching template. If False, use (only) the wavelet function.
excludeRegionsBedFile (Optional[str]) – A BED file with regions to exclude from matching
maxNumMatches (int | None)
merge (bool)
mergeGapBP (int)

Seealso:

consenrich.core.matchingParams, cconsenrich.csampleBlockStats(), `matching`

consenrich.core.getChromRanges(bamFile, chromosome, chromLength, samThreads, samFlagExclude)[source]¶

Get the start and end positions of reads in a chromosome from a BAM file.

Parameters:

bamFile (str) – See inputParams.
chromosome (str) – the chromosome to read in bamFile.
chromLength (int) – Base pair length of the chromosome.
samThreads (int) – See samParams.
samFlagExclude (int) – See samParams.

Returns:

Tuple of start and end positions (nucleotide coordinates) in the chromosome.

Return type:

Tuple[int, int]

Seealso:

getChromRangesJoint(), cconsenrich.cgetFirstChromRead(), cconsenrich.cgetLastChromRead()

consenrich.core.getChromRangesJoint(bamFiles, chromosome, chromSize, samThreads, samFlagExclude)[source]¶

For multiple BAM files, reconcile a single start and end position over which to count reads, where the start and end positions are defined by the first and last reads across all BAM files.

Parameters:

bamFiles (List[str]) – List of BAM files to read.
chromosome (str) – Chromosome to read.
chromSize (int) – Size of the chromosome.
samThreads (int) – Number of threads to use for reading the BAM files.
samFlagExclude (int) – SAM flag to exclude certain reads.

Returns:

Tuple of start and end positions.

Return type:

Tuple[int, int]

Seealso:

getChromRanges(), cconsenrich.cgetFirstChromRead(), cconsenrich.cgetLastChromRead()

consenrich.core.getReadLength(bamFile, numReads, maxIterations, samThreads, samFlagExclude)[source]¶

Infer read length from mapped reads in a BAM file.

Samples at least numReads reads passing criteria given by samFlagExclude and returns the median read length.

Parameters:

bamFile (str) – See inputParams.
numReads (int) – Number of reads to sample.
maxIterations (int) – Maximum number of iterations to perform.
samThreads (int) – See samParams.
samFlagExclude (int) – See samParams.

Returns:

The median read length.

Return type:

int

Raises:

ValueError – If the read length cannot be determined after scanning maxIterations reads.

Seealso:

cconsenrich.cgetReadLength()

consenrich.core.readBamSegments(bamFiles, chromosome, start, end, stepSize, readLengths, scaleFactors, oneReadPerBin, samThreads, samFlagExclude, offsetStr='0,0', applyAsinh=False, applyLog=False, extendBP=[], maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0)[source]¶

Calculate tracks of read counts (or a function thereof) for each BAM file.

See cconsenrich.creadBamSegment() for the underlying implementation in Cython.

Parameters:

bamFiles (List[str]) – See inputParams.
chromosome (str) – Chromosome to read.
start (int) – Start position of the genomic segment.
end (int) – End position of the genomic segment.
readLengths (List[int]) – List of read lengths for each BAM file.
scaleFactors (List[float]) – List of scale factors for each BAM file.
stepSize (int) – See countingParams.
oneReadPerBin (int) – See samParams.
samThreads (int) – See samParams.
samFlagExclude (int) – See samParams.
offsetStr (str) – See samParams.
extendBP (int) – See samParams.
maxInsertSize (int) – See samParams.
pairedEndMode (int) – See samParams.
inferFragmentLength (int) – See samParams.
applyAsinh (bool | None)
applyLog (bool | None)

Return type:

ndarray[tuple[Any, …], dtype[float32]]

consenrich.core.getAverageLocalVarianceTrack(values, stepSize, approximationWindowLengthBP, lowPassWindowLengthBP, minR, maxR, lowPassFilterType='median')[source]¶

Approximate a positional/local noise level track for a single sample’s read-density-based values.

First computes a moving average of values using a bp-length window approximationWindowLengthBP and a moving average of values**2 over the same window. Their difference is used to approximate the local variance. A low-pass filter (median or mean) with window lowPassWindowLengthBP then smooths the variance track. Finally, the track is clipped to [minR, maxR] to yield the local noise level track.

Parameters:

values (np.ndarray) – 1D array of read-density-based values for a single sample.
stepSize (int) – Bin size (bp).
approximationWindowLengthBP (int) – Window (bp) for local mean and second-moment. See observationParams.
lowPassWindowLengthBP (int) – Window (bp) for the low-pass filter on the variance track. See observationParams.
minR (float) – Lower clip for the returned noise level. See observationParams.
maxR (float) – Upper clip for the returned noise level. See observationParams.
lowPassFilterType (Optional[str]) – "median" (default) or "mean". Type of low-pass filter to use for smoothing the local variance track. See observationParams.

Returns:

Local noise level per interval.

Return type:

npt.NDArray[np.float32]

Seealso:

observationParams

consenrich.core.constructMatrixF(deltaF)[source]¶

Build the state transition matrix for the process model

Parameters:: deltaF (float) – See processParams.
Returns:: The state transition matrix \(\mathbf{F}\)
Return type:: npt.NDArray[np.float32]
Seealso:: processParams

consenrich.core.constructMatrixQ(minDiagQ, offDiagQ=0.0)[source]¶

Build the initial process noise covariance matrix \(\mathbf{Q}_{[1]}\).

Parameters:

minDiagQ (float) – See processParams.
offDiagQ (float) – See processParams.

Returns:

The initial process noise covariance matrix \(\mathbf{Q}_{[1]}\).

Return type:

npt.NDArray[np.float32]

Seealso:

processParams

consenrich.core.constructMatrixH(m, coefficients=None)[source]¶

Build the observation model matrix \(\mathbf{H}\).

Parameters:

m (int) – Number of observations.
coefficients (Optional[np.ndarray]) – Optional coefficients for the observation model, which can be used to weight the observations manually.

Returns:

The observation model matrix \(\mathbf{H}\).

Return type:

npt.NDArray[np.float32]

Seealso:

observationParams, class:inputParams

consenrich.core.runConsenrich(matrixData, matrixMunc, deltaF, minQ, maxQ, offDiagQ, dStatAlpha, dStatd, dStatPC, stateInit, stateCovarInit, boundState, stateLowerBound, stateUpperBound, chunkSize, progressIter, coefficientsH=None, residualCovarInversionFunc=None, adjustProcessNoiseFunc=None)[source]¶

Run consenrich on a contiguous segment (e.g. a chromosome) of read-density-based data. Completes the forward and backward passes given data and approximated observation noise covariance matrices \(\mathbf{R}_{[1:n, (11:mm)]}\).

Parameters:

matrixData (np.ndarray) – Read density data for a single chromosome or general contiguous segment, possibly preprocessed. Two-dimensional array of shape \(m \times n\) where \(m\) is the number of samples/tracks and \(n\) the number of genomic intervals.
matrixMunc (np.ndarray) – Uncertainty estimates for the read coverage data. Two-dimensional array of shape \(m \times n\) where \(m\) is the number of samples/tracks and \(n\) the number of genomic intervals. See getMuncTrack().
deltaF (float) – See processParams.
minQ (float) – See processParams.
maxQ (float) – See processParams.
offDiagQ (float) – See processParams.
dStatAlpha (float) – See processParams.
dStatd (float) – See processParams.
dStatPC (float) – See processParams.
stateInit (float) – See stateParams.
stateCovarInit (float) – See stateParams.
chunkSize (int) – Number of genomic intervals’ data to keep in memory before flushing to disk.
progressIter (int) – The number of iterations after which to log progress.
coefficientsH (Optional[np.ndarray]) – Optional coefficients for the observation model matrix \(\mathbf{H}\). If None, the coefficients are set to 1.0 for all samples.
residualCovarInversionFunc (Optional[Callable]) – Callable function to invert the observation covariance matrix \(\mathbf{E}_{[i]}\). If None, defaults to cconsenrich.cinvertMatrixE().
adjustProcessNoiseFunc (Optional[Callable]) – Function to adjust the process noise covariance matrix \(\mathbf{Q}_{[i]}\). If None, defaults to cconsenrich.updateProcessNoiseCovariance().
boundState (bool)
stateLowerBound (float)
stateUpperBound (float)

Returns:

Tuple of three numpy arrays: - state estimates \(\widetilde{\mathbf{x}}_{[i]}\) of shape \(n \times 2\) - state covariance estimates \(\widetilde{\mathbf{P}}_{[i]}\) of shape \(n \times 2 \times 2\) - post-fit residuals \(\widetilde{\mathbf{y}}_{[i]}\) of shape \(n \times m\)

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

Raises:

ValueError – If the number of samples in matrixData is not equal to the number of samples in matrixMunc.

Seealso:

observationParams, processParams, stateParams

consenrich.core.getPrimaryState(stateVectors, roundPrecision=3)[source]¶

Get the primary state estimate from each vector after running Consenrich.

Parameters:

stateVectors (npt.NDArray[np.float32]) – State vectors from runConsenrich().
roundPrecision (int)

Returns:

A one-dimensional numpy array of the primary state estimates.

Return type:

npt.NDArray[np.float32]

consenrich.core.getStateCovarTrace(stateCovarMatrices, roundPrecision=3)[source]¶

Get a one-dimensional array of state covariance traces after running Consenrich

Parameters:

stateCovarMatrices (np.ndarray) – Estimated state covariance matrices \(\widetilde{\mathbf{P}}_{[i]}\)
roundPrecision (int)

Returns:

A one-dimensional numpy array of the traces of the state covariance matrices.

Return type:

npt.NDArray[np.float32]

consenrich.core.getPrecisionWeightedResidual(postFitResiduals, matrixMunc, roundPrecision=3, stateCovarSmoothed=None)[source]¶

Get a one-dimensional precision-weighted array residuals after running Consenrich.

Applies an inverse-variance weighting of the post-fit residuals \(\widetilde{\mathbf{y}}_{[i]}\) and returns a one-dimensional array of “precision-weighted residuals”. The state-level uncertainty can also be incorporated given stateCovarSmoothed.

Parameters:

postFitResiduals (np.ndarray) – Post-fit residuals \(\widetilde{\mathbf{y}}_{[i]}\) from runConsenrich().
matrixMunc (np.ndarray) – An \(m \times n\) sample-by-interval matrix – At genomic intervals \(i = 1,2,\ldots,n\), the respective length-\(m\) column is \(\mathbf{R}_{[i,11:mm]}\). That is, the observation noise levels for each sample \(j=1,2,\ldots,m\) at interval \(i\). To keep memory usage minimal matrixMunc is not returned in full or computed in in runConsenrich(). If using Consenrich programmatically, run consenrich.core.getMuncTrack() for each sample’s count data (rows in the matrix output of readBamSegments()).
stateCovarSmoothed (Optional[np.ndarray]) – Smoothed state covariance matrices \(\widetilde{\mathbf{P}}_{[i]}\) from runConsenrich().
roundPrecision (int)

Returns:

A one-dimensional array of “precision-weighted residuals”

Return type:

npt.NDArray[np.float32]

consenrich.core.getMuncTrack(chromosome, intervals, stepSize, rowValues, minR, maxR, useALV, useConstantNoiseLevel, noGlobal, localWeight, globalWeight, approximationWindowLengthBP, lowPassWindowLengthBP, returnCenter, sparseMap=None, lowPassFilterType='median')[source]¶

Get observation noise variance \(R_{[:,jj]}\) for the sample \(j\).

Combines a local ALV estimate (see getAverageLocalVarianceTrack()) with an optional global component. If useALV is True, only the ALV is used. If useConstantNoiseLevel is True, a constant track set to the global mean is used. When a sparseMap is provided, local values are aggregated over nearby ‘sparse’ regions before mixing with the global component.

For heterochromatic or repressive marks (H3K9me3, H3K27me3, MNase-seq, etc.), consider setting useALV=True to prevent inflated sample-level noise estimates.

Parameters:

chromosome (str) – Tracks are approximated for this chromosome.
intervals (ndarray) – Genomic intervals for which to compute the noise track.
stepSize (int) – See countingParams.
rowValues (np.ndarray) – Read-density-based values for the sample \(j\) at the genomic intervals \(i=1,2,\ldots,n\).
minR (float) – See observationParams.
maxR (float) – See observationParams.
useALV (bool) – See observationParams.
useConstantNoiseLevel (bool) – See observationParams.
noGlobal (bool) – See observationParams.
localWeight (float) – See observationParams.
globalWeight (float) – See observationParams.
approximationWindowLengthBP (int) – See observationParams and/or getAverageLocalVarianceTrack().
lowPassWindowLengthBP (int) – See observationParams and/or getAverageLocalVarianceTrack().
sparseMap (Optional[dict[int, int]]) – Optional mapping (dictionary) of interval indices to the nearest sparse regions. See getSparseMap().
lowPassFilterType (Optional[str]) – The type of low-pass filter to use in average local variance track (e.g., ‘median’, ‘mean’).
returnCenter (bool)

Returns:

A one-dimensional numpy array of the observation noise track for the sample \(j\).

Return type:

npt.NDArray[np.float32]

consenrich.core.sparseIntersection(chromosome, intervals, sparseBedFile)[source]¶

Returns intervals in the chromosome that overlap with the sparse features.

Not relevant if observationParams.useALV is True.

Parameters:

chromosome (str) – The chromosome name.
intervals (np.ndarray) – The genomic intervals to consider.
sparseBedFile (str) – Path to the sparse BED file.

Returns:

A numpy array of start positions of the sparse features that overlap with the intervals

Return type:

np.ndarray[Tuple[Any], np.dtype[Any]]

consenrich.core.adjustFeatureBounds(feature, stepSize)[source]¶

Adjust the start and end positions of a BED feature to be centered around a step.

Parameters:

feature (Interval)
stepSize (int)

Return type:

Interval

consenrich.core.getSparseMap(chromosome, intervals, numNearest, sparseBedFile)[source]¶

Build a map between each genomic interval and numNearest sparse features

Parameters:

chromosome (str) – The chromosome name. Note, this function only needs to be run once per chromosome.
intervals (np.ndarray) – The genomic intervals to map.
numNearest (int) – The number of nearest sparse features to consider
sparseBedFile (str) – path to the sparse BED file.

Returns:

A dictionary mapping each interval index to the indices of the nearest sparse regions.

Return type:

dict[int, np.ndarray]

consenrich.core.getBedMask(chromosome, bedFile, intervals)[source]¶

Return a 1/0 mask for intervals overlapping a sorted and merged BED file.

This function is a wrapper for cconsenrich.cbedMask().

Parameters:

chromosome (str) – The chromosome name.
intervals (np.ndarray) – chromosome-specific, sorted, non-overlapping start positions of genomic intervals. Each interval is assumed stepSize.
bedFile (str) – Path to a sorted and merged BED file

Returns:

An intervals-length mask s.t. True indicates the interval overlaps a feature in the BED file.

Return type:

np.ndarray

`consenrich.detrorm`¶

consenrich.detrorm.getScaleFactor1x(bamFile, effectiveGenomeSize, readLength, excludeChroms, chromSizesFile, samThreads)[source]¶

Generic normalization factor based on effective genome size and number of mapped reads in non-excluded chromosomes.

Parameters:

bamFile (str) – See consenrich.core.inputParams.
effectiveGenomeSize (int) – Effective genome size in base pairs. See consenrich.constants.getEffectiveGenomeSize().
readLength (int) – Read length (base pairs). See consenrich.core.getReadLength().
excludeChroms (List[str]) – List of chromosomes to exclude from the analysis.
chromSizesFile (str) – Path to the chromosome sizes file.
samThreads (int) – See consenrich.core.samParams.

Returns:

Scale factor for 1x normalization.

Return type:

float

consenrich.detrorm.getScaleFactorPerMillion(bamFile, excludeChroms)[source]¶

Generic normalization factor based on number of mapped reads in non-excluded chromosomes.

Parameters:

bamFile (str) – See consenrich.core.inputParams.
excludeChroms (List[str]) – List of chromosomes to exclude when counting mapped reads.

Returns:

Scale factor accounting for number of mapped reads (only).

Return type:

float

consenrich.detrorm.getPairScaleFactors(bamFileA, bamFileB, effectiveGenomeSizeA, effectiveGenomeSizeB, readLengthA, readLengthB, excludeChroms, chromSizesFile, samThreads, scaleDown=True)[source]¶

Get scaling constants that normalize two alignment files to each other (e.g. ChIP-seq treatment and control) with respect to sequence coverage.

Parameters:

bamFileA (str) – Path to the first BAM file.
bamFileB (str) – Path to the second BAM file.
effectiveGenomeSizeA (int) – Effective genome size for the first BAM file.
effectiveGenomeSizeB (int) – Effective genome size for the second BAM file.
readLengthA (int) – Read length for the first BAM file.
readLengthB (int) – Read length for the second BAM file.
excludeChroms (List[str]) – List of chromosomes to exclude from the analysis.
chromSizesFile (str) – Path to the chromosome sizes file.
samThreads (int) – Number of threads to use for reading BAM files.
scaleDown (bool)

Returns:

A tuple containing the scale factors for the first and second BAM files.

Return type:

Tuple[float, float]

consenrich.detrorm.detrendTrack(values, stepSize, detrendWindowLengthBP, useOrderStatFilter, usePolyFilter, detrendTrackPercentile, detrendSavitzkyGolayDegree)[source]¶

Detrend tracks using either an order statistic filter or a polynomial filter.

Parameters:

values (np.ndarray) – Values to detrend.
stepSize (int) – see consenrich.core.countingParams.
detrendWindowLengthBP (int) – See consenrich.core.detrendParams.
useOrderStatFilter (bool) – Whether to use a sliding order statistic filter.
usePolyFilter (bool) – Whether to use a sliding polynomial/least squares filter.
detrendTrackPercentile (float) – Percentile to use for the order statistic filter.
detrendSavitzkyGolayDegree (int) – Degree of the polynomial for the Savitzky-Golay/Polynomial filter.

Returns:

Detrended values.

Return type:

np.ndarray

Raises:

ValueError – If the detrend window length is not greater than 3 times the step size or if the values length is less than the detrend window length.

`consenrich.constants`¶

Important

This module is provided for convenience. If a genome is not listed here, users can still specify resources (e.g., sizes file, blacklist) manually.

consenrich.constants.getEffectiveGenomeSize(genome, readLength)[source]¶

Get the effective genome size for a given genome and read length.

Parameters:

genome (str) – Name of the genome. See consenrich.constants.resolveGenomeName() and consenrich.core.genomeParams.
readLength (int) – Length of the reads. See consenrich.core.getReadLength().

Raises:

ValueError – If the genome is not recognized or if the read length is not available for the genome.

Returns:

Effective genome size in base pairs.

Return type:

int

consenrich.constants.getGenomeResourceFile(genome, fileType, dir_='data')[source]¶

Get the path to a genome resource file.

Parameters:

genome (str) – the genome assembly. See consenrich.constants.resolveGenomeName() and consenrich.core.genomeParams.
fileType (str) – One of ‘sizes’, ‘blacklist’, ‘sparse’.
dir_ (str)

Returns:

Path to the resource file.

Return type:

str

Raises:

ValueError – If not a sizes, blacklist, or sparse file.
FileNotFoundError – If the resource file does not exist.

consenrich.constants.resolveGenomeName(genome)[source]¶

Standardize the genome name for consistency :param genome: Name of the genome. See consenrich.core.genomeParams. :type genome: str :return: Standardized genome name. :rtype: str :raises ValueError: If the genome is not recognized.

Parameters:: genome (str)
Return type:: str

`consenrich.matching`¶

(Experimental) Detect genomic regions showing both enrichment and non-random structure

Take a set of genomic intervals \(i=1,2,\ldots,n\), each spanning \(L\) base pairs,
and a ‘consensus’ signal track defined over the genomic intervals, estimated from multiple independent samples’ high-throughput functional genomics sequencing data:

\[\widetilde{\mathbf{x}} = \{\widetilde{x}_{[i]}\}_{i=1}^{i=n}.\]

In this documentation, we assume \(\widetilde{\mathbf{x}}\) is the Consenrich ‘primary state estimate’.

Aim: Determine a set of ‘structured’ peak-like genomic regions where the consensus signal track \(\widetilde{\mathbf{x}}\) exhibits both:

Enrichment (large relative amplitude)
Non-random structure (polynomial, oscillatory, etc.)

Why: Prioritizing genomic regions that are both enriched and show a prescribed level of structure is appealing for several reasons. Namely,

Improved confidence that the identified genomic regions are not due to stochastic noise, which is characteristically unstructured.
Targeted detection of biologically relevant signal patterns in a given assay (e.g., see related works analyzing peak-shape Cremona et al., 2015, Parodi et al., 2017)
Speed: Runs genome-wide in seconds/minutes using an efficient Fast Fourier Transform (FFT)-based implementation.

In the case of Consenrich, that the primary signal estimates in \(\{\widetilde{x}_{[i]}\}^{n}\) are reinforced by multiple samples and account for relevant sources of uncertainty is particularly advantageous–it provides a more reliable basis for evaluating legitimate structure and identifying high-resolution features.

Algorithm Overview¶

To detect structured peaks, we run an approach akin to matched filtering, with templates derived from discrete wavelet or scaling functions.

\[\boldsymbol{\xi} = \{\xi_{[t]}\}_{t=1}^{t=T}.\]

Define the response sequence as the cross-correlation between the consensus signal \(\widetilde{\mathbf{x}}\) and the template \(\boldsymbol{\xi}\):

\[\{\mathcal{R}_{[i]}\}_{i=1}^{i=n} = \widetilde{\mathbf{x}} \star \boldsymbol{\xi}\]

(Equivalently, the convolution of \(\widetilde{\mathbf{x}}\) with the time-reversed template \(\boldsymbol{\xi}'\), where \(\xi'_{[t]} = \xi_{[T-t+1]}\).)

Where \(\mathcal{R}_{[i]}\) is large, there is greater evidence that the signal \(\widetilde{\mathbf{x}}\) is enriched and exhibits structure that is agreeable to the template \(\boldsymbol{\xi}\).

Detection Thresholds

matchingParams.alpha

defines the (\(1 - \alpha\))-quantile threshold of the block-maxima null on the response sequence, i.e., the cross-correlation between the Consenrich track and template (default 0.05).

This null is constructed by sampling blocks in the response sequence and recording the maximum value within each block.
The size of each sampled block is drawn from a (truncated) geometric distribution with an expected value equal to the template length/minimum feature size

matchingParams.minSignalAtMaxima (Optional)

Can be an absolute value (float) or string "q:<quantileValue>" to require the value at response-maxima to exceed the given quantile of non-zero values (default q:0.75).

This threshold is applied after tempering the dynamic range with an arsinh transform (i.e., \(\sinh^{-1}(x)\)).
By default, the upper quartile of non-zero values is used: q:0.75
To disable: set to a negative numeric value.

matchingParams.minMatchLengthBP: (Optional)

Minimum feature length in bp to qualify as a match (default 250).

To disable: set to a negative numeric value.

Cython functions: `consenrich.cconsenrich`¶

Several computationally burdensome tasks are written in Cython for improved efficiency.

consenrich.cconsenrich.creadBamSegment(bamFile, chromosome, start, end, stepSize, readLength, oneReadPerBin, samThreads, samFlagExclude, shiftForwardStrand53=0, shiftReverseStrand53=0, extendBP=0, maxInsertSize=1000, pairedEndMode=0, inferFragmentLength=0)¶: Count reads in a BAM file for a given chromosome

consenrich.cconsenrich.cinvertMatrixE(muncMatrixIter, priorCovarianceOO)¶

Invert the residual covariance matrix during the forward pass.

Parameters:

muncMatrixIter (cnp.ndarray[cnp.float32_t, ndim=1]) – The diagonal elements of the covariance matrix at a given genomic interval.
priorCovarianceOO (cnp.float32_t) – The a priori ‘primary’ state variance \(P_{[i|i-1,11]}\).

Returns:

The inverted covariance matrix.

Return type:

cnp.ndarray[cnp.float32_t, ndim=2]

consenrich.cconsenrich.updateProcessNoiseCovariance(matrixQ, matrixQCopy, dStat, dStatAlpha, dStatd, dStatPC, inflatedQ, maxQ, minQ)¶

Adjust process noise covariance matrix \(\mathbf{Q}_{[i]}\)

Parameters:

matrixQ – Current process noise covariance
matrixQCopy – A copy of the initial original covariance matrix \(\mathbf{Q}_{[.]}\)
inflatedQ – Flag indicating if the process noise covariance is inflated

Returns:

Updated process noise covariance matrix and inflated flag

Return type:

tuple

consenrich.cconsenrich.csampleBlockStats(intervals, values, expectedBlockSize, iters, randSeed, excludeIdxMask)¶

Sample contiguous blocks in the response sequence (xCorr), record maxima, and repeat.

Used to build an empirical null distribution and determine significance of response outputs. The size of blocks is drawn from a truncated geometric distribution, preserving rough equality in expectation but allowing for variability to account for the sampling across different phases in the response sequence.

Parameters:

values (cnp.ndarray[cnp.float64_t, ndim=1]) – The response sequence to sample from.
expectedBlockSize (int) – The expected size (geometric) of the blocks to sample.
iters (int) – The number of blocks to sample.
randSeed (int) – Random seed for reproducibility.

Returns:

An array of sampled block maxima.

Return type:

cnp.ndarray[cnp.float64_t, ndim=1]

Seealso:

consenrich.matching.matchWavelet()

consenrich.cconsenrich.cSparseAvg(trackALV, sparseMap)¶

Fast access and average of numNearest sparse elements.

See consenrich.core.getMuncTrack()

Parameters:

trackALV (float[::1]) – See consenrich.core.getAverageLocalVarianceTrack()
sparseMap (dict[int, np.ndarray]) – See consenrich.core.getSparseMap()

Returns:

array of mena(‘nearest local variances’) same length as trackALV

Return type:

cnp.ndarray[cnp.float32_t, ndim=1]

consenrich.cconsenrich.cgetFragmentLength(bamFile, chromosome, start, end, samThreads=0, samFlagExclude=3844, maxInsertSize=1000, minInsertSize=50, iters=1000, blockSize=5000, fallBack=147, rollingChunkSize=250, lagStep=5, earlyExit=100, randSeed=42)¶

consenrich.cconsenrich.cbedMask(chromosome, bedFile, intervals, stepSize)¶

Return a 1/0 mask for intervals overlapping a sorted and merged BED file.

Parameters:

chromosome (str) – Chromosome name.
bedFile (str) – Path to a sorted and merged BED file.
intervals (cnp.ndarray[cnp.uint32_t, ndim=1]) – Array of sorted, non-overlapping start positions of genomic intervals. Each interval is assumed stepSize.
stepSize (int32_t) – Step size between genomic positions in intervals.

Returns:

A mask s.t. 1 indicates the corresponding interval overlaps a BED region.

Return type:

cnp.ndarray[cnp.uint8_t, ndim=1]

API Reference¶

consenrich.core¶

consenrich.detrorm¶

consenrich.constants¶

consenrich.matching¶

Algorithm Overview¶

Cython functions: consenrich.cconsenrich¶

`consenrich.core`¶

`consenrich.detrorm`¶

`consenrich.constants`¶

`consenrich.matching`¶

Cython functions: `consenrich.cconsenrich`¶