ROCCO: [R]obust [O]pen [C]hromatin Detection via [C]onvex [O]ptimization

logo

What

ROCCO is an efficient algorithm for detection of “consensus peaks” in large datasets with multiple HTS data samples (namely, ATAC-seq), where an enrichment in read counts/densities is observed in a nontrivial subset of samples.

Input/Output

  • Input: Samples’ BAM alignments or BigWig tracks

  • Output: BED file of consensus peak regions

Note, if BigWig input is used, no preprocessing options can be applied at the alignment level.

How

ROCCO models consensus peak calling as a constrained optimization problem with an upper-bound on the total proportion of the genome selected as open/accessible and a fragmentation penalty to promote spatial consistency in active regions and sparsity elsewhere.

Why

ROCCO offers several attractive features:

  1. Consideration of enrichment and spatial characteristics of open chromatin signals

  2. Scaling to large sample sizes (100+) with an asymptotic time complexity independent of sample size

  3. No required training data or a heuristically determined set of initial candidate peak regions

  4. No rigid thresholds on the minimum number/width of supporting samples/replicates

  5. Mathematically tractable model permitting worst-case analysis of runtime and performance

Example Behavior

Input

  • ENCODE lymphoblastoid data (BEST5, WORST5): 10 real ATAC-seq alignments of varying TSS enrichment (SNR-like)

  • Synthetic noisy data (NOISY5)

We run twice under two conditions – once with noisy samples and once without to compare results (blue).

rocco -i *.BEST5.bam *.WORST5.bam -g hg38 -o rocco_output_without_noise.bed
rocco -i *.BEST5.bam *.WORST5.bam *.NOISY5.bam -g hg38 -o rocco_output_with_noise.bed

Note, users may run ROCCO with flag –narrowPeak to generate 10-column output with various statistics for comparison.

Output

  • ROCCO is unaffected by the Noisy5 samples and effectively identifies true signal across multiple samples

  • ROCCO simultaneously detects both wide and narrow consensus peaks

example

Paper/Citation

If using ROCCO in your research, please cite the original paper in Bioinformatics (DOI: btad725)

Nolan H Hamilton, Terrence S Furey, ROCCO: a robust method for detection of open chromatin via convex optimization,
Bioinformatics, Volume 39, Issue 12, December 2023

Documentation

For additional details, usage examples, etc. please see ROCCO’s documentation: https://nolan-h-hamilton.github.io/ROCCO/

Installation

PyPI (pip)

python -m pip install rocco --upgrade

If lacking administrative control, you may need to append –user to the above.

Build from Source

If preferred, ROCCO can easily be built from source:

  • Clone or download this repository

    git clone https://github.com/nolan-h-hamilton/ROCCO.git
    cd ROCCO
    python setup.py sdist bdist_wheel
    pip install -e .
    

System-Level Dependencies:

ROCCO utilizes the popular bioinformatics software Samtools (http://www.htslib.org) and bedtools (https://bedtools.readthedocs.io/en/latest/).

If these are not already available, you can install system-wide dependencies with a package manager e.g.,

For Homebrew (MacOS):

brew install samtools
brew install bedtools

For Ubuntu/Debian:

sudo apt-get install samtools
sudo apt-get install bedtools

Conda:

conda install bioconda::bedtools
conda install bioconda::samtools

or built from source (See Samtools (http://www.htslib.org) and bedtools (https://bedtools.readthedocs.io/en/latest/)).

As a supplementary resource, see docs/environments in the GitHub repository for environment files with detailed package version information.

chrom_solution_to_bed(chromosome, intervals, solution, ID=None, check_gaps_intervals=True, min_length_bp=None) str[source]

Convert the ROCCO-generated vector of decision variables for a given chromosome to a BED file

Parameters:
  • chromosome (str) – Chromosome name

  • intervals (np.ndarray) – Intervals for the chromosome

  • solution (np.ndarray) – Solution vector for the chromosome

  • ID (str) – Unique identifier for the solution

  • check_gaps_intervals (bool) – Check if intervals are contiguous and fixed width

  • min_length_bp (int) – Minimum length of a region to be included in the output BED file

Returns:

Output BED file

Return type:

str

combine_chrom_results(chrom_bed_files: list, output_file: str, name_features: bool = False) str[source]

Combine the results from individual chromosome solutions into a single BED file after running ROCCO on each chromosome

Parameters:
  • chrom_bed_files (list) – List of BED files for each chromosome

  • output_file (str) – Output BED file

  • name_features (bool) – Name the features in the output BED file

Returns:

Output BED file

Return type:

str

cscores_quantiles(chrom_scores: numpy.ndarray, quantiles: numpy.ndarray | None = None, add_newlines=True) str[source]

Returns a formatted string of quantiles for a given array of ‘locus scores’ (score_chrom_linear())

Parameters:
  • chrom_scores (np.ndarray) – locus scores (float) for a given chromosome

  • quantiles (np.ndarray, optional) – array of quantiles in [0.0,1.0] to compute.

Returns:

pformatted string of quantiles

Return type:

str

Seealso:

score_chrom_linear()

get_floor_eps_sol(chrom_lp_sol: numpy.ndarray, budget: float, int_tol: float = 1e-06, eps_mult: float = 1.01) numpy.ndarray[source]

Compute the floor_eps heuristic from the relaxation

Adds a small \(\epsilon\) to each decision variable before applying the floor function. Addresses solutions in the interior of the feasible region near integer corner point solutions. The floor_eps approach is a crude heuristic to obtain an initial integer-feasible solution from the LP relaxation, but is well-suited as an initial reference point for the ROCCO-RR heuristic. In many cases, the relaxed LP solution is nearly integral and the floor_eps heuristic and randomized ROCCO-RR procedure has little effect.

Parameters:
  • chrom_lp_sol (np.ndarray) – Solution vector from the LP relaxation

  • budget (float) – \(b\) upper bounds the proportion of the chromosome that can be selected as open/accessible

  • int_tol (float) – If a decision variable is within int_tol of 0 or 1, it is considered integral

  • eps_mult (float) – Value by which to divide the initial epsilon value

Returns:

Initial integer-feasible solution

Return type:

np.ndarray

get_rround_sol(chrom_lp_sol, scores, budget, gamma, rand_iter=1000, int_tol=1e-06, eps_mult: float = 1.01) Tuple[numpy.ndarray, float][source]

Get the ROCCO-RR solution from the solution to the relaxation

Parameters:
  • chrom_lp_sol (np.ndarray) – Solution vector from the LP relaxation

  • scores (np.ndarray) – Scores for each genomic position within a given chromosome

  • budget (float) – \(b\) upper bounds the proportion of the chromosome that can be selected as open/accessible

  • gamma (float) – \(\gamma\) is the coefficient for the fragmentation penalty used to promote spatial consistency in distinct open genomic regions and sparsity elsewhere.

  • rand_iter (int) – Number of randomizations to obtain the ROCCO-RR solution

  • int_tol (float) – If a decision variable is within int_tol of 0 or 1, it is considered integral and ignored in the randomization step

  • eps_mult (float) – Value by which to divide the initial get_floor_eps_sol() epsilon value

Returns:

ROCCO-RR solution and optimal value

Return type:

Tuple[numpy.ndarray, float]

get_warm_idx(scores, budget, gamma, warm_thresh=None) Tuple[numpy.ndarray, numpy.ndarray, float][source]

Prior to solving, identify ‘warm’ indices–those with scores greater than the worst-case fragmentation penalty that could be incurred by their selection assuming integrality (2*gamma). Note that this loses meaning if gamma is scaled.

Parameters:
  • scores (np.ndarray) – Scores for each genomic position within a given chromosome

  • budget (float) – \(b\) upper bounds the proportion of the chromosome that can be selected as open/accessible

  • gamma (float) – \(\gamma\) is the coefficient for the fragmentation penalty used to promote spatial consistency in distinct open genomic regions and sparsity elsewhere. The ideal value of gamma is for a given dataset is dependent on the user’s preference. Increase to encourage solutions that are ‘block-like’.

  • warm_thresh (float) – Threshold for warm indices. If None, the threshold is set to 2*gamma

Returns:

Warm indices, warm scores, and the proportion of warm indices to the maximum number of selections

Return type:

Tuple[np.ndarray, np.ndarray, float]

minmax_scale_scores(vector: numpy.ndarray, min_val: float, max_val: float) numpy.ndarray[source]

Scale a vector (scores) to the range [min_val,max_val]

Note that even if data is already in the range [min_val,max_val], this function will still scale the data such that the minimum value is min_val and the maximum value is max_val.

Parameters:
  • vector (np.ndarray) – Vector to scale

  • min_val (float) – minimum value of the scaled vector (should be near-zero for gamma-scaling)

  • max_val (float) – maximum value of the scaled vector

Returns:

Scaled vector

Return type:

np.ndarray

parsig(scores, gamma=None, parsig_B=None, parsig_M=None, parsig_R=None, scale_quantile: float = 0.5, remove_min=True) numpy.ndarray[source]

Applies a smooth step function mapping input scores to [0,parsig_M)

This transformation of scores can be used to promote integral solutions for the relaxed optimization: The relative gradients for roughly the top (100)*(1-parsig_B) percent of scores are amplified, which pushes these decision variables closer to their integral bounds in the optimal LP solution. The remaining scores below this inflection point are pushed towards zero such that their respective decision variables’ gradients are relatively small. Ideally, this yields a more efficient optimization procedure that is also less likely to yield an excess of fractional decision variables.

For each value \(x\) in the input array scores, the transformation is given by:

\[\mathsf{Parsig}(x) = \frac{M}{1 + \exp(-R(x - B_q))}\]

where

\[B_q = \mathsf{Quantile}(scores, q = parsig\_B)/2\]

Such that the inflection point occurs at quantile(scores, parsig_B)/2.

Implementation Notes:

  • If parsig_B is None, it defaults to 0.95.

  • parsig_R determines the ‘steepness’ of the function, approaching a discontinuous step function in the limit (only two distinct values). If parsig_R is None, it defaults to 2.0. Setting very large parsig_R and parsig_M may result in trivial solutions.

  • parsig_M determines the upper bound of the transformed scores. If parsig_M is None, it is determined using the observed distribution of scores and \(\gamma\).

Example:

>>> import rocco
>>> import numpy as np
>>> np.set_printoptions(formatter={'float_kind': lambda x: '{0:.3f}'.format(x)})
>>> a = np.zeros(50)
>>> # First 25 elements ~ Poisson(λ=1)
>>> a[:25] = sorted(np.random.poisson(1,size=25))
>>> # Second 25 elements ~ Poisson(λ=10)
>>> a[25:] = sorted(np.random.poisson(10,size=25))
>>> a # before parsig rescaling
array([0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
    0.000, 0.000, 0.000, 1.000, 1.000, 1.000, 1.000, 1.000, 1.000,
    1.000, 2.000, 2.000, 2.000, 3.000, 4.000, 4.000, 5.000, 6.000,
    6.000, 7.000, 7.000, 8.000, 8.000, 9.000, 9.000, 10.000, 10.000,
    10.000, 11.000, 12.000, 12.000, 12.000, 12.000, 13.000, 13.000,
    14.000, 14.000, 15.000, 15.000, 16.000, 16.000])
>>> rocco.parsig(a) # after parsig rescaling
rocco.parsig -  INFO - Parsig transformation applied with M=11.0, B=15.0, R=2.0
array([0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
    0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000,
    0.000, 0.000, 0.000, 0.000, 0.001, 0.010, 0.010, 0.074, 0.522,
    0.522, 2.958, 2.958, 8.042, 8.042, 10.478, 10.478, 10.926, 10.926,
    10.926, 10.990, 10.999, 10.999, 10.999, 10.999, 11.000, 11.000,
    11.000, 11.000, 11.000, 11.000, 11.000, 11.000])
>>> exit()
Parameters:
  • scores (np.ndarray) – Scores for each genomic position within a given chromosome. Assumed to be scores generated with score_chrom_linear(), but can work for others as well.

  • gamma (float) – \(\gamma\) is the coefficient for the fragmentation penalty (See solve_relaxation_chrom_pdlp() or solve_relaxation_chrom_glop())

  • parsig_M (float) – Upper bound (sup.) of scores under the transformation. Determined with scale_quantile and gamma if parsig_M is None.

  • parsig_B (float) – Defined such that an inflection occurs at quantile(scores, parsig_B)/2. For a given parsig_R, larger parsig_B value will result in more scores being pushed toward the minimum value of the transformed scores. Defaults to 0.95 if None.

  • parsig_R (float) – Scales the rate of change around the inflection point at quantile(scores, parsig_B)/2. Higher values of parsig_R will result in a steeper sigmoid curve, approaching a step function in the limit with two distinct values. Defaults to 2.0 if None.

  • scale_quantile (float) – Not used in this version.

  • remove_min – If True, subtract the minimum value of the transformed scores to ensure values are non-negative

Returns:

transformed scores

Return type:

np.ndarray

resolve_config(args)[source]

Resolve command-line arguments with a JSON configuration file

Parameters:

args (dict) – Command-line arguments obtained with argparse

Returns:

Resolved command-line arguments

Return type:

dict

Note

  • Modifies/overrides command-line arguments specified explicitly in the JSON file

  • For boolean arguments, use true or false in the JSON file rather than True or False

Example JSON config file

{
    "input_files": ["sample1.bam", "sample2.bam"],
    "output": "rocco_peaks_output.bed",
    "genome": "hg38",
    "chroms": ["chr21", "chr22"],
    "int_tol": 0.01,
    "verbose": true,
    "solver": "pdlp"
}

Can then run rocco –config config.json […].

resolve_transformations(args: dict)[source]

Check if both savgol and medfilt transformations are invoked by CLI, resolve to medfilt if so.

Parameters:

args (dict) – Command-line arguments

Returns:

Resolved command-line arguments

Return type:

dict

score_boundary_chrom(signal_vector: numpy.ndarray, denom: float = 1.0, power: float = 1.0) numpy.ndarray[source]

Calculate the boundary stats for the chromosome score (maximum absolute difference in either direction)

Parameters:
  • signal_vector (np.ndarray) – (Assumed: central tendency stats) vector for a given chromosome

  • denom (float) – Denominator for boundary stats

  • power (float) – Power to raise the boundary stats to

Returns:

Boundary scores

Return type:

np.ndarray

score_central_tendency_chrom(chrom_matrix, method='quantile', quantile=0.5, tprop=0.05, power=1.0) numpy.ndarray[source]

Calculate the central tendency of read measures across samples (rows) at each genomic position (column) in a given chrom.

Parameters:
  • chrom_matrix (np.ndarray) – Matrix of values for a given chromosome

  • method (str) – Method to calculate central tendency. Options are ‘quantile’, ‘tmean’ (trimmed-mean), and ‘mean’. Default is quantile, q=.50 (median)

  • quantile (float) – Quantile to use if method is ‘quantile’

  • tprop (float) – Proportion of values to trim if method is ‘tmean’

  • power (float) – Power to raise the central tendency value to

Returns:

Central tendency score

Return type:

np.ndarray

score_chrom_linear(central_tendency_vec: numpy.ndarray, dispersion_vec: numpy.ndarray, boundary_vec: numpy.ndarray, gamma=None, c_1=1.0, c_2=-1.0, c_3=1.0, eps_neg=-0.001, transform_parsig=False, parsig_B=None, parsig_M=None, parsig_R=None) numpy.ndarray[source]

Compute array of default scores \((\mathbf{G}\boldsymbol{\alpha})^{\top}\) where \(\mathbf{G}\) is the \(n \times 3\) matrix of central tendency scores, dispersion scores, and boundary scores for a given chromosome and \(\boldsymbol{\alpha}\) is the 3D vector of coefficients for each.

Parameters:
  • central_tendency_vec (np.ndarray) – Central tendency scores for a given chromosome

  • dispersion_vec (np.ndarray) – Dispersion scores for a given chromosome

  • boundary_vec (np.ndarray) – Boundary scores for a given chromosome

  • gamma (float) – \(\gamma\) is the coefficient for the fragmentation penalty used to promote spatial consistency in distinct open genomic regions and sparsity elsewhere. The ideal value of gamma is for a given dataset is dependent on the user’s preference. Increase to encourage solutions that are ‘block-like’.

  • c_1 (float) – Coefficient for central tendency scores (\(c_1 g_1(i),~i=1,\ldots,n\) in the paper)

  • c_2 (float) – Coefficient for dispersion scores (\(c_2 g_2(i),~i=1,\ldots,n\) in the paper)

  • c_3 (float) – Coefficient for boundary scores (\(c_3 g_3(i),~i=1,\ldots,n\) in the paper)

  • use_parsig (False) – If True, apply the parametric sigmoid transformation to the scores: parsig()

  • parsig_M (float) – Upper bound (sup.) of scores under the transformation

  • parsig_B (float) – An inflection point occurs in the transformation at quantile(scores, parsig_B)/2. For a given parsig_R, larger parsig_B value will result in more scores being pushed toward the minimum value of the transformed scores.

  • parsig_R (float) – Scales the rate of change around the inflection point at quantile(scores, parsig_B)/2. Higher values of parsig_R will result in a steeper sigmoid curve, approaching a step function in the limit with two distinct values.

  • eps_neg (float) – Negative epsilon value to add to the scores

Returns:

chromosome scores

Return type:

np.ndarray

score_dispersion_chrom(chrom_matrix: numpy.ndarray, method: str = 'mad', rng=(25, 75), tprop=0.05, power: float = 1.0) numpy.ndarray[source]

Calculate the dispersion of read measures across samples (rows) at each genomic position (column) in a given chrom.

Parameters:
  • chrom_matrix (np.ndarray) – Matrix of values for a given chromosome

  • method (str) – Method to calculate dispersion. Options are ‘mad’, ‘iqr’, ‘tstd’ (trimmed std.), and ‘std’. Default is ‘mad’

  • tprop (float) – Proportion of values to trim if method is ‘tstd’

  • rng (Tuple[int,int]) – Range of quantiles to use if method is ‘iqr’

  • power (float) – Power to raise the dispersion value to

Returns:

Dispersion scores

Return type:

np.ndarray

solve_relaxation_chrom_glop(scores, budget: float = 0.035, gamma: float = 1.0, beta: float | None = None, denom_const: float | None = None, glop_parameters=None, glop_dual_feasibility_tolerance: float = 1e-06, glop_primal_feasibility_tolerance: float = 1e-06, glop_use_scaling: bool = True, glop_initial_basis: str = 'TRIANGULAR', glop_push_to_vertex: bool = True, glop_use_dual_simplex=None, glop_allow_simplex_algorithm_change: bool = True, scale_gamma: bool = False, hint=None, threads: int = 0, verbose: bool = False, save_model: str | None = None) Tuple[numpy.ndarray, float][source]

Solve the relaxation for a specific chromosome using the simplex-based method, glop

OR-tools linear programming resources and documentation

A simplex-based method that yields corner-point feasible (CPF, vertex) solutions.

Parameters:
  • scores (np.ndarray) – Scores for each genomic pisition within a given chromosome

  • budget (float) – \(b\) upper bounds the proportion of the chromosome that can be selected as open/accessible

  • gamma (float) – \(\gamma\) is the coefficient for the fragmentation penalty used to promote spatial consistency in distinct open genomic regions and sparsity elsewhere. The ideal value of gamma is for a given dataset is dependent on the user’s preference. Increase to encourage solutions that are ‘block-like’.

  • beta (float) – Exponent for the denominator in the fragmentation penalty. If None, defaults to 1/2 when scale_gamma is True.

  • denom_const (float) – Constant to add to the denominator in the fragmentation penalty. If None, defaults to 0.5 when scale_gamma is True.

  • scale_gamma (bool) – If True, Scale the fragmentation penalty (γ) positionally based on the difference between adjacent scores. This will yield a more nuanced fragmentation penalty that does not discourage differences in adjacent decision variables if their corresponding scores are dissimilar and thus reflect a true change in state that should affect the solution: γ/(|Score_i - Score_i+1| + ε)^β See beta and denom_const for more details.

  • glop_parameters (parameters_pb2.GlopParameters) – glop-specific protocol buffer. If this is not None, the explicit solver arguments in this function definition are ignored. See https://protobuf.dev for more information on protocol buffers and parameters.proto for Google’s glop-specific protocol buffer.

  • glop_dual_feasibility_tolerance (float) – Dual feasibility tolerance for glop.

  • glop_primal_feasibility_tolerance (float) – Primal feasibility tolerance for glop.

  • glop_use_scaling (bool) – Use scaling for glop. If True, the scaling method is set to EQUILIBRATION and the cost scaling is set to CONTAIN_ONE_COST_SCALING. Recommended for most cases unless the user is confident the problem is not poorly-scaled. Future releases of ROCCO should support additional options.

  • glop_initial_basis (str) – Determines which of the glop-supported heuristics to identify an initial basis is executed. Options are ‘NONE’, ‘TRIANGULAR’, and ‘MAROS’. ‘TRIANGULAR’ is the default and recommended option.

  • threads (int) – Number of threads to use for optimization. Default is 0 (use default number of threads defined in solvers.proto). If threads is negative, the number of threads used will be the maximum of the number of available CPUs minus 1, or 1 in the case of a single-core machine.

  • verbose (bool) – Enable verbose output

  • save_model (str) – Save the model as an MPS file

  • glop_push_to_vertex (bool)

  • glop_allow_simplex_algorithm_change (bool)

Returns:

Solution vector and optimal value

Return type:

Tuple[np.ndarray, float]

solve_relaxation_chrom_pdlp(scores, budget: float = 0.035, gamma: float = 1.0, beta: float | None = None, denom_const: float | None = None, pdlp_proto=None, pdlp_presolve_options_use_glop: bool = True, pdlp_termination_criteria_eps_optimal_absolute: float = 1e-08, pdlp_termination_criteria_eps_optimal_relative: float = 1e-08, pdlp_use_low_precision: bool = False, scale_gamma: bool = False, hint=None, threads: int = 0, verbose: bool = False, save_model: str | None = None) Tuple[numpy.ndarray, float][source]

Solve the relaxation for a specific chromosome using the first-order method, pdlp

See the full paper for PDLP (Applegate et al., 2021)

Parameters:
  • scores (np.ndarray) – Scores for each genomic pisition within a given chromosome

  • budget (float) – \(b\) upper bounds the proportion of the chromosome that can be selected as open/accessible

  • gamma (float) – \(\gamma\) is the coefficient for the fragmentation penalty used to promote spatial consistency in distinct open genomic regions and sparsity elsewhere. The ideal value of gamma is for a given dataset is dependent on the user’s preference. Increase to encourage solutions that are ‘block-like’.

  • beta (float) – Exponent for the denominator in the fragmentation penalty. If None, defaults to 1/2 when scale_gamma is True.

  • denom_const (float) – Constant to add to the denominator in the fragmentation penalty. If None, defaults to 1.0 when scale_gamma is True.

  • scale_gamma (bool) – If True, Scale the fragmentation penalty (γ) positionally based on the difference between adjacent scores. This will yield a more nuanced fragmentation penalty that does not discourage differences in adjacent decision variables if their corresponding scores are dissimilar and thus reflect a true change in state that should affect the solution: γ/(|Score_i - Score_i+1| + ε)^β See beta and denom_const for more details.

  • pdlp_proto (solvers_pb2.PrimalDualHybridGradientParams) –

    pdlp-specific protocol buffer. If this is not None, the explicit solver arguments in this function definition are ignored. See https://protobuf.dev for more information on protocol buffers and solvers.proto for Google’s pdlp-specific protocol buffer.

  • pdlp_presolve_options_use_glop (bool) – Use glop’s presolve routines but solve with pdlp. Recommended for most cases unless computational resources are limited or the user has specified the problem already to exploit context-specific structure.

  • pdlp_termination_criteria_eps_optimal_absolute (float) – Appealing to strong duality for LPs, the duality gap (difference between the primal objective function and the dual objective function at a given iteration) must be less than this value plus a scaling of the primal/dual objective values (See solvers.proto linked above for exact details). If computational resources are limited, consider using 1.0e-4 per the ortools documentation.

  • pdlp_termination_criteria_eps_optimal_relative (float) –

    Relative termination criterion for the pdlp solver. If computational resources are limited, consider using 1.0e-4 per the ortools documentation.

  • pdlp_use_low_precision (bool) – Use “loose solve”/low precision mode. This will override the termination arguments to weakened criteria (1.0e-4).

  • threads (int) – Number of threads to use for optimization. Default is 0 (use default number of threads defined in solvers.proto). If threads is negative, the number of threads used will be the maximum of the number of available CPUs minus 1, or 1 in the case of a single-core machine.

  • verbose (bool) – Enable verbose output

  • save_model (str) – Save the model as an MPS file

Returns:

Solution vector and optimal value

Return type:

Tuple[np.ndarray, float]

ROCCO: Readtracks

Basic suite of functions for calculating read density tracks from BAM files and collating data for downstream analysis.

apply_filter(intervals: numpy.ndarray, count_matrix: numpy.ndarray, filter_type: str | None = None, savgol_window_bp: int | None = None, savgol_window_steps: int = 5, savgol_order: int | None = None, medfilt_kernel_bp: int | None = None, medfilt_kernel_steps: int = 5) Tuple[numpy.ndarray, numpy.ndarray][source]

Apply filtering to the count matrix.

Parameters:
  • intervals (np.ndarray) – The genomic intervals.

  • count_matrix (np.ndarray) – The count matrix.

  • filter_type (str) – The type of filter to apply. Options are ‘savitzky-golay’ or ‘median’.

  • savgol_window_bp (int) – Window size for the Savitzky-Golay filter in base pairs.

  • savgol_window_steps (int) – Window size for the Savitzky-Golay filter in steps.

  • savgol_order (int) – Polynomial order for the Savitzky-Golay filter.

  • medfilt_kernel_bp (int) – Kernel size for the median filter in base pairs.

  • medfilt_kernel_steps (int) – Kernel size for the median filter in steps.

Return type:

Tuple[numpy.ndarray, numpy.ndarray]

apply_transformation(intervals: numpy.ndarray, count_matrix: numpy.ndarray, transform_log_pc: bool = False, log_const: float | None = None, transform_local_ratio: bool = False, local_ratio_window_bp: int | None = None, local_ratio_window_steps: int | None = None, local_ratio_pc: float | None = None, nonnegative: bool = True) Tuple[numpy.ndarray, numpy.ndarray][source]

Transform the count matrix after computing read densities.

Parameters:
  • intervals (np.ndarray) – The genomic intervals.

  • count_matrix (np.ndarray) – The count matrix.

  • transform_log_pc (bool) – If True, apply a log2(x + c) transformation to data.

  • log_const (float) – Constant to add before taking the log.

  • transform_local_ratio (bool) – If True, apply local ratio transformation to the values: x/(median_filter(x,row,kernel=local_ratio_window_bp) + ε).

  • local_ratio_window_bp (int) – Window size for the local ratio transformation in base pairs.

  • local_ratio_window_steps (int) – Window size for the local ratio transformation in steps.

  • local_ratio_pc (float) – Constant to add to the local reference before taking the ratio.

  • nonnegative (bool) – If True, the transformation will compute enrichment with respect to a non-negative reference. Otherwise, the transformation will allow negative reference values to divide the signal values.

Returns:

A tuple containing two numpy arrays. The first array contains genomic intervals, and the second is a matrix (2d array) containing the corresponding values.

Return type:

tuple(numpy.ndarray, numpy.ndarray)

decompress_features_vals(intervals: numpy.ndarray, vals: numpy.ndarray, step: int)[source]

Reformat data from BedGraph-like intervals to have fixed-width, contiguous intervals. To avoid redundancy, methods generating BedGraph-like ouput will often create a single entry for repeated values over multiple adjacent intervals. For the use-cases in scope of this module, the savings achieved by this compressed representation are of little practical value and can be annoying to work with downstream. So this function decompresses the intervals and values.

Parameters:
  • intervals (np.ndarray) – list-like of integer values corresponding to genomic positions

  • vals (np.ndarray) – list-like of floats representing the measured signal value at each interval

  • step (int) – integer specifying the step size in base pairs. ‘step’ is used synonymously with ‘bin size’.

Returns:

A tuple of np.ndarray arrays: the first is the gap-filled intervals, the second is the signal values over the new intervals

Return type:

tuple

Seealso:

get_chrom_reads()

Seealso:

generate_chrom_matrix()

generate_bigwig(bam_file: str, step: int = 50, effective_genome_size: float = -1, norm_method: str = 'RPGC', min_mapping_score: int = -1, flag_include: int = 66, flag_exclude: int = 1284, extend_reads: int = -1, center_reads: bool = False, ignore_for_norm: list = ['chrX', 'chrY', 'chrM'], scale_factor: float = 1.0, ouput_file: str | None = None, num_processors: int = -1, overwrite: bool = False) str[source]

Generate a BigWig file from a BAM file.

Parameters:
  • bam_file (str) – Path to the BAM file.

  • step (int) – Step size for the intervals. Used synonymously with ‘bin size’ throughout this documentation.

  • effective_genome_size (float) – Effective genome size for normalization. Required for RPGC normalization.

  • norm_method (str) – Normalization method. Must be one of: RPGC, RPKM, CPM, BPM. See deeptools documentation for more information.

  • min_mapping_score (int) – Minimum mapping score for reads to be included in the count track.

  • flag_include (int) – SAM flag to include in the count track. Equivalent to -f in samtools view.

  • flag_exclude (int) – SAM flag to exclude from the count track. Equivalent to -F in samtools view.

  • extend_reads (int) – Extend reads by a fixed number of base pairs. If < 0, ignore. If 0, extend reads to ‘estimated fragment length’. If > 0, extend reads by that number of base pairs.

  • center_reads (bool) – Center reads at the midpoint of the fragment length.

  • ignore_for_norm (list) – List of chromosome names to ignore for normalization.

  • scale_factor (float) – Factor to scale the values by.

  • ouput_file (str) – Path to the output BigWig file. If None, the file name will be generated based on the parameters used.

  • num_processors (int) – Number of processes to use for the calculation. If < 0, use all but one of the available processors.

  • overwrite (bool) – If True, overwrite the output file if it already exists. If False, skip the calculation if the output file already exists.

Returns:

Path to the output BigWig file.

Return type:

str

Raises:
  • FileNotFoundError – If the BAM file is not found.

  • ValueError – If the normalization method is not one of: RPGC, RPKM, CPM, BPM.

  • ValueError – If effective genome size is not specified and the normalization method is RPGC.

generate_bigwigs(bam_files: list, step: int = 50, effective_genome_size: float = -1, norm_method: str = 'RPGC', min_mapping_score: int = -1, flag_include: int = 66, flag_exclude: int = 1284, extend_reads: int = -1, center_reads: bool = False, ignore_for_norm: list = ['chrX', 'chrY', 'chrM'], scale_factor: float = 1.0, num_processors: int = -1, overwrite: bool = True) list[source]

Generate BigWig files from a list of BAM files. This function is just a wrapper for generate_bigwig()

Parameters:
  • bam_files (list) – List of paths to the BAM files.

  • step (int) – Step size for the intervals.

  • effective_genome_size (float) – Effective genome size for normalization. Required for RPGC normalization.

  • norm_method (str) – Normalization method. Must be one of: RPGC, RPKM, CPM, BPM. See deeptools documentation for more information.

  • min_mapping_score (int) – Minimum mapping score for reads to be included in the count track.

  • flag_include (int) – SAM flag to include in the count track. Equivalent to -f in samtools view.

  • flag_exclude (int) – SAM flag to exclude from the count track. Equivalent to -F in samtools view.

  • extend_reads (int) – Extend reads by a fixed number of base pairs. If < 0, ignore. If 0, extend reads to ‘estimated fragment length’. If > 0, extend reads by that number of base pairs.

  • center_reads (bool) – Center reads at the midpoint of the fragment length.

  • ignore_for_norm (list) – List of chromosome names to ignore for normalization.

  • scale_factor (float) – Factor to scale the values by.

  • num_processors (int) – Number of processes to use for the calculation. If < 0, use all but one of the available processors.

  • overwrite (bool) – If True, overwrite the output file if it already exists. If False, skip the calculation if the output file already exists.

Returns:

List of paths to the output BigWig files.

Return type:

list

Raises:
  • FileNotFoundError – If a BAM file is not found.

  • ValueError – If the normalization method is not one of: RPGC, RPKM, CPM, BPM.

  • ValueError – If effective genome size is not specified and the normalization method is RPGC.

Seealso:

generate_bigwig()

generate_chrom_matrix(chromosome: str, bigwig_files: list, chrom_sizes_file: str, step: int, const_scale: float = 1.0, round_digits: int = 5, scale_by_step: bool = False, filter_type: str | None = None, savgol_window_bp: int | None = None, savgol_window_steps: int = 5, savgol_order: int | None = None, medfilt_kernel_bp: int | None = None, medfilt_kernel_steps: int = 5, transform_log_pc: bool = False, log_const: float | None = None, transform_local_ratio: bool = False, local_ratio_window_bp: int | None = None, local_ratio_window_steps: int | None = None, local_ratio_pc: float | None = None)[source]

Create a matrix of read counts for a given chromosome from a list of BigWig files with potentially varying start and end positions. Provides options for transforming and filtering the tracks comprising the matrix.

Parameters:
  • chromosome (str) – The chromosome to generate the matrix for.

  • bigwig_files (list) – List of paths to the BigWig files.

  • chrom_sizes_file (str) – Path to the chromosome sizes file.

  • step (int) – Step size for the intervals.

  • const_scale (float) – Factor to scale the signal values by after bigwig generation.

  • round_digits (int) – Number of decimal places to round the wig values to (not intervals).

  • scale_by_step (bool) – If True, scale the values by the step size.

  • filter_type (str) – Type of filter to apply. Options are ‘savitzky-golay’ or ‘median’.

  • savgol_window_bp (int) – Window size for the Savitzky-Golay filter in base pairs.

  • savgol_window_steps (int) – Window size for the Savitzky-Golay filter in steps.

  • savgol_order (int) – Polynomial order for the Savitzky-Golay filter.

  • medfilt_kernel_bp (int) – Kernel size for the median filter in base pairs.

  • medfilt_kernel_steps (int) – Kernel size for the median filter in steps.

  • transform_log_pc (bool) – If True, apply a log2(x + c) transformation to data.

  • log_const (float) – Constant to add before taking the log.

  • transform_local_ratio (bool) – If True, apply local ratio transformation to the values. See apply_transformation().

  • local_ratio_window_bp (int) – Window size for the local ratio transformation in base pairs.

  • local_ratio_window_steps (int) – Window size for the local ratio transformation in steps.

  • local_ratio_pc (float) – Constant to add to the local reference before taking the ratio.

Returns:

A tuple containing two numpy arrays. The first array contains genomic intervals, and the second contains the corresponding values.

Return type:

tuple(numpy.ndarray, numpy.ndarray)

Raises:

FileNotFoundError – If a BigWig file or chromosome sizes file is not found.

Seealso:

get_chrom_reads()

Seealso:

apply_transformation()

Seealso:

apply_filter()

get_chrom_reads(bigwig_file: str, chromosome: str, chrom_sizes_file: str, step: int, const_scale: float = 1.0, round_digits: int = 5, scale_by_step: bool = False)[source]

Extract, decompress, and scale signal values (e.g., read counts) from a BigWig file for a specific chromosome.

Parameters:
  • bigwig_file (str) – Path to the BigWig file.

  • chromosome (str) – Chromosome name to extract data for.

  • chrom_sizes_file (str) – Path to the chromosome sizes file.

  • step (int) – Step size for the intervals.

  • const_scale (float) – Factor to scale the signal values by after bigwig generation. Note, this is not the same parameter as the scale factor used in generate_bigwig.

  • round_digits (int) – Number of decimal places to round the wig values to (not intervals).

  • scale_by_step (bool) – If True, scale the values by the step size.

Returns:

A tuple containing two np arrays. One for the genomic intervals, one for the values.

Return type:

tuple

Raises:
  • FileNotFoundError – If the BigWig file or chromosome sizes file is not found.

  • ValueError – If no non-zero values are found in the BigWig file.

Seealso:

decompress_features_vals()

Seealso:

generate_chrom_matrix()

get_chroms_and_sizes(chrom_sizes_file)[source]

Parse a chromosome sizes file and return a dictionary of chromosome names and sizes. :param chrom_sizes_file: Path to the chromosome sizes file. :type chrom_sizes_file: str

Returns:

A dictionary of chromosome names and sizes.

Return type:

dict

Raises:

FileNotFoundError – If the chromosome sizes file is not found.

get_shape(matrix: numpy.ndarray) Tuple[source]

Helper function to get the shape of a 1D/2D numpy array

Parameters:

matrix (numpy.ndarray)

Return type:

Tuple

ROCCO: Scores

Compute post hoc peak statistics given samples’ BAM files and ROCCO peak output.

get_ecdf(bam_files, length: int, chrom_sizes_file: str, nsamples=500, sample_scaling_constants: list | None = None, seed: int | None = None)[source]

Compute an empirical cdf of counts from randomly selected genomic regions of length bases

In the current implementation, this function is called for each unique peak length by score_peaks() so that the empirical cdf used to approximate p-values is based on regions of the same length as the peaks.

Parameters:
  • bam_files (list or str) – List of paths to the BAM files OR a single filepath to a text file containing a list of BAM files (one filepath per line).

  • length (int) – Defines length of regions that will be randomly sampled to estimate the background distribution

  • chrom_sizes_file (str) – Filepath to chromosome sizes file.

  • nsamples (int) – Number of random ‘background’ regions to sample.

  • sample_scaling_constants (list) – List of scaling constants (floats) for each sample (provided by score_peaks() in current implementation).

  • seed (int) – Random seed supplied to bedtools random

Returns:

ECDF object with an evaluate() method.

Return type:

scipy.stats._distn_infrastructure.rv_frozen

Seealso:

score_peaks(), multi_ecdf()

get_read_length(bam_file: str, num_reads: int = 1000, min_mapping_quality: int = 10)[source]

Get the mapped read length from a BAM file’s first num_reads.

Parameters:
  • bam_file (str) – Path to the BAM file.

  • num_reads (int) – Number of reads to sample for the approximation.

  • min_mapping_quality (int)

Returns:

estimated mapped read length

Return type:

int

multi_ecdf(bam_files, lengths, chrom_sizes_file: str, nsamples_per_length, sample_scaling_constants=None, seed=None, proc: int | None = None)[source]

Compute ECDFs in parallel for each unique peak length and a correspondding dictionary of scipy ECDF objects.

See get_ecdf() for details.

Parameters:
  • chrom_sizes_file (str)

  • proc (int | None)

raw_count_matrix(bam_files: str, peak_file: str, output_file: str, bed_columns: int = 3, overwrite=True)[source]

Generate a ‘raw’ count matrix from BAM files and a ROCCO output peak file.

Parameters:
  • bam_files (list or str) – List of paths to the BAM files OR a single filepath to a text file containing a list of BAM files (one filepath per line).

  • bam_list_file (str) – list of filepaths to BAM files or a single filepath to a textfile containing a list of BAM files (one per line, order will be preserved in the output matrix).

  • peak_file (str) – Path to the BED file containing peaks.

  • output_file (str) – name of file to write the count matrix in.

  • bed_columns (int)

Returns:

Name/path of the output file if successful, otherwise None.

Return type:

str

Note

Output Header.

The header of the output peak-count files (TSV) will preserve the order that the BAM files were supplied in.

Note

BED3 format.

Default expects three-column BED peak files. If the peak file contains additional columns, set bed_columns accordingly.

score_peaks(bam_files, chrom_sizes_file: str | None = None, peak_file: str | None = None, count_matrix_file: str | None = None, effective_genome_size: float | None = None, skip_for_norm: list = ['chrX', 'chrY', 'chrM'], row_scale=1000, ucsc_base=250, threads: int | None = None, pc=1, ecdf_nsamples=500, output_file='scored_peaks.bed', seed: int | None = None, proc: int | None = None)[source]

Compute peak scores based on geometric mean of 1x-normalized, length-scaled counts. p-values based on ECDFs of read counts in each peak length.

Parameters:
  • bam_files (list or str) – List of paths to the BAM files OR a single filepath to a text file containing a list of BAM files (one filepath per line).

  • chrom_sizes_file (str) – Path to the chromosome sizes file.

  • peak_file (str) – Path to (BED) file containing peak regions.

  • uscc_base (int) – Base value for UCSC score column. Default is 250 such that no peaks are indiscernible.

  • count_matrix_file (str | None)

  • effective_genome_size (float | None)

  • skip_for_norm (list)

  • threads (int | None)

  • seed (int | None)

  • proc (int | None)

wrap_run_ecdf(*args)[source]

Wrapper for get_ecdf() for compliance with multiprocessing.Pool

ROCCO: Constants

Default effective genome sizes, chromosome size files, chromosome-specific budget/gamma parameters for supported assemblies: hg38, hg19, mm10, mm39, and dm6.

Universally optimal parameters/default values cannot be established for all datasets. When possible, users may find improved results by tuning the parameters to their specific data. See existing assembly-specific files for default parameter files you may use as a template.

Use the –params CLI argument to specify a custom parameter file.

Variables

GENOME_DICTdict

Dictionary of genome assemblies with effective genome size, chromosome sizes file, and parameters file. Keys are assembly names, values are dictionaries with keys ‘effective_genome_size’, ‘sizes_file’, and ‘params’. Default assemblies: hg38, hg19, mm10, mm39, and dm6.

get_file_path(filename)[source]