ROCCO: [R]obust [O]pen [C]hromatin Detection via [C]onvex [O]ptimization

logo

Underlying ROCCO is a constrained optimization problem with solutions dictating accessible chromatin regions that are consistent across multiple samples.

GitHub (Homepage)

https://github.com/nolan-h-hamilton/ROCCO/

Paper

If using ROCCO in your research, please cite the original paper in Bioinformatics

Nolan H Hamilton, Terrence S Furey, ROCCO: a robust method for detection of open chromatin via convex optimization,
Bioinformatics, Volume 39, Issue 12, December 2023

DOI: 10.1093/bioinformatics/btad725

Demo

A brief demonstration of ROCCO using ENCODE ATAC-seq data is available as a Jupyter notebook.

Installation

pip install rocco

Input

  • A set of BAM alignment files
    • In this case, ROCCO generates coverage track files in BigWig format for each sample using deepTools bamCoverage and user-specified parameters/arguments for read filtering, normalization, etc.

OR

  • A set of BigWig coverage track files
    • Users can also manually generate BigWig files for each sample according to their experimental needs and preferred protocol and supply them them directly to ROCCO. Input BigWig tracks are treated as final and not modified by ROCCO (e.g., no normalization, etc.)

AND

  • A genome sizes file containing chromosome names and sizes separated by tabs (e.g., docs/hg38.sizes)

Output

A BED file listing the identified peak regions and scores.

Example Use

Both an API and command-line interface (CLI) are available to run ROCCO. Toy examples are provided below using both the CLI and API.

API Use

Example One

Run ROCCO with BAM input files for each sample using the default chromosome-specific budget, gamma, etc. parameters for hg38 assembly (See Rocco.HG38_PARAMS or hg38_template.csv).

>>> import rocco
>>> bamfiles = ['sample1.bam', 'sample2.bam', 'sample3.bam','sample4.bam', 'sample5.bam']
>>> rocco_obj = rocco.Rocco(
...     input_files=bamfiles,
...     genome_file='genome.sizes',
...     chrom_param_file='hg38')
>>> rocco_obj.run() # output peak annotation saved as a BED file `rocco_obj.outfile`

Example Two

Run ROCCO with BigWig input files for each sample using default chromosome-specific budget, gamma, etc. parameters for the hg38 assembly in Rocco.HG38_PARAMS

Accepting BigWig input directly allows users the option to generate the coverage tracks according to their preferred protocol, e.g., deepTools bamCoverage or another utility that produces BigWig signal files with additional features for normalization, smoothing, read extension, etc. that are useful in many experimental settings.

>>> import rocco
>>> bw_files = ['sample1.bw', 'sample2.bw', 'sample3.bw', 'sample4.bw', 'sample5.bw']
>>> rocco_obj = rocco.Rocco(
...     input_files=bw_files,
...     genome_file='genome.sizes',
...     chrom_param_file='hg38')
>>> rocco_obj.run()

Example Three

Scale coverage signals before calling peaks

>>> import rocco
>>> bamfiles = ['sample1.bam', 'sample2.bam', 'sample3.bam','sample4.bam', 'sample5.bam']
>>> rocco_obj = rocco.Rocco(
...     input_files=bamfiles,
...     genome_file='genome.sizes',
...     chrom_param_file='hg38',
...     sample_weights=[0.50, 1.0, 1.0, 1.0, 0.75]
... )
>>> rocco_obj.run()

Example Four

Use a custom parameter file to set chromosome-specific budgets, gamma, tau, etc.

With the following saved to ``custom_params.csv``

chrom,budget,gamma,tau,c_1,c_2,c_3
chr19,0.05,1.0,0,1.0,1.0,1.0
chr21,0.03,1.0,0,1.0,1.0,1.0
>>> import rocco
>>> bamfiles = ['sample1.bam', 'sample2.bam', 'sample3.bam','sample4.bam', 'sample5.bam']
>>> rocco_obj = rocco.Rocco(
...     input_files=bamfiles,
...     genome_file='genome.sizes',
...     chrom_param_file='custom_params.csv')
>>> rocco_obj.run()

For a template chromosome parameter file see hg38_template.csv.

Example Five

Use constant, default genome-wide parameters for all chromosomes. Users can modify these default genome-wide parameters via the constant_ arguments at the command line or as below

>>> import rocco
>>> bamfiles = ['sample1.bam', 'sample2.bam', 'sample3.bam','sample4.bam', 'sample5.bam']
>>> rocco_obj = rocco.Rocco(
...     input_files=bamfiles,
...     genome_file='genome.sizes',
...     chrom_param_file=None,
...     constant_budget=0.035,
...     constant_gamma=1.0,
...     constant_tau=0.0)
>>> rocco_obj.run()

CLI Use

See rocco.main() or run rocco --help at the command line for a full list of available arguments.

Example One

rocco -i sample1.bam sample2.bam sample3.bam sample4.bam sample5.bam -g genome.sizes --params hg38

ROCCO will also accept wildcard/regex filename patterns:

rocco -i *.bam -g genome.sizes --params hg38

Example Two

rocco -i *.bw -g genome.sizes --params hg38

Example Three

Explicitly list the input files and weights to ensure the correct mapping of weights to samples

rocco -i sample1.bam sample2.bam sample3.bam sample4.bam sample5.bam -g genome.sizes --params hg38 --sample_weights 0.50 1.0 1.0 1.0 0.75

Example Four

rocco -i *.bam -g genome.sizes --params custom_params.csv

Example Five

rocco -i *.bam -g genome.sizes --budget 0.035 --gamma 1.0 --tau 0.0

Testing ROCCO

Run PyTest unit tests

cd tests
pytest -v -rPA -l -k "regular" test_rocco.py

Miscellaneous

  • Parameter Tuning Default parameters generally provide strong results, but users may alter these using a custom --params (e.g., hg38_template.csv.) or by modifying the --budget/--constant_budget, --gamma/--constant_gamma, etc. arguments if wishing to use consistent parameters for all chromosomes.

  • Peak Scores Run with --plot_hist to generate a histogram of peak scores that may be useful if tuning the --peak_score_filter argument.

  • Memory Use If RAM is a special consideration, you can try increasing –step from its default of 50 to, e.g., 100 and/or using a lightweight solver for the optimization, e.g., a first-order method such as PDLP. Note that if using BigWig files as input, –step has no effect as the step size is inferred from the data.

  • Dependencies Ensure samtools and bedtools are installed and in your PATH. These tools are utilized for several auxiliary features.

To-Do Items

Todo

Rocco.run(): Update peak score histogram with subplots for each chromosome

original entry

class rocco.Rocco(input_files, genome_file, chrom_param_file=None, **kwargs)[source]

Bases: object

Parameters:
  • input_files (list) – a list of samples’ BAM files, or a list of samples’ BigWig

  • genome_file (str) – Path to the genome sizes file containing chromosome sizes

  • chrom_param_file (str) – Path to the chromosome parameter file

  • male_samples (list, optional) – If specified, peak calling over chrY is restricted to these samples. Otherwise, all samples are used for all chromosomes.

  • skip_chroms (list, optional) – List of chromosomes to skip

  • proc_num (int, optional) – Number of processes to use when computing chromosome-specific coverage tracks

  • step (int, optional) – Step size for coverage tracks. Inferred from the intervals if bigwig input is supplied.

  • sample_weights (numpy array, optional) – List/array of weights used to scale the coverage tracks for each sample. Defaults to np.ones()

  • filler_params (dict, optional) – Filler parameters used if missing from chrom_param_file

  • solver (str, optional) – Solver to use for optimization (default: ‘CLARABEL’).

  • solver_maxiter (int, optional) – Maximum number of solving iterations

  • solver_reltol (float, optional) – Relative optimality gap tolerance when solving the relaxed optimization problem

  • solver_abstol (float, optional) – Absolute optimality gap tolerance when solving the relaxed optimization problem

  • solver_feastol (float, optional) – Feasibility tolerance when solving the relaxed optimization problem

  • rand_iter (int, optional) – Number of RR iterations.

  • verbose_solving (bool, optional) – Whether to print solver logging data

  • norm_method (str, optional) – apply RPKM or RPGC normalization (see documentation for deeptools’s bamCoverage tool) on each Sample’s coverage track independently.

  • norm_ignore_chroms (list, optional) – list of chromosomes to skip for normalization

  • effective_genome_size – Effective genome size for normalization. Ignored if norm_method is not ‘RPGC’.

  • raw_counts (bool, optional) – If True, no normalization is performed when computing the coverage track from the input BAM file.

delete_tempfiles()[source]
get_Smat(chromosome, samples=None)[source]

Generates a \(K \times n\) coverage signal matrix \(\mathbf{S}_{chr}\) from samples’ data for input to ROCCO

\[\begin{split}\mathbf{S}_{chr} = \begin{pmatrix} s_{1_1} & s_{2_1} & \ldots & s_{n_1}\\ s_{1_2} & s_{2_2} & \ldots & s_{n_2}\\ \ldots &\ldots&\ldots&\ldots\\ s_{1_k} & s_{2_k} & \ldots & s_{n_K} \end{pmatrix}\end{split}\]
Parameters:
  • chromosome (str) – Chromosome over which to compute \(\mathbf{S}_{chr}\)

  • samples (list) – list of Sample objects

Returns:

a matrix representing the coverage signals of each sample over common loci

Return type:

numpy.ndarray

get_scores(chromosome, Smat_chr, c_1=None, c_2=None, c_3=None, tau=None)[source]

Compute locus scores \(\mathcal{S}(i),~ \forall i=1\ldots n\)

Parameters:
  • chromosome (str) – Name of chromosome, e.g., ‘chr19’

  • Smat_chr (np.ndarray) – Matrix of coverage signals

Returns:

An \(n\) -length array of locus scores, each of which quantifies the appeal of selecting the respective locus as open (accessible)

Return type:

np.ndarray

run(peak_score_scale=1000.0)[source]

Execute ROCCO over each given chromosome, merge, score results, and create an output BED file.

Todo

Rocco.run(): Update peak score histogram with subplots for each chromosome

run_rr_proc(lp_sol, scores, budget, gamma, rand_iter=None, eps_rand=0.0001) numpy.ndarray[source]

Execute the randomization procedure to obtain an integral solution from the solution to the relaxed problem, lp_sol

Parameters:
  • lp_sol (np.nadarray) – A solution to the continuous, relaxed optimization problem.

  • scores (np.ndarray) –

    The score for a given locus quantifies the benefit in selecting it as accessible.

    \[\mathcal{S}: \mathbb{R}^{K\times 1}_{\geq 0} \rightarrow \mathbb{R}\]

  • budget (float, optional) –

    Defines the budget constraint to upper bound the proportion of loci selected as open/accessible.

    \[\sum^{n}_{i=1} \ell_i \leq \lfloor nb \rfloor\]

    where \(\textsf{budget} := b\) and \(n\) is the number of loci (fixed-step contiguous genomic intervals). In general, a lower budget will result in more conservative results.

  • gamma (type, optional) –

    Weight for the ‘fragmentation penalty’ in decision space.

    \[\gamma \sum^{n-1}_{i=1} |\ell_i - \ell_{i+1}|\]

    In general, higher values of gamma yield fewer but broader peaks. Gamma should be set relative to the step size in the coverage signals, with gamma increasing as the step size decreases, as discussed in the Supplement of the paper.

  • rand_iter (int, optional) –

    Number of randomized integral solutions to draw as

    \[\mathbf{\ell}^{\mathsf{~rand}} \sim \text{Bernoulli}(\mathbf{\ell^{\mathsf{~LP}}})\]

  • eps_rand (float, optional) –

    Defines the initial ‘floor-epsilon’ reference solution with the following procedure

    eps_cpy = eps_rand
    # initialize as floor_eps solution
    init_sol = np.floor(lp_sol + eps_rand)
    while np.sum(init_sol) > np.floor(n*budget):
        # loop guarantees the feasibility of solutions
        eps_cpy = eps_cpy/2
        init_sol = np.floor(lp_sol + eps_cpy)
    # now start randomization procedure...
    

Return type:

numpy.ndarray

solve_chrom(chromosome, common_loci=None, Smat_chr=None, scores=None, budget=None, gamma=None, tau=None, solver=None, solver_reltol=None, solver_abstol=None, solver_feastol=None, solver_maxiter=None, verbose_solving=None, step=None, outfile=None, pr_bed=None)[source]

Executes ROCCO on a given chromosome.

Parameters:
  • chromosome (str) – Name of chromosome on which to execute Rocco

  • common_loci (np.ndarray, optional) – A list/array of loci indices in chromosome over which all samples’ coverage signals are defined. Inferred from data if None.

  • Smat_chr (np.ndarray, optional) – A \(K \times n\) np.ndarray (samples by loci) coverage signal matrix. Inferred from data if None.

  • scores (np.ndarray, optional) – A list/array of scores for each locus quantifying their appeal for selection as open/accessible.

  • budget (float, optional) –

    Defines the budget constraint to upper bound the proportion of loci selected as open/accessible.

    \[\sum^{n}_{i=1} \ell_i \leq \lfloor nb \rfloor\]

    where \(\textsf{budget} := b\) and \(n\) is the number of loci (fixed-step contiguous genomic intervals). In general, a lower budget will result in more conservative results.

  • gamma (type, optional) –

    Weight for the ‘fragmentation penalty’ in decision space.

    \[\gamma \sum^{n-1}_{i=1} |\ell_i - \ell_{i+1}|\]

    In general, higher values of gamma yield fewer but broader peaks. Gamma should be set relative to the step size in the coverage signals, with gamma increasing as the step size decreases, as discussed in the Supplement of the paper.

  • outfile (str, optional) – Name of the chromosome-specific BED output file.

Returns:

a tuple (chromosome, selected_loci)

Return type:

tuple(str, np.ndarray)

Notes:

The initial, unrelaxed optimization problem ROCCO addresses is the following integer program

\[\begin{split}\begin{aligned} & \underset{\boldsymbol{\ell}}{\text{ Minimize:}} & & f_{\mathsf{IP}}(\boldsymbol{\ell}) = \sum_{i=1}^{n}-\left(\mathcal{S}(i)\cdot\ell_i\right) + \gamma\sum_{i=1}^{n-1} |\ell_i - \ell_{i+1}| \\ & \text{Subject To:} & & \text{(i)}~~\sum_{i=1}^{n}\ell_i \leq \lfloor nb\rfloor \\ & & & \text{(ii)}~~\ell_i \in \{0,1\}, ~\forall i=1 \ldots n. \end{aligned}\end{split}\]

where \(\ell_i\) denotes the binary decision variable for the \(i\) -th locus, \(\mathcal{S}(i)\) denotes the ‘score’ for the \(i\) -th locus, etc. See paper for more details.

class rocco.Sample(input_file, genome_file, **kwargs)[source]

Bases: object

Used to generate/parse coverage tracks of samples’ BAM or BigWig files

Parameters:
  • input_file (str) – a BAM (.bam), or bigwig (.bw) file for a given sample

  • genome_file (str) – Path to the genome sizes file containing chromosome sizes

  • skip_chroms (list, optional) – List of chromosomes to exclude when parsing/generating coverage tracks

  • proc_num (int, optional) – Number of processes to use when computing chromosome-specific coverage tracks

  • step (int, optional) – Step size for coverage tracks. This is overwritten and inferred from the data if a BigWig file is used as input

  • weight (float, optional) – Weight by which to scale coverage values by. Can be used to apply scaling factor for normalization, etc. Defaults to 1.0

  • norm_method (str, optional) – use CPM, BPM, or RPKM (see documentation for deeptools’s bamCoverage tool) to normalize the sample’s coverage track. Ignored if input is a BigWig file. Defaults to RPKM for BAM files.

  • norm_ignore_chroms (list, optional) – list of chromosomes to skip for normalization

  • effective_genome_size – Effective genome size for normalization. Ignored if norm_method is not ‘RPGC’.

  • raw_counts (bool, optional) – If True, no normalization is performed when computing the coverage track from the input BAM file.

bam_to_bigwig(bamcov_cmd=None, additional_args=None)[source]

Wraps deeptools’ bamCoverage to generate a BigWig coverage track from the input BAM file

get_chrom_data(chromosome)[source]

Parse a BigWig file and return chromosome-specific coverage loci, coverage values

get_input_type()[source]

Determine if self.input_file is a BAM, or BigWig file

The file type is determined by the file extension: ‘.bam’, ‘.bw’, etc. and is not robust to incorrectly labelled files.

Raises:

ValueError – If file extension is not supported

Returns:

a string (extension) representing the file type

Return type:

str

rocco.file_basename(filename)[source]
rocco.get_chroms_and_sizes(genome_file)[source]

Parse chromosomes and their respective sizes in genome_file

Raises:

FileNotFoundError – If genome file cannot be found

Returns:

chromosome names and sizes in genome_file

Return type:

tuple(list,list)

rocco.has_chrom_reads(input_file, chromosome)[source]

Check if a given chromosome has reads in the input file

rocco.main()[source]
rocco.nearest_step_idx(val, step)[source]