ROCCO: [R]obust [O]pen [C]hromatin Detection via [C]onvex [O]ptimization¶
Underlying ROCCO is a constrained optimization problem with solutions dictating accessible chromatin regions that are consistent across multiple samples.
GitHub (Homepage)¶
Paper¶
If using ROCCO in your research, please cite the original paper in Bioinformatics
Nolan H Hamilton, Terrence S Furey, ROCCO: a robust method for detection of open chromatin via convex optimization,
Bioinformatics, Volume 39, Issue 12, December 2023
DOI: 10.1093/bioinformatics/btad725
Demo¶
A brief demonstration of ROCCO using ENCODE ATAC-seq data is available as a Jupyter notebook.
Installation¶
pip install rocco
Input¶
- A set of BAM alignment files
In this case, ROCCO generates coverage track files in BigWig format for each sample using deepTools bamCoverage and user-specified parameters/arguments for read filtering, normalization, etc.
OR
- A set of BigWig coverage track files
Users can also manually generate BigWig files for each sample according to their experimental needs and preferred protocol and supply them them directly to ROCCO. Input BigWig tracks are treated as final and not modified by ROCCO (e.g., no normalization, etc.)
AND
A genome sizes file containing chromosome names and sizes separated by tabs (e.g.,
docs/hg38.sizes
)
Output¶
A BED file listing the identified peak regions and scores.
Example Use¶
Both an API and command-line interface (CLI) are available to run ROCCO. Toy examples are provided below using both the CLI and API.
API Use¶
Example One¶
Run ROCCO with BAM input files for each sample using the default chromosome-specific budget, gamma, etc. parameters for hg38 assembly (See Rocco.HG38_PARAMS
or hg38_template.csv).
>>> import rocco
>>> bamfiles = ['sample1.bam', 'sample2.bam', 'sample3.bam','sample4.bam', 'sample5.bam']
>>> rocco_obj = rocco.Rocco(
... input_files=bamfiles,
... genome_file='genome.sizes',
... chrom_param_file='hg38')
>>> rocco_obj.run() # output peak annotation saved as a BED file `rocco_obj.outfile`
Example Two¶
Run ROCCO with BigWig input files for each sample using default chromosome-specific budget, gamma, etc. parameters for the hg38
assembly in Rocco.HG38_PARAMS
Accepting BigWig input directly allows users the option to generate the coverage tracks according to their preferred protocol, e.g., deepTools bamCoverage or another utility that produces BigWig signal files with additional features for normalization, smoothing, read extension, etc. that are useful in many experimental settings.
>>> import rocco
>>> bw_files = ['sample1.bw', 'sample2.bw', 'sample3.bw', 'sample4.bw', 'sample5.bw']
>>> rocco_obj = rocco.Rocco(
... input_files=bw_files,
... genome_file='genome.sizes',
... chrom_param_file='hg38')
>>> rocco_obj.run()
Example Three¶
Scale coverage signals before calling peaks
>>> import rocco
>>> bamfiles = ['sample1.bam', 'sample2.bam', 'sample3.bam','sample4.bam', 'sample5.bam']
>>> rocco_obj = rocco.Rocco(
... input_files=bamfiles,
... genome_file='genome.sizes',
... chrom_param_file='hg38',
... sample_weights=[0.50, 1.0, 1.0, 1.0, 0.75]
... )
>>> rocco_obj.run()
Example Four¶
Use a custom parameter file to set chromosome-specific budgets, gamma, tau, etc.
With the following saved to ``custom_params.csv``
chrom,budget,gamma,tau,c_1,c_2,c_3
chr19,0.05,1.0,0,1.0,1.0,1.0
chr21,0.03,1.0,0,1.0,1.0,1.0
>>> import rocco
>>> bamfiles = ['sample1.bam', 'sample2.bam', 'sample3.bam','sample4.bam', 'sample5.bam']
>>> rocco_obj = rocco.Rocco(
... input_files=bamfiles,
... genome_file='genome.sizes',
... chrom_param_file='custom_params.csv')
>>> rocco_obj.run()
For a template chromosome parameter file see hg38_template.csv.
Example Five¶
Use constant, default genome-wide parameters for all chromosomes. Users can modify these default genome-wide parameters via the constant_
arguments at the command line or as below
>>> import rocco
>>> bamfiles = ['sample1.bam', 'sample2.bam', 'sample3.bam','sample4.bam', 'sample5.bam']
>>> rocco_obj = rocco.Rocco(
... input_files=bamfiles,
... genome_file='genome.sizes',
... chrom_param_file=None,
... constant_budget=0.035,
... constant_gamma=1.0,
... constant_tau=0.0)
>>> rocco_obj.run()
CLI Use¶
See rocco.main()
or run rocco --help
at the command line for a full list of available arguments.
Example One¶
rocco -i sample1.bam sample2.bam sample3.bam sample4.bam sample5.bam -g genome.sizes --params hg38
ROCCO will also accept wildcard/regex filename patterns:
rocco -i *.bam -g genome.sizes --params hg38
Example Two¶
rocco -i *.bw -g genome.sizes --params hg38
Example Three¶
Explicitly list the input files and weights to ensure the correct mapping of weights to samples
rocco -i sample1.bam sample2.bam sample3.bam sample4.bam sample5.bam -g genome.sizes --params hg38 --sample_weights 0.50 1.0 1.0 1.0 0.75
Example Four¶
rocco -i *.bam -g genome.sizes --params custom_params.csv
Example Five¶
rocco -i *.bam -g genome.sizes --budget 0.035 --gamma 1.0 --tau 0.0
Testing ROCCO¶
Run PyTest unit tests
cd tests
pytest -v -rPA -l -k "regular" test_rocco.py
Miscellaneous¶
Parameter Tuning Default parameters generally provide strong results, but users may alter these using a custom
--params
(e.g., hg38_template.csv.) or by modifying the--budget/--constant_budget
,--gamma/--constant_gamma
, etc. arguments if wishing to use consistent parameters for all chromosomes.Peak Scores Run with
--plot_hist
to generate a histogram of peak scores that may be useful if tuning the--peak_score_filter
argument.Memory Use If RAM is a special consideration, you can try increasing –step from its default of 50 to, e.g., 100 and/or using a lightweight solver for the optimization, e.g., a first-order method such as PDLP. Note that if using BigWig files as input, –step has no effect as the step size is inferred from the data.
Dependencies Ensure samtools and bedtools are installed and in your PATH. These tools are utilized for several auxiliary features.
To-Do Items¶
Todo
Rocco.run()
: Update peak score histogram with subplots for each chromosome
- class rocco.Rocco(input_files, genome_file, chrom_param_file=None, **kwargs)[source]¶
Bases:
object
- Parameters:
input_files (list) – a list of samples’ BAM files, or a list of samples’ BigWig
genome_file (str) – Path to the genome sizes file containing chromosome sizes
chrom_param_file (str) – Path to the chromosome parameter file
male_samples (list, optional) – If specified, peak calling over chrY is restricted to these samples. Otherwise, all samples are used for all chromosomes.
skip_chroms (list, optional) – List of chromosomes to skip
proc_num (int, optional) – Number of processes to use when computing chromosome-specific coverage tracks
step (int, optional) – Step size for coverage tracks. Inferred from the intervals if bigwig input is supplied.
sample_weights (numpy array, optional) – List/array of weights used to scale the coverage tracks for each sample. Defaults to np.ones()
filler_params (dict, optional) – Filler parameters used if missing from chrom_param_file
solver (str, optional) – Solver to use for optimization (default: ‘CLARABEL’).
solver_maxiter (int, optional) – Maximum number of solving iterations
solver_reltol (float, optional) – Relative optimality gap tolerance when solving the relaxed optimization problem
solver_abstol (float, optional) – Absolute optimality gap tolerance when solving the relaxed optimization problem
solver_feastol (float, optional) – Feasibility tolerance when solving the relaxed optimization problem
rand_iter (int, optional) – Number of RR iterations.
verbose_solving (bool, optional) – Whether to print solver logging data
norm_method (str, optional) – apply RPKM or RPGC normalization (see documentation for deeptools’s bamCoverage tool) on each Sample’s coverage track independently.
norm_ignore_chroms (list, optional) – list of chromosomes to skip for normalization
effective_genome_size – Effective genome size for normalization. Ignored if norm_method is not ‘RPGC’.
raw_counts (bool, optional) – If
True
, no normalization is performed when computing the coverage track from the input BAM file.
- get_Smat(chromosome, samples=None)[source]¶
Generates a \(K \times n\) coverage signal matrix \(\mathbf{S}_{chr}\) from samples’ data for input to ROCCO
\[\begin{split}\mathbf{S}_{chr} = \begin{pmatrix} s_{1_1} & s_{2_1} & \ldots & s_{n_1}\\ s_{1_2} & s_{2_2} & \ldots & s_{n_2}\\ \ldots &\ldots&\ldots&\ldots\\ s_{1_k} & s_{2_k} & \ldots & s_{n_K} \end{pmatrix}\end{split}\]- Parameters:
chromosome (str) – Chromosome over which to compute \(\mathbf{S}_{chr}\)
samples (list) – list of Sample objects
- Returns:
a matrix representing the coverage signals of each sample over common loci
- Return type:
numpy.ndarray
- get_scores(chromosome, Smat_chr, c_1=None, c_2=None, c_3=None, tau=None)[source]¶
Compute locus scores \(\mathcal{S}(i),~ \forall i=1\ldots n\)
- Parameters:
chromosome (str) – Name of chromosome, e.g., ‘chr19’
Smat_chr (np.ndarray) – Matrix of coverage signals
- Returns:
An \(n\) -length array of locus scores, each of which quantifies the appeal of selecting the respective locus as open (accessible)
- Return type:
np.ndarray
- run(peak_score_scale=1000.0)[source]¶
Execute ROCCO over each given chromosome, merge, score results, and create an output BED file.
Todo
Rocco.run()
: Update peak score histogram with subplots for each chromosome
- run_rr_proc(lp_sol, scores, budget, gamma, rand_iter=None, eps_rand=0.0001) numpy.ndarray [source]¶
Execute the randomization procedure to obtain an integral solution from the solution to the relaxed problem,
lp_sol
- Parameters:
lp_sol (np.nadarray) – A solution to the continuous, relaxed optimization problem.
scores (np.ndarray) –
The score for a given locus quantifies the benefit in selecting it as accessible.
\[\mathcal{S}: \mathbb{R}^{K\times 1}_{\geq 0} \rightarrow \mathbb{R}\]budget (float, optional) –
Defines the budget constraint to upper bound the proportion of loci selected as open/accessible.
\[\sum^{n}_{i=1} \ell_i \leq \lfloor nb \rfloor\]where \(\textsf{budget} := b\) and \(n\) is the number of loci (fixed-step contiguous genomic intervals). In general, a lower budget will result in more conservative results.
gamma (type, optional) –
Weight for the ‘fragmentation penalty’ in decision space.
\[\gamma \sum^{n-1}_{i=1} |\ell_i - \ell_{i+1}|\]In general, higher values of gamma yield fewer but broader peaks. Gamma should be set relative to the step size in the coverage signals, with gamma increasing as the step size decreases, as discussed in the Supplement of the paper.
rand_iter (int, optional) –
Number of randomized integral solutions to draw as
\[\mathbf{\ell}^{\mathsf{~rand}} \sim \text{Bernoulli}(\mathbf{\ell^{\mathsf{~LP}}})\]eps_rand (float, optional) –
Defines the initial ‘floor-epsilon’ reference solution with the following procedure
eps_cpy = eps_rand # initialize as floor_eps solution init_sol = np.floor(lp_sol + eps_rand) while np.sum(init_sol) > np.floor(n*budget): # loop guarantees the feasibility of solutions eps_cpy = eps_cpy/2 init_sol = np.floor(lp_sol + eps_cpy) # now start randomization procedure...
- Return type:
numpy.ndarray
- solve_chrom(chromosome, common_loci=None, Smat_chr=None, scores=None, budget=None, gamma=None, tau=None, solver=None, solver_reltol=None, solver_abstol=None, solver_feastol=None, solver_maxiter=None, verbose_solving=None, step=None, outfile=None, pr_bed=None)[source]¶
Executes ROCCO on a given chromosome.
- Parameters:
chromosome (str) – Name of chromosome on which to execute Rocco
common_loci (np.ndarray, optional) – A list/array of loci indices in
chromosome
over which all samples’ coverage signals are defined. Inferred from data ifNone
.Smat_chr (np.ndarray, optional) – A \(K \times n\) np.ndarray (samples by loci) coverage signal matrix. Inferred from data if
None
.scores (np.ndarray, optional) – A list/array of scores for each locus quantifying their appeal for selection as open/accessible.
budget (float, optional) –
Defines the budget constraint to upper bound the proportion of loci selected as open/accessible.
\[\sum^{n}_{i=1} \ell_i \leq \lfloor nb \rfloor\]where \(\textsf{budget} := b\) and \(n\) is the number of loci (fixed-step contiguous genomic intervals). In general, a lower budget will result in more conservative results.
gamma (type, optional) –
Weight for the ‘fragmentation penalty’ in decision space.
\[\gamma \sum^{n-1}_{i=1} |\ell_i - \ell_{i+1}|\]In general, higher values of gamma yield fewer but broader peaks. Gamma should be set relative to the step size in the coverage signals, with gamma increasing as the step size decreases, as discussed in the Supplement of the paper.
outfile (str, optional) – Name of the chromosome-specific BED output file.
- Returns:
a tuple (chromosome, selected_loci)
- Return type:
tuple(str, np.ndarray)
- Notes:
The initial, unrelaxed optimization problem ROCCO addresses is the following integer program
\[\begin{split}\begin{aligned} & \underset{\boldsymbol{\ell}}{\text{ Minimize:}} & & f_{\mathsf{IP}}(\boldsymbol{\ell}) = \sum_{i=1}^{n}-\left(\mathcal{S}(i)\cdot\ell_i\right) + \gamma\sum_{i=1}^{n-1} |\ell_i - \ell_{i+1}| \\ & \text{Subject To:} & & \text{(i)}~~\sum_{i=1}^{n}\ell_i \leq \lfloor nb\rfloor \\ & & & \text{(ii)}~~\ell_i \in \{0,1\}, ~\forall i=1 \ldots n. \end{aligned}\end{split}\]where \(\ell_i\) denotes the binary decision variable for the \(i\) -th locus, \(\mathcal{S}(i)\) denotes the ‘score’ for the \(i\) -th locus, etc. See paper for more details.
- class rocco.Sample(input_file, genome_file, **kwargs)[source]¶
Bases:
object
Used to generate/parse coverage tracks of samples’ BAM or BigWig files
- Parameters:
input_file (str) – a BAM (.bam), or bigwig (.bw) file for a given sample
genome_file (str) – Path to the genome sizes file containing chromosome sizes
skip_chroms (list, optional) – List of chromosomes to exclude when parsing/generating coverage tracks
proc_num (int, optional) – Number of processes to use when computing chromosome-specific coverage tracks
step (int, optional) – Step size for coverage tracks. This is overwritten and inferred from the data if a BigWig file is used as input
weight (float, optional) – Weight by which to scale coverage values by. Can be used to apply scaling factor for normalization, etc. Defaults to 1.0
norm_method (str, optional) – use CPM, BPM, or RPKM (see documentation for deeptools’s bamCoverage tool) to normalize the sample’s coverage track. Ignored if input is a BigWig file. Defaults to RPKM for BAM files.
norm_ignore_chroms (list, optional) – list of chromosomes to skip for normalization
effective_genome_size – Effective genome size for normalization. Ignored if norm_method is not ‘RPGC’.
raw_counts (bool, optional) – If
True
, no normalization is performed when computing the coverage track from the input BAM file.
- bam_to_bigwig(bamcov_cmd=None, additional_args=None)[source]¶
Wraps deeptools’ bamCoverage to generate a BigWig coverage track from the input BAM file
- get_chrom_data(chromosome)[source]¶
Parse a BigWig file and return chromosome-specific coverage loci, coverage values
- get_input_type()[source]¶
Determine if self.input_file is a BAM, or BigWig file
The file type is determined by the file extension: ‘.bam’, ‘.bw’, etc. and is not robust to incorrectly labelled files.
- Raises:
ValueError – If file extension is not supported
- Returns:
a string (extension) representing the file type
- Return type:
str
- rocco.get_chroms_and_sizes(genome_file)[source]¶
Parse chromosomes and their respective sizes in genome_file
- Raises:
FileNotFoundError – If genome file cannot be found
- Returns:
chromosome names and sizes in genome_file
- Return type:
tuple(list,list)