https://github.com/tdhock/FLOPART/workflows/R-CMD-check/badge.svg
Functional Labeled Optimal Partitioning (FLOPART) is an optimal peak detection algorithm with label constraints. The dynamic programming algorithm computes a segmentation and corresponding set of peaks which minimizes the penalized Poisson loss (for non-negative integer count data), subject to the following constraints:
- SAME as previous PeakSeg packages: there are two states (up/peak and down/background). Changes from up/peak state must be non-increasing and go to down/background state. Changes from down/background state must be non-decreasing and go to up/peak state.
- LESS constraints relative to previous PeakSeg packages: there are no constraints for the model to start and end in the down/background state.
- MORE constraints relative to previous PeakSeg packages: there are additional constraints defined by labels (see below).
if(require("remotes"))install.packages("remotes") remotes::install_github("tdhock/FLOPART")The main driver function is FLOPART, which takes three arguments
coverageis a data table with columns chromStart, chromEnd, count.labelis a data table with columns chromStart, chromEnd, annotation.penaltyis a non-negative numeric value, larger for fewer changepoints/peaks.chromStart and chromEnd columns are integer positions on a chromosome.
count is an integer data value to segment.
annotation is a character string, one of noPeaks, peakStart, peakEnd.
The constraints for the different label types are as follows:
- noPeaks: the model must stay in the down/background state in this label.
- peakStart: at the start of the label the model must be in the down/background state, there must be exactly one non-decreasing change in this label, and the model must be in the up/peak state at the end of the label.
- peakEnd: at the start of the label the model must be in the up/peak state, there must be exactly one non-increasing change in this label, and the model must be in the down/background state at the end of the label.
For a simple real data example,
> library(data.table) # for print method.> data("Mono27ac.simple", package="FLOPART") >Mono27ac.simple$coveragechromchromStartchromEndcount1:chr1114500014676502:chr1114676514680713:chr1114680717525404:chr1117525417529615:chr111752961757380---2975:chr1132698032698152976:chr1132698132698362977:chr1132698332698552978:chr1132698532699242979:chr113269923270003$labelchromchromStartchromEndannotation1:chr11180000200000noPeaks2:chr11208000220000peakEnd3:chr11300000308250peakStart4:chr11308260320000peakEndTo run FLOPART on these data,
fit<- with(Mono27ac.simple, FLOPART::FLOPART(coverage, label, penalty=1400))The fit is a list of results; we can use head to look at the first few items in each of these results.
> lapply(fit, head) $coverage_dtchromStartchromEndcountweight1:145000146765017652:1467651468071423:1468071752540284474:1752541752961425:17529617573804426:175738175780142$label_dtchromStartchromEndannotationtypefirstRowlastRow1:180000200000noPeaks0141182:208000220000peakEnd-172513223:300000308250peakStart1272327704:308260320000peakEnd-127732871$cost_mat [,1] [,2] [1,] 0.000000000.00000000 [2,] 0.110677170.11067717 [3,] 0.010522510.01052251 [4,] 0.019097840.01909784 [5,] 0.018862800.01886280 [6,] 0.026601390.02660139$intervals_mat [,1] [,2] [1,] 11 [2,] 21 [3,] 33 [4,] 35 [5,] 44 [6,] 45$segments_dtmeanfirstRowlastRowstatechromStartchromEndstatus1:0.0655650111970145000206725background2:13.1774387819811322986206725209216peak3:0.22431609113314230209216236120background4:7.38781362142417922986236120237515peak5:0.19459495179320790237515267598background6:3.68970814208025522986267598270853peakThe main output of the algorithm is segments_dt (shown above) which is a data table of optimal segment means, given the penalty and constraints. Each row of this data table describes a segment in the optimal model:
- mean is the mean value (in units of the count column from the coverage data)
- chromStart/chromEnd give the start/end positions of each segment, in units of chromStart/chromEnd from the coverage data. firstRow/lastRow are similar but in units of data points (rows in coverage_dt).
- state/status show if the segment is in peak or background.
Other outputs which could be of interest:
cost_matis the optimal cost computed by dynamic programming at each data point (row) in each state (column).intervals_matis the number of intervals (function pieces) stored in the piecewise cost function at each data point (row) in each state (column). This is useful for empirical analysis of time complexity, since the computation time is linear in the number of intervals.coverage_dtis a modified version of the initial coverage data set. For example there will be additional rows if there are labels with start/end values that are not present in the initial coverage.label_dthas additional columns firstRow/lastRow which correspond to the start/end positions of the labels, in units of the rows incoverage_dt.
- PeakSegOptimal and PeakSegDisk implement solvers for the Poisson model without label constraints, but with an additional constraint that the model must start and end in the down/background state.
- LOPART implements a solver for the normal model with 0/1 label constraints, but no inequality constraints (each change can be in any direction, up or down).
- Repo with figures from related paper in progress.