Summary and scope
wigglescout
is an R library that allows you to calculate
summary values across bigWig files and BED files and visualize them in a
genomics-relevant manner. It is based on broadly used libraries such as
rtracklayer
and GenomicRanges
, among others
for calculation, and mostly ggplot2
for visualization. You
can look at the DESCRIPTION
file to get more information
about all the libraries that make this one possible.
There are also many other tools whose functionality overlaps a little
or much with wigglescout
, but there was no single tool that
included all that I needed. The aim of this library is therefore not to
replace any of those tools, or to provide a silver-bullet solution to
genomics data analysis, but to provide a comprehensive, yet simple
enough set of tools focused on bigWig files that can be used entirely
from the R environment without switching back and forth across
environments.
Other tools and libraries for akin purposes that you may be looking for include: deepTools, SeqPlots, bwtools, wiggletools, and the list is endless!
wigglescout
allows you to summarize and visualize the
contents of bigWig files in two main ways:
- Genome-wide. Genome is partitioned on equally-sized bins and their aggregated value is calculated. Useful to get a general idea of the signal distribution without looking at specific places.
- Across sets of loci. This can be either summarized categories, or individual values, as in genome-wide analyses.
wigglescout
functionality is built in two layers. Names
of functions that calculate values over bigWig files start with
bw_
. These return GRanges
objects when
possible, data.frame
objects otherwise (i.e. when values
are summarized over some category, genomic location is lost in this
process).
On the other hand, functions that plot such values and that usually
make internal use of bw_
functions, start with
plot_
.
About bundled data
This package comes with a set of small files to show functionality.
These have been built from a published data from Simon Elsässer’s lab
and correspond to a H3.3 and H3K9me3 ChIP + input
(GSE149080
), subset across a 500kbp genomic region:
chr15-102600000-103100000
, which overlaps with the
HOXC gene cluster. A ChromHMM annotation has also been subset
to overlap with such region.
library(ggplot2)
library(rtracklayer)
library(GenomicRanges)
library(wigglescout)
h33_chip <- system.file("extdata", "sample_H33_ChIP.bw", package = "wigglescout")
h3k9me3_chip <- system.file("extdata", "sample_H3K9me3_ChIP.bw", package = "wigglescout")
input_chip <- system.file("extdata", "sample_Input.bw", package = "wigglescout")
genes <- system.file("extdata", "sample_genes_mm9.bed", package = "wigglescout")
chromhmm <- system.file("extdata", "sample_chromhmm.bed", package = "wigglescout")
locus <- GRanges(seqnames = "chr15", IRanges(102600000, 103100000))
All these values are paths to bigWig and BED files.
Quick start
Here is a small example of what you can do with
wigglescout
. Imagine you want to take a look at how values
are distributed overall in your bigwig files, genome-wide:
plot_bw_bins_violin(
c(h33_chip, h3k9me3_chip),
bin_size = 2000,
labels = c("H3.3", "H3K9me3"),
remove_top = 0.001,
selection = locus # Plot can be subset to a certain GRanges.
)
For transparency and reproducibility, plots show relevant underlying
values by default, such as parameters used and calculated values: how
many NA
values were found, if any, how many points were
excluded from the plot due to quantile cutoff, and so on. Note also that
in this case the quantile cutoff seems low, but it is because top
elements are removed according to mean in all samples.
Note also that the bin size chosen, 2000
is somewhat
small for genome-wide analyses (it will take long runtime, see the
“Performance and runtime” section at the end of the document for more
details), but it can be done in this case due to the use of
selection
parameter, which restricts the bin analysis to a
certain locus.
You could also be interested in checking out how are certain
loci behaving in comparison with the global distribution. You
can do so providing a highlight
parameter:
plot_bw_bins_violin(
c(h33_chip, h3k9me3_chip),
bin_size = 2000,
labels = c("H3.3", "H3K9me3"),
remove_top = 0.001,
highlight = genes,
selection = locus # Plot can be subset to a certain GRanges.
)
Here you see highlighted the bins that overlap with the loci in the
genes
file.
You could also want to pairwise compare the bins. You can do so by plotting a scatterplot:
plot_bw_bins_scatter(
h33_chip,
h3k9me3_chip,
bin_size = 2000,
remove_top = 0.001,
highlight = genes,
highlight_label = "Genes",
selection = locus # Plot can be subset to a certain GRanges.
)
In real cases where you are plotting the full genome-wide set of bins, which can be in the tens or hundreds of thousands data points, there can be a lot of overplotting and it might be difficult to see where the majority of those are. For this, you can just plot a 2d histogram instead:
plot_bw_bins_density(
h33_chip,
h3k9me3_chip,
bin_size = 20000,
plot_binwidth = 0.05
)
You may be interested in looking at the signal just at the genes instead:
plot_bw_loci_scatter(
h33_chip,
h3k9me3_chip,
loci = genes
)
You can also check how these two signal behave globally in a more
meaningful way, i.e. in relation with genomic annotations. You can do
this with bw_loci_summary_heatmap
:
# ChromHMM is a genome-wide annotation according to epigenetics marks. Each
# locus is tagged by a category. And the amount of categories must be limited.
# In this case, it is fifteen.
chrom_values <- import(chromhmm, format = "BED")
head(chrom_values)
#> GRanges object with 6 ranges and 2 metadata columns:
#> seqnames ranges strand | name score
#> <Rle> <IRanges> <Rle> | <character> <numeric>
#> [1] chr15 102602601-102606600 * | 12_Heterochrom 0
#> [2] chr15 102606601-102607600 * | 11_Repressed 0
#> [3] chr15 102615401-102615800 * | 15_Insulator 0
#> [4] chr15 102615801-102616800 * | 11_Repressed 0
#> [5] chr15 102616801-102626600 * | 12_Heterochrom 0
#> [6] chr15 102626601-102627600 * | 11_Repressed 0
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
plot_bw_loci_summary_heatmap(
c(h33_chip, h3k9me3_chip),
loci = chromhmm,
labels = c("H33", "H3K9me3")
)
In a more detailed way, you can look at the signal profile at the
given genes, using plot_bw_profile
:
plot_bw_profile(
c(h33_chip, h3k9me3_chip, input_chip),
loci = genes,
labels = c("H3.3", "H3K9me3", "Input")
)
Still, these profiles are an average of 28 loci. So you may be interested in looking at the individual profiles, which you can do with a heatmap view:
plot_bw_heatmap(
h33_chip,
loci = genes
)
In a nutshell, these are the type of things you can do with
wigglescout
. Each of these functions have many parameters
that allow you to fine tune the results. You can look at that in more
detail in the section below.
Functionality to calculate the values without plotting them is also provided, so if you want to plot something different that is not provided as an out-of-the-box function, you can still use this library for that.
On performance and runtime
All these functions work on genome-wide data, and often you will want to run these on more than one bigWig file at a time. It is possible to run all of this in a regular laptop, however if resolution is too high, waiting times will raise to minutes and even hours, depending on the amount of files and the given resolution.
In a Intel i7 laptop, bins analyses for a single bigWig file in resolution around 10000bp tend to take a few seconds. 5000bp is still reasonable interactive time. For plotting under 5000 bp resolution you will need to wait quite some time and I would recommend running these in a script outside R environment.
Locus-based analyses runtime tends to be smaller since the amount of values to be calculated is smaller than genome-wide bins. An exception to this are ChromHMM plots, since these are also genome-wide bins, if only of different lengths and labeled with categories. Keep this in mind when plotting a large set of bigWig files. It will take some time as well.
Using multiple processors
As of version 0.2.0, wigglescout
core functions support
future
specifications using furrr
library.
This means you can run the code in multiple R sessions:
# You just have to set the plan
library(future)
plan(multisession, workers = 2)
# Then run the functions just the same
bw_bins(c(h33_chip, input_chip),
bin_size = 50000,
genome = "mm9",
selection = locus,
norm_mode = "log2fc")
#> GRanges object with 11 ranges and 2 metadata columns:
#> seqnames ranges strand | sample_H33_ChIP sample_Input
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] chr15 102550001-102600000 * | 1.423578 0.920439
#> [2] chr15 102600001-102650000 * | 1.166658 1.054456
#> [3] chr15 102650001-102700000 * | 0.734975 1.033551
#> [4] chr15 102700001-102750000 * | 0.930566 1.028487
#> [5] chr15 102750001-102800000 * | 3.198913 1.050770
#> [6] chr15 102800001-102850000 * | 2.064035 1.053755
#> [7] chr15 102850001-102900000 * | 2.240118 1.053838
#> [8] chr15 102900001-102950000 * | 1.246168 1.140229
#> [9] chr15 102950001-103000000 * | 2.212256 1.198468
#> [10] chr15 103000001-103050000 * | 3.816928 1.136733
#> [11] chr15 103050001-103100000 * | 3.168476 1.092466
#> -------
#> seqinfo: 35 sequences from an unspecified genome; no seqlengths
It is advised against implementing future planning within library
function according to future
documentation:
Please refrain from modifying the future strategy inside your packages / functions, i.e. do not call plan() in your code. Instead, leave the control on what backend to use to the end user. This idea is part of the core philosophy of the future framework - as a developer you can never know what future backends the user have access to. Moreover, by not making any assumptions about what backends are available, your code will also work automatically with any new backends developed after you wrote your code.
How to find the best configuration
Parallelization in wigglescout
setting is based on the
fact that operations on multiple files are done separately and
independently of each other. However, doing this requires some passing
of data back and forth, which means that if your bigWig files are large
(or the results you get from them, for instance, big
GRanges
objects), or you don’t have many, you may suffer
the overhead without getting any benefit from it.
An easy rule of thumb is: swap from sequential
(which is
the default) to multisession
if you are running functions
on multiple files at once. It seems like there is benefit to do this
pretty much for any number of files larger than one.
However, adding more sessions does not seem to help a lot. Best guess is going for 2 or 4 workers. With 2 workers you already see a considerable reduction in time.
My expectation is that multicore
may be faster. However
I have not benchmarked it, as it does not run within Rstudio.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] future_1.34.0 purrr_1.0.2 wigglescout_0.19.0
#> [4] rtracklayer_1.64.0 GenomicRanges_1.56.1 GenomeInfoDb_1.40.1
#> [7] IRanges_2.38.1 S4Vectors_0.42.1 BiocGenerics_0.50.0
#> [10] ggplot2_3.5.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4
#> [3] vipor_0.4.7 farver_2.1.2
#> [5] Biostrings_2.72.1 bitops_1.0-9
#> [7] fastmap_1.2.0 RCurl_1.98-1.16
#> [9] GenomicAlignments_1.40.0 XML_3.99-0.17
#> [11] digest_0.6.37 lifecycle_1.0.4
#> [13] Cairo_1.6-2 magrittr_2.0.3
#> [15] compiler_4.4.1 rlang_1.1.4
#> [17] sass_0.4.9 tools_4.4.1
#> [19] utf8_1.2.4 yaml_2.3.10
#> [21] knitr_1.48 S4Arrays_1.4.1
#> [23] labeling_0.4.3 curl_5.2.3
#> [25] DelayedArray_0.30.1 RColorBrewer_1.1-3
#> [27] abind_1.4-8 BiocParallel_1.38.0
#> [29] withr_3.0.1 desc_1.4.3
#> [31] grid_4.4.1 fansi_1.0.6
#> [33] colorspace_2.1-1 globals_0.16.3
#> [35] scales_1.3.0 SummarizedExperiment_1.34.0
#> [37] cli_3.6.3 rmarkdown_2.28
#> [39] crayon_1.5.3 ragg_1.3.3
#> [41] generics_0.1.3 httr_1.4.7
#> [43] rjson_0.2.23 ggbeeswarm_0.7.2
#> [45] cachem_1.1.0 stringr_1.5.1
#> [47] zlibbioc_1.50.0 parallel_4.4.1
#> [49] ggrastr_1.0.2 XVector_0.44.0
#> [51] restfulr_0.0.15 matrixStats_1.4.1
#> [53] vctrs_0.6.5 Matrix_1.7-0
#> [55] jsonlite_1.8.9 beeswarm_0.4.0
#> [57] listenv_0.9.1 systemfonts_1.1.0
#> [59] jquerylib_0.1.4 tidyr_1.3.1
#> [61] glue_1.8.0 parallelly_1.38.0
#> [63] pkgdown_2.1.1 codetools_0.2-20
#> [65] stringi_1.8.4 gtable_0.3.5
#> [67] BiocIO_1.14.0 UCSC.utils_1.0.0
#> [69] munsell_0.5.1 tibble_3.2.1
#> [71] pillar_1.9.0 furrr_0.3.1
#> [73] htmltools_0.5.8.1 GenomeInfoDbData_1.2.12
#> [75] R6_2.5.1 textshaping_0.4.0
#> [77] evaluate_1.0.0 Biobase_2.64.0
#> [79] lattice_0.22-6 highr_0.11
#> [81] Rsamtools_2.20.0 bslib_0.8.0
#> [83] SparseArray_1.4.8 xfun_0.48
#> [85] fs_1.6.4 MatrixGenerics_1.16.0
#> [87] pkgconfig_2.0.3