The xcmsViewer package provides a pipeline for analyzing and visualizing the mass spectrometry-based untargeted metabolomics data. It emphasizes the following features:
Figure 1: The xcmsViewer data analysis pipeline.
Integrated annotation database for metabolites identification The feature annotation (or identification of metabolite) is performed on both MS1 and MS2 levels. On the MS1 level, the mass of metabolites (by adding or removing certain adducts) are compared with the mass of known metabolites to annotate features. On the MS2 level, the MS2 spectra are compared with the spectra of known metabolites in annotation database. The annotation database provided along with xcmsViewer are compiled from HMDB (experimental spectra) and MSDIAL (all public MS/MS). Additionally, users can also create in-house database to annotate metabolites.
Expression matrix-centric The feature table returned by xcms is a matrix, where rows are features/metabolites, columns are samples, and the values in the matrix are the intensities of features/metabolites in each sample. Such a matrix is usually a starting point for the downstream statistical analysis, such as differential expression, feature selection, etc. Due to the immaturity of metabolomics data processing, we often need to check the chromatograms behind intensity and how well an MS2 spectrum matched a reference spectrum. In xcmsViewer, the “expression matrix” plays a central role in bridging the evidence of metabolite identification (e.g., chromatogram, spectra comparison) with statistical results (e.g., different expression).
Suitable for large sample sizes Instead of loading all files into memory, xcmsViewer processes raw files sequentially. In most of the steps, only one raw file is loaded into memory at a time for peak picking, so a large number of raw files could be handled with a less powerful computer.
Flexibility in statistical analysis and a separated viewer for statistical results visualization The xcmsViewer is seamlessly integrated into the R environment, therefore, it inherits the flexibility of statistical analysis given by R. Importantly, once the statistical analyses have been incorporated into the results, other users will be able to check it directly, with any need of coding.
Interactive visualization XcmsViewer visualizes features and samples in two-dimensional spaces defined by any statistical analyses used for prioritizing interesting candidates. Users can interactively select a subset of samples or features of interest. Downstream analyses (e.g., enrichment analysis, correlation analysis) are performed spontaneously in response to the selected features or samples. The interactive visualization enables fast hypothesis evaluation and generation.
Deliverable The processed data, together with the visualization using the Shiny app, could be delivered to the collaborator as standalone software or using shiny-server. Your collaborator will not need to install any program to explore the results.
The xcmsViewer is still under active development and available from github:
devtools::install_github("mengchen18/xcmsViewer")
In addition, the xcmsViewer package is largely dependent on the [omicsViewer] package. If you want to access the most recent version of omicsViewer, please install the github version of omicsViewer first:
devtools::install_github("mengchen18/omicsViewer")
In this vignette, we work with a simple experiment of 4 measurements (using data dependent acquisition). The files can be downloaded from Zenodo. As indicated by the file names, there are two experimental conditions A and B, each condition has two replicates. This experiment aims to find the metabolites of differential abundance between the two conditions. To achieve this goal, we will process the data to find which metabolites could be detected and how abundant they are in each sample. Next, we will use t-test to compare samples from the two conditions to find metabolites significantly different in their abundance. In the data exploration, we can visualize the t-test results using a volcano plot to prioritize the interesting metabolites.
In a highly collaborative research environment, the computational scientist will perform the data processing and statistical analysis, and the wet-lab scientist will mainly focus on biological interpretation. One of the primary purposes of xcmsViewer is to facilitate data sharing and communication between researchers of different backgrounds. Therefore, xcmsViewer consists of a separate data processing pipeline and visualization pipeline. In the data processing, users should perform feature identification and annotation, including statistical analysis needed for the data interpretation, then store the data into an object to be visualized using the Shiny interface. The Shiny user can directly explore the statistical results and focus on data interpretation without worrying about data processing details.
Therefore, before we look into how to prepare the object, we first see how we can open the Shiny interface and how to explore the results using it. The analysis mentioned in the previous section has been prepared and shipped together with xcmsViewer package. We can open it as below:
library(xcmsViewer)
xcmsViewer( system.file("extdata", package = "xcmsViewer") )
Users must first load data using the dropdown box in the top-right corner. This interface is build based on the omicsViewer package. You can find a tutorial for the interface here. There is only one object in this example. Usually, we have at least two datasets in actual experiments, one for the negative mode and one for the positive mode.
After loading the data, we see a few tabs on the left-panel, we will mainly check a few unique features at xcmsViewer (as compared to omicsViewer):
Figure 2: Feature space defined by retention time (the x-axis) and mass (the y-axis).
We can also visualize features in other two-dimensional spaces defined by any statistical results. For example, in this experiment, we used a t-test to compare groups A and B. We can use a volcano plot to visualize the results of differential expression. To show features in a volcano plot, users need to change the x- and y-axis by selecting the dropdown box of “x-axis” and “y-axis” on top, that is, the x-axis should be set to mean difference (mean.diff), and the y-axis is should be log transformed p-values (log.pvalue). Of note, the mean difference is on the log10 scale, which means 1 (or -1) on the x-axis indicate a 10-fold difference in the original scale.
Users can use the box or lasso selection function to select features from the figures. The selected feature will be shown in the Feature table and sent to the right panel for downstream analysis.
Figure 3: Feature space defined by mean difference (x-axis) and log transformed p-values (y-axis).
When a single feature is selected from this tab or Feature table tab, usually it is helpful to check the chromatogram and the match of MS2 spectra (if available). This could be done by checking the Annotation tab on the right panel (see below for more information).
The right panel reacts on the features and samples selected from the left panel. The Annotation tab (Figure 4) is specific for checking the metabolomics data. Users can use it to check chromatogram quality and MS2 peak matches. In figure 4, panel A shows the chromatogram. Panel B is a table where every row is a metabolite from the annotation database whose mass is mapped to the feature under checking. Some of these annotations will also have MS2 peaks matched to MS2 peaks detected in the experiment (Panel C). In addition, we may see some features with good MS2 peaks, but the peaks are not mapped to any know metabolites in the annotation database. In these cases, we can download the MS2 spectra in MGF format and use it to infer the compound structure, for example, using SIRIUS.
Figure 4: Usre interface of “Annotation” tab used to check chromatogram and annotation.
Now we will learn how to use xcmsViewer to prepare the object that has been explored in the previous section.
In this tutorial, we will work in a project folder in this structure:
--- projectA:
|--- R - save all rscripts
|--- mzXML - store the mzXML files
|--- Rds - store intermediate or temporary objects created during processing
|--- xcmsViewerData - save the final object to be visualzed using xcmsViewer
If you have not do so, please download the example from Zenodo and put the four files intot he mzXML folder. Then we are ready to move to the next section.
# setwd("projectA") # change to the working directory
# loading required libraries
library(parallel)
library(BiocParallel)
library(omicsViewer)
library(xcms)
library(xcmsViewer)
# loading annotation data
data("hmdb_msdial")
hmdb_msdial is the feature annotation database compiled from HMDB experimental spectra and the MS-DIAL public dataset.
Users can prepare an in-house annotation data following the format above and merge it with the public database.
This dataset contains 4 samples, two from condition A and two from condition B, each condition has two replicates. We create a data.frame to show the experimental design.
# prepare meta data
ff <- list.files("mzXML")
pd <- data.frame(file = file.path("mzXML", ff), stringsAsFactors = FALSE)
pd$label <- sub(".mzXML", "", basename(pd$file))
pd$group <- c("A", "A", "B", "B")
The experimental design data.frame should contain at least two columns named “file” and “label”. The “file” column tells which mzXML (or mzML) files will be processed and where these files are. The “label” column contains the unique label for every sample. In this tutorial, we are working on a straightforward experimental design, therefore, only one factor “group” is added to tell which sample is from which biological group. However, users should try to include all meta information helpful for data interpreting in more complex experimental designs. Doing so will make the best use of xcmsViewer, where users can quickly link any meta information to the metabolites and explore many assumptions quickly.
The next step is to process data to get an intensity matrix, where rows are features and columns are samples. The matrix values correlate with the features’ abundance in each sample. The processing consists of three steps, i.e., peak identification, alignment, and correspondence, defined in the R/Bioconductor package xcms. To know more information about the workflow, please refer to here.
# feature identification and quantification
res <- runPrunedXcmsSet(
files = pd$file, # path to the files
pheno = pd, # phenotype data
tmpdir = "Rds/temp", # folder to store temporary files, needs to be an empty folder or not existing then a new folder will be created
postfilterParam = list(
postfilter = c(5, 3000), # at least 5 peaks has intensity higher than 3000
ms1.noise = 100, # ms1 spectrum, intensity lower than this will be considered as noise
ms1.maxPeaks = Inf, # ms1 spectrum, maximum number of peaks to retain
ms1.maxIdenticalInt = 20, # noise peaks are usually same intensity, maximum number identical peaks to be considered as non-noise
ms2.noise = 30, # same but for ms2
ms2.maxPeaks = 100,
ms2.maxIdenticalInt = 6,
BPPARAM=bpparam() # parallel processing
),
keepMS1 = FALSE, # should the MS1 be kept, if yes, we can generate chromatogram give an intensity range. But it will make the data much larger.
mode = "neg", # ionization mode, "pos" or "neg"
ref = hmdb_msdial, # feature annotation database
RTAdjustParam = PeakGroupsParam(minFraction = 0.5, span = 0.5), # retention time adjustment parameter, will be passed to "adjustRtimePeakGroups"
ppmtol = 20, # mass tolerance in PPM
mclapplyParam = list(fun_parallel = mclapply, mc.cores = 4) # parallel
)
saveRDS(res, file = "Rds/xcmsViewer_intermediate_obj.RDS") # save the data
The last step in the workflow is to perform statistical analysis on the data. XcmsViewer implemented a convenient function to perform most frequently used analyses, such as t-test or PCA. The prepViewerData function will create an object can be visualized using shiny app. User should always use call this function first, then we can add results of other analyses to the object later.
# statistical analysis
expr <- log10( exprs(res@featureSet) ) # usually we log10 transform the original expression matrix.
boxplot(expr)
exprs(res@featureSet) <- normalize.nQuantiles(expr, probs = 0.5) # median centering of columns of expression matrix
tcs <- rbind(c("group", "A", "B")) # define a matrix tells which groups should be compared
viewerObj <- prepViewerData( object = res, median.center = FALSE, compare.t.test = tcs, fillNA = TRUE, paired = FALSE)
saveRDS(viewerObj, "xcmsViewerData/res.RDS")
# start viewer
xcmsViewer("xcmsViewerData/") # start the viewer
One of the advantages of xcmsViewer is that any statistical result could be incorporated into the results and, therefore, be used to visualize the data in the Shiny interface and prioritize the interesting candidates.
Imaging another experiment has more samples studied, and assuming we have a quantitative phenotypical variable, we want to know which metabolites are well correlated with it. To extract this information, we can perform a correlation analysis. Because we usually visualize features in a 2-Dimensional space, we can calculate the correlation coefficient R and (logarithm transformed) p-value, then combine them with the previously created object. In the Shiny interface, we will use the R and (logarithm transformed) p-value to visualize features. It does not make sense to perform such an analysis with data of samples, but this trivial the example is good enough to exemplify the idea.
# define a variable which should be tested with
var1 <- c(1, 1.5, 5, 6)
expr <- exprs(viewerObj@featureSet)
cors <- apply(expr, 1, function(x) {
if (sum(!is.na(x)) < 3) # if less than 3 valid values, then return NA
return(c(NA, NA))
r <- cor.test(x, var1, use = "pair")
c(r$p.value, r$estimate)
})
corstats <- data.frame(
R = cors[2, ],
p.value = cors[1, ]
)
# log transform for visualization
corstats$log.pvalue <- - log10(corstats$p.value)
# to add prefix to the column headers
colnames(corstats) <- paste("cor_test", "var1", colnames(corstats), sep = "|")
# add the correlation test results to the object
fd <- fData(viewerObj@featureSet)
fd <- cbind(fd, corstats)
fData(viewerObj@featureSet) <- fd
# save results
saveRDS(viewerObj, file = "xcmsViewerData/res_with_cortest.RDS") # save the data
Then in the Shiny app, the features can be visualized in a two-dimensional space where the x-axis is “R” and the y-axis is “log.pvalue”.
R session used to generate the vignette:
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.8.0 magrittr_2.0.3
## [5] evaluate_0.15 stringi_1.7.6 rlang_1.0.2 cli_3.3.0
## [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1 rmarkdown_2.14
## [13] tools_4.2.1 stringr_1.4.0 xfun_0.30 yaml_2.3.5
## [17] fastmap_1.1.0 compiler_4.2.1 htmltools_0.5.2 knitr_1.39
## [21] sass_0.4.1