- Section 1: Introduction
- Section 2: Installation
- Section 3: Gnerating Pseudo Bulk Data
- Section 4: Noise Analysis
- Section 5: Rare Component Analysis
- Section 6: Single Cell Related Functions
- Section 7: Well-Characterized Deconvolution Datasets
- Citation
Links:
Cell type proportion is related to phenotypes or diseases (Wang, et al., Wei, et al.). Therefore, quantifying cell or tissue proportions is important for understanding the mechanisms in biological processes.
Here, we propose a cell type deconvolution evaluating toolkit named 'Deconer' to perform comprehensive and systematic evaluation of different algorithms.
Deconer consists of 6 main functional components as described below:
- Pseudo bulk data generation (from large-scale bulk data and single-cell data).
- Stability analysis under various types of noise.
- Rare component analysis.
- Unknown component analysis.
- Comprehensive evaluation metrics, along with visually appealing figure generation.
- Well-characterized datasets for deconvolution utilities.
In the subsequent sections, we will provide a comprehensive tutorial on how to use Deconer.
Deconer is based on R and can be easily installed on Windows, Linux as well as MAC OS.
First, users should install R >= 4.1.0.
Next, install devtools and Deconer.
# install devtools
install.packages('devtools')
# install the Deconer package
devtools::install_github('Honchkrow/Deconer')
# load Deconer
library(Deconer)
Generating pseudo-bulk data is a challenging problem. Deconer takes inspiration from previous works (Francisco, et al., Wang, et al., Wei, et al., Racle, et al.) and the Tumor Deconvolution DREAM Challenge to provide various strategies for generating pseudo-bulk RNA-seq data from both large-scale bulk RNA-seq data and scRNA-seq data.
Many deconvolution methods require cell type-specific bulk data as prior knowledge during the deconvolution process. To create a more realistic simulation, we have collected 302 well-characterized bulk RNA-seq data for generating pseudo-bulk data. During the data generation process, one-third of the samples are used for generating the mixture, while the remaining samples are used for generating the external reference.
Taking inspiration from the Tumor Deconvolution DREAM Challenge, Deconer also provides a function to generate 'coarse' and 'fine' level mixture samples.
For the 'coarse' level, Deconer generates mixture samples that consist of the following 8 cell types:
- B cells
- CD4 T cells
- CD8 T cells
- endothelial cells
- macrophages
- monocytes
- neutrophils
- NK cells
For the 'fine' level, there will be 14 cell types:
- memory B cells
- naive B cells
- memory CD4 T cells
- active CD4 T cells
- regulatory T cells
- memory CD8 T cells
- naive CD8 T cells
- NK cells
- neutrophils
- monocytes
- myeloid dendritic cells
- macrophages
- fibroblasts
- endothelial cells
The following demonstration shows how to generate simulated samples. Since different deconvolution methods may be implemented in different programming languages, we provide the simulated data output as CSV files, which can be easily loaded into R, Python, as well as MATLAB.
library(Deconer)
exprSim(n_sample = 50, # generate 50 samples
type = 'coarse', # can be changed to 'fine'
transform = 'TPM',
outputPath = "./exprSim",
mix_name = "coarse_mix.csv", # mixture samples
ref_name = "coarse_ref.csv", # cell type specific reference
prop_name = "coarse_prop.csv", # groundtruth proportions
refVar_name = "coarse_refVar.csv", # expression variance for cell type specific reference
train_name = "train_data.csv") # data for generating reference, can be used for differential analysis
All the output files will be saved in the folder named 'exprSim'. The proportions will be randomly generated from a uniform distribution (Wei, et al.). Additionally, Deconer provides output data for generating the external reference (parameter 'train_name' in the function exprSim), which can be utilized for marker gene selection in differential expression analysis.
With the advancements in single-cell technologies, the prediction of cell type proportions has become more accurate. Numerous single-cell-based methods have been proposed, such as MuSiC and SCDC. Furthermore, the construction of in silico bulk data has become more straightforward.
For comprehensive simulations, Deconer provides two types of simulated bulk data.
The first type is the human PBMC data from 10X. This single-cell dataset contains more than 10,000 single cells from human blood. We processed this dataset using 10X Cell Ranger and muon. Cell type annotation was performed following the tutorial 'Processing gene expression of 10k PBMCs'. We annotated 13 cell types as follows:
- intermediate_mono
- CD8+_naĂŻve_T
- mDC
- CD4+_naĂŻve_T
- NK
- memory_B
- CD4+_memory_T
- CD16_mono
- pDC
- naĂŻve_B
- CD8+_activated_T
- CD14_mono
- MAIT
The second dataset is derived from mouse tissue. Tissue-level deconvolution is also an important aspect, such as predicting tissue origin from human liquid biopsy. Han, et al. constructed a high-quality scRNA-seq atlas for mice, which Deconer utilizes for in silico mixing. To ensure a robust simulation, Deconer incorporates data from 7 tissues of the female fetal mouse, with 1500 cells for each tissue.
- stomach
- lung
- liver
- kidney
- intestine
- brain
- gonad
The method for generating bulk data is similar with Section 3.1.
scExprSim(
n_sample = 50, # generate 50 samples
cell_number = 3000,
ref_cell_number = 1000,
p = 2/3, # the proportion of cell number of building reference for each cell type
transform = "TPM",
outputPath = "./scExprSim",
bulk_name = "scMouse_gene_expr.csv",
ref_bulk_name = "scMouse_ref_bulk.csv",
ref_sc_name = "scMouse_ref_sc.csv",
ref_sc_label = "scMouse_ref_sc_label.csv",
prop_name = "scMouse_prop.csv",
type = "mouse_tissue" ## 'mouse_tissue' or 'human_PBMC'
)
Generally, when a new deconvolution approach is proposed, it is essential to conduct comparisons among different methods. Deconer offers multi-level comparison and visualization functions. In this section, we will demonstrate how to perform a simple evaluation of deconvolution results.
For the demonstration, we have selected EpiDISH, DeconRNASeq, and FARDEEP from Bioconductor as the example methods. Users can follow the steps below to easily reproduce the results.
First, generate simulated expression data. In this example, we will use the massive RNA-seq dataset as a reference for generating the simulated data.
library(Deconer)
# generate bulk data
exprSim(n_sample = 50, # generate 50 samples
type = 'coarse', # can be changed to 'fine'
transform = 'TPM',
outputPath = "./exprSim",
mix_name = "coarse_mix.csv", # mixture samples
ref_name = "coarse_ref.csv", # cell type specific reference
prop_name = "coarse_prop.csv", # groundtruth proportions
refVar_name = "coarse_refVar.csv", # expression variance for cell type specific reference
train_name = "train_data.csv") # data for generating reference, can be used for differential analysis
Next, utilize EpiDISH, DeconRNASeq, and FARDEEP to perform deconvolution on the bulk data and save the results.
library(EpiDISH)
library(DeconRNASeq)
library(FARDEEP)
mix <- read.csv(file = "./exprSim/coarse_mix.csv", header = T, row.names = 1)
ref <- read.csv(file = "./exprSim/coarse_ref.csv", header = T, row.names = 1)
# For a fast demo, we use only 500 markers.
mix <- as.matrix(mix[1:500, ])
ref <- as.matrix(ref[1:500, ])
# deconvolute with CIBERSORT algorithm
res1 <- epidish(beta.m = mix, ref.m = ref, method = "CBS")
p1 <- t(res1$estF)
# deconvolute with RPC algorithm
res2 <- epidish(beta.m = mix, ref.m = ref, method = "RPC")
p2 <- t(res2$estF)
# deconvolute with DeconRNASeq algorithm
res3 <- DeconRNASeq(datasets = as.data.frame(mix), signatures = as.data.frame(ref))
p3 <- t(res3$out.all)
colnames(p3) <- colnames(mix)
# deconvolute with FARDEEP algorithm
res4 <- fardeep(X = ref, Y = mix)
p4 <- t(res4$relative.beta)
# save the results
write.csv(x = p1, file = paste0("./exprSim/CBS.csv"), row.names = T, quote = F)
write.csv(x = p2, file = paste0("./exprSim/RPC.csv"), row.names = T, quote = F)
write.csv(x = p3, file = paste0("./exprSim/DeconRNASeq.csv"), row.names = T, quote = F)
write.csv(x = p4, file = paste0("./exprSim/FARDEEP.csv"), row.names = T, quote = F)
Next, users can compare the deconvolution results from various perspectives.
For instance, one can directly compare the root mean square error (RMSE) of each cell type.
actual <- "./exprSim/coarse_prop.csv"
predicted <- c("./exprSim/CBS.csv",
"./exprSim/RPC.csv",
"./exprSim/DeconRNASeq.csv",
"./exprSim/FARDEEP.csv")
label <- c("CBS", "RPC", "DeconRNASeq", "FARDEEP")
plot_multiple(actual = actual,
predicted = predicted,
label = label,
method = "rmse",
type = "celltype", # can be changed to 'sample'
figure = "boxplot")
We can also compare the root mean square error (RMSE) of each sample by changing the parameter 'type' from 'celltype' to 'sample'.
Additionally, the overall RMSE can be visualized using a barplot.
plot_multiple(actual = actual,
predicted = predicted,
label = label,
method = "rmse",
type = "all",
figure = "barplot")
Furthermore, the results can be visualized in a more specific manner using a heatmap.
plot_multiple(actual = actual,
predicted = predicted,
label = label,
method = "rmse",
type = "celltype",
figure = "heatmap")
Additionally, a scatter plot with the Pearson Correlation Coefficient (PCC) for each method can also be generated.
plot_multiple(actual = actual,
predicted = predicted,
label = label,
method = "pearson",
type = "celltype", # the color is assigned for each cell type
figure = "scatterplot",
nrow = 2)
A combined heatmap with circles can be utilized to illustrate multiple metrics simultaneously.
If users wish to plot the results for a single method, they can utilize the 'plot_single' function. For more detailed information, please refer to the Deconer manual.
In expression data analysis, both technical and biological noise cannot be overlooked. The presence of noise in bulk data can have a detrimental impact on deconvolution. To assess the stability and accuracy of different methods, Deconer offers functions to introduce noise with various structures into bulk data.
Various noise models have been used in different studies, including the normal, log-normal, and negative binomial models (Jin, et al.). Deconer has implemented these models and offers a flexible interface for users to utilize.
- Normal Model
- Log-Nomal Model
In the aforementioned models,
- Negative Binomial Model
The simulation strategy was proposed by Jin, et al. and Law, et al.. The noise level, denoted as
The function "addNoiseExpr" can generate noise based on a normal, log-normal, or negative binomial model (Jin, et al.). We highly recommend utilizing the negative binomial model. Below is a simple example code:
# first, generate a simple dataset with 50 samples and 200 marker genes with function 'pseudoExpr'
res <- pseudoExpr(n_sample = 50, n_gene = 200)
# users can check the dimension of output
dim(res$mix)
dim(res$ref)
dim(res$prop)
# save the bulk data
write.csv(x = res$mix, file = "./addNoiseExpr/mix.csv", row.names = T, quote = F)
write.csv(x = res$ref, file = "./addNoiseExpr/ref.csv", row.names = T, quote = F)
write.csv(x = res$prop, file = "./addNoiseExpr/prop.csv", row.names = T, quote = F)
# the new data will be generated in a folder
addNoiseExpr(exprFile = "./addNoiseExpr/mix.csv",
Pt = seq(0.1, 1, 0.1), # parameter to control noise level
type = "NB") # "NB", "N" or "LN". 3 types of model.
A folder named "mix" will be generated in the path './addNoiseExpr', and bulk data with different levels of noise will be saved in separate files within the folder, as shown below.
mix.csv
ref.csv
prop.csv
mix/
├── mix_NL_0.csv # original data without in silico noise
├── mix_NL_0.1.csv # noise level 0.1
├── mix_NL_0.2.csv
├── mix_NL_0.3.csv
├── mix_NL_0.4.csv
├── mix_NL_0.5.csv
├── mix_NL_0.6.csv
├── mix_NL_0.7.csv
├── mix_NL_0.8.csv
├── mix_NL_0.9.csv
└── mix_NL_1.csv # noise level 1
Afterward, users can easily test the performance of different methods. The following demonstration illustrates how to analyze the stability of different methods with varying noise levels.
Comparing the deconvolution performance among different algorithms can help identify the most suitable method for a specific biological scenario.
In this example, we demonstrate how to utilize Deconer for conducting a systematic comparison of different methods using stability analysis. To provide a direct and efficient example, we have also employed EpiDISH, DeconRNAseq, and FARDEEP for deconvolution.
First, perform deconvolution using different methods.
library(EpiDISH)
library(DeconRNASeq)
library(FARDEEP)
ref <- as.matrix(read.csv(file = "./addNoiseExpr/ref.csv", header = T, row.names = 1))
for (NL in seq(0, 1, 0.1)) {
writeLines(paste("Now, processing noise level:", NL, sep = " "))
mixFile <- paste0("./addNoiseExpr/mix/mix_NL_", NL, ".csv")
mix <- as.matrix(read.csv(file = mixFile, header = T, row.names = 1))
# deconvolute with CIBERSORT algorithm
res1 <- epidish(beta.m = mix, ref.m = ref, method = "CBS")
p1 <- t(res1$estF)
# deconvolute with RPC algorithm
res2 <- epidish(beta.m = mix, ref.m = ref, method = "RPC")
p2 <- t(res2$estF)
# deconvolute with DeconRNASeq algorithm
res3 <- DeconRNASeq(datasets = as.data.frame(mix), signatures = as.data.frame(ref))
p3 <- t(res3$out.all)
colnames(p3) <- colnames(p1) # assign column name
# deconvolute with FARDEEP algorithm
res4 <- fardeep(X = ref, Y = mix)
p4 <- t(res4$relative.beta)
# save the results
write.csv(x = p1, file = paste0("./addNoiseExpr/mix/CBS_", NL, ".csv"), row.names = T, quote = F)
write.csv(x = p2, file = paste0("./addNoiseExpr/mix/RPC_", NL, ".csv"), row.names = T, quote = F)
write.csv(x = p3, file = paste0("./addNoiseExpr/mix/DeconRNASeq_", NL, ".csv"), row.names = T, quote = F)
write.csv(x = p4, file = paste0("./addNoiseExpr/mix/FARDEEP_", NL, ".csv"), row.names = T, quote = F)
}
First, we aim to assess the performance of a single method under varying levels of noise. Let's consider the CIBERSORT algorithm provided by EpiDISH as an example.
We can generate the trend of root mean square error (RMSE) across different noise levels. Additionally, we can plot cell type-specific metrics to visualize the estimation bias for each cell type.
# the real proportion
actual <- "./addNoiseExpr/prop.csv"
# predicted proportions
predicted <- paste0("./addNoiseExpr/mix/CBS_", seq(0, 1, 0.1), ".csv")
# label for each file
noise_level <- paste0("NL", seq(0, 1, 0.1))
# boxplot
plot_multiple(actual = actual,
predicted = predicted,
label = noise_level,
method = "rmse",
type = "celltype",
figure = "boxplot")
# heatmap
plot_multiple(actual = actual,
predicted = predicted,
label = noise_level,
method = "rmse",
type = "celltype",
figure = "heatmap")
The output figures are presented below.
It is evident that as the noise power increases, the quality of the deconvolution results deteriorates.
Certainly, the results specific to cell types are influenced by noise power. However, this issue has not been fully studied yet.
Next, we will demonstrate how to perform a comprehensive cross-comparison among different methods.
# the real proportion
actual <- "./addNoiseExpr/prop.csv"
# predicted proportions as a list
RPC <- paste0("./addNoiseExpr/mix/RPC_", seq(0, 1, 0.1), ".csv")
CBS <- paste0("./addNoiseExpr/mix/CBS_", seq(0, 1, 0.1), ".csv")
DeconRNASeq <- paste0("./addNoiseExpr/mix/DeconRNASeq_", seq(0, 1, 0.1), ".csv")
FARDEEP <- paste0("./addNoiseExpr/mix/FARDEEP_", seq(0, 1, 0.1), ".csv")
predicted <- list(RPC = RPC,
CBS = CBS,
DeconRNASeq = DeconRNASeq,
FARDEEP = FARDEEP)
noise_level <- paste0("NL_", seq(0, 1, 0.1))
# boxplot
plot_multiple2(actual = actual,
predicted = predicted,
condition = noise_level,
method = 'rmse', # compute rmse
type = 'celltype',
figure = 'boxplot')
# heatmap
plot_multiple2(actual = actual,
predicted = predicted,
condition = noise_level,
method = 'mape', # compute mape
type = 'celltype',
figure = 'heatmap')
The boxplot and heatmap for RMSE and MAPE are presented below.
Typically, relying on a single metric may not be adequate to assess the efficacy of deconvolution, particularly for rare components (Wei, et al.). Deconer offers a heatmap with circles that illustrates dual metrics in a single figure.
# generate figure for rmse and pearson
plot_multiple2(actual = actual,
predicted = predicted,
condition = noise_level,
method = 'rmse',
method2 = 'mape',
type = 'celltype',
figure = 'cheatmap')
In the cell type deconvolution problem, cell types with extremely small proportions are often ignored. However, these cell types can play vital roles in certain situations, such as tumor-infiltrating lymphocytes (TILs), which exhibit low fractions in many cancer tissues. To address this issue, several methods have been proposed, such as DWLS and ARIC.
Deconer offers the functions 'rareExprSim' and 'rarescExprSim' for simulating bulk data with rare components. To conduct a comprehensive analysis, Deconer iterates over all cell types as potential rare components in a loop, using a pre-defined rare proportion gradient.
Here is an example involving 6 different deconvolution algorithms.
First, generate an in silico rare proportion dataset.
# rare proportion is set to 0.001, 0.003, 0.005, 0.008, 0.01, 0.03 and 0.05
rareExprSim(p_rare = c(0.001, 0.003, 0.005, 0.008, 0.01, 0.03, 0.05),
outputPath = "./function_test/rareExprSim/",
type = "coarse",
transform = "TPM")
Next, we perform deconvolution on the simulated bulk data using different algorithms. Here, we present the deconvolution results obtained from ARIC, CIBERSORT, EPIC, dtangle, FARDEEP, and DeconRNAseq as examples.
For simplicity, the specific code for each deconvolution method is not shown here. The following plotting code is based on known deconvolution results.
We can generate scatter plots for each method, as shown below.
p_rare = c(0.001, 0.003, 0.005, 0.008, 0.01, 0.03, 0.05)
actual <- "coarse_prop.csv"
# use EPIC as an example
predicted <- "EPIC_coarse_coarse_p-0.01_fc-2.csv"
plot_rare(actual = actual,
predicted = predicted,
p_rare = p_rare,
method = "spearman",
celltype = TRUE,
figure = "scatterplot")
From the above figure, users can analyze the deconvolution performance for each cell type.
Additionally, we can generate heatmaps and circular heatmaps to facilitate cross-comparisons.
actual <- "./coarse_prop.csv"
predicted <- Sys.glob(paths = "./*_coarse_coarse_p-0.01_fc-2.csv")
label <- c("ARIC", "CBS", "EPIC", "dRNAseq", "dtangle", "fardeep")
p_rare = c(0.001, 0.003, 0.005, 0.008, 0.01, 0.03, 0.05)
plot_rare(actual = actual,
predicted = predicted,
p_rare = p_rare,
method = "rmse",
figure = "heatmap")
plot_rare(actual = actual,
predicted = predicted,
p_rare = p_rare,
method = "rmse",
method2 = "mape",
figure = "cheatmap")
Functions related to single-cell data are similar to those used for large-scale bulk data, such as 'scExprSim' and 'rarescExprSim'. For more detailed information, please consult the Deconer manual.
Additionally, we have curated 14 well-characterized deconvolution datasets for users. Some of these datasets include known cell-type proportions, while others may not have known true proportions but have associated phenotypic information. We provide these datasets in a processed format, including bulk data, reference data, and true proportions, making them readily usable. The summarized datasets are presented in the following table.
NO. | NAME | Description | Reference Type | Proportion | Sample Size | Source |
---|---|---|---|---|---|---|
1 | Abbas | Microarray | Bulk | known | 12 | Abbas, et al. |
2 | Becht | Microarray | Bulk | known | 10 | Becht, et al. |
3 | Gong | Microarray | Bulk | known | 9 | Gong, et al. |
4 | Kuhn | Microarray | Bulk | known | 10 | Kuhn, et al. |
5 | Linsley | RNA-seq | Bulk | known | 5 | Linsley, et al. |
6 | Liu | RNA-seq | Bulk | known | 24 | Liu, et al. |
7 | Parsons | RNA-seq | Bulk | known | 30 | Parsons, et al. |
8 | Shen-Orr | Microarray | Bulk | known | 33 | Shen-Orr, et al. |
9 | Shi | Microarray | Bulk | known | 60 | Shi, et al. |
10 | T2D | RNA-seq | Single Cell | unknown | 89 | Fadista, et al. |
11 | TCGA_LUSC | RNA-seq | Bulk | unknown | 130 | Vasaikar, et al. |
12 | TCGA_OV | RNA-seq | Bulk | unknown | 514 | Vasaikar, et al. |
13 | kidney_Arvaniti | RNA-seq | Single Cell | unknown | 11 | Arvaniti, et al. |
14 | kidney_Arvaniti_TPM | RNA-seq | Bulk | unknown | 11 | Arvaniti, et al. |
15 | kidney_Craciun | RNA-seq | Single Cell | unknown | 19 | Craciun, et al. |
16 | kidney_Craciun_TPM | RNA-seq | Bulk | unknown | 19 | Craciun, et al. |
17 | TCGA 35 cancer datasets | RNA-seq | Bulk | unknown | - | Vasaikar, et al. |
We have provided processed 'LUSC' and 'OV' datasets from TCGA, which have been validated by FARDEEP. Additionally, datasets related to other cancers are available in the form of expression matrix along with the original information.
Users can download these datasets from the following page.
Note: Some datasets are collected from dtangle and MuSiC.
Please cite the corresponding article when using these datasets.
Zhang, Wei, et al. "Deconer: A comprehensive and systematic evaluation toolkit for reference-based cell type deconvolution algorithms using gene expression data." bioRxiv (2023): 2023-12.