Returns a volcano plot from the output of the FindMarkers function from the Seurat package, which is a ggplot object that can be modified or plotted. To better illustrate the assumptions of the theorem, consider the case when the size factor sjcis the same for all cells in a sample j and denote the common size factor as sj*. As an example, consider a simple design in which we compare gene expression for control and treated subjects. The study by Zimmerman et al. . For example, lets pretend that DCs had merged with monocytes in the clustering, but we wanted to see what was unique about them based on their position in the tSNE plot. In a scRNA-seq study of human tracheal epithelial cells from healthy subjects and subjects with idiopathic pulmonary fibrosis (IPF), the authors found that the basal cell population contained specialized subtypes (Carraro et al., 2020). Our analysis of CF and non-CF pigs showed that the subject method better controlled the FPR of DS analysis when the expected rate of true positives is small; here, using the same animal model, we compare large and small airway ciliated cells which are expected to vary largely. Visualizing FindMarkers result in Seurat using Heatmap Our study highlights user-friendly approaches for analysis of scRNA-seq data from multiple biological replicates. However, the plot does not look well volcanic. 6b). For macrophages (Supplementary Fig. This is done by passing the Seurat object used to make the plot into CellSelector(), as well as an identity class. healthy versus disease), an additional layer of variability is introduced. Then, for each method, we defined the permutation test statistic to be the unadjusted P-value generated by the method. ## [118] sctransform_0.3.5 parallel_4.2.0 grid_4.2.0 Generally, tests for marker detection, such as the wilcox method, are sufficient if type I error rate control is less of a concern than type II error rate and in circumstances where type I error rate is most important, methods like subject and mixed can be used. First, in a simulation study, we show that when the gene expression distribution of a population of cells varies between subjects, a nave approach to differential expression analysis will inflate the FDR. ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3 Raw gene-by-cell count matrices for pig scRNA-seq data are available as GEO accession GSE150211. For the T cells, (Supplementary Fig. First, we identified the AT2 and AM cells via clustering (Fig. FloWuenne/scFunctions source: R/DE_Seurat.R - rdrr.io I have successfully installed ggplot, normalized my datasets, merged the datasets, etc., but what I do not understand is how to transfer the sequencing data to the ggplot function. Supplementary Figure S10 shows concordance between adjusted P-values for each method. Rows correspond to different proportions of differentially expressed genes, pDE and columns correspond to different SDs of (natural) log fold change, . For each subject, gene counts are summed for all cells. Infinite p-values are set defined value of the highest -log(p) + 100. Hi, I am having difficulty in plotting the volcano plot. Comparison of methods for detection of CD66+ and CD66- basal cell markers from human trachea. ## [61] labeling_0.4.2 rlang_1.1.0 reshape2_1.4.4 In order to objectively measure the performance of our tested approaches in scRNA-seq DS analysis, we compared them to a gold standard consistent of bulk RNA-seq analysis of purified/sorted cell types. For the AM cells (Fig. In addition, it will plot either 'umap', 'tsne', or, # DoHeatmap now shows a grouping bar, splitting the heatmap into groups or clusters. ## [37] gtable_0.3.3 leiden_0.4.3 future.apply_1.10.0 ## [55] pkgconfig_2.0.3 sass_0.4.5 uwot_0.1.14 The analyses presented here have illustrated how different results could be obtained when data were analysed using different units of analysis. Whereas the pseudobulk method is a simple approach to DS analysis, it has limitations. Differential expression testing Seurat - Satija Lab In practice, often only one cutoff value for the adjusted P-value will be chosen to detect genes. The top 50 genes for each method were defined to be the 50 genes with smallest adjusted P-values. The null and alternative hypotheses for the i-th gene are H0i:i2=0 and H0i:i20, respectively. As scRNA-seq studies grow in scope, due to technological advances making these studies both less labor-intensive and less expensive, biological replication will become the norm. ", I have seen tutorials on the web, but the data there is not processed the same as how I have been doing following the Satija lab method, and, my files are not .csv, but instead are .tsv. When only 1% of genes were differentially expressed, the mixed method had a larger area under the curve than the other five methods. ## [11] hcabm40k.SeuratData_3.0.0 bmcite.SeuratData_0.3.0 Default is 0.25. FindMarkers: Finds markers (differentially expressed genes) for identified clusters. Basic volcano plot. We performed marker detection analysis of cells obtained from a study of five human skin punch biopsies (Sole-Boldo et al., 2020). ## [4] lazyeval_0.2.2 sp_1.6-0 splines_4.2.0 Department of Internal Medicine, Roy J. and Lucille A. make sure label exists on your cells in the metadata corresponding to treatment (before- and after-), You will be returned a gene list of pvalues + logFc + other statistics. Supplementary data are available at Bioinformatics online. The use of the dotplot is only meaningful when the counts matrix contains zeros representing no gene counts. The recall, also known as the true positive rate (TPR), is the fraction of differentially expressed genes that are detected. Improvements in type I and type II error rate control of the DS test could be considered by modeling cell-level gene expression adjusted for potential differences in gene expression between subjects, similar to the mixed method in Section 3. Crowell et al. The computations for each method were performed on the high-performance computing cluster at the University of Iowa. As a gold standard, results from bulk RNA-seq of isolated AT2 cells and AM comparing IPF and healthy lungs (bulk). The volcano plots for the three scRNA-seq methods have similar shapes, but the wilcox and mixed methods have inflated adjusted P-values relative to subject (Fig. (c) Volcano plots show results of three methods (subject, wilcox and mixed) used to identify CD66+ and CD66- basal cell marker genes. The marginal distribution of Kij is approximately negative binomial with mean ij=sjqij and variance ij+iij2. For this study, there were 35 distinct permutations of CF and non-CF labels between the 7 pigs. Below is a brief demonstration but please see the patchwork package website here for more details and examples. Figure 5d shows ROC and PR curves for the three scRNA-seq methods using the bulk RNA-seq as a gold standard. Among the three genes detected by subject, the genes CFTR and CD36 were detected by all methods, whereas only subject, wilcox, MAST and Monocle detected APOB. (b) CD66+ basal cells were identified via detection of CEACAM5 or CEACAM6. (d) ROC and PR curves for subject, wilcox and mixed methods using bulk RNA-seq as a gold standard. Third, the proposed model also ignores many aspects of the gene expression distribution in favor of simplicity. Single-cell RNA-sequencing (scRNA-seq) enables analysis of the effects of different conditions or perturbations on specific cell types or cellular states. Another interactive feature provided by Seurat is being able to manually select cells for further investigation. d Volcano plots showing DE between T cells from random groups of unstimulated controls drawn . For higher numbers of differentially expressed genes (pDE > 0.01), the subject method had lower NPV values when = 0.5 and similar or higher NPV values when > 0.5. Supplementary Table S1 shows performance measures derived from these curves. ## 13714 features across 2638 samples within 1 assay, ## Active assay: RNA (13714 features, 2000 variable features), ## 2 dimensional reductions calculated: pca, umap, # Ridge plots - from ggridges. ## [94] highr_0.10 desc_1.4.2 lattice_0.20-45 We also assume that cell types or states have been identified, DS analysis will be performed within each cell type of interest and henceforth, the notation corresponds to one cell type. As a gold standard, results from bulk RNA-seq comparing CD66+ and CD66- basal cells (bulk). 14.1 Basic usage. Applying themes to plots. They also thank Paul A. Reyfman and Alexander V. Misharin for sharing bulk RNA-seq data used in this study. We compared the performances of subject, wilcox and mixed for DS analysis of the scRNA-seq from healthy and IPF subjects within AT2 and AM cells using bulk RNA-seq of purified AT2 and AM cell type fractions as a gold standard, similar to the method used in Section 3.5. Therefore, as experiments that include biological replication become more common, statistical frameworks to account for multiple sources of biological variability will be critical, as recently described by Lhnemann et al. Simply add the splitting variable to object, # metadata and pass it to the split.by argument, # SplitDotPlotGG has been replaced with the `split.by` parameter for DotPlot, # DimPlot replaces TSNEPlot, PCAPlot, etc. Marker detection methods were found to have unacceptable FDR due to pseudoreplication bias, in which cells from the same individual are correlated but treated as independent replicates, and pseudobulk methods were found to be too conservative, in the sense that too many differentially expressed genes were undiscovered. Multiple methods and bioinformatic tools exist for initial scRNA-seq data processing, including normalization, dimensionality reduction, visualization, cell type identification, lineage relationships and differential gene expression (DGE) analysis (Chen et al., 2019; Hwang et al., 2018; Luecken and Theis, 2019; Vieth et al., 2019; Zaragosi et al., 2020). ## [82] pbapply_1.7-0 future_1.32.0 nlme_3.1-157 ## ## [49] htmlwidgets_1.6.2 httr_1.4.5 RColorBrewer_1.1-3 The main idea of the theorem is that if gene counts are summed across cells and the number of cells grows large for each subject, the influence of cell-level variation on the summed counts is negligible. In addition to simulated data, we analysed an animal model dataset containing large and small airway epithelia from CF and non-CF pigs (Rogers et al., 2008). a, Volcano plot of RNA-seq data from bulk hippocampal tissue from 8- to 9-month-old P301S transgenic and non-transgenic mice (Wald test). Help with Volcano plot - Biostar: S Second, we make a formal argument for the validity of a DS test with subjects as the units of analysis and discuss our development of a Bioconductor package that can be incorporated into scRNA-seq analysis workflows. The number of genes detected by wilcox, NB, MAST, DESeq2, Monocle and mixed were 6928, 7943, 7368, 4512, 5982 and 821, respectively. ## [5] ssHippo.SeuratData_3.1.4 pbmcsca.SeuratData_3.0.0 Performance measures for DS analysis of simulated data. Rows correspond to different proportions of differentially expressed genes, pDE and columns correspond to different SDs of (natural) log fold change, . On the other hand, subject had the smallest FPR (0.03) compared to wilcox and mixed (0.26 and 0.08, respectively) and had a higher PPV (0.38 compared to 0.10 and 0.23). (Lahnemann et al., 2020). Applying the assumptions Cj-1csjck1 and Cj-1csjc2k2 completes the proof. The data from pig airway epithelia underlying this article are available in GEO and can be accessed with GEO accession GSE150211. I have scoured the web but I still cannot figure out how to do this. ## Introduction. Supplementary Figure S11 shows cumulative distribution functions (CDFs) of permutation P-values and method P-values. Each panel shows results for 100 simulated datasets in one simulation setting. To use, simply make a ggplot2-based scatter plot (such as DimPlot() or FeaturePlot()) and pass the resulting plot to HoverLocator(). sessionInfo()## R version 4.2.0 (2022-04-22) Here, we introduce a mathematical framework for modeling different sources of biological variation introduced in scRNA-seq data, and we provide a mathematical justification for the use of pseudobulk methods for DS analysis. The vertical axis gives the precision (PPV) and the horizontal axis gives recall (TPR). As in Section 3.5, in the bulk RNA-seq, genes with adjusted P-values less than 0.05 and at least a 2-fold difference in gene expression between healthy and IPF are considered true positives and all others are considered true negatives. Session Info The wilcox, MAST and Monocle methods had intermediate performance in these nine settings. Step 2: Get the data ready. ## [13] magrittr_2.0.3 memoise_2.0.1 tensor_1.5 In contrast, single-cell experiments contain an additional source of biological variation between cells. Let Gammaa,b denote the gamma distribution with shape parameter a and scale parameter b, Poissonm denote the Poisson distribution with mean m and XY denote the conditional distribution of random variable X given random variable Y. ## [9] panc8.SeuratData_3.0.2 ifnb.SeuratData_3.1.0 (c and d) Volcano plots show results of three methods (subject, wilcox and mixed) used to find differentially expressed genes between IPF and healthy lungs in (c) AT2 cells and (d) AM. R: Flexible wrapper for GEX volcano plots A more powerful statistical test that yields well-controlled FDR could be constructed by considering techniques that estimate all parameters of the hierarchical model. ## [115] MASS_7.3-56 rprojroot_2.0.3 withr_2.5.0 Give feedback. ## attached base packages: Specifically, we considered a setting in which there were two groups of subjects to compare, containing four and three subjects, respectively with 21 731 genes. For each setting, 100 datasets were simulated, and we compared seven different DS methods. Differential expression testing - Satija Lab If the ident.2 parameter is omitted or set to NULL, FindMarkers () will test for differentially expressed features between the group specified by ident.1 and all other cells. Step 4: Customise it! These analyses provide guidance on strengths and weaknesses of different methods in practice. The volcano plot that is being produced after this analysis is wierd and seems not to be correct. Figure 4a shows volcano plots summarizing the DS results for the seven methods. To consider characteristics of a real dataset, we matched fixed quantities and parameters of the model to empirical values from a small airway secretory cell subset from the newborn pig data we present again in Section 3.2. Published by Oxford University Press. ## [58] deldir_1.0-6 utf8_1.2.3 tidyselect_1.2.0 Single-cell RNA-sequencing (scRNA-seq) provides more granular biological information than bulk RNA-sequencing; bulk RNA sequencing remains popular due to lower costs which allows processing more biological replicates and design more powerful studies. disease and intervention), (ii) variation between subjects, (iii) variation between cells within subjects and (iv) technical variation introduced by sampling RNA molecules, library preparation and sequencing. # Particularly useful when plotting multiple markers, # Visualize co-expression of two features simultaneously, # Split visualization to view expression by groups (replaces FeatureHeatmap), # Violin plots can also be split on some variable. ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C If we omit DESeq2, which seems to be an outlier, the other six methods form two distinct clusters, with cluster 1 composed of wilcox, NB, MAST and Monocle, and cluster 2 composed of subject and mixed. These methods appear to form two clusters: the cell-level methods (wilcox, NB, MAST, DESeq2 and Monocle) and the subject-level method (subject), with mixed sharing modest concordance with both clusters. ## [31] progressr_0.13.0 spatstat.data_3.0-1 survival_3.3-1 These analyses suggest that a nave approach to differential expression testing could lead to many false discoveries; in contrast, an approach based on pseudobulk counts has better FDR control. To illustrate scalability and performance of various methods in real-world conditions, we show results in a porcine model of cystic fibrosis and analyses of skin, trachea and lung tissues in human sample datasets. A common use of DGE analysis for scRNA-seq data is to perform comparisons between pre-defined subsets of cells (referred to here as marker detection methods); many methods have been developed to perform this analysis (Butler et al., 2018; Delmans and Hemberg, 2016; Finak et al., 2015; Guo et al., 2015; Kharchenko et al., 2014; Korthauer et al., 2016; Miao et al., 2018; Qiu et al., 2017a, b; Wang et al., 2019; Wang and Nabavi, 2018). ## [106] cowplot_1.1.1 irlba_2.3.5.1 httpuv_1.6.9 The vertical axes give the performance measures, and the horizontal axes label each method. The intra-cluster correlations are between 0.9 and 1, whereas the inter-cluster correlations are between 0.51 and 0.62. ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C In our simulation, the analysis focused on transcriptome-wide data simulated from the proposed model for scRNA-seq counts under different numbers of differentially expressed genes and different signal-to-noise ratios. Step-by-step guide to create your volcano plot. Introduction to Single-cell RNA-seq - ARCHIVED - GitHub Pages The negative binomial distribution has a convenient interpretation as a hierarchical model, which is particularly useful for sequencing studies. If subjects are composed of different proportions of types A and B, DS results could be due to different cell compositions rather than different mean expression levels. To whom correspondence should be addressed. The implemented methods are subject (red), wilcox (blue), NB (green), MAST (purple), DESeq2 (orange), monocle (gold) and mixed (brown). As an example, were going to select the same set of cells as before, and set their identity class to selected. ## [3] thp1.eccite.SeuratData_3.1.5 stxBrain.SeuratData_0.1.1 Figure 3a shows the area under the PR curve (AUPR) for each method and simulation setting. The subject method had the shortest average computation times, typically <1 min. We then compare multiple differential expression testing methods on scRNA-seq datasets from human samples and from animal models. Pseudobulking has been tested in real scRNA-seq studies (Kang et al., 2018) and benchmarked extensively via simulation (Crowell et al., 2020). The lists of genes detected by the other six methods likely contain many false discoveries. ## [76] goftest_1.2-3 knitr_1.42 fs_1.6.1 (b) AT2 cells and AM express SFTPC and MARCO, respectively. As increases, the width of the distribution of effect sizes increases, so that the signal-to-noise ratio for differentially expressed genes is larger. Overall, the subject and mixed methods had the highest concordance between permutation and method P-values. ## [10] digest_0.6.31 htmltools_0.5.5 fansi_1.0.4 ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 "poisson" : Likelihood ratio test assuming an . ## [19] globals_0.16.2 matrixStats_0.63.0 pkgdown_2.0.7 Standard normalization, scaling, clustering and dimension reduction were performed using the R package Seurat version 3.1.1 (Butler et al., 2018; Satija et al., 2015; Stuart et al., 2019). Because these assumptions are difficult to validate in practice, we suggest following the guidelines for library complexity in bulk RNA-seq studies. We will also label the top 10 most significant genes with their . In the second stage, the observed data for each gene, measured as a count, is assumed to follow a Poisson distribution with mean equal to the product of a size factor, such as sequencing depth, and gene expression generated in the first stage. Third, we examine properties of DS testing in practice, comparing cells versus subjects as units of analysis in a simulation study and using available scRNA-seq data from humans and pigs. Supplementary Figure S14 shows the results of marker detection for T cells and macrophages. Compared to the T cell and macrophage marker detection analysis in Section 3.4, we note that the CD66+ and CD66-basal cells are not as transcriptionally distinct (Fig. In bulk RNA-seq studies, gene counts are often assumed to follow a negative binomial distribution (Hardcastle and Kelly, 2010; Leng et al., 2013; Love et al., 2014; Robinson et al., 2010). Default is 0.25. Nine simulation settings were considered. Supplementary Figure S12a shows volcano plots for the results of the seven DS methods described. Data for the analysis of human skin biopsies were obtained from GEO accession GSE130973. RNA-Seq Data Heatmap: Is it necessary to do a log2 . ## [121] tidyr_1.3.0 rmarkdown_2.21 Rtsne_0.16 Infinite p-values are set defined value of the highest -log(p) + 100. I used ggplot to plot the graph, but my graph is blank at the center across Log2Fc=0. The other two methods were Monocle, which utilized a negative binomial generalized additive model to test for differences in gene expression using the R package Monocle (Qiu et al., 2017a, b; Trapnell et al., 2014) and mixed, which modeled counts using a negative binomial generalized linear mixed model with a random effect to account for differences in gene expression between subjects and DS testing was performed using a Wald test. A software package, aggregateBioVar, is freely available on Bioconductor (https://www.bioconductor.org/packages/release/bioc/html/aggregateBioVar.html) to accommodate compatibility with upstream and downstream methods in scRNA-seq data analysis pipelines. Developed by Paul Hoffman, Satija Lab and Collaborators. The following differential expression tests are currently supported: "wilcox" : Wilcoxon rank sum test (default) "bimod" : Likelihood-ratio test for single cell feature expression, (McDavid et al., Bioinformatics, 2013) "roc" : Standard AUC classifier. < 10e-20) with a different symbol at the top of the graph. Here, we present a highly-configurable function that produces publication-ready volcano plots. I understand a little bit more now. ## [73] fastmap_1.1.1 yaml_2.3.7 ragg_1.2.5