Vitek-Lab · anshuman-raina · Sep 13, 2024 · Sep 9, 2024 · Sep 10, 2024 · Sep 10, 2024
diff --git a/vignettes/MSstatsUpdated.Rmd b/vignettes/MSstatsUpdated.Rmd
@@ -0,0 +1,335 @@
+---
+title: "MSstats: End to End Workflow"
+date: September 5th, 2024
+---
+
+```{r style, echo = FALSE, results = 'asis'}
+BiocStyle::markdown()
+```
+
+```{r global_options, include=FALSE}
+knitr::opts_chunk$set(fig.width=10, fig.height=7, warning=FALSE, message=FALSE)
+options(width=110)
+```
+
+```{=html}
+<!--
+%\VignetteIndexEntry{MSstats: End to End Workflow}
+%\VignetteEngine{knitr::knitr}
+-->
+```
+# __MSstats: Protein/Peptide significance analysis__
+
+Package: MSstats
+
+Author: Anshuman Raina
+
+Date: 5th Semptember 2024
+
+## __Introduction__
+
+MSstats, an R package in Bioconductor, supports protein significance analysis for statistical relative quantification of proteins and peptides in global, targeted and data-independent proteomics. It handles shotgun, label-free and label-based (universal synthetic peptide-based) SRM (selected reaction monitoring), and SWATH/DIA (data independent acquisition) experiments. It can be used for experiments with complex designs (e.g. comparing more than two experimental conditions, or a time course).
+
+This vignette summarizes the introduction and various options of all functionalities in MSstats. More details are available in `User Manual`.
+
+## __Installation__
+
+To install this package, start R (version “4.0”) and enter:
+
+``` {r code Installation}
+if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
+
+BiocManager::install("MSstats")
+library(MSstats)
+
+```
+
+## __1. Workflow__
+
+### __1.1 Raw Data__ 
+
+To begin with, we will load sample datasets, including both annotated and plain data. The dataset you need can be found [here]( https://github.com/Vitek-Lab/MSstatsConvert/blob/devel/inst/tinytest/raw_data/PD/pd_input.csv).
+
+We will also load the Annotation Dataset using MSstatsConvert. You can access this dataset [here](https://github.com/Vitek-Lab/MSstatsConvert/blob/devel/inst/tinytest/raw_data/PD/annot_pd.csv).
+
+
+``` {r code Load Dataset}
+library(MSstats)
+
+# Load data
+pd_raw = system.file("tinytest/raw_data/PD/pd_input.csv", 
+                    package = "MSstatsConvert")
+
+annotation_raw = system.file("tinytest/raw_data/PD/annot_pd.csv", 
+                   package = "MSstatsConvert")
+
+pd = data.table::fread(pd_raw)
+annotation = data.table::fread(annotation_raw)
+
+head(pd, 5)
+head(annotation, 5)
+
+```
+
+### __1.2 Loading PD Data to MSstats__
+
+The imported data from Step 1.1. now must be converted through MSStatsConvert package's PDtoMSstatsFormat converter.
+
+This function converts the Proteome discoverer output into the required input format for MSstats.
+
+Actual Data modification can be seen below:
+
+```{r code PDtoMSstatsFormat}
+library(MSstatsConvert)
+
+pd_imported = MSstatsConvert::PDtoMSstatsFormat(pd, annotation, use_log_file = FALSE)
+
+head(pd_imported)
+```
+
+
+### __1.3 Converters__
+
+
+
+We have the following converters, which allow you to convert various types of output reports which include the feature level data to the required input format of MSstats. Further information about the converters can be found in the MSStatsConvert package.
+
+1. DIANNtoMSstatsFormat
+2. DIAUmpiretoMSstatsFormat
+3. FragPipetoMSstatsFormat
+4. MaxQtoMSstatsFormat
+5. OpenMStoMSstatsFormat
+6. OpenSWATHtoMSstatsFormat
+7. PDtoMSstatsFormat
+8. ProgenesistoMSstatsFormat
+9. SkylinetoMSstatsFormat
+10. SpectronauttoMSstatsFormat
+11. MetamorpheusToMSstatsFormat
+
+We show an example of how to use the above said Converters. 
+
+```{r Converter Files}
+
+skyline_raw = system.file("tinytest/raw_data/Skyline/skyline_input.csv", 
+                    package = "MSstatsConvert")
+
+skyline = data.table::fread(skyline_raw)
+head(skyline, 5)
+
+```
+
+```{r SkylinetoMSstatsFormat, results='hide', message=FALSE, warning=FALSE}
+
+msstats_format = MSstatsConvert::SkylinetoMSstatsFormat(skyline_raw,
+                                      qvalue_cutoff = 0.01,
+                                      useUniquePeptide = TRUE,
+                                      removeFewMeasurements = TRUE,
+                                      removeOxidationMpeptides = TRUE,
+                                      removeProtein_with1Feature = TRUE)
+
+
+```
+
+```{r SkylinetoMSstatsFormat head}
+head(msstats_format)
+```
+
+
+### __1.4 Data Process__
+
+Once we import the dataset correctly with Converter, we need to pre-process the data which is taken care by dataProcess.
+
+This step is essentially Data pre-processing and quality control of MS runs of the original raw data into quantitative data for model fitting and group comparison.
+
+There are 3 main steps that happen in background :
+
+* __Normalization__ - There are three different normalizations supported. 'equalizeMedians' (default) represents constant normalization (equalizing the medians) based on reference signals is performed. 'quantile' represents quantile normalization based on reference signals is performed. 'globalStandards' represents normalization with global standards proteins. FALSE represents no normalization is performed.
+
+* __Feature selection__ - This also has three options i.e. Select All features, Top-N features (by mean intensity) or “Best” features.
+* __Missing value imputation__ - We impute plausible values in case of missing data points. The RunLevelData can be queried to show Number of imputed intensities (censored intensities) in a RUN and Protein.
+
+
+``` {r code dataProcess}
+summarized = dataProcess(pd_imported)
+
+head(summarized$FeatureLevelData)
+
+head(summarized$ProteinLevelData)
+
+head(summarized$SummaryMethod)
+
+```
+
+### __1.4.1 Data Process Plots__
+
+After dataProcessing the input data, MSstats provides multiple plots to analyze the pre-processed data. Here we show the various types of plots we can use. Once invoked, a pdf file will be downloaded with corresponding feature level data and the Plot generated.
+
+```{r dataProcessPlots, results='hide', message=FALSE, warning=FALSE}
+
+# Profile plot
+dataProcessPlots(data=summarized, type="ProfilePlot", address = FALSE, which.Protein = "P0ABU9")
+
+# Quality control plot
+dataProcessPlots(data=summarized, type="QCPlot", address = FALSE, which.Protein = "P0ABU9")
+
+# Quantification plot for conditions
+dataProcessPlots(data=summarized, type="ConditionPlot", address = FALSE, which.Protein = "P0ABU9")
+
+```
+
+
+
+
+### __1.5 Modeling __
+
+In this step we test for significant changes in protein abundance across conditions based on a family of linear mixed-effects models in targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. 
+
+It is applicable to multiple types of sample preparation, including label-free workflows, workflows that use stable isotope labeled reference proteins and peptides, and workflows that use fractionation. Experimental design of case-control study (patients are not repeatedly measured) or time course study (patients are repeatedly measured) is automatically determined based on proper statistical model.
+
+
+``` {r code groupComparison}
+
+# Model
+
+model = groupComparison("pairwise", summarized)
+
+```
+
+Model Details
+
+``` {r Model }
+
+head(model$ModelQC)
+
+head(model$ComparisonResult)
+
+```
+
+### __1.5.1 groupComparisonPlot__
+
+Visualization for model-based analysis and summarizing differentially abundant proteins. To summarize the results of log-fold changes and adjusted p-values for differentially abundant proteins, groupComparisonPlots takes testing results from function groupComparison as input and automatically generate three types of figures in pdf files as output :
+
+* __Volcano plot__ : For each comparison separately. It illustrates actual log-fold changes and adjusted p-values for each comparison separately with all proteins. The x-axis is the log fold change. The base of logarithm transformation is the same as specified in “logTrans” from dataProcess. The y-axis is the negative log2 or log10 adjusted p-values. The horizontal dashed line represents the FDR cutoff. The points below the FDR cutoff line are non-significantly abundant proteins (colored in black). The points above the FDR cutoff line are significantly abundant proteins (colored in red/blue for up-/down-regulated). If fold change cutoff is specified (FCcutoff = specific value), the points above the FDR cutoff line but within the FC cutoff line are non-significantly abundant proteins (colored in black).
+
+* __Heatmap__ : For multiple comparisons. It illustratea up-/down-regulated proteins for multiple comparisons with all proteins. Each column represents each comparison of interest. Each row represents each protein. Color red/blue represents proteins in that specific comparison are significantly up-regulated/down-regulated proteins with FDR cutoff and/or FC cutoff. The color scheme shows the evidences of significance. The darker color it is, the stronger evidence of significance it has. Color gold represents proteins are not significantly different in abundance.
+
+* __Comparison plot__ : For multiple comparisons per protein. It illustrates log-fold change and its variation of multiple comparisons for single protein. X-axis is comparison of interest. Y-axis is the log fold change. The red points are the estimated log fold change from the model. The error bars are the confidence interval with 0.95 significant level for log fold change. This interval is only based on the standard error, which is estimated from the model.
+
+``` {r GroupComparisonPlots}
+groupComparisonPlots(
+  model$ComparisonResult,
+  type="Heatmap",
+  sig = 0.05,
+  FCcutoff = FALSE,
+  logBase.pvalue = 10,
+  ylimUp = FALSE,
+  ylimDown = FALSE,
+  xlimUp = FALSE,
+  x.axis.size = 10,
+  y.axis.size = 10,
+  dot.size = 3,
+  text.size = 4,
+  text.angle = 0,
+  legend.size = 13,
+  ProteinName = TRUE,
+  colorkey = TRUE,
+  numProtein = 100,
+  clustering = "both",
+  width = 800,
+  height = 600,
+  which.Comparison = "all",
+  which.Protein = "all",
+  address = FALSE,
+  isPlotly = FALSE
+)
+
+
+
+groupComparisonPlots(
+  model$ComparisonResult,
+  type="VolcanoPlot",
+  sig = 0.05,
+  FCcutoff = FALSE,
+  logBase.pvalue = 10,
+  ylimUp = FALSE,
+  ylimDown = FALSE,
+  xlimUp = FALSE,
+  x.axis.size = 10,
+  y.axis.size = 10,
+  dot.size = 3,
+  text.size = 4,
+  text.angle = 0,
+  legend.size = 13,
+  ProteinName = TRUE,
+  colorkey = TRUE,
+  numProtein = 100,
+  clustering = "both",
+  width = 800,
+  height = 600,
+  which.Comparison = "Condition2 vs Condition4",
+  which.Protein = "all",
+  address = FALSE,
+  isPlotly = FALSE
+)
+
+```
+
+### __1.6 Sample Size Calculation__
+
+Calculate sample size for future experiments of a Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment based on intensity-based linear model. The function fits the model and uses variance components to calculate sample size. The underlying model fitting with intensity-based linear model with technical MS run replication. Estimated sample size is rounded to 0 decimal. Two options of the calculation:
+
+* number of biological replicates per condition
+* power
+
+```{r Sample Size}
+
+sample_size_calc = designSampleSize(model$FittedModel,
+                                    desiredFC=c(1.75,2.5),
+                                    power = TRUE,
+                                    numSample=5)
+
+```
+
+
+
+### __1.6.1 Sample Size Calculation Plot__
+
+To illustrate the relationship of desired fold change and the calculated minimal number sample size which are
+
+Number of biological replicates per condition
+power.
+The input is the result from function designSampleSize.
+```{r Sample Size plot}
+
+designSampleSizePlots(sample_size_calc, isPlotly=FALSE)
+
+```
+
+
+### __1.7 Quantification from groupComparison Data__
+
+Model-based quantification for each condition or for each biological samples per protein in a targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. Quantification takes the processed data set by dataProcess as input and automatically generate the quantification results (data.frame) with long or matrix format. The quantification for endogenous samples is based on run summarization from subplot model, with TMP robust estimation.
+
+* Sample quantification : individual biological sample quantification for each protein. The label of each biological sample is a combination of the corresponding group and the sample ID. If there are no technical replicates or experimental replicates per sample, sample quantification is the same as run summarization from dataProcess (xxx$RunlevelData from dataProcess). If there are technical replicates or experimental replicates, sample quantification is median among run quantification corresponding MS runs.
+
+* Group quantification : quantification for individual group or individual condition per protein. It is median among sample quantification.
+
+```{r Quantification}
+sample_quant_long = quantification(summarized,
+                             type = "Sample",
+                             format = "long")
+sample_quant_long
+sample_quant_wide = quantification(summarized,
+                              type = "Sample",
+                              format = "matrix")
+sample_quant_wide
+group_quant_long = quantification(summarized,
+                                  type = "Group",
+                                  format = "long")
+group_quant_long
+group_quant_wide = quantification(summarized,
+                                  type = "Group",
+                                  format = "matrix")
+group_quant_wide
+```