---
title: "Mass Spectrometry interaction Prediction (MSiP)"
author: | 
 | Matineh Rahmatbakhsh
 | matinerb.94@gmail.com
email: "matinerb.94@gmail.com"
package: "MSiP"
output: html_vignette
vignette: >
    %\VignetteEngine{knitr::knitr}
    %\VignetteIndexEntry{MSiP tutorial}
    %\usepackage[UTF-8]{inputenc}
---




#Introduction
The MSiP is a computational approach to predict protein-protein
interactions (PPIs) from large-scale affinity purification mass spectrometry (AP-MS) data. This approach includes both spoke and matrix models for interpreting AP-MS data in a network context. The "spoke" model considers only bait-prey interactions, whereas the "matrix" model assumes that each of the identified proteins (baits and prey) in a given AP-MS experiment interacts with each of the others. The spoke model has a high false-negative rate, whereas the matrix model has a high false-positive rate. Thus, although both statistical models have merits, a combination of both models has shown to increase the performance of machine learning classifiers in terms of their capabilities in discrimination between true and false positive interactions. 

###Load the package
```{r}
library(MSiP)
```

###Sample Data Description:
A demo AP-MS proteomics dataset is provided in this package to guide the users about data structure.
```{r}
data("SampleDatInput")
head(SampleDatInput)
```

###Scoring based on "spoke-model":
Comparative Proteomic Analysis Software Suite (CompPASS) is a robust statistical scoring scheme for assigning scores to bait-prey interactions. The output from CompPASS scoring includes Z-score, S-score, D-score, WD-score and other features. This function was optimized from the.  

```{r}
datScoring <- 
    cPASS(SampleDatInput)
head(datScoring)
```

###Scoring based on "matrix-model":
The Dice coefficient was first applied by to score interaction between all identified proteins (baits and preys) in a given AP-MS expriment.
```{r}
datScoring <- 
    diceCoefficient(SampleDatInput)
head(datScoring)
```

Alternatively, Jaccard, Simpson, and Overlap scores can be used to score the interaction between all the identified proteins in a given AP-MS experiment. 
```{r}
#Jaccard coefficient
datScoring <- 
    jaccardCoefficient(SampleDatInput)
head(datScoring)

#Simpson coefficient
datScoring <- 
    simpsonCoefficient(SampleDatInput)
head(datScoring)

#Overlap score
datScoring <- 
    simpsonCoefficient(SampleDatInput)
head(datScoring)
```

Finally, a weighted matrix model can also be employed to score interactions between identified proteins in a given AP-MS experiment. The output of the weighted matrix model includes the number of experiments for which the pair of proteins is co-purified (i.e., k) and $-1$*log(P-value) of the hypergeometric test (i.e., logHG) given the experimental overlap value, each protein's total number of observed experiments, and the total number of experiments.

```{r}
datScoring <- 
Weighted.matrixModel(SampleDatInput)
head(datScoring)
```

###Assign a confidence score to each instances using classifiers:
The labeled feature matrix can be used as input for Support Vector Machine (SVM) or Random Forest (RF) classifiers. The classifier then assigns each bait-prey pair a confidence score, indicating the level of support for that pair of proteins to interact. Hyperparameter optimization can also be performed to select a set of parameters that maximizes the model's performance. The RF and the SVM functions provided in this package also computes the areas under the precision-recall (PR) and ROC curve to evalute the performance of the classifier. 

####Import the demo data:
```{r}
data("testdfClassifier")
head(testdfClassifier)

```

####Run the RF classifier:
```{r rfTrain output figure, echo=FALSE, fig.height=4, fig.width=5, message=FALSE, warning=FALSE, paged.print=FALSE}
#only generate the pr.curve
predidcted_RF <- 
    rfTrain(testdfClassifier,impute = FALSE, p = 0.3, parameterTuning = FALSE,
            mtry  = seq(from = 1, to = 5, by = 1),
            min_node_size = seq(from = 1, to = 5, by = 1),
            splitrule =c("gini"),metric = "Accuracy",
            resampling.method = "repeatedcv",iter = 5,repeats = 5,
            pr.plot = TRUE, roc.plot = FALSE
    )
```

####Output from RF classifier:
```{r}
#positive score corresponds to  the level of support for the pair of proteins to be true positive
#negative score corresponds to the level of support for the pair of proteins to be true negative
head(predidcted_RF)
```


####Run the SVM classifier:
```{r}
#only generate the ROC curve
predidcted_SVM <- 
    svmTrain(testdfClassifier,impute = FALSE,p = 0.3,parameterTuning = FALSE,
             cost = seq(from = 2, to = 10, by = 2),
             gamma = seq(from = 0.01, to = 0.10, by = 0.02),
             kernel = "radial",ncross = 10,
             pr.plot = FALSE, roc.plot = TRUE
    ) 
```

####Output from SVM classifier:
```{r}
#positive score corresponds to  the level of support for the pair of proteins to be true positive
#negative score corresponds to the level of support for the pair of proteins to be true negative
head(predidcted_SVM)
```