Mass Spectrometry interaction Prediction (MSiP)

#Introduction The MSiP is a computational approach to predict protein-protein interactions (PPIs) from large-scale affinity purification mass spectrometry (AP-MS) data. This approach includes both spoke and matrix models for interpreting AP-MS data in a network context. The “spoke” model considers only bait-prey interactions, whereas the “matrix” model assumes that each of the identified proteins (baits and prey) in a given AP-MS experiment interacts with each of the others. The spoke model has a high false-negative rate, whereas the matrix model has a high false-positive rate. Thus, although both statistical models have merits, a combination of both models has shown to increase the performance of machine learning classifiers in terms of their capabilities in discrimination between true and false positive interactions.

###Load the package

library(MSiP)

###Sample Data Description: A demo AP-MS proteomics dataset is provided in this package to guide the users about data structure.

data("SampleDatInput")
head(SampleDatInput)

##    Experiment.id Replicate   Bait   Prey counts
##            <int>    <char> <char> <char>  <int>
## 1:            14    BN2198   NSP5 Q07065     17
## 2:            23    BN2214  ORF7A O75947      3
## 3:            16    BN2173   NSP7 Q8TCU6      2
## 4:             7    BN2177  NSP11 P30153      2
## 5:            22    BN2186   ORF6 Q8NBM4      1
## 6:             4    BN2190      N P51571      1

###Scoring based on “spoke-model”: Comparative Proteomic Analysis Software Suite (CompPASS) is a robust statistical scoring scheme for assigning scores to bait-prey interactions. The output from CompPASS scoring includes Z-score, S-score, D-score, WD-score and other features. This function was optimized from the.

datScoring <- 
    cPASS(SampleDatInput)
head(datScoring)

##   Bait   Prey AvePSM TotalPSM Ratio_totalPSM      Ratio   S_score  Z_score
## 1    E A5YKK6      4        4      1.0000000 0.03846154 10.198039 4.899136
## 2    E O15144      5       17      0.2941176 0.11538462  6.582806 2.344863
## 3    E O95299      1        1      1.0000000 0.03846154  5.099020 4.899136
## 4    E P04899      7        7      1.0000000 0.03846154 13.490738 4.899136
## 5    E P07355      1        1      1.0000000 0.03846154  5.099020 4.899136
## 6    E P07954      4        4      1.0000000 0.03846154 10.198039 4.899136
##     D_score  WD_score Entropy
## 1 10.198039 0.6014181       0
## 2  6.582806 0.2893455       0
## 3  5.099020 0.3007091       0
## 4 13.490738 0.7956014       0
## 5  5.099020 0.3007091       0
## 6 10.198039 0.6014181       0

###Scoring based on “matrix-model”: The Dice coefficient was first applied by to score interaction between all identified proteins (baits and preys) in a given AP-MS expriment.

datScoring <- 
    diceCoefficient(SampleDatInput)
head(datScoring)

##             BPI      Dice
## 1 A4D1P6~A5YKK6 0.0000000
## 2      A4D1P6~E 0.0000000
## 3 A4D1P6~E9PAV3 0.0000000
## 4      A4D1P6~M 0.0000000
## 5      A4D1P6~N 0.0000000
## 6   A4D1P6~NSP1 0.6666667

Alternatively, Jaccard, Simpson, and Overlap scores can be used to score the interaction between all the identified proteins in a given AP-MS experiment.

#Jaccard coefficient
datScoring <- 
    jaccardCoefficient(SampleDatInput)
head(datScoring)

##             BPI Jaccard
## 1 A4D1P6~A5YKK6     0.0
## 2      A4D1P6~E     0.0
## 3 A4D1P6~E9PAV3     0.0
## 4      A4D1P6~M     0.0
## 5      A4D1P6~N     0.0
## 6   A4D1P6~NSP1     0.5

#Simpson coefficient
datScoring <- 
    simpsonCoefficient(SampleDatInput)
head(datScoring)

##             BPI Simpson
## 1 A4D1P6~A5YKK6       0
## 2      A4D1P6~E       0
## 3 A4D1P6~E9PAV3       0
## 4      A4D1P6~M       0
## 5      A4D1P6~N       0
## 6   A4D1P6~NSP1       1

#Overlap score
datScoring <- 
    simpsonCoefficient(SampleDatInput)
head(datScoring)

##             BPI Simpson
## 1 A4D1P6~A5YKK6       0
## 2      A4D1P6~E       0
## 3 A4D1P6~E9PAV3       0
## 4      A4D1P6~M       0
## 5      A4D1P6~N       0
## 6   A4D1P6~NSP1       1

Finally, a weighted matrix model can also be employed to score interactions between identified proteins in a given AP-MS experiment. The output of the weighted matrix model includes the number of experiments for which the pair of proteins is co-purified (i.e., k) and −1*log(P-value) of the hypergeometric test (i.e., logHG) given the experimental overlap value, each protein’s total number of observed experiments, and the total number of experiments.

datScoring <- 
Weighted.matrixModel(SampleDatInput)

## Joining with `by = join_by(UniprotID)`
## Joining with `by = join_by(UniprotID)`

head(datScoring)

##             BPI k    logHG
## 1 A4D1P6~Q12931 2 4.762174
## 2 O00160~P53396 2 3.690590
## 3 O00160~Q9UBT2 2 4.762174
## 4 O00231~P28331 2 3.558201
## 5 O15144~Q9P2D7 2 3.690590
## 6 O43776~P12004 2 4.762174

###Assign a confidence score to each instances using classifiers: The labeled feature matrix can be used as input for Support Vector Machine (SVM) or Random Forest (RF) classifiers. The classifier then assigns each bait-prey pair a confidence score, indicating the level of support for that pair of proteins to interact. Hyperparameter optimization can also be performed to select a set of parameters that maximizes the model’s performance. The RF and the SVM functions provided in this package also computes the areas under the precision-recall (PR) and ROC curve to evalute the performance of the classifier.

####Import the demo data:

data("testdfClassifier")
head(testdfClassifier)

##             BPI      Dice   Jaccard   Overlap   Simpson  k     logHG target
## 1 O00232~P35998 0.3478261 0.2105263 0.2105263 1.0000000  4 1.5102503      1
## 2 P18077~P42766 0.7272727 0.5714286 0.5538462 0.9230769 12 3.0401840      1
## 3 Q08722~Q96N66 0.6153846 0.4444444 0.4444444 1.0000000  8 3.9266218      0
## 4 O43684~P42224 0.6363636 0.4666667 0.4083333 0.7000000  7 3.0103311      0
## 5 Q13838~Q14498 0.5882353 0.4166667 0.3571429 0.7142857  5 3.1524199      1
## 6 P22314~P61224 0.3333333 0.2000000 0.1142857 0.4000000  2 0.9471408      0

####Run the RF classifier:

## named numeric(0)

####Output from RF classifier:

#positive score corresponds to  the level of support for the pair of proteins to be true positive
#negative score corresponds to the level of support for the pair of proteins to be true negative
head(predidcted_RF)

##    fulldat[, 1]  Positive  Negative
## 1 O00232~P35998 0.6493617 0.3506383
## 2 P18077~P42766 0.7876696 0.2123304
## 3 Q08722~Q96N66 0.1543956 0.8456044
## 4 O43684~P42224 0.3505563 0.6494437
## 5 Q13838~Q14498 0.6022014 0.3977986
## 6 P22314~P61224 0.0805381 0.9194619

####Run the SVM classifier:

#only generate the ROC curve
predidcted_SVM <- 
    svmTrain(testdfClassifier,impute = FALSE,p = 0.3,parameterTuning = FALSE,
             cost = seq(from = 2, to = 10, by = 2),
             gamma = seq(from = 0.01, to = 0.10, by = 0.02),
             kernel = "radial",ncross = 10,
             pr.plot = FALSE, roc.plot = TRUE
    )

## named numeric(0)

## Setting levels: control = Positive, case = Negative

## Setting direction: controls < cases

####Output from SVM classifier:

#positive score corresponds to  the level of support for the pair of proteins to be true positive
#negative score corresponds to the level of support for the pair of proteins to be true negative
head(predidcted_SVM)

##                   Positive            Negative           
## 1 "O00232~P35998" "0.433857088579018" "0.566142911420982"
## 2 "P18077~P42766" "0.521637430750344" "0.478362569249656"
## 3 "Q08722~Q96N66" "0.411314802177396" "0.588685197822604"
## 4 "O43684~P42224" "0.393538672144932" "0.606461327855068"
## 5 "Q13838~Q14498" "0.40621899565531"  "0.59378100434469" 
## 6 "P22314~P61224" "0.393537868461527" "0.606462131538473"