Assessment of the reliability of protein-protein interactions
and protein function prediction

Minghua Deng , Fengzhu Sun and Ting Chen
Program in Molecular and Computational Biology, Department of Biological Sciences
University of Southern California, 1042 West 36th Place, DRB 155, Los Angeles, CA 90089-1113.





Abstract

As more and more high-throughput protein-protein interaction data are accumulated, the task of estimating the reliability of different data sets becomes increasingly important. In this paper, we develop a maximum likelihood method to estimate the reliability of protein interaction data sets according to the distribution of correlation coefficients of gene expression profiles of putative interacting protein pairs. We apply the method to the yeast two-hybrid protein interaction data of the Uetz and the Ito, taking into account those protein pairs in the Ito's data which are observed to interact at least certain number of IST hits. We find that the reliability of the Uetz's interaction data is much higher than that of the Ito's, and is comparable with the Ito's which are observed to interact more than one IST hits. The reliability for the Ito's data generally increases with the number of IST hits, which is consistent with our intuition. We also apply the method to the protein complex data based on mass spectrometry using tandem affinity purification (TAP) and high-throughput mass-spectrometric protein complex identification (HMS-PCI), and find that the reliability of the TAP data is higher than that of the HMS-PCI data. We predict protein functions based on the protein interaction data sets, and estimate the accuracy of the predicted functions. The results show that the estimated reliability is consistent with the accuracy of the function prediction.


Original data

1. YPD protein name with cellular function.

Here we use gene name in Yeast Proteomics Database(YPD) to identify proteins and YPD cellular roles are used as the function categories. We download data from YPD at April 8,2002, which contain 6416 proteins in Yeast with one or more of 43 cellular roles, while 3894 proteins are annotated and 2522 proteins are unknown. The data is given as belowing.

Text file (825K byte)
Compressed file (244K byte--Recommended)
2. Protein-protein interaction data.
3. Gene expression data.

We also use gene expression data to verify our prediction. The expression data we used contain 6080 genes with 77 time points is listed.

Text file (3.36M byte)
Compressed file (820K byte----Recommended)


Distribution of gene expression correlation for different protein-protein interaction data.

We compute the gene expression correlation coefficient for each protein pairs and draw the distribution for each data set as well as randomly chosen pairs. Shown in the figure1 and figure2. To test whether the mean of correlation coefficient for interaction data is significantly higher than that for randomly pairs, the T-score and the corresponding p-values are computed. The following table list all the statistics of the distributions for different data sets.



Reliability estimation.

We use a maximum likelihood method to estimate the reliability of putative protein-protein interaction data. For physical interaction data, such as Uetz's, Ito's data and Ito's data with different IST numbers, we use MIPS physical interaction data as the reference data. For protein complex data, such as TAP and HMS-PCI data, we use MIPS complex data as reference to estimate the reliabilities. The results are given in the following table.



Function prediction based on different protein-protein interaction data.

We predict protein function based on protein-protein interaction data using neighborhood-count method and chi-square method, and the leave-one-out method is used to measure the specificity and sensitivity of predictions. The specificity and sensitivity of prediction based on a specific protein-protein interaction data is drawn for comparison of different data sets.





Ting Chen
Last modified: Thursday, July 18, 2002.