Method
As shown in below figure, there are 557 sulfation tyrosines corresponding to 247 Swiss-Prot proteins, of which 144 sites corresponding to 100 proteins are experimentally validated. In this study, t he non-sulfated tyrosine residues within the experimentally validated sulfation proteins were selected as the negative set for evaluating the predictive models. Furthermore, we defined the position 0 as the sulfated residue and the positions (-4 ~ -1) and (+1 ~ +4) as the residues surrounding the sulfation sites of tyrosine. After removing the redundancy in the positive set, there were 72 tyrosine sulfation sites clustered by the Maximal Dependence Decomposition (MDD) method in order to increase the predictive sensitivity and specificity of the models. The MDD is a methodology to group a set of aligned signal sequences to moderate a large group into subgroups that capture the most significant dependencies between positions. The profile Hidden Markov Models (HMMs) was trained separately from the MDD-clustered subgroups of the 72 tyrosine sulfation site sequences. An HMM describes a probability distribution over a potentially infinite number of sequences. It can be used to detect distant relationships between amino acids sequences. Herein, we used the software package HMMER (version 2.3.2 ) to build the models, to calibrate the models and to search the putative sulfation sites against the test sets.
To evaluate the trained models, two cross-validation methods, k-fold cross-validation and leave-one-out cross-validation, were applied in this work. To search the hits of a model, HMMER returned both a HMMER score and an expectation value (E-value). The score was the base-two logarithm of the ratio between the probability that the query sequence was a significant match and the probability that it was generated by a random model. The E-value represented the expected number of sequences with a score greater than or equal to the returned HMMER bit scores. We selected the HMMER score as the criteria to define a HMM match. A search of a model with the HMMER score greater than the threshold t was defined as a positive prediction, i.e., a HMM recognizes a sulfation site. The threshold t of each model was decided by maximizing the accuracy measure during a variety of cross-validations with the HMM bit score value range from 0 to -10. The optimization of the threshold of the HMM bit score in the tyrosine sulfation model is shown in below figure. The threshold of the model was set to -4.5 to maximize the accuracy measure of the model. |