logo
  Home | Contact us | Browse | Search by Swiss-Prot ID or Keyword

Menu:

Latest news:

Jan. 20, 2008:
The structural information, protein disorder regions, will be annotated on dbPTM in Feb. 2008!

Read more...

Abstract

dbPTM is a database that compiles information on protein post-translational modifications (PTM) such as the modified sites, solvent accessibility of surrounding amino acids, protein secondary and tertiary structures, protein domains, and protein variations. The version 2.0 of dbPTM integrates the experimentally validated PTM sites with referable literatures from Swiss-Prot, Phospho.ELM, O-GLYCBASE, and UbiProt. In all of the collected PTM information, about 25 types of PTM with enough experimentally validated sites are trained the profile hidden Markov models (HMMs) to detect the potential PTM sites with 100% specificity against Swiss-Prot proteins. To help users investigating more detailed in each type of PTM, the substrate peptide specificity such as positional amino acid frequency, solvent accessibility and secondary structure surrounding the modified sites are provided. Moreover, the information of orthologous protein clusters is provided to users for analyzing whether the PTM sites located in the evolutionary conserved regions or not. All the PTMs and related information are graphically visualized, which is now freely accessible at http://dbPTM.mbc.nctu.edu.tw/.


System Architecture of dbPTM 2.0

The data generation flow comprises the three major components: Integration of external known Post-Translational Modification (PTM) databases, Learning and prediction of PTM sites, and Structural and functional annotations. The experimental validated PTM data sources were extracted from Swiss-Prot, Phospho.ELM, OGlycBase, and UbiProt. The experimentally verified PTM sites were used to generate computer models for further identifying putative PTM sites against the Swiss-Prot proteins. Additional structural properties and functional information, such as protein tertiary structures, protein secondary structures, the solvent accessibility of the residues, protein functional domains, protein variations and non-synonymous SNP are also annotated to the Swiss-Prot proteins. Furthermore, the information of orthologous protein clusters is provided to users for analyzing whether the PTM sites located in the evolutionary conserved regions or not.

 

The Learning and Prediction of 20 Types of PTM

The experimental validated PTM data sources were extracted from Swiss-Prot, Phospho.ELM, OGlycBase, and UbiProt. The redundant PTM sites among the four databases were removed; furthermore, about 20 types of PTM with at least 30 experimentally validated sites were used to investigate the amino acids surrounding the modified sites and train the profile HMMs. Given the window length n, the fragments of 2n+1 residues centering on PTM site (position 0) are extracted and constructed as the positive training set. The value of n is set to 6. However, the window lengths in several types of PTM which occurred on N-terminal or C-terminal of protein sequence are set to 0 ~ +6 or -6 ~ 0. Due to the absence of confirmed non-PTM sites, the residues that had not been annotated as PTM sites within PTM annotated proteins were chosen as a representation of general non-PTM sites (negative training set). The Maximal Dependence Decomposition (MDD), which was firstly applied in the prediction of RNA splicing sites, employs statistical -test to group a set of aligned signal sequences to moderate a large group into subgroups that capture the most significant dependencies between positions. In each type of PTM, the profile Hidden Markov Models (HMMs), which describes a probability distribution over a potentially infinite numbers of sequences, was adopted to train the computation models from the positive sets of the PTM site sequences aligned without gaps. Herein, we use the software package HMMER (version 2.3.2) to build the models, to calibrate the models and to search the putative PTM sites against the protein sequence. Two important parameters of HMMER should be considered, bit score and expectation value (E-value). A search of a model with the bit score greater than the threshold t and the E-value smaller than the threshold e is defined as a positive prediction. We select the HMMER bit score as the criteria to define a HMM match. The threshold t of HMM in each type of PTM is decided by maximizing the accuracy measure during a variety of cross-validation with the bit score value range from -10 to 0. Finally, we set the predictive parameters as the values when the prediction specificity is 100% and fully detect the potential PTM sites against Swiss-Prot protein sequences.

 

The Benchmark of Nonredundant PTM Test Set

The benchmark of generating the nonredundant dataset in each type of PTM. The protein sequences containing the same type of PTM sites were clustered with a threshold of 30% identity by BLASTCLUST. If two protein sequences were similar with more than 30% identity, we re-aligned the fragment sequences with window length 2n+1 residues centering on modified sites by BL2SEQ. Ff two PTM fragment sequences were similar with 100% identity and the PTM sites from the two proteins were at the same position corresponding whole protein, only one site was kept while the other one was discarded.

 

Significant Improvements and Advances

The significant improvements and advancements of dbPTM 2.0 comparing to dbPTM 1.0 are shown as following table. The major improvements include the integration of UbiProt, the increasing types of PTM prediction,the referable literatures of experimental PTM sites, substrate peptide specificity, conserved regions of orthologous protein clusters, the relationship between PTM and subcellular localization, and the PTM associations.

 

References

SWISS-PROT: The curated protein sequence database on Internet. Watanabe K., Harayama S. Protein, Nucleic Acid and Enzyme 46:80-86(2001).

Swiss-Prot: Juggling between evolution and stability. Bairoch A., Boeckmann B., Ferro S., Gasteiger E. Brief. Bioinform. 5:39-55(2004).

Annotation of post-translational modifications in the Swiss-Prot knowledgebase. Farriol-Mathis N., Garavelli J.S., Boeckmann B., Duvaud S., Gasteiger E., Gateau A., Veuthey A.-L., Bairoch A. Proteomics 4:1537-1550(2004).

Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. Diella F, Cameron S, Gemund C, Linding R, Via A, Kuster B, Sicheritz-Ponten T, Blom N, Gibson TJ. BMC Bioinformatics. 2004 Jun 22;5(1):79.

Bourne: The Protein Data Bank. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Nucleic Acids Research, 28 (2000) 235-242.

InterPro : an integrated documentation resource for protein families, domains and functional sites. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, et al.Bioinformatics. 2000 Dec;16(12):1145-50.

KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. H.D. Huang*, T.Y. Lee, S.W. Tseng, and J.T. Horng. (2005) Nucleic Acids Research

Incorporating Hidden Markov Model for Identifying Protein Kinase-specific Phosphorylation Sites. H.D. Huang*, T.Y. Lee, S.W. Tseng, L.C. Wu, J.T. Horng, and A.P. Tsou. (2005) Journal of Computational Chemistry, Vol. 26, pp. 1032-1041.

RVP-net: Online prediction of real valued accessible surface area of proteins from single sequences. Shandar Ahmad, M Michael Gromiha and Akinori Sarai, Bionformatics 19 (2003) 1849-1851.

The PSIPRED protein structure prediction server. McGuffin LJ, Bryson K, Jones DT. Bioinformatics. 16 (2000), 404-405.

dbPTM: An Information Repository of Protein Post-Translational Modification. Tzong-Yi Lee, H.D. Huang* (joint first authorship), J.H. Hung, Y.S. Yang, and T.H. Wang*. Nucleic Acids Research (2006), Vol. 34, D622-D627.