Project - DSP for Biomolecular Structures
A protein molecule is made of a long chain of amino acid sequences that fold into a complex three-dimensional structure. It is often the geometrical shapes that determine the protein functions. In molecular biology, researchers use sequence alignment and structure matching to compare the similarity among proteins. Considering proteins as 3D structures, we have developed algorithms to identify geometry-based features to retrieve similar proteins without having to deal with complex chemical characteristics and biological properties. Biologists suggest that, in the next decade a large amount of proteins structures will be derived without knowing its functions. Our work is very promising to help biologists identify the protein functions, which is essential important to drug design and disease prediction.
Global Matching Algorithm
There are around 18000 3D protein and nucleic acid molecular models stored in the Protein Data Bank (increasing rapidly every day). Our system collected 2500 protein 3D structures from them. First of all, we use PCA (Principle Component Analysis) to find the best alignment of two proteins before we extract the features for matching.
31 features are extracted with different weighting.
We emphasize a lot on the geometric
features, so most of the weights are put on 3D and secondary structures.
In the current system, we simply normalize the features and use Euclidean to measure the similarity.
Substructure Matching Algorithm
Folding into complex 3D structures, protein molecules are responsible for carrying out nearly all of the essential functions in living cells by properly binding to other molecules with a number of chemical bonds connecting neighboring atoms. Locations of these atoms are called the binding sites. To help biologists identify the protein functions, which is essential important to drug design and disease prediction, it is desirable to retrieve common binding sites among proteins. We use the geometric hashing algorithm to identify similar binding sites among protein structures.
There are 8390 PDB enzyme entries in the Enzyme Structure Classification Database including 8059 separate PDB files. Proteins in the same entry share the same function. We choose 200 proteins from Enzyme Database, with 85 of them are in Set E.C. 126.96.36.199., which contains peptidylprolyl isomerase proteins with similar functions. The precision-recall graph, shows that considering the substructure matching is much better.
Any suggestions or comments are welcome. Please send them to ShannChing Chen.