Advanced Multimedia Processing Lab -- Projects -- DSP for Biomolecular Structures

            About AMP Lab        Projects        Downloads        Publications        People        Links

Project - DSP for Biomolecular Structures

 

ShannChing Chen

shanncc@andrew.cmu.edu

Top of this page

                                  

Top of this page

A protein molecule is made of a long chain of amino acid sequences that fold into a complex three-dimensional structure. It is often the geometrical shapes that determine the protein functions. In molecular biology, researchers use sequence alignment and structure matching to compare the similarity among proteins. Considering proteins as 3D structures, we have developed algorithms to identify geometry-based features to retrieve similar proteins without having to deal with complex chemical characteristics and biological properties. Biologists suggest that, in the next decade a large amount of proteins structures will be derived without knowing its functions. Our work is very promising to help biologists identify the protein functions, which is essential important to drug design and disease prediction. 

Global Matching Algorithm

There are around 18000 3D protein and nucleic acid molecular models stored in the Protein Data Bank (increasing rapidly every day). Our system collected 2500 protein 3D structures from them.  First of all, we use PCA (Principle Component Analysis) to find the best alignment of two proteins before we extract the features for matching. 

Protein models before and after rotation

 

Feature extraction

31 features are extracted with different weighting. 

We emphasize a lot on the geometric features, so most of the weights are put on 3D and secondary structures.

3D Structure

Feature

Definition

Weight

Atom number

ATOM number excluding HETATMs

0.08

Render scale

0.08

Aspect ratio1

0.08

Aspect ratio2

0.08

Moment

     *   

0.32

Secondary Structure

Feature

Definition

Weight

HELIX

Number of HELIXs in PDB file

0.053

SHEET

Number of SHEETs in PDB file

0.053

TURN

Number of TURNs in PDB file

0.053

Primary Structure

Feature

Definition

Weight

Residue ratio

Different Residue Ratio in the protein

0.001

Hydrophobic Residue ratio

Hydrophobic Residues Ratio in the protein

0.001

where and is the atomic weight, ( , , ) is the coordinates of the atom.

 

Similarity measurement

In the current system, we simply normalize the features and use Euclidean to measure the similarity.

 

Substructure Matching Algorithm

Folding into complex 3D structures, protein molecules are responsible for carrying out nearly all of the essential functions in living cells by properly binding to other molecules with a number of chemical bonds connecting neighboring atoms. Locations of these atoms are called the binding sites. To help biologists identify the protein functions, which is essential important to drug design and disease prediction, it is desirable to retrieve common binding sites among proteins. We use the geometric hashing algorithm to identify similar binding sites among protein structures. 

Flowchart of the substructure matching algorithm

 

There are 8390 PDB enzyme entries in the Enzyme Structure Classification Database including 8059 separate PDB files. Proteins in the same entry share the same function. We choose 200 proteins from Enzyme Database, with 85 of them are in Set E.C. 5.2.1.8., which contains peptidylprolyl isomerase proteins with similar functions. The precision-recall graph, shows that considering the substructure matching is much better.      

          Precision Recall Graph in E.C.5.1.2.8 of two queries: 1a7x and1bck

 

  1. "Retrieval of 3D Protein Structure", S. C. Chen and T. Chen, to appear in ICIP 2002 , Rochester, NY, U.S.A., September 2002.
  2. "Protein Retrieval by Matching 3D Surfaces", S. C. Chen and T. Chen, to appear in GENSIPS 2002 , Raleigh, North Carolina, USA., October 2002.

 

Any suggestions or comments are welcome. Please send them to ShannChing Chen

Top of this page