Advanced Multimedia Processing Lab -- Projects -- DSP for Biomolecular Structures

About AMP Lab Projects Downloads Publications People Links

Project - DSP for Biomolecular Structures

Contents

Team member
Motivation
Contact

Team Member

ShannChing Chen

shanncc@andrew.cmu.edu

Top of this page

Motivation

Top of this page

A protein molecule is made of a long chain of amino acid sequences that fold into a complex three-dimensional structure. It is often the geometrical shapes that determine the protein functions. In molecular biology, researchers use sequence alignment and structure matching to compare the similarity among proteins. Considering proteins as 3D structures, we have developed algorithms to identify geometry-based features to retrieve similar proteins without having to deal with complex chemical characteristics and biological properties. Biologists suggest that, in the next decade a large amount of proteins structures will be derived without knowing its functions. Our work is very promising to help biologists identify the protein functions, which is essential important to drug design and disease prediction.

System Description

Global Matching Algorithm

There are around 18000 3D protein and nucleic acid molecular models stored in the Protein Data Bank (increasing rapidly every day). Our system collected 2500 protein 3D structures from them. First of all, we use PCA (Principle Component Analysis) to find the best alignment of two proteins before we extract the features for matching.

Protein models before and after rotation

Feature extraction

31 features are extracted with different weighting.

We emphasize a lot on the geometric features, so most of the weights are put on 3D and secondary structures.

3D Structure
Feature	Definition	Weight
Atom number	ATOM number excluding HETATMs	0.08
Render scale		0.08
Aspect ratio1		0.08
Aspect ratio2		0.08
Moment		0.32
Secondary Structure
Feature	Definition	Weight
HELIX	Number of HELIXs in PDB file	0.053
SHEET	Number of SHEETs in PDB file	0.053
TURN	Number of TURNs in PDB file	0.053
Primary Structure
Feature	Definition	Weight
Residue ratio	Different Residue Ratio in the protein	0.001
Hydrophobic Residue ratio	Hydrophobic Residues Ratio in the protein	0.001

where and is the atomic weight, ( , , ) is the coordinates of the atom.

Similarity measurement

In the current system, we simply normalize the features and use Euclidean to measure the similarity.

Substructure Matching Algorithm

Folding into complex 3D structures, protein molecules are responsible for carrying out nearly all of the essential functions in living cells by properly binding to other molecules with a number of chemical bonds connecting neighboring atoms. Locations of these atoms are called the binding sites. To help biologists identify the protein functions, which is essential important to drug design and disease prediction, it is desirable to retrieve common binding sites among proteins. We use the geometric hashing algorithm to identify similar binding sites among protein structures.

Flowchart of the substructure matching algorithm

There are 8390 PDB enzyme entries in the Enzyme Structure Classification Database including 8059 separate PDB files. Proteins in the same entry share the same function. We choose 200 proteins from Enzyme Database, with 85 of them are in Set E.C. 5.2.1.8., which contains peptidylprolyl isomerase proteins with similar functions. The precision-recall graph, shows that considering the substructure matching is much better.

Precision Recall Graph in E.C.5.1.2.8 of two queries: 1a7x and1bck

Publications

"Retrieval of 3D Protein Structure", S. C. Chen and T. Chen, to appear in ICIP 2002 , Rochester, NY, U.S.A., September 2002.
"Protein Retrieval by Matching 3D Surfaces", S. C. Chen and T. Chen, to appear in GENSIPS 2002 , Raleigh, North Carolina, USA., October 2002.

Contact

Any suggestions or comments are welcome. Please send them to ShannChing Chen.

Top of this page