|Structural Comparison of alpha and beta globin chain. Both were taken from human hemoglobin (PDBid 4HHB).|
So, after a little bit warming up about the bascis of protein structure, now I would like to discuss about databases of protein structure and file formats. After that the discussion about protein structure comparison and alignment will follow. In the Part 1, I’ve already discussed a little about the primary databases of biological macromolecule structures. Yes, those are wwPDB, which mainly focuses on protein structures, and NDB which focuses on nucleic acid structures.
Now I would like to discuss about one of the wwPDB subserver, RCSB PDB. About more than 100,000 structures have been resolved per January 2016 (precisely 115,306 structures). It is quite many, but this number is nothing compared to the sequence databases which exceeding more than a hundred million sequences. Okay, back to the PDB. Just like its sequence database counterpart, we can download structural data in PDB. Each of the protein structures are identified by an unique identifier called as PDBid. A PDBid is characterized by four alphanumeric characters which is permanent and immutable, means that as long as that structure exists it has that specific PDBid. As an example, human hemoglobin has PDBid of 4HHB. Try to type “4HHB” in the PDB search box you you will be directed to the page containing hemoglobin structure. The structure page contains several features of the respective protein, such as structure summary, 3D view, annotations, sequence, sequence.structure similarity, and related literatures. We can also download a structural file in this page.
There are three types of file formats of structural files, namely PDB file, mmCIF file, and XML file. All of these three files contain informations about the relative position of every atom composing the proteins in 3D space. The difference among these files are the information parsibility to allow further computational analysis. PDB file is the earliest file format developed to accomodate the structural data. It is written in table-like format including the number of atoms, amino acid residue, protein subunits, and atom position in xyz coordinates. This format is relatively easy to read and understand, but hardly parsable by computer since it adopts a textfile format. The other, mmCIF and XML, formats adopt a relational database format so they're more parsable by computer to allow more analyses like residual grouping, sorting, etc.
Having seen the PDB file, does it give you an image of what the protein look like? Of course it’s quite hard to imagine the overal protein structure just by plotting all atoms based on their coordinates. This is why in the earlier time (around 1970s), protein structure was made physically by using molecular models (just like in the chemistry class) in order to visualize it. But now there’re various sophisticated molecular visualization programs which allow us to visualize the protein in various styles, from wireframe, ball and stick, spacefill, until the ribbon style. Some of the programs like Cn3D, Rasmol, Jmol, YASARA View, UCSF Chimera are free to download. All of these program can take all three file formats to be visualized.
One of the common analysis of protein structure is to compare whether how fit one structure to the others. This kind of analysis enables us to see the structural variation among similar or homologous structures. Protein structure comparison is measured in distance between equipositional atoms in compared structures. This measure is called root mean square distance or RMSD for short. Commonly, only RMSD between equipositional alpha-carbon are calculated (see the above figure). Just like visualization programs, there are also several free-to-download or web-based protein comparison program like CE, UCSF Chimera, Expresso, FATCAT, MAMMOTH, VAST+, etc. Beside structural comparison, these programs also can perform structural alignment, in which the amino acid residues in two proteins are aligned to each other based on their position in both structures. This structural alignment is considered better compared to ordinary sequence alignment since structural conservation is higher than sequence conservation, that is a tiny error might ruin the overall structure.