Rabu, 20 Januari 2016

Structural Bioinformatics - Part 2: Structural File Formats and Structure Comparison

Structural Comparison of alpha and beta globin chain. Both were taken from human hemoglobin (PDBid 4HHB).

So, after a little bit warming up about the bascis of protein structure, now I would like to discuss about databases of protein structure and file formats. After that the discussion about protein structure comparison and alignment will follow. In the Part 1, I’ve already discussed a little about the primary databases of biological macromolecule structures. Yes, those are wwPDB, which mainly focuses on protein structures, and NDB which focuses on nucleic acid structures.
Now I would like to discuss about one of the wwPDB subserver, RCSB PDB. About more than 100,000 structures have been resolved per January 2016 (precisely 115,306 structures). It is quite many, but this number is nothing compared to the sequence databases which exceeding more than a hundred million sequences. Okay, back to the PDB. Just like its sequence database counterpart, we can download structural data in PDB. Each of the protein structures are identified by an unique identifier called as PDBid. A PDBid is characterized by four alphanumeric characters which is permanent and immutable, means that as long as that structure exists it has that specific PDBid. As an example, human hemoglobin has PDBid of 4HHB. Try to type “4HHB” in the PDB search box you you will be directed to the page containing hemoglobin structure. The structure page contains several features of the respective protein, such as structure summary, 3D view, annotations, sequence, sequence.structure similarity, and related literatures. We can also download a structural file in this page.
There are three types of file formats of structural files, namely PDB file, mmCIF file, and XML file. All of these three files contain informations about the relative position of every atom composing the proteins in 3D space. The difference among these files are the information parsibility to allow further computational analysis. PDB file is the earliest file format developed to accomodate the structural data. It is written in table-like format including the number of atoms, amino acid residue, protein subunits, and atom position in xyz coordinates. This format is relatively easy to read and understand, but hardly parsable by computer since it adopts a textfile format. The other, mmCIF and XML, formats adopt a relational database format so they're more parsable by computer to allow more analyses like residual grouping, sorting, etc.
Having seen the PDB file, does it give you an image of what the protein look like? Of course it’s quite hard to imagine the overal protein structure just by plotting all atoms based on their coordinates. This is why in the earlier time (around 1970s), protein structure was made physically by using molecular models (just like in the chemistry class) in order to visualize it. But now there’re various sophisticated molecular visualization programs which allow us to visualize the protein in various styles, from wireframe, ball and stick, spacefill, until the ribbon style. Some of the programs like Cn3D, Rasmol, Jmol, YASARA View, UCSF Chimera are free to download. All of these program can take all three file formats to be visualized.
One of the common analysis of protein structure is to compare whether how fit one structure to the others. This kind of analysis enables us to see the structural variation among similar or homologous structures. Protein structure comparison is measured in distance between equipositional atoms in compared structures. This measure is called root mean square distance or RMSD for short. Commonly, only RMSD between equipositional alpha-carbon are calculated (see the above figure). Just like visualization programs, there are also several free-to-download or web-based protein comparison program like CE, UCSF Chimera, Expresso, FATCAT, MAMMOTH, VAST+, etc. Beside structural comparison, these programs also can perform structural alignment, in which the amino acid residues in two proteins are aligned to each other based on their position in both structures. This structural alignment is considered better compared to ordinary sequence alignment since structural conservation is higher than sequence conservation, that is a tiny error might ruin the overall structure.


Victor

Kamis, 14 Januari 2016

Structural Bioinformatics - Part 1: The Basics, Structural Protein Determination, and Database

Structure of Hemoglobin


This is still a part of the “Sebuah Tulisan Bioinformatika” series which was previously written in Bahasa Indonesia. Now I would like to push my luck a little bit further by writing it in English started from this article and further on. Okay, this time I’d like to write about the basics of Structural Bioinformatics. Hopefully you would enjoy my story :)
The story starts with a definition. Structural Bioinformatics is a branch of bioinformatics which deals with the structural parts of biological macromolecules, the DNA, RNA and Protein. Nowadays, protein structures are dominating the structural databases compared to DNA and RNA. Several factors that cause protein domination might be associated with the history which in 1970s were initiated by resolving protein structure by using X-ray crystallography. Therefor, I will put more focus on protein structural bioinformatics in the story.
As you might already know, there are 4 hierarchical structures governing the protein, named primary until quartenary structure. The primary structure represents protein in a string/sequence of amino acid composing the protein. So if you see the sequence WHYGARTFED for example, that is the primary structure composed of tryptophan, histidine, tyrosine, glycine, alanine, arginine, and so on. For more details of amino acid abbreviation symbol, you can search in the Google. The secondary structure composed of the local structures formed by local interactions of adjacent amino acids through hydrogen bonds. Commonly there are eight types of secondary structures as defined in the Dictionary of Secondary Structure of Protein (DSSP) by Kabsch in 1983. They are 310 helix (G), alpha helix (H), pi helix (I), beta bridge (B), beta bulges (E), turns (T), curve (S), and loop (C). To ease the complexity, these secondary structures often grouped into three larger classes, named helix (G, H, and I), strands (B and E), and loops (T, S, and C). The tertiary structure often determines the majority folding patterns of protein. It is formed by non-local residual interactions involving Van der Waals and hydrophobic interactions. Sometimes the folding is also strengthened by incorporating covalent bond through disulfide bridge between two cysteine. Owing to the native tertiary structure, the protein can function properly. But several large protein complexes needs a higher order structure in order to function, by which we call as quartenary structure. This structure involves several tertiary structure subunits to be assembled together. Hemoglobin, a classic example, is a protein with a quartenary structure. It is composed of two alpha globin and two beta globin subunits.
Until recently, there are three methods employed to determine the protein structure. Ordered from the earliest to the most recent, they are  X-ray crystallography, nucleic magnetic resonance (NMR) spectroscopy, and electron microscopy (EM). Each of these methods have their advantages and drawbacks. X-ray crystallography focusing an x-ray beam through a crystallized protein. The patterns of electron diffraction due to x-ray beam is mapped into an electron-density map, which is then used to build a model structure of the corresponding protein. The use of protein crystallization and electron-density map allows the building of a model structure in high resolution. However, the crystallization process almost always give a problem since not all protein can be crystallized. The use of NMR in protein structure determination solves this crystallization problem simply because this method do not require such process. In NMR spectroscopy, a purified protein solution is placed in a very strong magnetic field and then a radio wave hit to the molecules. The corresponding resonance from a radio wave is then analyzed to map a number of adjacent atomic nuclei. The model structure is then build based on the position of these atomic nuclei relative to the others. NMR spectroscopy gives an intermediate resolution of the resulted structure compared to x-ray, but the independency of protein crystallization process make this method used to model the structure of non-crystallizeable protein such as transmembrane proteins. The EM method is the most recently developed method for determining protein structure. In the process, electron beams are projected directly to the protein complex at every angle to generate a 3D image, similar to cell structure visualization. EM is able to model a large of even huge protein complexes which the other two methods could not. However, speaking of the resolution, EM-generated model structure has the lowest resolution compared to NMR or X-ray methods.
After generating an image, then what to do next? Well, in this digital era where uploading images are prominent feature of the term “exist”, similar thing also happens for protein structure. The world wide Protein Data Bank (wwPDB) is the primary database where the researchers all over the world submit their model structures to be deposited. This database is divided into three sub-databases located in three different countries:
1.         Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) in USA. URL: http://www.rcsb.org/pdb/home/home.do
2.         European Protein Data Bank (PDBe) in UK. URL: http://www.ebi.ac.uk/pdbe/node/1
3.         Japan Protein Data Bank (PDBj) in Japan. URL: http://pdbj.org/
These three databases also accept other biological macromolecular structures such as DNA and RNA. But as the data grows, a new database developed specially to accommodate DNA and RNA structure was built. This database is called Nucleic Acid Database (NDB) and you can access it in: http://ndbserver.rutgers.edu/. The occurrence of all these databases help the researcher all around the world to deposit and exchanging structural data in order to make one further step in their research. Well, I think that’s enough for the first part of the story. In the next part, I will tell more about the structural databases as well as the file formats.

Victor