Quality assessment for molecular models

From Proteopedia

Jump to: navigation, search


Crystallographic Models

About 88% of the molecular models published in the Protein Data Bank come from X-ray crystallography experiments. These crystallographic models vary widely in quality, and rarely they are grossly incorrect[1][2] or fraudulent (see Retractions and Fraud). Generally, model quality is indicated by the resolution of the model, the R value, and especially the Free R. Useful information on model quality, including the Ramachandran plots, can be obtained from PDBReports[3]. All-atom contact analysis[4] is a powerful newer method for finding and correcting errors in crystallographic models, made easy and convenient with the MolProbity Server[5].

Generally, crystallographic models are reliable in most details when they have resolutions of 2.0 Å or better (the lower the number the better), R values of 0.20 or less, and Free R values of 0.25 or less. However, new and important structural insights are often provided by models with much lower resolution. Interestingly, the quality of published molecular models is inversely related to the impacts of the journals in which they are published[6].

Validation By The PDB

Detailed validation reports are available from the PDB for all entries, including those deposited before the validation process was implemented by the PDB. An example is the Full Validation Report for 6ZGG (SARS-CoV-2 spike protein in an open conformation). To access such validation reports at RCSB, go to the page about the structure (for example 6ZGG), and scroll down to the section Experimental Data & Validation.


In 2011, the Validation Task Force of the worldwide Protein Data Bank recommended that state-of-the-art crystallographic validation tools be used to generate succinct reports, understandable to non-experts, at the time a PDB code is assigned, and made available to the authors, reviewers, and users of the model[2]. Their report[2] discusses these validation tools in some detail, including:

  • Geometric and conformational validation criteria
    • Bond lengths, angles, and planes.
    • Protein backbone conformation (Ramachandran plot).
    • Protein sicechain conformations (rotamers).
The above analyses are available from MolProbity, or from the WHAT_IF server where you can upload PDB files, or, for published PDB files, from the PDBREPORT database.
  • All-atom contacts (clash score, Asn/Gln/His flips), analyses available from MolProbity.
  • Underpacking (holes in the core). Analysis available from RosettaHoles2[7] (not available as a server).

Validation of SARS-CoV-2 Structures

In 2021, Grabowski et al. analyzed ~1,000 recently deposited models of SARS-CoV-2 proteins, comparing the models with the deposited X-ray diffraction data (structure factors)[8]. They emphasized the importance of this, rather than relying on the meta-data in the PDB file header section:

"Metadata that are only contained in the PDB itself can be unreliable because they are supplied by the researcher who made the deposition. Inexperience or haste may lead to information being submitted to the wrong field, to inappropriate values being entered or to data items being skipped. First-time depositors make as many as 20% of all PDB depositions (assuming that the first author of a structure is responsible for the deposition); therefore, mistakes are not uncommon."

They found minor to moderate quality issues in about 100 structures, and serious issues in nine, two of which are presented in case studies. They provided a database of "validated SARS-CoV-2 related structural models of potential drug targets" at covid19.bioreproducibility.org, which includes diagnostic tools[9].

NMR Models

Models resulting from solution NMR experiments account for about 15% of those published in the Protein Data Bank. These are generally less reliable than crystallographic models because the method yields less detailed information. For NMR, there are no widely reported global error estimates equivalent to the crystallographic R value and Free R. Unlike with crystallographic results, it is not possible to distinguish reliable from unreliable NMR models from information included in the PDB files. NMR models are more likely to contain major errors [10] than are crystallographic models that have good Resolution and Free R values. In 2012, an X-ray crystallographic structure of integral membrane diacylglycerol kinase, 3ze4, revealed functionally important domain swapping[11][12] that was not present in an earlier NMR structure 2kdc[13]. At least one rapid approach [14] has been introduced to avoid misassignments, as summarized here. In 2020 a "useful addition to existing measures of accuracy" was proposed in 'A method for validating the accuracy of NMR protein structures' by Fowler et al..[15] . The software repository related to that method has currently not been updated since 2021.

Global vs. Local Quality

The indicators discussed above, notably resolution, R value, and free R, asses the average or global quality of the model. However, quality and uncertainty are not uniformly distributed throughout the model. Rather, there are regions of higher and lower uncertainty and quality. For crystallographic models, the easiest way to visualize local variations in uncertainty is to color the model by temperature value. As explained in the article on Temperature value, in a temperature-colored model, red atoms have the highest uncertainty in their positions in the model.

For models determined by NMR, disagreement among the ensemble of models in a particular region may signal higher uncertainty, due to local inadequacy of the distance restraints. However, it could also signal thermal motion -- please see NMR Ensembles of Models#Meaning of the Variation Between Models.

The MolProbity server[5] offers 3D visualization of atomic clashes, with indication of the severity of each clash. The presence of severe clashes indicates greater uncertainty in that local region of the model. MolProbity's analysis, termed all atom contact analysis, can be performed on NMR models (individual models in the ensemble, or the minimized average model) as well as on crystallographic models.

The orientation of the sidechains of Asn, Gln, and His cannot be determined from the electron density in a crystallographic experiment at typical resolution, because of the similarity in electron densities of carbon vs. nitrogen. It is usually straightforward to determine the correct orientation by examining the local environment and optimising hydrogen bonding. Unfortunately, is is common for these determinations not to be made in published crystallographic models. Fortunately, MolProbity does these determinations automatically, and corrects the model by flipping the sidechains of Asn, Gln and HIs when this is warranted.

Local Quality Scores of Protein Models in Cryo-EM Maps

The DAQ-Score Database provides pre-computed residue-wise local quality scores for structure models in the PDB that were derived from cryo-EM maps[16].

Improving Published Models

There are several free automated servers that can improve most published models. See Improving published models.

Further Reading

Wlodawer et al. (2008) explain how non-crystallographers can judge model quality[17]. Laskowski[18] has provided an outstandingly clear and succinct overview of how to assess model quality. See also the 2007 overview by Kleywegt[19] For examples of published crystallographic errors, see Laskowski, and Kleywegt, 2000[20], and Kleywegt and Brünger, 1996[21]. Kleywegt has also provided an excellent on-line tutorial on model validation[22].

See also the publications cited at Retractions and Fraud, where you will find links to sites where you can search for retractions or expressions of concern.

See Also

Content Donors

Portions of this page were adapted from the Glossary of ProteinExplorer.Org, with the permission of the principal author, Eric Martz.

References and Websites

  1. Miller G. Scientific publishing. A scientist's nightmare: software problem leads to five retractions. Science. 2006 Dec 22;314(5807):1856-7. PMID:17185570 doi:10.1126/science.314.5807.1856
  2. 2.0 2.1 2.2 Read RJ, Adams PD, Arendall WB 3rd, Brunger AT, Emsley P, Joosten RP, Kleywegt GJ, Krissinel EB, Lutteke T, Otwinowski Z, Perrakis A, Richardson JS, Sheffler WH, Smith JL, Tickle IJ, Vriend G, Zwart PH. A new generation of crystallographic validation tools for the protein data bank. Structure. 2011 Oct 12;19(10):1395-412. PMID:22000512 doi:10.1016/j.str.2011.08.006
  3. PDBREPORT Database
  4. Richardson, Jane S. (2003). All-atom contacts: a new approach to structure validation. Precis. Chapter 15 in Structural Bioinformatics (2003) edited by Philip E. Bourne and Helge Weissig, Wiley-Liss, 649 pages. Complete contents at structuralbioinformaticsbook.com.
  5. 5.0 5.1 MolProbity Server: All-atom contact analysis, flip corrections for Asn, Gln, His, clash analysis, Ramachandran analysis, and more.
  6. Brown EN, Ramaswamy S. 2007. Quality of protein crystal structures. Biol. Crystallography 63:941-950.
  7. Sheffler W, Baker D. RosettaHoles2: a volumetric packing measure for protein structure refinement and validation. Protein Sci. 2010 Oct;19(10):1991-5. PMID:20665689 doi:10.1002/pro.458
  8. Grabowski M, Macnar JM, Cymborowski M, Cooper DR, Shabalin IG, Gilski M, Brzezinski D, Kowiel M, Dauter Z, Rupp B, Wlodawer A, Jaskolski M, Minor W. Rapid response to emerging biomedical challenges and threats. IUCrJ. 2021 Mar 26;8(Pt 3):395-407. doi: 10.1107/S2052252521003018. eCollection, 2021 May 1. PMID:33953926 doi:http://dx.doi.org/10.1107/S2052252521003018
  9. Brzezinski D, Kowiel M, Cooper DR, Cymborowski M, Grabowski M, Wlodawer A, Dauter Z, Shabalin IG, Gilski M, Rupp B, Jaskolski M, Minor W. Covid-19.bioreproducibility.org: A web resource for SARS-CoV-2-related structural models. Protein Sci. 2021 Jan;30(1):115-124. doi: 10.1002/pro.3959. Epub 2020 Oct 8. PMID:32981130 doi:http://dx.doi.org/10.1002/pro.3959
  10. Traditional biomolecular structure determination by NMR spectroscopy allows for major errors. Sander B. Nabuurs, Chris. A. E. M. Spronk, Geerten W. Vuister, and Gert Vriend. (2006). PLoS Computational Biology 2: Open Access Full Text Precis. DOI: 10.1371/journal.pcbi.0020009
  11. Zheng J, Jia Z. Structural biology: tiny enzyme uses context to succeed. Nature. 2013 May 23;497(7450):445-6. doi: 10.1038/nature12245. Epub 2013 May 15. PMID:23676672 doi:http://dx.doi.org/10.1038/nature12245
  12. Li D, Lyons JA, Pye VE, Vogeley L, Aragao D, Kenyon CP, Shah ST, Doherty C, Aherne M, Caffrey M. Crystal structure of the integral membrane diacylglycerol kinase. Nature. 2013 May 23;497(7450):521-4. doi: 10.1038/nature12179. Epub 2013 May 15. PMID:23676677 doi:10.1038/nature12179
  13. Van Horn WD, Kim HJ, Ellis CD, Hadziselimovic A, Sulistijo ES, Karra MD, Tian C, Sonnichsen FD, Sanders CR. Solution nuclear magnetic resonance structure of membrane-integral diacylglycerol kinase. Science. 2009 Jun 26;324(5935):1726-9. PMID:19556511 doi:324/5935/1726
  14. Sarotti AM. Successful combination of computationally inexpensive GIAO C NMR calculations and artificial neural network pattern recognition: a new strategy for simple and rapid detection of structural misassignments. Org Biomol Chem. 2013 Jun 19. PMID:23779148 doi:10.1039/c3ob40843d
  15. Fowler NJ, Sljoka A, Williamson MP. A method for validating the accuracy of NMR protein structures. Nat Commun. 2020 Dec 18;11(1):6321. PMID:33339822 doi:10.1038/s41467-020-20177-1
  16. Terashi G, Wang X, Maddhuri Venkata Subramaniya SR, Tesmer JJG, Kihara D. Residue-wise local quality estimation for protein models from cryo-EM maps. Nat Methods. 2022 Sep;19(9):1116-1125. PMID:35953671 doi:10.1038/s41592-022-01574-4
  17. Wlodawer A, Minor W, Dauter Z, Jaskolski M. Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures. FEBS J. 2008 Jan;275(1):1-21. doi: 10.1111/j.1742-4658.2007.06178.x. Epub 2007, Nov 23. PMID:18034855 doi:http://dx.doi.org/10.1111/j.1742-4658.2007.06178.x
  18. Laskowski, Roman A. 2003. Structural quality assurance. Chapter 14 in Structural Bioinformatics (2003) edited by Philip E. Bourne and Helge Weissig, Wiley-Liss, 649 pages. Complete contents at structuralbioinformaticsbook.com.
  19. Kleywegt, GJ. 2007. Quality control and validation. Methods Mol. Biol. 364:255-72. PubMed.
  20. Kleywegt, GJ. 2000. Validation of protein crystal structures. Acta. Crystallogr. D. Biol. Crystallogr. 56:249-265
  21. Kleywegt, GJ, AT Brünger. 1996. Checking your imagination: applications of the free R value. Structure 4:897-904. PubMed.
  22. Practical Model Validation by Gerard Kleywegt, University of Uppsala, Sweden

Proteopedia Page Contributors and Editors (what is this?)

Eric Martz, Wayne Decatur, Eran Hodis

Personal tools