How to find the structure of a protein
From Proteopedia
Here is a general guide to finding a structure for a protein molecule of interest. These procedures are some of many possible. When you find a structure you want, below are also instructions for loading it into FirstGlance in Jmol, which is the easiest place to learn about and explore your structure.
Contents |
Empirical Models
Empirical models are structures determined empirically (experimentally) by X-ray crystallography, cryo-Electron Microscopy, solution NMR, or rarely by other methods. Empirical models are usually the most accurate and reliable, especially when they have good resolution. All published, empirically-determined, atomic-resolution, macromolecular 3D structures are available in the World Wide Protein Data Bank (the "PDB").
Each model in the PDB has a unique 4-character identification code (PDB ID) that begins with a numeral, and has letters or numerals for the last 3 characters . Examples are 1d66, 4mdh, 9ins.
Below are methods for finding out if your query amino acid sequence, or parts of it, have empirically-determined 3D structures in the PDB. Even if empirical models are available, you may wish to compare them with AlphaFold-predicted models because empirical models often have missing residues or atoms, while AlphaFold models are complete. See Missing residues and incomplete sidechains.
Easy UniProt search for empirical models
At UniProt.Org, find your protein of interest.
- Example: search for human acetylcholinesterase, then click on P22303.
- Click on Structure in the column at left, and wait for this section to load.
- If there is a table with PDB in the first column of each row, followed by a PDB ID in each row in the IDENTIFIER column, these are empirical structures for your protein.
- If there is no PDB list, go to Structure Predicted by AlphaFold below.
Choosing a model
If there are multiple PDB structures, you need to pick one (or a few) that best meet your needs. Because of its medical and pharmaceutical interest, P22303 has more than 50 empirical PDB structures.
Resolution
Models with the best resolution (2.5 Å or less) will be the most accurate.
Coverage
You will likely prefer models that cover all or most of the complete sequence.
- Click Sequence (in the column at left) and note the length (in amino acids). For P22303, the length is 614.
- Click PDM/Processing (in the column at left) and note the length of the signal sequence, if given. The mature protein will start after the signal sequence. For P222303, the signal sequence is 1-31. Therefore, the mature protein will start at position 32.
- The POSITIONS column gives the sequence range for the protein used for structure determination. For P22303, 1vzj includes only a short C-terminal segment, 575-614, a tetramerization domain[1]. But most of the models span 33-574, omitting the tetramerization domain.
Missing Residues
Pay attention to the number and locations of missing residues as well as incomplete sidechains, as these limit the value of the model. FirstGlance clearly marks regions where residues are missing in its initial view, whereas other structure viewers may leave you unaware of how many residues are missing, and where residues are missing. FirstGlance also tabulates the sequence positions of missing residues for each chain in the structure, lists them, and summarizes the number of charged residues missing. When the best available empirical structure has substantial numbers of missing residues and/or incomplete sidechains, compare with the AlphaFold model, which will be complete (see examples). When large loops are missing, be aware that these may be Intrinsically Disordered Protein. Also, residues can be lost by proteolysis during purification, or radiation damage during X-ray crystallography can sometimes remove atoms[2], and conformations can be affected by crystallization conditions[3].
Quality
When comparing PDB IDs (empirical structures), pay attention to Rfree, a measure of quality. When a PDB ID is displayed in FirstGlance, it interprets Rfree objectively, characterizing it as (at the stated resolution) much better/better than average, average, worse than average, or unreliable. Avoid models with worse than average Rfree when possible as they likely have structural errors.
Ligands
You may prefer a model that includes a specific ligand, such as an inhibitor. Here are two ways to evaluate ligands.
- Ligands via Proteopedia
- Go to Proteopedia.Org.
- Enter the 4-character PDB ID from the UniProt IDENTIFIER column into the Proteopedia search slot at the left.
- At the Proteopedia page titled with your PDB ID:
- The title of the model often mentions the key ligand. For human acetylcholinesterase P22303 4ey5, it is huperzine A.
- Abbreviations for all ligands present are listed in a blue/green bar. Clicking on one highlights it in the 3D view, and shows its full name in red at the bottom.
- Ligands via FirstGlance in Jmol
- Go to FirstGlance.
- Enter the 4-character PDB ID from the UniProt IDENTIFIER column into the FirstGlance slot.
- After the model displays, in the upper left panel scroll down to Ligands+ & Non-Standard Residues. There you will find a clickable list of all ligands with their full names.
Structure Predicted by AlphaFold
If there are no empirical models for your sequence, the Structure section in UniProt usually offers a structure predicted by AlphaFold. Empirical models are the most reliable, but if none are available, AlphaFold has an impressive track record of correctly predicting structures from sequence. The AlphaFold models offered by UniProt come from the AlphaFold Database, and are limited to single-chain proteins with no ligands.
If there is no AlphaFold model in UniProt for your sequence, or if your molecule has multiple chains (protein, nucleic acids) or important ligands, you can submit the sequences and get a prediction: see How to predict structures with AlphaFold.
- Download the predicted PDB file (a file ending .pdb). In UniProt, use the download button (a down arrow) in the AlphaFold line. Example: spider acetylcholinesterase.
- Go to FirstGlance.
- Upload the PDB file to FirstGlance.
FirstGlance automatically colors predicted models by confidence.
Sequence-Related Empirical Models
This method finds empirical structures that have sequence similarity to the query. Their structures can be compared to AlphaFold models.
For example, if your query is calmodulin from the lancelet fish (Q9UB37, CALM2_BRALA), zero empirical structures are listed at UniProt. However, the query is 97% sequence identical to human calmodulin (P62158 CALM_HUMAN) and calmodulins from other taxa, for which there are numerous full-length empirical structures. When these superpose closely with the AlphaFold-predicted model, you can have high confidence in the AlphaFold model.
Another example: searching UniProt for trapdoor spider acetylcholinesterase finds W4VSJ0 which has no empirical models.
- At UniProt.Org, in the Sequence section, note the length of your sequence.
- At UniProt, in the Sequence section, click Download. This displays the sequence in FASTA format.
- Copy the FASTA-formatted sequence, excluding the identifier line at the top that begins '>'.
- At RCSB.org (the USA branch of the PDB), click on the Advanced Search link just below the slot at the top.
- Click on Sequence Similarity under 'Advanced Search Query Builder', which opens a slot for your sequence.
- Paste your query sequence into the slot.
- Change Return in the bottom line from Structures to Polymer Entities.
- Push the button at the lower right to run the search.
- Scroll down to see the list of hits.
- The best hits will be listed first. Notice that each hit starts with a large, bold PDB ID.
For each hit, notice the Sequence Identity % above the sequence alignment box. The top sequence similarity hit for W4VSJ0 at RCSB.Org, 6emi, has 41% sequence identity. Human acetylcholinesterase 4ey5 is the 102nd hit with 36% sequence identity.
Also notice the Region range, which tell you the range of residues in the PDB ID that align with your query sequence. Compare this to the full length of your query sequence.
The length of the spider protein W4VSJ0 is 559, or 538 after subtracting the 21-residue signal sequence. The top hit sequence alignment region is 8-529 for 6emi. That range aligns with 31-557 (length 527) of the query sequence (with a few small gaps), so the coverage of the query sequence is nearly complete (527/538 = 98% minus gaps).[4]
For 4ey5, the spider protein query alignment region is 1-534 (length 534), so the coverage of the query sequence is also nearly complete (534/538 = 99% minus gaps).
- To explore the structure of a hit in FirstGlance, just enter the PDB ID.
- If you click the Download button in the list of hits at RCSB, you will get the CIF file. If you need PDB file format, click on the PDB ID code and open the Download menu on that single entry page to get all format options. A downloaded PDB file can be uploaded to FirstGlance.
Structure Superposition
Superposing ("aligning"[5]) two structures tells how similar they are, and highlights where they differ. Similarity between an AlphaFold-predicted structure and an empirical structure for a sequence similar to the query sequence supports confidence in the AlphaFold prediction. See an example superpositon here.
Structure is more conserved than sequence[6][7][8]. This conclusion is supported by many examples of proteins that have similar structures, yet no discernable sequence identity. The customary interpretation of this frequent observation is that modern proteins with very similar structures have a common ancestor, and that their sequences diverged while maintaining the ancestral 3D fold structure.
An example is the ftsZ cell division protein in bacteria which shares structure with mammalian tubulin despite only 12-15% sequence identity[9]. This example is illustrated in interactive 3D at Visualizing Structure Superpositions. (Be sure to look at the morph.) The core folds are very similar (RMSD about 3 Å), with surface loops diverging more.
Superposition with FATCAT
An easy and powerful tool for superposing two structures is FATCAT (see Structure superposition tools). Use its "Pairwise Alignment" tool to superpose two structures that you specify. To submit an AlphaFold model to FATCAT, first download the .pdb file, then upload it to FATCAT. FATCAT gives you an RMSD, and its Interactive Viewer displays the superposed structures in many different ways. Perhaps the most useful is its Animation, a morph between the two superposed structures.
- Click on the green Interactive Viewer button
- To display the morph, at the Interactive Viewer page, click the Animation radio button.
- If you wish, you can capture the morph animation as an .mp4 or .gif -- see Capturing Videos.
For the AlphaFold model of spider acetylcholinesterase W4VSJ0 vs. human acetylcholinesterase 4ey5, FATCAT superposes 517 alpha carbon atoms with RMSD 2.4 Å. 517 is 96% of the 538 of the spider protein, and 89% of the 583 of the human protein (lengths given after subtracting the signal sequences). The morph animation is very informative, showing very close superposition of the core domain, with deviations being largely in the surface loops. This close superposition increases confidence in the AlphaFold-predicted model.
Notes & References
- ↑ Dvir H, Harel M, Bon S, Liu WQ, Vidal M, Garbay C, Sussman JL, Massoulie J, Silman I. The synaptic acetylcholinesterase tetramer assembles around a polyproline II helix. EMBO J. 2004 Nov 10;23(22):4394-405. Epub 2004 Nov 4. PMID:15526038 doi:7600425
- ↑ Weik M, Ravelli RB, Kryger G, McSweeney S, Raves ML, Harel M, Gros P, Silman I, Kroon J, Sussman JL. Specific chemical and structural damage to proteins produced by synchrotron radiation. Proc Natl Acad Sci U S A. 2000 Jan 18;97(2):623-8. PMID:10639129
- ↑ Dym O, Song W, Felder C, Roth E, Shnyrov V, Ashani Y, Xu Y, Joosten RP, Weiner L, Sussman JL, Silman I. The Impact of Crystallization Conditions on Structure-Based Drug Design: A Case Study on the Methylene Blue/Acetylcholinesterase Complex. Protein Sci. 2016 Mar 14. doi: 10.1002/pro.2923. PMID:26990888 doi:http://dx.doi.org/10.1002/pro.2923
- ↑ In the sequence alignment graphic at RCSB below 6emi, touch any part of the graphic and enlarge it with your mouse wheel. When sufficiently enlarged, you can see the first (or last) aligned residue of the query. Touching that residue reports its sequence number above the graphic.
- ↑ Structure superposition is often called "structure alignment", but "alignment" is easily confused with sequence alignment. Some structure superposition methods are guided by the sequence alignment, while others are independent of sequence. See Structure superposition tools.
- ↑ Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986 Apr;5(4):823-6. PMID:3709526
- ↑ Holm L, Sander C. Mapping the protein universe. Science. 1996 Aug 2;273(5275):595-603. PMID:8662544 doi:10.1126/science.273.5275.595
- ↑ Holm L. Dali server: structural unification of protein families. Nucleic Acids Res. 2022 Jul 5;50(W1):W210-W215. PMID:35610055 doi:10.1093/nar/gkac387
- ↑ A 3D structure similarity search gives tubulin as one of the closest matches to ftsZ, with an RMSD (alpha carbons) of <2.6 Å.