How to predict structures with AlphaFold

From Proteopedia

Jump to: navigation, search

In 2020, the AlphaFold project of Google's DeepMind team demonstrated a major breakthrough in predicting protein structure from sequence. Their success in the blind CASP competition astonished many experts. For an overview, see Theoretical models, bearing in mind "The Joys and Perils of AlphaFold"[1]. AlphaFold2 continued to have the highest success rate in the 2022 CASP 15 competition. In 2024, the AlphaFold team won half of the Nobel Prize in Chemistry.

In July, 2021, DeepMind released AlphaFold as open source code. Subsequently, several Colabs became available offering free structure prediction for user-submitted protein sequences. These Google Colabs (collaboratories)[2]. enable users to submit sequences via web browser, executing the code in the Google cloud, using space private to each user, returning predicted structures. In 2024, DeepMind provided the AlphaFold3 server[3] (see below).

Below are instructions for beginners who wish to predict structures.

Contents

Is An Empirical Model Available?

Empirical models are the most accurate, so you should look for those first. See How To Find A Structure. If there is no empirical model for your amino acid sequence, it may be useful to explore empirical models for closely-related sequences, if available. Even if an empirical structure is available, most have missing residues or atoms, and it may be useful to compare it with the AlphaFold prediction: see Missing residues and incomplete sidechains.

Does AlphaFold Database Already Have Your Protein?

Structure predictions for over 200,000,000 proteins are available from the AlphaFold Database. If your protein is there, download the prediction from the Database. Then you can explore it in the viewer of your choice. For beginners, FirstGlance in Jmol is easiest if you want more than a momentary impression. Upload your PDB file. FirstGlance has numerous unique conveniences yet considerable depth and power.

In 2024, AlphaFold Database predictions are always single protein chain structures without ligands. If your protein is an assembly of multiple chains, you will likely want to compare the Database structure with predictions from the latest servers capable of multiple-chain + ligand predictions (see below).

Prediction Servers

You can submit one (or a set) of sequences to these servers, and they will return predicted structures, along with estimates of confidence in their predictions. This is not a comprehensive list. Please add other servers of interest to a broad range of users, including beginners.

  • 2024[5]: CombFold predicts the structures of large protein complexes from subunit sequences using AlphaFold Multimer paired with a cominatorial method to assemble subunits. From Shor and Schneidman-Duhovny[5].
  • 2022[10]: AlphaFill “transplants” missing ligands, cofactors and (metal) ions into AlphaFold models. From the Perrakis team[10]. Ligand positioning is approximate. See CAUTION provided by the AlphaFill team:
"AlphaFill models are not meant or suitable for precise quantification of interactions between the transferred ligand(s) and the protein (e.g. hydrogen bonds, π-π or cation-π interactions, van der Waals interactions, hydrophobic interactions, halogen bonds)."
  • 2021[11]: RoseTTAFold at Robetta is an independent design from the Baker team[11], influenced by the design of AlphaFold2. Predicts monomers and multimers. Comparing results of RoseTTAFold with results of AlphaFold2/3 is worthwhile. At Robetta, open the Structure Prediction menu at the top, and choose Submit. Be sure to check RoseTTAFold under Optional!

Cost?

The above servers are free for limited, non-commercial use.

Colabs: After multiple free jobs in a Colab, a new job may be refused. You may be informed that a GPU could not be assigned. In 2024, a subscription to Colab Pro is US $10/month. Paying this will enable you to do many more jobs.

Visualizing Predicted Structures

FirstGlance in Jmol automatically colors its initial view of uploaded AlphaFold or RoseTTAFold models by estimated confidence pLDDT (blue for high confidence, red for low confidence). After you go to other views or tools, you can always get back to this color scheme by clicking Reliability Estimates in the Views tab.

  • iCn3D automatically colors AlphaFold2 Database models loaded from their UniProt IDs. For AlphaFold files opened from your computer, use pLDDT on the pull-down Color menu.
  • PyMOL and ChimeraX have no built-in confidence/pLDDT color scheme. Their rainbow/spectrum color schemes for temperature/B-factor color confidence/pLDDT with the AlphaFold color scheme inverted.

Upload your predicted PDB file to FirstGlance.Jmol.Org, which has many unique conveniences and capabilities.

You can easily visualize

  • Estimated confidence/pLDDT by touching an atom
  • Average confidence/pLDDT ("reliability") for the entire model, or for a specified sequence range.
  • Secondary structure (Views tab)
  • Distribution of hydrophobic vs. polar residues (Views tab: integral membrane proteins will have large hydrophobic surfaces while soluble proteins will have hydrophobic cores revealed by the Slab button)
  • Distribution of charges (Views tab: nucleic acid binding sites will have clusters of positive charges)
  • Disulfide bonds (Tools tab)
  • Domain structure and positions of the ends of the polypeptide chain (Views tab: N -> C Rainbow)
  • Locations of functional sites by evolutionary conservation (see instructions at How_to_see_conserved_regions)

Instructions for ColabFold 2022

This procedure was written in 2022[12]. In 2024, ColabFold is not necessarily the best or only place to submit you job: see #Prediction Servers.

Initially, AlphaFold and ColabFold performed best with single chains[8], which may include one or a few domains. The instructions below were written before ColabFold was adapted to prediction of multimers. If you are interested in complexes or alternate conformations, please see ColabFold instructions in the 2023 paper by Kim et al. [13]

Submitting A Sequence

First, if your query is a single chain molecule, check the AlphaFold Database for the protein of interest. If its structure has already been predicted there, download it, and skip to Interpreting Results below. Otherwise ...

Don't worry about any of the options not specifically mentioned below. Leave them at their default settings.
1. Obtain the sequence of the protein of interest, e.g. at UniProt. Click on the FASTA button above the sequence in UniProt. Copy only the sequence, excluding the FASTA header line that begins with ">".

2. Login with a google account at AlphaFold2_advanced. You can register for a free gmail account to use for login. (Another free AlphaFold2 service is ColabFold. Using it may require a procedure different from the steps below.)

3. Paste in your sequence, making sure to completely replace the default sequence:

This input slot can accept sequences >1,000 amino acids, even though it is only one line. Sequence lengths of ~1,000 amino acids, or longer, may cause the Colab to fail, but can be predicted by submitting in two halves.[14] See also [14] and Joining AlphaFold predictions for halves of a molecule.

4. Enter a jobname in the slot below the sequence slot. The results.zip filename will begin with this jobname (but none of its contents include the jobname).

5. Scroll down to the section titled run alphafold, subsection Sampling options:

  • num_models, the number of models to be predicted, is 5 by default. You could reduce this to 3 if you are in a hurry.
  • max_recycles: Set this to 48 (or at least 12). The actual number of "recycles" performed will stop when the model has converged to the specified tolerance. The default of 3 recycles is often not enough for an optimal result.
  • tol (tolerance): Set this to 0.5 Å (or 1.0 to get a faster result). When a prediction differs from the previous "recycle" prediction by less than this value (RMSD in Å between alpha carbons), the recycles will stop.
  • num_samples (random seeds): Leave this at 1. Beware that if you increase this above 1, you will generate a number of models equal to the product of this value times num_models. This will proportionally increase the time to complete a result.


6. Open the Runtime menu at the very top of the page, and select Run all.
Image:AF2Adv-runall.png
Don't worry about the "Warning". It is just Google's disclaimer that they did not write the code you are about to execute. Click Run anyway.

Downloading Results

Do NOT close your AlphaFold2_advanced browser tab until the job is completed. It appears that you will lose your job if you close the browser tab. You will be warned if you inadvertently try.

When the job is completed, a dialog to download a zip file will appear automatically. (Sometimes you will be asked for permission to enable download first.)

Interpreting Results

Static images of backbone renderings of predicted models will appear in your web browser at the bottom of the section run alphafold as each is completed.

Estimated Reliability

Each predicted model has an average estimated reliability (pLDDT, predicted local distance difference test). >90 is likely accurate; <70 is low confidence. For more about interpreting these values, please see the AlphaFold Database FAQ.

Each residue has an estimated reliability of its position (0-100) in the PDB temperature column. BEWARE that high values mean high confidence, and low values mean low confidence. This is the INVERSE of crystallographic temperature values, where low values are good and high values are bad. Uploading your PDB file to FirstGlance in Jmol will automatically color each residue by its estimated reliability.

Intrinsic Disorder

Some models have high confidence in a folded domain, and low confidence in a segment that is not part of a compact domain. Low-confidence segments may be intrinsically disordered. It is useful to compare predictions of disorder with AlphaFold reliability estimates.

Relative Positions of Domains

If the predicted model has more than one domain, each domain may have high confidence, yet the relative positions of the domains may not. The estimated reliability of relative domain positions is in graphs of predicted aligned error (PAE) which are included in the downloadable zip file of results. For an explanation, see How should I interpret the relative positions of domains? in the AlphaFold Database FAQ.

Recycles For Convergence

You may be interested to note the number of recycles required for each model to converge to the specified tolerance. These numbers are not captured in the downloaded zip file.

The models will be ranked with number one having the highest estimated reliability (pLDDT). This is usually not in the order in which they were calculated. You might want to copy the ranking list, perhaps adding the number of recycles and final tolerance values:

model rank based on pLDDT              Recycles   Tolerance

rank_1_model_2_ptm_seed_0 pLDDT:62.46    10          0.33

rank_2_model_3_ptm_seed_0 pLDDT:59.59     9          0.47

rank_3_model_1_ptm_seed_0 pLDDT:55.63    12          0.52

Notice that the model predicted 2nd had the best estimated reliability (pLDDT), and that the model ranked 3rd did not quite achieve the specified tolerance of 0.5 Å RMSD after 12 recycles. (12 was specified as the maximum in this job.)

Also notice that, in this case, all 3 models have low confidence (pLDDT < 70), and are of questionable value.

See Also

References and Notes

  1. Perrakis A, Sixma TK. AI revolutions in biology: The joys and perils of AlphaFold. EMBO Rep. 2021 Oct 20:e54046. doi: 10.15252/embr.202154046. PMID:34668287 doi:http://dx.doi.org/10.15252/embr.202154046
  2. Collaboratory FAQ at Google.
  3. 3.0 3.1 3.2 Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J, Bodenstein SW, Evans DA, Hung CC, O'Neill M, Reiman D, Tunyasuvunakool K, Wu Z, Žemgulytė A, Arvaniti E, Beattie C, Bertolli O, Bridgland A, Cherepanov A, Congreve M, Cowen-Rivers AI, Cowie A, Figurnov M, Fuchs FB, Gladman H, Jain R, Khan YA, Low CMR, Perlin K, Potapenko A, Savy P, Singh S, Stecula A, Thillaisundaram A, Tong C, Yakneen S, Zhong ED, Zielinski M, Žídek A, Bapst V, Kohli P, Jaderberg M, Hassabis D, Jumper JM. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024 Jun;630(8016):493-500. PMID:38718835 doi:10.1038/s41586-024-07487-w
  4. 4.0 4.1 Krishna R, Wang J, Ahern W, Sturmfels P, Venkatesh P, Kalvet I, Lee GR, Morey-Burrows FS, Anishchenko I, Humphreys IR, McHugh R, Vafeados D, Li X, Sutherland GA, Hitchcock A, Hunter CN, Kang A, Brackenbrough E, Bera AK, Baek M, DiMaio F, Baker D. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science. 2024 Apr 19;384(6693):eadl2528. PMID:38452047 doi:10.1126/science.adl2528
  5. 5.0 5.1 Shor B, Schneidman-Duhovny D. CombFold: predicting structures of large protein assemblies using a combinatorial assembly algorithm and AlphaFold2. Nat Methods. 2024 Mar;21(3):477-487. PMID:38326495 doi:10.1038/s41592-024-02174-0
  6. 6.0 6.1 Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022 Jun;19(6):679-682. PMID:35637307 doi:10.1038/s41592-022-01488-1
  7. Kim G, Lee S, Levy Karin E, Kim H, Moriwaki Y, Ovchinnikov S, Steinegger M, Mirdita M. Easy and accurate protein structure prediction using ColabFold. Nat Protoc. 2024 Oct 14. PMID:39402428 doi:10.1038/s41596-024-01060-5
  8. 8.0 8.1 8.2 8.3 Protein complex prediction with AlphaFold-Multimer, Preprint, Evans et al. 2021.
  9. 9.0 9.1 Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Jul 15. pii: 10.1038/s41586-021-03819-2. doi:, 10.1038/s41586-021-03819-2. PMID:34265844 doi:http://dx.doi.org/10.1038/s41586-021-03819-2
  10. 10.0 10.1 Hekkelman ML, de Vries I, Joosten RP, Perrakis A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat Methods. 2023 Feb;20(2):205-213. PMID:36424442 doi:10.1038/s41592-022-01685-y
  11. 11.0 11.1 Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, Millan C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Garcia KC, Grishin NV, Adams PD, Read RJ, Baker D. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021 Jul 15. pii: science.abj8754. doi: 10.1126/science.abj8754. PMID:34282049 doi:http://dx.doi.org/10.1126/science.abj8754
  12. Some of the recommended options were gleaned from the 1 hour 46 minute video of presentations by Sergey Ovchinnikov and Martin Steinegger (August, 2021) for the Boston Protein Design and Modeling Club hosted by Chris Bahl.
  13. Easy and accurate protein structure prediction using ColabFold, 2023, Kim et al. (DeepMind Team).
  14. 14.0 14.1 I had one sequence of length ~1,300. After it failed, I submitted it as two halves with a substantial overlap (~350 residues). The middle overlap of ~200 residues of the predicted structures superposed very closely with DeepView. I trimmed off the ends that superposed poorly, and superposed the two halves via the mid-overlap. By inspection, I chose pair of alpha carbons near the middle where the alpha carbon positions were nearly identical. I trimmed each half to this position, and "ligated" the two halves by combining the superposed half PDB files with a text editor. For further details, contact User:Eric_Martz.

Proteopedia Page Contributors and Editors (what is this?)

Eric Martz

Personal tools