Practical Guide to Homology Modeling

From Proteopedia

Homology modeling has become largely obsolete since the 2020 success of structure prediction by AlphaFold and other AI prediction systems. Rather than starting here, we suggest starting at How To Find A Structure.

Many assertions in this article are lacking literature citations. Help improving documentation in this article will be appreciated. Wikipedia's article on Homology modeling is well documented, although more technical and less of a practical guide than the present article.

1 Terminology
2 What Is A Homology Model?
3 Rationale for homology modeling
4 Do you need a homology model?
- 4.1 Has AlphaFold predicted a model?
- 4.2 Is there an empirical model?
  - 4.2.1 Simple search for empirical models (via PIR)
  - 4.2.2 Advanced search for empirical models (RCSB PDB)
5 Are parts (or all) of the query protein intrinsically disordered?
- 5.1 Prediction of intrinsic disorder
  - 5.1.1 MobiDB
  - 5.1.2 FoldIndex
6 Is your query protein in the structural genomics pipeline?
7 Limitations of Homology Modeling
8 Strengths of Homology Models
9 How to obtain homology models
- 9.1 Pre-calculated Models
- 9.2 Generating New Models
10 How To Explore 3D Models
- 10.1 FirstGlance in Jmol
- 10.2 Evolutionary Conservation
11 See Also
12 Notes and References

Terminology

Query sequence: The amino acid sequence for which a 3D model is wanted. More commonly called the target sequence, but talking about target vs. template gets confusing.
Template: An empirically determined 3D protein structure with significant sequence similarity to the query.
"Structure" will be used in this article to mean three-dimensional protein molecular structure.

What Is A Homology Model?

Homology models, also called comparative models, are obtained by folding a query protein sequence (also called the target sequence) to fit an empirically-determined template model. The registration between residues in the query and template is determined by an amino acid sequence alignment between the query and template sequences.

Imagine that the template’s polypeptide backbone is a folded glass tube. Now imagine that the query sequence is a thin metal chain that can be pulled through the tube. The chain (query) will adopt the same fold as the tube (template). The sequence alignment specifies how far the chain should be pulled into the tube; that is, how the residues in the query sequence match up with the structure of the template.

Errors or uncertainties in the sequence alignment result in errors or uncertainties in the homology model. Portions of the query sequence cannot be modeled reliably when there are gaps in the sequence alignment due to insertions/deletions ("indels"), or portions of the template that lack coordinates due to crystallographic disorder. Provided there is sufficient sequence identity between the query and template (at least 30%), the main chain in homology models is usually mostly correct. However, the positions of sidechain rotamers in homology models are usually unreliable.

Nevertheless, homology models are useful for seeing low-resolution features, such as which residues are on the surface or buried, which are close to other features of interest (such as a putative active site), and the overall distribution of charges and evolutionary conservation.

Rationale for homology modeling

The science of predicting the structure of a protein from its sequence, using theory, has very limited success, despite decades of work by some very bright people, and real progress having been made (see Theoretical models).

Structure is more conserved than sequence. This conclusion is supported by many examples of proteins that have similar structures, yet no discernable sequence identity. An example is the ftsZ cell division protein in bacteria which shares structure with mammalian tubulin despite only 12-15% sequence identity^[1]. The customary interpretation is that modern proteins with very similar structures have a common ancestor, and that their sequences diverged while maintaining the ancestral 3D structure.

Thus, if the query sequence has significant identity with an empirically determined protein structure (the template), there is a very high probability that they have similar structures. Folding the query sequence identically to the template, guiding the registration by the sequence alignment, produces a homology model.

Do you need a homology model?

You don’t need a homology model if the amino acid sequence of interest (the query sequence) already has an empirically determined 3D structure. Structures determined empirically, by X-ray crystallography or (much less often) by solution NMR or cryo-EM, will almost always be more accurate than a homology model.

If AlphaFold has predicted a model for your amino acid sequence of interest, it will often be more accurate than a homology model, and in most cases, a homology model won't be possible due to lack of a suitable template.

Has AlphaFold predicted a model?

Empirical models are the most reliable, but if none are available, AlphaFold has an impressive track record of correctly predicting structures from sequence. Check the AlphaFold Database for a model of your protein of interest. You can also submit a sequence and get a prediction: How to predict structures with AlphaFold. Another model prediction service with a good track record is RoseTTaFold. Submit your sequence there, making sure to check RoseTTaFold as the method. With any of these methods, download the predicted PDB file and then upload it to FirstGlance in Jmol for exploration and analysis. FirstGlance automatically colors predicted models by reliability.

Is there an empirical model?

Empirically-determined models are usually the most reliable. All published, empirically-determined, atomic-resolution, macromolecular 3D structures are available in the World Wide Protein Data Bank.

Each model in the PDB has a unique 4-character identification code (PDB ID) that begins with a numeral, and has letters or numerals for the last 3 characters . Examples are 1d66, 4mdh, 9ins.

Here are two methods for finding out if your query amino acid sequence, or parts of it, have empirically-determined 3D structures in the PDB.

Simple search for empirical models (via PIR)

At UniProt.Org, find your protein and click on Structure (blue button at the left).

If there is a section 3D Structure Databases with a column labeled PDB entry containing 4-character PDB IDs, these are empirical structures for your protein. Pay attention to the “Positions” column, which gives the sequence number range covered by each model.
- To explore one of these models, write down its 4-character PDB code. Then see #How To Explore 3D Models below.
If there is no “PDB entry” column, then there are no sequence-identical empirical structures for your protein. Then try the Advanced search method below.
Some proteins have no Structure section (e.g. K4QDG1_SACBA). Then try the Advanced search method below.

If empirical structures exist, see #How To Explore 3D Models below. If they are satisfactory, then you don't need a homology model.

Advanced search for empirical models (RCSB PDB)

This method takes more time but gives you more information. It will find empirical structures that have sequence similarity to the query. Such hits enable a high-quality homology model.

For example, if your query is calmodulin from the lancelet fish (Q9UB37, CALM2_BRALA), zero empirical structures are listed at UniProt. However, the query is 97% sequence identical to human calmodulin (P62158 CALM_HUMAN) and calmodulins from other taxa, for which there are numerous full-length empirical structures. A very high quality homology model can be constructed.

Advanced search procedure:

Copy the FASTA format sequence for your protein, for example, from UniProt.Org.
Note the length of your sequence.
At rcsb.org, go to Advanced Search.
Select Sequence under 'Advanced Search Query Builder'.
Paste your query sequence into the box.
Push the button to run the search.
Scroll down to see the list of hits.
At the top of the list, change Display Results as to Polymer Entities. Then push again. This is crucial because it displays the identity percentages and alignments for the hits. It should be the default!
The best hits will be listed first. Notice that each hit starts with a large, bold PDB ID.

For each hit, notice the Sequence Identity % above the sequence alignment box.

Also notice the Region range, which tell you how many of your query residues align with the hit. Compare this to the full length of your query sequence.

If you click the Download button in the list of hits, you will get the CIF file. If you need PDB file format, click on the PDB ID code and open the Download menu on that single entry page to get all format options.

Are parts (or all) of the query protein intrinsically disordered?

Attempts to determine structure for intrinsically disordered protein will be futile. Therefore, before considering homology modeling or crystallization experiments, it is important to predict whether portions of the query protein are likely to be intrinsically disordered.

Although fold is required for the function of most proteins, some proteins are intrinsically disordered (natively unstructured) and do not fold, at least by themselves. Often, intrinsically disordered protein transitions to an ordered state when it binds to a folded partner protein. However some proteins remain disordered while performing their functions.

By some estimates, 10% of proteins are intrinsically disordered for their full lengths, and about 40% of eukaryotic proteins have at least one loop 50 residues or longer that is intrinsically disordered^[2]. These disordered loops are typically missing from X-ray crystallographic structures because the disorder blurs that portion of the electron density map.

Examples:

Folded: Pyruvate kinase (length 531; e.g. P11979, KPYM_FELCA) has no disordered regions. The crystal structure (1pkm) lacks only 11 residues at the C terminus.
Partially folded: The tumor suppressor protein p53 (length 393; e.g. P04637, P53_HUMAN) is intrinsically disordered at both the N and C termini. There are many crystallographic structures for the folded mid-region (~200 residues), which lack coordinates for 90-some residues at the N terminus, and 90-some at the C terminus. Some solution NMR structures of the N terminus illustrate the disorder (e.g. 2ly4).
Unfolded: Caldesmon from chicken gizzard (length 771; P12957, CALD1_CHICK) has no crystal structures, and is predicted to be disordered for essentially its full length.

Prediction of intrinsic disorder

MobiDB

MobiDB is a meta-server: it summarizes disorder predictions from various other servers that use different methods.

At UniProt.Org, find your protein, then copy its UniProt accession code, something like P04386.
Go to MobiDB.
Enter your UniProt accession code, such as P04386. Do NOT include for example (GAL4_YEAST) or it will say "not found".

In 2017, MobiDB changed its output format, and it is rather confusing. There is no color key and the results are poorly explained, if at all. If you know of a better meta-server, please mention it in the discussion page. You may find these instructions helpful^[3].

FoldIndex

The FoldIndex server is a useful adjunct to the MobiDB report, since it is not included in that report.

Is your query protein in the structural genomics pipeline?

Structural Genomics is a worldwide initiative that gained momentum in the early 2000’s. Sequences may be chosen for structure determination because they represent a family of sequences for which no member has an empirical 3D structure. It is possible that your query (target) sequence has been selected for structure determination. Although funding enthusiasm for structural genomics has waned in recent years, some institutions do register their target sequences and progress. You can find out whether your sequence has been selected, and how much progress has been made, at the TargetTrack database. If your sequence has been selected, and progress has reached diffraction quality crystals, it may be worthwhile to contact the institution to see if they can expedite publication of the structure.

Limitations of Homology Modeling

Templates are often unavailable, or fragmentary

To create a 3D homology model (also called a comparative model) for a query sequence, the first step is to find a template: a reliable empirical structure with significant sequence identity. Depending on the stringency of your sequence identity criteria, templates will be available for no more than ~30% of query sequences.

Full-length templates are unlikely to be found for larger proteins (>~200 residues). 89% of structures in the Protein Data Bank were determined by X-ray crystallography. Most crystallographic structures represent fragments of full-length proteins, because fragments generally give higher crystallization success^[4]. 10% of structures in the Protein Data Bank were determined by solution NMR, but these tend to be small proteins or single domains. The median molecular mass of structures determined by NMR is 10 KD^[5] (about 90 amino acids^[6]). NMR is generally not able to determine atomic resolution structures for proteins >30 KD.

In contrast, the median molecular mass of asymmetric units determined by X-ray crystallography is 50 KD^[5], and a few are very large, such as virus capsids (e.g. 4qyk, ~2 million Daltons; 4v99, 10 million Daltons) or ribosomes (e.g. 4w2i, 4.5 million Daltons).

Errors and uncertainties in the sequence alignment produce errors in the homology model

The quality of a homology model depends upon the quality of the alignment between the query and template sequences. When the sequence identity falls below about 35%, the chances increase for errors in the alignment. Errors in the sequence alignment result in errors in positioning the query residues on the template fold; that is, errors in the 3D model.

Gaps in the sequence alignment make errors in the model. Gaps are opened in a sequence alignment in order to optimize the alignment. Such gaps may be regarded as insertions or deletions, but since it is usually unclear which, these are commonly called by the noncommittal term indels. The presence of large numbers of gapped residues in a sequence alignment guarantees that there will be errors in the homology model: missing residues, or residues in incorrect positions.

A gap in the template sequence means that the corresponding portion of the query is untemplated. Different homology modeling servers handle this differently. Swiss-Model includes the untemplated query residues, putting them in a loop (which may extend some distance away from the remainder of the domain when the loop is long).

A gap in the query sequence means that the two residues flanking the gap will usually be peptide-bonded in the 3D model, yet the aligned template residues may not be close to each other.

Templates determined by crystallography often have missing residues. FirstGlance in Jmol reports missing residues and marks their locations clearly. Missing residues have no coordinates in the crystallographic model due to disorder of those residues in the crystal. Thus, even though the sequences may align, some residues are frequently absent in the 3D template, and it is unclear where to position those residues. Some homology modeling servers omit such residues entirely, producing an incomplete homology model.

Sidechain rotamer positions will be incorrect

Even when the sequence alignment and template result in a correct backbone fold for the homology model, the sidechain rotamer positions (orientations relative to the alpha carbon position) will be incorrect. Despite knowing where each alpha carbon atom is located, theory does not correctly predict how the sidechains will fit together. At best, the sidechain rotamer positions will avoid steric clashes and electrostatic repulsions of like charges, and may optimize some salt bridges and hydrogen bonds. However, when a high quality empirical model becomes available, the details of sidechain packing in the homology model will be shown to be incorrect.

Strengths of Homology Models

Given the limitations explained above, you might well wonder whether homology models have any uses. Provided that the sequence alignment is reliable (about 35% identity or more), and if the sequence alignment lacks numerous or large gaps (indels), the backbone fold is likely to be correct. This provides a great deal of information despite the inaccuracies in sidechain positions.

The model suggests which residues are on the surface and which are buried.

If mutagenesis studies have shown phenotypic changes, it will be useful to see where the crucial residues lie in the homology model.

The distribution of evolutionarily conserved residues may suggest functional sites. For example, coloring the homology model by evolutionary conservation (e.g. with the ConSurf Server) may show patches or pockets of highly conserved residues. Pay attention to which residues may be missing from the homology model for the reasons explained above. Some missing residues could be highly conserved.

The distribution of charges on the surface may be useful. For example, a large region or pocket with exclusively positive charges may be a binding site for nucleotides, DNA or RNA. A region devoid of charges suggests interaction with something hydrophobic^[7]Remember that the fine details of charge distribution will be incorrect; however the general arrangement may be informative. Also pay attention to whether some charged residues are missing in the model, as explained above, due to gaps in the sequence alignment or missing residues in the template. FirstGlance in Jmol quantitates missing charges.

Example: Structure of E. coli DnaC helicase loader is an analysis of a homology model.

How to obtain homology models

Pre-calculated Models

At UniProt.Org, find your protein and click on Structure.

Protein Model Portal

The ProteinModelPortal has been shut down. The webpage merely remains to serve as a relay to established resources pre-calculating protein structure models.

SMR: Swiss Model Repository

SMR will display arc bar graphics depicting the structural coverage of pre-calculated homology models and experimental structure for a given UniProt entry side-by-side. Clicking the bars and then hovering reports model details e.g. sequence range for each model. Links to download the models are offered in a separate paragraph.

ModBase

Notice, in the blue box Dataset Information at top right, the date of the latest calculation. You may wish to click start a new calculation to take advantage of more recent templates.

The initial page does not list all models. Open the pull-down menu Select Option, and pick Model Details. Now there is a table below with information about each pre-calculated model. Don't confuse the column PDB Segment with the coverage range, which is in the right-most column as graphics.

Sometimes a model is listed by ModBase that was not listed at ProteinModelPortal or SwissModelRepository (due to low sequence identity, higher unreliability).

To download a model, open the pull-down menu and pick Coordinates.

Generating New Models

This process ensures that you are using the latest templates, and may generate a model with better coverage (and likely lower sequence identity) than the pre-calculated models. It also enables you to select the template that you would prefer to use when several are available.

At UniProt.Org, find your sequence, and copy it in FASTA format.

Go to SwissModel.expasy.org^[8].

It is a good idea to create an account, and login. This makes it easy to find your models later, although they are not kept on the server more than a week.

Open the menu Modelling at the top, and select Automated Mode.

Paste your sequence into the box, give the project a title, and click Build Model. Processing can take from a few minutes to a few hours.

The results will have a table listing percentages of sequence identity, and templates used. Below will be molecular images for the models. Click on a molecular image to open more information. In the box that opens, click the symbol at right that looks like "v" to open more details.

To download a model, right-click on the blue button Model 01 (or 02, 03, etc.) and pick Download Linked File.

On the Summary page (you may need to click a link Summary), it is worthwhile to click Show full template details. This table shows coverage for each model. You may want a model from this table that was not selected by Swiss Model. If you open out the row for a particular model (click on the "v" at the right), there is a blue button to Build Model.

How To Explore 3D Models

There are many superb molecular graphics programs. Most are quite challenging to use ("not user friendly").

FirstGlance in Jmol

FirstGlance in Jmol is perhaps easiest to use, has a great deal of help for interpreting what you see, and is nevertheless quite powerful. (See What Is FirstGlance in Jmol? and FirstGlance in Jmol.)

Empirical model (PDB code)
- Write down the PDB code of interest.
- Go to FirstGlance.Jmol.Org.
- Enter your PDB code in the slot.

Homology Model
- Download your homology model(s).
- Go to FirstGlance.Jmol.Org.
- Click on Upload your own PDB file and designate your homology model. Click View in FirstGlance. Your molecule should appear momentarily.

Hydrophobic/Polar

Most of the views under the Views tab will be informative. Particularly important is the Hydrophobic/Polar view. Soluble proteins should not have large areas (> ~ 15 Å across) of hydrophobic surface. Polar residues should be sprinkled over the entire surface. An exception is lipases, e.g. 1lpm, where the pocket at the catalytic site is hydrophobic. Other exceptions would, of course, be insoluble proteins, such as integral or trans-membrane proteins, e.g. 1bl8, 7ahl.

Hydrophobic, Polar

Hydrophilic surface of a homology model.	Hydrophobic catalytic face of lipase (1lpm).	Transmembrane protein (3waj) Transmembrane hydrophobic zone is indicated by the red bracket.

Hydrophobic Core

Soluble proteins should have a well-defined hydrophobic core. To see this in FirstGlance, under the Views tab, click Hydrophobic/Polar, and then turn on the Slab button. If the protein has multiple domains, each domain should have a hydrophobic core. If there is no hydrophobic core in a soluble protein model, the model most likely has very substantial errors.

Hydrophobic, Polar

Hydrophobic cores in domains (circled in red; 4cpa).

Charge Distribution

Evolutionary Conservation

Patches of highly conserved amino acids in a homology model can be very informative, as such patches indicate functional sites.

Go to the ConSurf Server: ConSurf.tau.ac.il.
Click Amino Acids.
Click YES there is a known protein structure.
Enter your PDB code, or click Choose File to upload a homology model. Click Next.
Select the chain of interest. For a homology model, there will usually be only one chain, "A".
Select NO you have not prepared a Multiple Sequence Alignment (MSA) that you wish to upload. The server will generate the MSA for you.
Leave parameters at their defaults.
Check manually for "Select homologs ...".
Enter a job title and your email address, then click the Submit button. The first step, gathering similar sequences, typically takes less than 5 minutes.

When the sequences are gathered, you will see SELECT SEQUENCES.
Continue as explained here: ConSurfDB_vs._ConSurf#Limiting_ConSurf_Analysis_to_Proteins_of_a_Single_Function.

Notes and References

↑ A 3D structure similarity search gives tubulin as one of the closest matches to ftsZ, with an RMSD (alpha carbons) of <2.6 Å.
↑ Tompa P. Intrinsically unstructured proteins. Trends Biochem Sci. 2002 Oct;27(10):527-33. PMID:12368089
↑ The MobiDB instructions were designed to supplement Section 18 of this assignment.
↑ The overall success rate for solving the 3D structure of a given protein sequence is about 5%. Failures commonly occur because the expressed protein is not sufficiently soluble (about half of expressed sequences), because soluble proteins fail to crystallize, or because crystals are not well ordered.
↑ ^5.0 ^5.1 Median molecular masses in the PDB were determined in December, 2014.
↑ The average mass of an amino acid is 111.4 Daltons, weighted according to the frequencies of occurrences in proteins.
↑ Lipases commonly have a hydrophobic surface (devoid of charges) around their active sites. See Lipase lid morph.
↑ Studer G, Tauriello G, Bienert S, Biasini M, Johner N, Schwede T. ProMod3-A versatile homology modelling toolbox. PLoS Comput Biol. 2021 Jan 28;17(1):e1008667. doi: 10.1371/journal.pcbi.1008667. PMID:33507980 doi:http://dx.doi.org/10.1371/journal.pcbi.1008667

Proteopedia Page Contributors and Editors (what is this?)

Eric Martz, Juergen Haas, Jaime Prilusky

Retrieved from "http://proteopedia.org/wiki/index.php/Practical_Guide_to_Homology_Modeling"