ConSurfDB vs. ConSurf

From Proteopedia

Jump to: navigation, search

Evolutionary Conservation is introduced at Introduction to Evolutionary Conservation, and treated in somewhat greater depth in the article Conservation, Evolutionary. These describe how conservation patterns in 3D can help to identify functional sites in proteins. Proteopedia displays conservation patterns pre-calculated by ConSurfDB, when available. These are usually based on broad protein families that include sequences of proteins with multiple functions. Consequently, they usually obscure conservation present in a family of proteins with a single function (see Caveats). The present article describes the mechanisms utilized by the ConSurfDB and ConSurf servers, and how to use the latter to reveal conservation within a family of proteins with a single function.


The Two ConSurf Servers

There are two ConSurf servers:

  • ConSurfDB:
    • In January 2018: ConSurfDB had not been updated with new entries in the Protein Data Bank since January, 2013.
    • Has pre-calculated results for every chain in the PDB.
    • Proteopedia's Evolutionary Conservation resource displays results from ConSurfDB.
    • Results typically obscure some conservation related to a protein's function because the analysis typically included proteins of multiple functions (see ConSurfDB Often Obscures Some Functional Sites).
  • ConSurf:
    • You submit proteins of interest and wait for the analysis to be completed.
    • Enables you to pick the sequences used in the analysis from a list with checkboxes.
    • Highly flexible with many configurable parameters and several sequence database options.
    • You can upload your own multiple sequence alignment, or phylogenetic tree, for use in the analysis.

Both servers use state-of-the-art methods that are published in peer-reviewed journals. For comparisons with other methods, see Other Evolutionary Conservation Servers.

Both servers permit you to download results. This is a good idea since the continual growth of sequence databases and improvements in analysis algorithms will give at least slightly different results for the same jobs run several months or more apart. Also, results are periodically deleted from the ConSurf server to conserve disk space.

Examining Functions of Proteins in ConSurf-DB's MSA

In January 2018: ConSurfDB had not been updated with new entries in the Protein Data Bank since January, 2013.

As explained above, ConSurf-DB typically includes proteins with more than one function in its conservation analysis. Before deciding whether to do a ConSurf Server job that limits the analysis to proteins of a single function, you may want to see what proteins ConSurf-DB included in its analysis. Here is how to see the names (which hopefully reveal the functions) of the proteins included in ConSurf-DB's analysis of a protein chain. (The following steps are needed in May, 2009. A request to make this easier has been sent to the ConSurf-DB development team.)

  1. Go to (the DB, distinct from the ConSurf Server).
  2. Enter the PDB code (PDB ID) for the protein of interest, and click the Show chains button.
  3. Click the button for complete, step by step computational results for the chain of interest.
  4. Under Alignment, note the number of sequences used.
  5. Under Output Files click on PSI-BLAST output. This will download the file
  6. Windows XP or Vista:
    1. Double click on to unzip it. Right click on seq.blast and Copy. Right click on your Desktop (or elsewhere of your choosing) and Paste. Now you have the unzipped file seq.blast.
    2. Open seq.blast in a program that can number lines. (Notepad and Wordpad cannot number lines.) Start MS Word or the free Open Office Writer program (available from Use the File menu to Open seq.blast.
    3. Delete everything above the first sequence, so the first sequence will be line number 1. The first sequence follows the header Sequences producing significant alignments:.
    4. Number the sequences by numbering the lines.
      1. MS Word: search for "add line numbers" to get instructions.
      2. Open Office Writer: Save the file as seq_blast.txt. (This enables line numbering.) Open the Tools menu, and select Line Numbering....
  7. Mac OS X:
    1. In the Finder, right-click (ctrl-click) on the file seq.blast, then Open With an application that can number lines of text. An excellent free one is Textwrangler from BareBones.Com.
    2. Delete everything above the first sequence, so the first sequence will be line number 1. The first sequence follows the header Sequences producing significant alignments:.
    3. Number the sequences by numbering the lines.
      1. MS Word: Set the Open dialog to enable All Files. Search for "add line numbers" to get instructions. You may need to select all and change the font (e.g. to Arial) to get the description of each sequence to fit on one line.
      2. TextWrangler (or BBEdit): Open the View menu, and under Text Display click Show Line Numbers.
      3. iWork Pages appears to lack a line numbering capability.
  8. Now you have the sequences numbered. Find the number equal to the number of sequences used reported under Alignment by ConSurf-DB.

If the functions of the proteins for this sequence number (and lower numbers) differ from that of the protein of interest, then ConSurf-DB included proteins of multiple functions in its analysis. This tends to obscure patches of conservation that exist among proteins with the same function as the query protein of interest.

Limiting ConSurf Analysis to Proteins of a Single Function

This section was updated in June, 2017 to correspond to changes in the ConSurf Server.

As explained above, the ConSurf-DB Evolutionary Conservation scene available in Proteopedia often includes proteins with multiple functions. However, the best way to find all functional sites by conservation analysis is to limit the analysis to proteins with a single function. A procedure for doing this follows.


  1. Go to, the ConSurf Server (distinct from ConSurf-DB).
  2. Fill out the form. For your first run, all options can be left at their default settings. When you get to the section Select homologs for ConSurf analysis, be sure to check manually.
  3. Enter your email address and click the Submit button.
  4. After a few minutes, a green message will appear SELECT SEQUENCES. The job cannot continue until you select the sequences.
  5. Look at the names of the proteins in the list that has checkboxes, under the header "Sequences producing significant alignments:". Find the first case where the function of the protein is not the same as the protein of interest. Usually you will want to exclude sequences for proteins of different functions.
  6. Just below the large red line Please choose which sequences you want to use for ConSurf calculation is a form. Put the number of the last sequence having the same function as the protein of interest in the box "Select the first [ .... ] sequences". Then click on the "Update selection" button.
    1. ConSurf will not accept >500 sequences. 200-250 sequences are plenty. Using more sequences simply loads the server unnecessarily and delays returning your result. If the number of the last sequence having the same function as the query protein is higher than 250, use the radio buttons labeled "only every 2nd, 3rd, ..." to reduce the total number of sequences selected while sampling the full diversity of the desired sequences.
    2. The form is confusingly labeled. If you check an "only every" number, then you will need to divide the number in the slot labeled "Select the first" by the "only every" number you selected. For example, if the first 472 sequences have the same function as the query protein, check "only every 2nd" and enter 236 (namely, 472/2) in the "Select the first" slot.
  7. Examine the list of sequences to make sure that only the desired sequences are checked. (Of course you may check or uncheck individual sequences if you wish.)
  8. When you are satisfied, scroll to the very bottom of the page and click the Submit button.

Too Many Sequences

ConSurf will list up to 2,000 sequences from which to select. In some cases, these sequences are all too similar. Some proteins will retrieve >5,000 sequences with an expectation value (E value) < 1.0e-4 (1.0 times ten to the -4), the default threshold. Then the 2,000th sequence listed may still be very similar to the first sequence listed. This would be true if the 2,000th sequence has a very small E value, such as 1.0e-100. In such a case, you may wish to try searching the Swiss-Prot database, which is much smaller than the default Uniref-90 database. Start a new job, with the only difference being the database searched.

Too Few Sequences

If your results have more than a few amino acids with insufficient data (  yellow color  ), you need more sequences. Try repeating the procedure above with one change. Under "Choose parameters to homolog search algorithm", change the Proteins Database to UniProt or NR (larger databases than the default Uniref90).

Using Your Results

The results of this "one protein function" job will usually enable you to identify more functional sites than did the ConSurf-DB result built into Proteopedia.

See below for instructions on how to make a green-link scene in Proteopedia that shows your single-function ConSurf result.

The ConSurf-DB Mechanism

In January 2018: ConSurfDB had not been updated with new entries in the Protein Data Bank since January, 2013.

Because results from the ConSurf DataBase server, ConSurf-DB[1] are displayed within Proteopedia as Evolutionary Conservation, an overview of its methods is provided here. ConSurf-DB pre-calculates conservation levels for each amino acid in every protein chain in the Protein Data Bank. It went into service in 2008. It uses state-of-the-art methods, all published in peer-reviewed journals[1].

ConSurf-DB Process

  1. A list of unique protein chains is extracted from the Protein Data Bank. Chains shorter than 30 amino acids are not processed because they do not contain enough information for reliable phylogenetic tree construction. Non-standard residues are converted to the closest standard amino acids. Chains with more than 15% non-standard residues are not processed. Chains that could not be processed are colored gray in Proteopedia -- see the color key at the top of this page.
  2. The amino acid sequence of each protein chain is submitted to PSI-BLAST[2] for collection of related sequences from UniprotKB/Swiss-Prot[3]. Three iterations are performed using an expectation value[4] cutoff of 10-3.
  3. The sequences gathered with PSI-BLAST are then filtered (see below) using a scheme that attempts a balance between limiting the sequences to close homologues, and including distant sequences that do not share structure or function.
  4. The filtered sequence set is multiply aligned with MUSCLE (a multiple sequence alignment algorithm that out-performs CLUSTALW).
  5. A phylogenetic tree is constructed from the multiple sequence alignment (MSA) using the Rate4Site program developed by the ConSurf team.
  6. Rate4Site then calculates an evolutionary rate for each position in the MSA using a Bayesian approach shown by the ConSurf team to be superior[5]. "The amino acid evolution is traced using the JTT[6] substitution model. High evolutionary rate represents a variable position while low rate represents an evolutionarily conserved position."[1]
  7. "The conservation scores are normalized so that the average over all residues is zero, and the standard deviation is one."[1] Thus, conservation scores are relative, not absolute and comparing them between different protein families might be misleading (see Caveat above).
  8. The normalized conservation scores are then divided into nine levels from 1 (highly variable) to 9 (highly conserved).
  9. Colors mapped to the nine conservation levels, from turquoise (1) to burgandy (9) are applied to the 3D protein structure visualized in FirstGlance in Jmol. A coloring script for RasMol is also provided.
  1. A confidence interval for the conservation level is calculated for each amino acid position in the MSA. When this indicates low reliability, the position is colored yellow, signifying that the data were insufficient to assign a meaningful conservation level.
  1. An Average Pairwise Distance (APD) is calculated to describe the diversity of sequences in the MSA (see below).

The results of each stage of the above process may be viewed for each chain at ConSurf-DB. In the initial run (February 2008), roughly 100 computer CPU's were utilized concurrently via a distributed computing system. Processing of the 30,918 unique protein chains in the PDB took about five days, or an average of roughly 30 minutes per chain.


Filtering of the sequences gathered for each protein chain is crucial to making the ConSurfDB results maximally informative. Filtering consists of the following steps.

  1. Sequences with more than 95% sequence identity to the query sequence are discarded.
  2. Sequences shorter than 60% of the query sequence are discarded.
  3. Locally aligned sequence fragments that overlap by over 10% are discarded.
  4. Redundant sequences (>95% identical) are removed using CD-HIT[7].
  5. A maximum of 300 sequences meeting the above criteria is used (the 300 with the lowest expectation values[4], that is, most closely related to the query sequence).
  6. If the above process yields fewer than 50 sequences, the entire process is repeated using the Clean_UniProt database, which is about ten times larger than UniProtKB/Swiss-Prot. Clean_UniProt is a version of the UniProt database that attempts to exclude mutant or dubious sequences.
  7. If the above process yields fewer than 5 sequence homologs, no calculation is performed due to insufficient data. In February, 2008, this occurred for 1,348 chains out of 30,918 (4%).

Average Pairwise Distance

An Average Pairwise Distance (APD) is calculated to describe the diversity of sequences in the MSA generated during the processing of each chain. A value of 0.01 means that on average, there is one amino acid replacement for every 100 positions. Optimally informative results are obtained when the APD is between roughly 0.5 and 1.5.

The ConSurf Server

The ConSurf Server, first available in 2001[8][9][10] with many subsequent enhancements, can calculate and display the conservation pattern for 3D structures completely automatically. It should be used whenever the pre-calculated result at the ConSurf-DB needs improvement (for example, see above), or if you have your own multiple sequence alignment (MSA) that you wish to use. The default settings of ConSurf need to be adjusted in order to get an optimally informative result. For an example with default settings, see the cytochrome c comparision at ConSurf-DB. The main adjustment needed is to gather an adequate number of sequences for proteins of the same function as your protein of interest (see above).

Like ConSurf-DB, the ConSurf Server uses the same state-of-the-art methods, all of which are published in peer-reviewed journal articles. Unlike ConSurf-DB's pre-calculated results the ConSurf Server permits considerable customization. For example, the user may specify the number of sequences to use, choose the database from which sequences are obtained (Swiss-Prot or UniProt), set the Expectation cutoff[4], set the number of PSI-BLAST iterations, or submit their own multiple sequence alignment, or phylogenetic tree. Also you can upload your own PDB file, which enables you to process unpublished data, theoretical models, or "trimmed" chains, e.g. a domain of interest from a long chain.

In brief, the ConSurf Server uses the following process by default:

  1. Obtains the protein sequence for the specified PDB code (or uploaded PDB file) and chain.
  2. Gathers closely related sequences from UNIREF90 (or another database that you specify) with a PSI-BLAST search. E value cutoff[4], number of iterations, and number of sequences to use are configurable.
  3. Filters the sequences, by default eliminating those redundant at 95% or higher identity with each other, and those with less than 35% sequence identity to the query sequence. These percentages are adjustable.
  4. Optionally enables the user to manually select which sequences will be used, from a list with checkboxes. In particular, this enables users to limit the analysis to proteins having the same function as the protein of interest (see above).
  5. Does a multiple sequence alignment with MAFFT. (Or you can choose a different algorithm or upload your own MSA.)
  6. Constructs a phylogenetic tree using neighbor joining with ML distance. (Or you can choose a different algorithm or upload your own tree.)
  7. Calculates a conservation score with confidence interval for each amino acid. Classifies the conservation scores into nine levels, and maps them to standard conservation level colors (see color key at the top of this page). Marks residues for which the conservation score confidence interval is too large, hence the conservation score is unreliable ("insufficient data").
  8. Displays the protein, colored by conservation, in interactive 3D, using FirstGlance in Jmol, Chimera, PyMOL, or Protein Explorer.


This example needs to be updated. It is on my list to do. Eric Martz 09:13, 25 April 2010 (IDT)

Evolutionary conservation reported by ConSurf-DB for Major Histocompatibility Class I alpha chain in 2vaa.

At right is the pattern of evolutionary conservation and variability reported by ConSurf-DB for the alpha chain of Major Histocompatibility Complex Class I (chain A of 2vaa). Image:ColorKey ConSurf NoYellow NoGray.gif

Because the scene at the right contains no amino acids marked insufficient data, and no chains with no data, the yellow and gray colors need not be included in the color key.

For all the available variations of the ConSurf color key, see Help:Color_Keys#ConSurf.

2vaa contains three chains. Here, ConSurf colors are applied only to the alpha chain (chain A), while the beta chain (chain B) and the peptide (chain P) are shown as gray backbone traces. See also How to Insert a ConSurf Result Into a Proteopedia Green Link.

Examples of conserved patches on other proteins, revealed by ConSurf, will be found in the articles on


  1. 1.0 1.1 1.2 1.3 Goldenberg O, Erez E, Nimrod G, Ben-Tal N. The ConSurf-DB: pre-calculated evolutionary conservation profiles of protein structures. Nucleic Acids Res. 2009 Jan;37(Database issue):D323-7. Epub 2008 Oct 29. PMID:18971256 doi:
  2. PSI-BLAST (Position Specific Iteration-BLAST) is an extension of the Basic Local Alignment Search Tool (BLAST) that is more sensitive at finding distantly related sequences. See PSI-BLAST at Wikipedia and PSI-BLAST at NCBI.
  3. From UniProtKB help: "UniProtKB/Swiss-Prot (reviewed) is a high quality manually annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions."
  4. 4.0 4.1 4.2 4.3 Expectation Value (E value): When searching a sequence database with a query sequence, e.g. using BLAST or PSI-BLAST, each found sequence can be characterized by an E value. It is the number of hits expected by chance with the sequence matching level observed, taking into account the size of the sequence database and length of the query sequence. Low values of E (much less than one) mean increasing significance of the match.
  5. Mayrose I, Graur D, Ben-Tal N, Pupko T. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol Biol Evol. 2004 Sep;21(9):1781-91. Epub 2004 Jun 16. PMID:15201400 doi:
  6. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992 Jun;8(3):275-82. PMID:1633570
  7. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006 Jul 1;22(13):1658-9. Epub 2006 May 26. PMID:16731699 doi:
  8. Armon A, Graur D, Ben-Tal N. ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J Mol Biol. 2001 Mar 16;307(1):447-63. PMID:11243830 doi:
  9. Glaser F, Pupko T, Paz I, Bell RE, Bechor-Shental D, Martz E, Ben-Tal N. ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics. 2003 Jan;19(1):163-4. PMID:12499312
  10. Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N. ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W299-302. PMID:15980475 doi:

Proteopedia Page Contributors and Editors (what is this?)

Eric Martz, Timothy Gregory, Joel L. Sussman

Personal tools