Online Sources of Sequence Alignments
This is by no means an exhaustive list, but includes several sources of
protein multiple sequence alignments for use in
Why you might want such alignments:
I would then work with the sequences and structure(s) using Chimera, for
example to: evaluate conservation, superimpose structures,
evaluate conformational variability, morph between different structures.
Related Chimera tutorials:
Sequences and Structures,
Superpositions and Alignments
- you have a protein structure and want to show which parts are
more or less conserved in its family or superfamily
- you have a protein structure and want to find other structures
in the same family or superfamily and compare them
- you are interested in a particular family or superfamily
and want to look up a sample of structures and compare them
**Chimera is not meant to handle large multiple sequence alignments
(those containing several hundreds to thousands of sequences).
A good tool for working with large alignments is
It allows redundancy-filtering and saving a smaller alignment in
various formats (aligned FASTA, etc.) that can be read into Chimera.
Jalview also calculates per-column conservation directly and via
web service; these annotation values can be exported as CSV,
but it can be difficult to reformat them properly into a
sequence alignment header file
(with the same number of positions as the filtered alignment) or
residue attribute file for use in Chimera.
A direct connection between Jalview and Chimera is under development.
The list is organized as follows:
Descriptions here are minimal.
I recommend consulting literature references and/or other documentation
provided on the sites to better understand their contents and methodology.
- (A) precalculated sequence alignments of known structures
- (B) precalculated sequence alignments,
structures not necessarily known
- (C) server-generated multiple alignment from a single input
- (D) DIY: find individual sequences yourself,
enter them into an alignment server
Points to consider:
- How diverse a set of sequences is appropriate for your purposes?
Must the proteins have identical functions, or are similar structures and
related functions sufficient?
Be aware that different sources have different definitions of
family and superfamily,
but everyone agrees that a family is more closely related,
a superfamily more diverse.
- How many related structures are known? It is generally more convenient
for sequence-structure work to use sequence alignments of known structures
(as in (A))
but that is only possible when several structures have been solved. If
only a single structure (perhaps even a model) is available,
you will have to use sequences without known structures to map conservation.
- One advantage of using precalculated alignments from a database,
or an alignment calculated by a server such as
is that in your own published work, you can cite the source rather than
having to describe in detail and justify how you generated your own alignment.
These databases contain sequence alignments of proteins with experimentally
determined 3D structures. Typically the names in the alignment are
structure identifiers, which makes it easy to autofetch all the structures with
a single step in Chimera (from the sequence alignment window, choose
Structure... Load Structures). Of course, you can just fetch
a subset of the structures individually with the open command or
File... Fetch by ID.
- alignments are for families, generally narrow sets of proteins
with high similarities
- download *.pir or *.ali sequence alignment (either variant of
aligned NBRF/PIR format is accepted by Chimera)
- can also download a coordinate file in which each structure is a
different chain; this is not as convenient as separate models in Chimera,
so I prefer to fetch structures as described above, then superimpose
them using the sequence alignment
- can autofetch structures from the PDB since the sequences
are named with PDB IDs
Tip: getting rid of extra chains
Often structures include additional chains that are not associated with
the sequence alignment and not needed for the intended analyses. These
chains may be additional copies of the same protein or different
macromolecules. Here is a trick for removing such unassociated chains
- With the mouse in the sequence alignment, draw a box that includes
at least one associated residue from each structure.
That will select the associated residues.
- Click into the main graphics window and press the keyboard up arrow
key to promote the selection from residues to chains.
- Invert the selection to contain unwanted atoms:
Command: sel invert
Menu: Select... Invert (all models)
- Delete the selection:
Command: del sel
Menu: Actions... Atoms/Bonds... delete
- structure-based multiple alignments of close and distant structural
clusters are available for some T and H groups in aligned FASTA format,
with sequences named by domain ID, which includes PDB ID
- choose Alignments tab, save FASTA format as plain text
- “core” alignment has “domains in cluster”
number of sequences, “expanded” has many more
- structure-based alignments of
superfamilies filtered at 40% sequence identity
(SCOP release 1.75 from Jun 2009)
- *.ali files (aligned NBRF/PIR format)
- superimposed structures can also be downloaded
- PASS2 sequence names can be converted to SCOP IDs
(e.g. for Chimera fetch by ID) by changing any “-”
(hyphen) to “_” (underscore)
- manually curated structure-based alignments for proteins
with non-trivial relationships such as permutations
- you can browse listings within the
alignment categories: fragment, homologous, fold
- aligned FASTA format
- sequence names start with PDB IDs
If the corresponding tree in New Hampshire (aka Newick) format is available,
it can be loaded after the sequence alignment has been opened.
- PFAM (also tree files)
- for Chimera purposes, I use “seed alignments”
(many fewer sequences than the full PFAM alignments)
- don't get their MSF format (last time I checked, it was wrong),
but Selex, Stockholm, FASTA, and the tree format are all fine
- for some families, corresponding structures are listed;
sometimes structures are not listed even though they exist
- can link out to PFAM from structure pages at the RCSB PDB
manual sequence-structure association
- Often your structure will not be similar enough to any sequence in the
alignment to associate automatically. Some things to try to associate
- Use Structure... Associations in Multalign Viewer to
compare all the sequences to the structure and associate it with the best
match even though it does not meet the automatic association criteria.
The resulting association must be examined, since it may not be good
enough to be usable. For example, the forced association of structure
2gbp with PF00532_seed.slx
only associates the first few residues of the structure with the
last few residues of a sequence and is obviously wrong.
- Use Edit... Add Sequence in Multalign Viewer and add
the sequence from the structure to the alignment. This may require several
cycles of unsuccessfully adding the sequence, using Edit... Delete
Sequences/Gaps to remove it, and adding it back again with different
alignment parameters. For example, adding the sequence of
2gbp to PF00532_seed.slx does not work with the default
parameters, but is reasonably successful using the BLOSUM-30 matrix.
(Structure-Function Linkage Database)
(family, subgroup, and superfamily alignments,
some Chimera “active site” sessions)
- deep and detailed coverage but of relatively few enzyme superfamilies
- lists some but not all structures
- can set up browser (except Safari) to open structures, sequence
alignments, sessions directly in Chimera
- can get an alignment that includes your query sequence if it matches HMMs
in the database
- ConSurf (also tree files)
- you can get precalculated results for PDB entries or submit a structure
to the server (server calculations may take a while)
- the server has many options; if you just use a single structure as input,
it will find homologous sequences and create a sequence alignment and tree
- alternatively, you can upload an existing multiple sequence alignment
(several formats accepted) for conservation calculations
- results available for download include the sequence alignment
(Clustal *.aln format), per-column ConSurf conservation scores,
and TheTree.txt (tree in Newick format)
- can set up browser (except Safari) to
show results directly in Chimera
- the Blast Protein tool in Chimera
(under Tools... Sequence in the main menu,
under Info in the Multalign Viewer menu)
uses a web service hosted by the
As with any BLAST search (see (D)),
the results depend on the search criteria and the database, and
may be unbalanced and/or contain significant redundancy.
- the input can be:
- a chain from a structure open in Chimera
- a sequence pasted in as text
- a sequence from an alignment open in Chimera's
- several parameters can be adjusted, including number of iterations
(multiple iterations = PSI-BLAST) and database (pdb or nr)
- the output is a list of hits, from which all or a user-chosen subset
can be retrieved as:
- a pseudo-multiple sequence alignment,
automatically shown in Chimera's
A pseudo-multiple alignment from BLAST is not a true multiple alignment,
but a consolidation of the pairwise alignments of individual hits to the
query, as specified by the BLAST
alignment view option
“flat query-anchored with letters for identities.”
Further, Blast may omit parts of the hit sequences. See tips below.
- structures for hits from pdb, automatically superimposed
according to the pseudo-multiple sequence alignment
Tips: improving a Blast pseudo-multiple alignment
- It may be helpful to realign the sequences with true multiple-alignment
software (e.g., using
Edit... Realign Sequences
in the Multalign Viewer menu,
or one of the alignment servers listed below).
- To fix the problem of omitted residues,
if structures for the hit sequences have been loaded in Chimera,
a new multiple alignment of their full sequences can be generated with
Align Chain Sequences
(under Tools... Sequence in the main Chimera menu).
Issues to consider are how diverse the set of sequences should be,
alignment quality, and balance, i.e. an alignment could oversample
some areas of the intended “sequence space” and undersample others.
Imbalance can be reduced by
filtering out sequences at some level of sequence identity, and in Chimera,
using sequence-weighting options to calculate conservation.
I used the DIY approach to make the alignments in the Chimera
demo (under Tools... Demos in the menu)
because I wanted to include sequences for the hormone and receptor
from the same six species.
The sequences were similar enough to align easily, so I didn't have
to worry about tweaking parameters to improve the results.
Look up sequences (I usually save or text-edit the sequences into
a single FASTA file):
Use a server to align them (order is merely alphabetical):
July 2014 / meng[at]cgl.ucsf.edu /