Online Sources of Sequence Alignments
This is by no means an exhaustive list, but includes several sites I have
found useful as sources of protein family or superfamily sequence alignments
for use with Chimera.
Why you might want such alignments:
- you have a protein structure and want to show which parts are
more or less conserved in its family or superfamily
- you have a protein structure and want to find other structures
in the same family or superfamily and compare them
- you are interested in a particular family or superfamily
and want to look up a sample of structures and compare them
I would then work with the sequences and structure(s) using Chimera, for
example to: evaluate conservation, superimpose structures,
evaluate conformational variability, morph between different structures.
Be aware that different sources have different definitions of
family and superfamily,
but everyone agrees that a family is more closely related,
a superfamily more diverse.
The list is organized as follows:
- (A) precalculated sequence alignments of known structures
- (B) precalculated sequence alignments,
structures not necessarily known
- (C) multiple alignment generated by a server from a
single input structure
- (D) DIY: find individual sequences yourself,
enter them into an alignment server
Descriptions here are minimal.
I recommend consulting literature references and/or other documentation
provided on the sites to better understand their contents and methodology.
Points to consider:
- How diverse a set of sequences is appropriate for your purposes?
Must the proteins have identical functions, or are similar structures and
related functions sufficient?
- How many related structures are known? It is generally more convenient
for sequence-structure work to use sequence alignments of known structures
(as in (A))
but that is only possible when several structures have been solved. If
only a single structure (perhaps even a model) is available,
you will have to use sequences without known structures to map conservation.
- An important advantage of using precalculated alignments or an alignment
calculated by a server is that in your own published work,
you can cite the source rather than having to describe and justify in
detail how you generated your own alignment (as in (D)).
(A) Alignments containing proteins of known structure
These databases contain sequence alignments of proteins with experimentally
determined 3D structures. Typically the names in the alignment are
structure identifiers, which makes it easy to autofetch all the structures with
a single step in Chimera (from the sequence alignment window, choose
Structure... Load Structures). Of course, you can just fetch
a subset of the structures individually with the open command or
File... Fetch by ID.
- HOMSTRAD
http://tardis.nibio.go.jp/homstrad/
- alignments are for families, generally narrow sets of proteins
with high similarities
- download *.pir or *.ali sequence alignment (either variant of
aligned NBRF/PIR format is accepted by Chimera)
- can also download a coordinate file in which each structure is a
different chain; this is not as convenient as separate models in Chimera,
so I prefer to fetch structures as described above, then superimpose
them using the sequence alignment
- can autofetch structures from the PDB since the sequences
are named with PDB IDs
Examples:
Tip: getting rid of extra chains
- Often PDB entries have multiple copies of the protein, but you may
only want to view one copy. See the
Superpositions and Alignments
Chimera tutorial for one way to remove such extra copies.
- PASS2
http://caps.ncbs.res.in/campass/pass2.html
- structure-based alignments of
SCOP
superfamilies filtered at 40% sequence identity
(SCOP release 1.63, not the newest)
- *.ali files (aligned NBRF/PIR format)
- sequence names lack the starting "d" of domain identifiers, so
autofetching gets entire PDB entries (see getting
rid of extra chains)
- superimposed structures can also be downloaded,
but the PDB file is malformed (all the superimposed structures
are in one PDB file, but do not have different chain or model IDs;
to use the PDB file in Chimera, text-edit it to add MODEL and ENDMDL records
around each structure)
Examples:
- S4 (Structure-based Sequence alignments
of SCOP Superfamilies)
http://compbio.mds.qmw.ac.uk/~james/S4.shtml
(unfortunately browsing/searching is broken
but you can still download S4_1_63.tar.gz, which contains all the alignments)
- structure-based alignments of
SCOP
superfamilies filtered at 40% sequence identity
(SCOP release 1.63, not the newest)
- Clustal *.aln format
- can autofetch domain structures from ASTRAL since the sequences
are named with SCOP domain IDs
(but from SCOP release 1.63, May 2003,
so some identifiers may no longer match available structures;
see dealing with old domain identifiers)
Examples:
Tip: dealing with old domain identifiers
- The chain IDs of many structures in the PDB changed during the
"remediation" process in 2007. Because of this, some of the domain IDs
in alignment files based on older SCOP releases now fail to automatically
load a structure. In most cases the change was from no chain ID to chain A,
so you can usually manually fetch the appropriate file by simply changing
the underscore in the chain position of the ID to "a." For example,
open scop:d2dri__ will fail, but open scop:d2dria_ will
retrieve the structure.
- DMAPS (Database of Multiple Alignments
for Protein Structures)
http://bioinformatics.albany.edu/~dmaps/
- multiple alignments for SCOP, CATH, ENZYME, and CE-based families
(most are narrow sets with high %ID and redundancy)
generated by CE-MC superposition
- searching may require pre-PDB-remediation chain IDs, for example
2mnr:_ instead of the current 2mnr:A
- links to some data are broken (Jun 2009) but many sequence alignments
are available
- aligned FASTA format
- sequence names start with PDB IDs, but autofetching is not recommended
for such redundant alignments
- SISYPHUS
http://sisyphus.mrc-cpe.cam.ac.uk/sisyphus/
- structural alignments for proteins with non-trivial relationships
such as permutations
- alignment categories: fragment, homologous, fold
- aligned FASTA format
- sequence names start with PDB IDs
(B) Alignments that do not necessarily contain proteins of known structure
If the corresponding tree in New Hampshire (aka Newick) format is available,
it can be loaded after the sequence alignment has been opened.
- PFAM (also tree files)
http://pfam.sanger.ac.uk/
http://pfam.janelia.org/
- for Chimera purposes, I use "seed alignments" (many fewer sequences than
the full PFAM alignments)
- don't get their MSF format (it's apparently wrong),
but Selex, Stockholm, FASTA, and the tree format are all fine
- for some families, corresponding structures are listed;
sometimes structures are not listed even though they exist
- can link out to PFAM from structure pages at the RCSB PDB
Example:
Tip:
manual sequence-structure association
- Often your structure will not be similar enough to the sequences in the
alignment to associate automatically. Some things to try to associate
the structure:
- Use Structure... Associations in Multalign Viewer to
compare all the sequences to the structure and associate it with the best
match even though it does not meet the automatic association criteria.
The resulting association must be examined, since it may not be good
enough to be usable. For example, the forced association of structure
2gbp with PF00532_seed.slx
only associates the first few residues of the structure with the
last few residues of a sequence and is obviously wrong.
- Use Edit... Add Sequence in Multalign Viewer and add
the sequence from the structure to the alignment. This may require several
cycles of unsuccessfully adding the sequence, using Edit... Delete
Sequences/Gaps to remove it, and adding it back again with different
alignment parameters. For example, adding the sequence of
2gbp to PF00532_seed.slx does not work with the default
parameters, but is reasonably successful using the BLOSUM-30 matrix.
- PANDIT
(again seed alignments from PFAM, but with more carefully built trees)
http://www.ebi.ac.uk/goldman-srv/pandit/
(Nov 2008: updates stopped, but database is still useful)
- see PFAM (above) for family identifier codes
- no convenient file download; I had to cut-n-paste text from
the screen to a file to save the alignment (apparently Selex format is shown)
and tree (saving from the link includes a bunch of HTML tags)
- this may be a version issue, but I couldn't mix-n-match the tree
and alignment from PFAM and PANDIT; they were only consistent both from
PFAM or both from PANDIT
- SFLD
(Structure-Function Linkage Database)
(family, subgroup, and superfamily alignments,
some Chimera "active site" sessions)
http://sfld.rbvi.ucsf.edu/
- deep and detailed coverage but of relatively few enzyme superfamilies
- lists some but not all structures
- can set up browser (except Safari) to open structures, sequence
alignments, sessions directly in Chimera
- can get an alignment that includes your query sequence if it matches HMMs
in the database
(C) Alignments created from single structure inputs
- ConSurf (also tree files)
http://consurf.tau.ac.il/
- you can get precalculated results for PDB entries or submit a structure
to the server (note server calculations may take a while)
- the server has many options; if you just use a single structure as input,
it will find homologous sequences and create a sequence alignment and tree
- results files available for download include the sequence
alignment (Clustal *.aln format) and TheTree.txt (tree in Newick format)
- can set up browser (except Safari) to
show results
directly in Chimera
(D) DIY: Find sequences individually, use alignment server
Issues to consider are how diverse the set of sequences should be,
alignment quality, and balance, i.e. an alignment could oversample
some areas of the intended "sequence space" and undersample others.
Imbalance can be reduced by
filtering out sequences at some level of sequence identity, and in Chimera,
using sequence-weighting options to calculate conservation.
I used the DIY approach to make the alignments in the Chimera
"hormone-receptor complex" demo (under Tools... Demos in the menu)
because I wanted to include sequences for the hormone and receptor
from the same six species.
The sequences were similar enough to align easily, so I didn't have
to worry about tweaking parameters to improve the results.
Look up sequences (I usually save or text-edit the sequences into
a single FASTA file):
Use a server to align them (order is merely alphabetical):
Jun 2009 / meng[at]cgl.ucsf.edu /
home page