Online Sources of Sequence Alignments

This is by no means an exhaustive list, but includes several sources of protein multiple sequence alignments for use in Chimera and/or ChimeraX.* Why you might want such alignments:

you are interested in a particular set of proteins and want to map sequence conservation onto structure(s)
you have a protein structure and want to find other structures in the same family or superfamily and compare them

Chimera(X) can be used to display the sequence alignment, calculate and display sequence conservation, superimpose structures, morph between structures, etc.
Related tutorials (Chimera): Sequences and Structures, Superpositions and Alignments, Mapping Sequence Conservation onto Structures
Related tutorials (ChimeraX): Coloring by Sequence Conservation

*Chimera(X) is not meant to handle very large multiple sequence alignments (those containing several hundreds to thousands of sequences). A good tool for working with large alignments is Jalview. It allows redundancy-filtering and saving a smaller alignment in various formats (aligned FASTA, etc.) that can be read into Chimera. Jalview also calculates per-column conservation directly and via AACon web service; these annotation values can be exported as CSV, but it can be difficult to reformat them properly into a sequence alignment header file (with the same number of positions as the filtered alignment) or residue attribute file for use in Chimera. A direct connection between Jalview and Chimera is under development.

Alignment sources are grouped as follows:

(A) Precalculated sequence alignments of known structures
(B) Precalculated sequence alignments, structures not necessarily known
(C) Server-generated multiple alignment from a single input
(D) DIY: find individual sequences yourself, enter them into an alignment server

Descriptions here are minimal; see literature references and/or other documentation at the individual websites for more details.

Points to consider:

How diverse a set of sequences is appropriate for your purposes? Must the proteins have identical functions, or are similar structures and related functions sufficient? Be aware that different sources have different definitions of family and superfamily.
How many related structures are known? It may be more convenient for sequence-structure work to use sequence alignments of known structures but that is only possible when several structures have been solved. If only a single structure (perhaps even a homology model) is available, you will have to use at least some sequences without known structures to calculate conservation.
One advantage of using precalculated alignments from a database, or an alignment calculated by a server such as ConSurf, is that in your own published work, you can cite the source rather than having to describe in detail and justify how you generated your own alignment.

← (A) Alignments containing proteins of known structure

These databases contain sequence alignments of proteins with experimentally determined 3D structures. Typically the names in the alignment are structure identifiers, which makes it easy to autofetch all the structures with a single step in Chimera (from the sequence alignment window, choose Structure... Load Structures). Of course, you can just fetch a subset of the structures individually with the open command or File... Fetch by ID.

HOMSTRAD
https://mizuguchilab.org/homstrad/
- alignments are for families, generally narrow sets of proteins with high similarities
- download *.pir or *.ali sequence alignment (either variant of aligned NBRF/PIR format is accepted by Chimera(X))
- can also download a coordinate file in which each structure is a different chain; this is not as convenient as separate models in Chimera(X), so I prefer to fetch structures as described above, then superimpose them using the sequence alignment
- can autofetch structures from the PDB since the sequences are named with PDB IDs
Examples:
- rubredoxin family and alignment file rub.pir
- periplasmic binding protein -- sugar family and alignment file sugbp.pir
Tip: getting rid of extra chains
- Often structures include additional chains that are not associated with the sequence alignment and not needed for the intended analyses. These chains may be additional copies of the same protein or different macromolecules. Here is a trick for removing such unassociated chains in Chimera:
  1. With the mouse in the sequence alignment, draw a box that includes at least one associated residue from each structure. That will select the associated residues.
  2. Click into the main graphics window and press the keyboard up arrow key to promote the selection from residues to chains.
  3. Invert the selection to contain unwanted atoms: Command: sel invert or Menu: Select... Invert (all models)
  4. Delete the selection: Command: del sel or Menu: Actions... Atoms/Bonds... delete
PASS2.6
http://caps.ncbs.res.in/pass2/
- structure-based alignments of SCOP superfamilies filtered at 40% sequence identity (SCOPe release 2.06)
- *.ali files (aligned NBRF/PIR format)
- superimposed structures can also be downloaded
- PASS2 sequence names can be converted to SCOP IDs (e.g. for Chimera fetch by ID) by changing any “-” (hyphen) to “_” (underscore)
Examples:
- WD40 repeat-like superfamily and alignment file 50978.ali
- periplasmic binding protein-like I superfamily and alignment file 53822.ali

← (B) Alignments that do not necessarily contain proteins of known structure

If the corresponding tree in New Hampshire (aka Newick) format is available, it can be loaded after the sequence alignment has been opened.

PFAM (also tree files)
http://pfam.xfam.org/
- for Chimera purposes, I use “seed alignments” (many fewer sequences than the full PFAM alignments)
- don't get their MSF format (last time I checked, it was wrong), but Selex, Stockholm, FASTA, and the tree format are all fine
- for some families, corresponding structures are listed; sometimes structures are not listed even though they exist
- can link out to PFAM from structure pages at the RCSB PDB
Example:
- Peripla_BP_1 family seed alignment file PF00532_seed.slx and tree file PF00532_seed.nhx
Tip: manual sequence-structure association
- Often your structure will not be similar enough to any sequence in the alignment to associate automatically. Some things to try to associate the structure:
  - Use Structure... Associations in Multalign Viewer to compare all the sequences to the structure and associate it with the best match even though it does not meet the automatic association criteria. The resulting association must be examined, since it may not be good enough to be usable. For example, the forced association of structure 2gbp with PF00532_seed.slx only associates the first few residues of the structure with the last few residues of a sequence and is obviously wrong.
  - Use Edit... Add Sequence in Multalign Viewer and add the sequence from the structure to the alignment. This may require several cycles of unsuccessfully adding the sequence, using Edit... Delete Sequences/Gaps to remove it, and adding it back again with different alignment parameters. For example, adding the sequence of 2gbp to PF00532_seed.slx does not work with the default parameters, but is reasonably successful using the BLOSUM-30 matrix.
SMART (Simple Modular Architecture Research Tool) alignments for >1000 domain families
http://smart.embl-heidelberg.de/smart/set_mode.cgi?NORMAL=1
- domain pages allow viewing alignments in various formats
- for Chimera purposes, I showed MSF format and then used cut-and-paste to save the browser contents to a plain text file named *.msf (had problems with their FASTA; didn't try other formats)
- domain page Structure section lists corresponding PDB entries
Example:
- RAB domain alignment saved to SMART-Rab.msf
NCBI CDD (Conserved Domain Database) domain-based local multiple sequence alignments from multiple sources including PFAM, SMART, COGs, TIGRFAM, PRK, and NCBI-curated
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
- choose number of sequences (Row Display), choices range from 10-100
- specify how they are chosen (Type Selection), default “the most diverse members”
- choose format mFasta, click Reformat, then copy-and-paste to create a plain text file named *.fasta (or one of the other suffixes recognized by Chimera as aligned FASTA)
Example:
- Carbonic anhydrase alignment with 20 rows saved to cdd-carbonic-anhydrase.fasta
SFLD (Structure-Function Linkage Database) family, subgroup, and superfamily sequence alignments, Chimera “active site” sessions for some families
http://sfld.rbvi.ucsf.edu/
- detailed information but for relatively few enzyme superfamilies
- lists some but not all structures
- can set up browser (except Safari) to open structures, sequence alignments, sessions directly in Chimera
- can get an alignment that includes your query sequence if it matches HMMs in the database

← (C) Server-generated multiple alignment from a single input

ConSurf (also creates tree files)
http://consurf.tau.ac.il/
- the input is one structure
- you can get precalculated results for PDB entries or submit a structure to the server (server calculations may take a while)
- the server has many options; if you just use a single structure as input, it will find homologous sequences and create a sequence alignment and tree
- alternatively, you can upload an existing multiple sequence alignment (several formats accepted) for conservation calculations
- results available for download include the sequence alignment (Clustal *.aln format), per-column ConSurf conservation scores, and TheTree.txt (tree in Newick format)
- can set up browser (except Safari) to show results directly in Chimera*
  *As of 2023, the server may provide a ChimeraX session, according to this paper
the Blast Protein tool and blastprotein command in ChimeraX and the Blast Protein tool in Chimera (under Tools... Sequence in the main menu) use a web service hosted by the UCSF RBVI. As with any BLAST search, the results depend on the search criteria and the database, and may be unbalanced and/or contain significant redundancy. Be aware that BLAST is meant to find similar sequences quickly rather than to generate accurate alignments (more on this below).
- the input is one sequence, any of the following:
  - a chain from a structure open in Chimera or ChimeraX
  - a sequence pasted in as text
  - a sequence from an alignment open in ChimeraX's Sequence Viewer or Chimera's Multalign Viewer
- several parameters can be adjusted, including database (pdb or nr)
- the output is a list of hits, from which all or a user-chosen subset can be retrieved as:
  - a pseudo-multiple sequence alignment, automatically shown in ChimeraX's Sequence Viewer or Chimera's Multalign Viewer. A pseudo-multiple alignment from BLAST is not a true multiple alignment, but a consolidation of the pairwise alignments of individual hits to the query, as specified by the BLAST alignment view option “flat query-anchored with letters for identities.” Further, Blast may omit parts of the hit sequences. See tips below.
  - structures for hits from pdb, automatically superimposed according to the pseudo-multiple sequence alignment
  Tips: improving a Blast pseudo-multiple alignment
  - It may be helpful to realign the sequences with true multiple-alignment software (e.g., using Edit... Realign Sequences in the ChimeraX Sequence Viewer context menu or the Chimera Multalign Viewer menu)
  - To fix the problem of omitted residues, if structures for the hit sequences have been opened, a new multiple alignment of their full sequences can be generated with the ChimeraX sequence align command or the Chimera Align Chain Sequences tool (under Tools... Sequence in the main menu).

← (D) DIY: Find sequences individually, use alignment server

Issues to consider are how diverse the set of sequences should be, alignment quality, and balance, i.e. an alignment could oversample some areas of the intended “sequence space” and undersample others. Imbalance can be reduced by filtering out sequences at some level of sequence identity, and in Chimera, using sequence-weighting options to calculate conservation.

I used the DIY approach to make the alignments in the Chimera “hormone-receptor complex” demo (under Tools... Demos in the menu) because I wanted to include sequences for the hormone and receptor from the same six species. The sequences were similar enough to align easily, so I didn't have to worry about tweaking parameters to improve the results.

Look up sequences (I usually save or text-edit the sequences into a single FASTA file):

UniProt (text search)
http://www.uniprot.org/
Entrez Protein (text search)
http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein
NCBI BLAST, PSI-BLAST etc. (input one sequence to find other potentially related sequences)
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?PAGE=Proteins&PROGRAM=blastp

Use a server to align them (order is merely alphabetical):

Clustal Omega
http://www.ebi.ac.uk/Tools/msa/clustalo/
Cobalt
http://www.ncbi.nlm.nih.gov/tools/cobalt/
Kalign
http://www.ebi.ac.uk/Tools/msa/kalign/
MAFFT
http://www.ebi.ac.uk/Tools/msa/mafft/
http://mafft.cbrc.jp/alignment/server/
MUSCLE
http://www.ebi.ac.uk/Tools/msa/muscle/
PRALINE
http://www.ibi.vu.nl/programs/pralinewww/
PROMALS3D (also takes structures)
http://prodata.swmed.edu/promals3d
T-Coffee
http://www.tcoffee.org/

December 2023 / meng[at]cgl.ucsf.edu / home page

If you find broken links or outdated information in this page, please let me know – thanks!