"Domains" in PFAM and SCOP

presentation 11/17/08

PFAM (current release 23.0, July 2008, 10340 families):
http://pfam.sanger.ac.uk/
http://pfam.janelia.org/

The home page says the Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

The help page says Pfam entries are classified in one of four ways: The glossary merely repeats these definitions without saying how they pertain to Pfam.

Although this makes it sound like the terms are mutually exclusive, in practice, they are used almost interchangeably. Usually Pfam calls all of its entries "families" or "domain families" (even on that same help page).

I searched for keyword WD40, which I know is a repeat smaller than a structural domain: [results] The page for entry WD40 says it is a "family," while its name includes both "domain" and "repeat"!

Next tack: read many papers about Pfam, relevant ones listed below. None describe de novo determination of domain boundaries; apparently domain boundaries are taken from other databases.


SCOP (Structural Classification of Proteins)
(current release 1.73, Nov 2007, 34494 PDB entries, 97178 domains):
http://scop.mrc-lmb.cam.ac.uk/scop/

SCOP is a hierarchical classification of protein domains.

Take-home: human curators define domains by their observed occurrences in known structures according to the definition above (albeit with the help of automated sequence searches). Only the appearance of a potential domain structure in different overall protein structures is used, not its compactness or other descriptors that could be calculated from its coordinates. These domain assignments can change as new structures are solved.

The sequences and structures of SCOP domains are available from the ASTRAL compendium. Chimera can fetch SCOP domain structures.

References and salient points:


See also: