"Domains" in PFAM and SCOP

presentation 11/17/08

PFAM (current release 23.0, July 2008, 10340 families):
http://pfam.sanger.ac.uk/
http://pfam.janelia.org/

The home page says the Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

Pfam-A entries are high quality, manually curated families. Sets of related Pfam-A entries are grouped into clans.
Pfam-B entries are generated by automated clustering (they are of lower quality but serve to extend coverage of known proteins).

The help page says Pfam entries are classified in one of four ways:

Family: A collection of related proteins
Domain: A structural unit which can be found in multiple protein contexts
Repeat: A short unit which is unstable in isolation but forms a stable structure when multiple copies are present
Motifs: A short unit found outside globular domains

The glossary merely repeats these definitions without saying how they pertain to Pfam.

Although this makes it sound like the terms are mutually exclusive, in practice, they are used almost interchangeably. Usually Pfam calls all of its entries "families" or "domain families" (even on that same help page).

I searched for keyword WD40, which I know is a repeat smaller than a structural domain: [results] The page for entry WD40 says it is a "family," while its name includes both "domain" and "repeat"!

Next tack: read many papers about Pfam, relevant ones listed below. None describe de novo determination of domain boundaries; apparently domain boundaries are taken from other databases.

Sonnhammer et al. NAR 1998, Pfam release 2.0 (527 families):
The definition of domain boundaries, family members and alignment is done semi-automatically based on expert knowledge, sequence similarity, other protein family databases and the ability of HMM-profiles to correctly identify and align the members.
No further details in paper, but they say most families are based on Prosite or Prints entries.
Bateman et al. NAR 2002, Pfam release 6.6 (3071 families):
...where data are available, structural information has been used... The domain boundaries used are currently those defined by the SCOP database... approximately 300 Pfam families have been split into two or more domains, with the domain boundaries of many more refined to better match the available structural data.
New annotation field TP specifies whether an entry is a family, domain, repeat, or motif. [I don't see any mention of TP on the Web site]
Family type is the default class which simply states that the members are related. A domain is defined as an autonomous structural unit, or a reusable sequence unit that may be found in multiple protein contexts. In contrast, a repeat is not usually stable in isolation; rather, multiple tandem repeats are usually required to form a globular domain or extended structure. Motifs generally describe shorter sequence units found outside globular domains. Pfam release 6.6 contains 2032 families, 980 domains, 54 repeats and 5 motifs.
Coin et al. Enhanced protein domain discovery by using language modeling techniques from speech recognition. PNAS USA 2003.
Despite the promising-sounding title, this does not describe development of new domain HMMs, but the improved detection of domains with the existing set of HMMs by factoring in "context," namely how likely a segment is to be a particular domain given other domains identified in the same sequence. Previously only the the part that matched was used to score the likelihood of a domain identification.

SCOP (Structural Classification of Proteins)
(current release 1.73, Nov 2007, 34494 PDB entries, 97178 domains):
http://scop.mrc-lmb.cam.ac.uk/scop/

SCOP is a hierarchical classification of protein domains.

protein domain - an evolutionary unit observed in nature either in isolation or in more than one context in multidomain proteins
family - set of domains with clear evolutionary relationships (high similarity in sequence and/or structure)
superfamily - set of families with probable evolutionary relationships (similarity in sequence, structure, function, other experimental observables)
fold - same major secondary structures in the same arrangement with the same topological connections; peripheral parts could differ
class - same general secondary structure content (all-alpha, all-beta, alpha + beta, alpha/beta, ...)

Take-home: human curators define domains by their observed occurrences in known structures according to the definition above (albeit with the help of automated sequence searches). Only the appearance of a potential domain structure in different overall protein structures is used, not its compactness or other descriptors that could be calculated from its coordinates. These domain assignments can change as new structures are solved.

The sequences and structures of SCOP domains are available from the ASTRAL compendium. Chimera can fetch SCOP domain structures.

References and salient points:

Murzin et al. JMB 1995, initial SCOP release (3179 domains):
The method used to construct the protein classification in scop is essentially the visual inspection and comparison of structures though various automatic tools are used to make the task manageable...
Lo Conte et al. NAR 2002, SCOP 1.55 (>30,000 domains):
...a domain definition or its classification can change. A typical example is when a domain in a multidomain protein already classified in SCOP is observed for the first time either by itself, or in a different context, and therefore qualifies as a separate domain.
Andreeva et al. NAR 2004, SCOP 1.63 (~50,000 domains):
An advantage of the SCOP database is that it embeds a theory of evolution as defined by human experts rather than by empirical rules implemented in a variety of bioinformatics algorithms and tools.
Andreeva et al. NAR 2007, SCOP 1.73 (>90,000 domains):
The relationships in SCOP are established by expert analysis of sequence, structural and functional similarities amongst proteins with known structure... While the new pre-classification protocol is entirely based on sequence comparisons, the analysis of structural similarities and final classification of the protein structures in the database will continue to rely on the SCOP authors' knowledge and expertise.
BLAST is used to identify possible family relationships, PSI-BLAST and RPS-BLAST to identify possible superfamily relationships.
They also mention plans to change the meaning and name of the fold level since many evolutionary relationships have been discovered between proteins currently classified as different folds.