PFAM (current release 23.0, July 2008, 10340 families):
http://pfam.sanger.ac.uk/
http://pfam.janelia.org/
The home page says the Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
Although this makes it sound like the terms are mutually exclusive, in practice, they are used almost interchangeably. Usually Pfam calls all of its entries "families" or "domain families" (even on that same help page).
I searched for keyword WD40, which I know is a repeat smaller than a structural domain: [results] The page for entry WD40 says it is a "family," while its name includes both "domain" and "repeat"!
Next tack: read many papers about Pfam, relevant ones listed below. None describe de novo determination of domain boundaries; apparently domain boundaries are taken from other databases.
The definition of domain boundaries, family members and alignment is done semi-automatically based on expert knowledge, sequence similarity, other protein family databases and the ability of HMM-profiles to correctly identify and align the members.No further details in paper, but they say most families are based on Prosite or Prints entries.
...where data are available, structural information has been used... The domain boundaries used are currently those defined by the SCOP database... approximately 300 Pfam families have been split into two or more domains, with the domain boundaries of many more refined to better match the available structural data.New annotation field TP specifies whether an entry is a family, domain, repeat, or motif. [I don't see any mention of TP on the Web site]
Family type is the default class which simply states that the members are related. A domain is defined as an autonomous structural unit, or a reusable sequence unit that may be found in multiple protein contexts. In contrast, a repeat is not usually stable in isolation; rather, multiple tandem repeats are usually required to form a globular domain or extended structure. Motifs generally describe shorter sequence units found outside globular domains. Pfam release 6.6 contains 2032 families, 980 domains, 54 repeats and 5 motifs.
Despite the promising-sounding title, this does not describe development of new domain HMMs, but the improved detection of domains with the existing set of HMMs by factoring in "context," namely how likely a segment is to be a particular domain given other domains identified in the same sequence. Previously only the the part that matched was used to score the likelihood of a domain identification.
SCOP (Structural Classification of Proteins)
(current release 1.73, Nov 2007, 34494 PDB entries, 97178 domains):
http://scop.mrc-lmb.cam.ac.uk/scop/
SCOP is a hierarchical classification of protein domains.
The sequences and structures of SCOP domains are available from the ASTRAL compendium. Chimera can fetch SCOP domain structures.
References and salient points:
The method used to construct the protein classification in scop is essentially the visual inspection and comparison of structures though various automatic tools are used to make the task manageable...
...a domain definition or its classification can change. A typical example is when a domain in a multidomain protein already classified in SCOP is observed for the first time either by itself, or in a different context, and therefore qualifies as a separate domain.
An advantage of the SCOP database is that it embeds a theory of evolution as defined by human experts rather than by empirical rules implemented in a variety of bioinformatics algorithms and tools.
The relationships in SCOP are established by expert analysis of sequence, structural and functional similarities amongst proteins with known structure... While the new pre-classification protocol is entirely based on sequence comparisons, the analysis of structural similarities and final classification of the protein structures in the database will continue to rely on the SCOP authors' knowledge and expertise.BLAST is used to identify possible family relationships, PSI-BLAST and RPS-BLAST to identify possible superfamily relationships.
They also mention plans to change the meaning and name of the fold level since many evolutionary relationships have been discovered between proteins currently classified as different folds.