Journal Club 03/24/08

Journal Club (March 24, 2008)

Pei J, Kim BH, Grishin NV. PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res. 2008 Apr;36(7):2295-300.

Summary:

PROMALS3D generates accurate multiple sequence alignments using both sequence and structure information. PROMALS3D stands for "PROfile Multiple Alignment with predicted Local Structures and 3D constraints" and is available via web server: http://prodata.swmed.edu/promals3d

Background:

The earlier program PROMALS incorporates:

database searching for homologs
secondary structure prediction
profile-profile alignment
PROMALS3D is PROMALS enhanced to use 3D structures. Other multiple sequence alignment programs have been enhanced with the ability to use structures, although the details differ:

MAFFT server - can add in one or more pre-existing structural alignments in sequence form (aligned fasta) (max sequences >10,000!)
T-Coffee server (Expresso/3DCoffee option) - the server will automatically find structures and align them for you (max seqences 50)
There are other web servers for these programs (at EBI, for example), but usually with fewer options exposed.

Method:

Most of the steps were already in PROMALS. Shaded boxes indicate the 3D part.

initial clustering uses BLOSUM62 (limitation?)
UniRef90 is UniProt filtered at 90% ID
PSIPRED uses neural networks on profiles from PSI-BLAST
PSI-BLAST search of structures is single iteration
if structures mostly overlap, only the one with the better E-value is kept
pairwise sequence profile-sequence profile alignments (incorporating the secondary structure predictions somehow) provide sequence constraints
pairwise structure-structure or structure-sequence profile alignments provide structure constraints
constraints integrated into consistency score (weighted structure constraints 1.5x relative to sequence constraints)

Evaluation:

benchmarks:
- SABmark - pairwise reference alignments (consensus of SOFI and CE) within each group
  - twilight zone set - 209 groups, SCOP fold level
  - superfamilies set - 425 groups, SCOP superfamily level
- PREFAB - 1682 multiple alignments, each containing 2 sequences with known structures (alignment consensus of FSSP and CE) and up to 24 homologous sequences; generally "easier" than SABmark
scoring:
- reference-dependent:
  - Q-score, number of correctly aligned residue pairs in test alignment divided by number of aligned residues in reference alignment
- reference-independent:
  - GDT-TS (global distance test) fraction of residues that pair within some cutoff CA-CA distance; TS refers to average from cutoffs 1, 2, 4, 8 Å

Table 1. Tests on SABmark database

SABmark-twi (209/10667) SABmark-sup (425/19092)

(# mult alignments/# pairwise ref alignments)

Method Q-score
(max ≈ 0.71) GDT-TS Q-score
(max ≈ 0.87) GDT-TS

PROMALS3D (D + S) 0.602 0.264 0.805 0.417

PROMALS3D (F + S) 0.555 0.220 0.779 0.390

PROMALS3D (T + S) 0.540 0.249 0.766 0.412

PROMALS3D (D + F + S) 0.611 0.256 0.812 0.414

PROMALS3D (D + T + S) 0.603 0.264 0.805 0.421

PROMALS3D (F + T + S) 0.595 0.251 0.800 0.413

PROMALS3D (D + F + T + S) 0.616 0.260 0.812 0.420

3DCoffee (D + S) 0.574 0.252 0.802 0.421

3DCoffee (SAP + S) 0.553 0.222 0.786 0.390

Expresso webserver 0.508 0.206 – –

PROMALS3D (D/2 + S) 0.475 0.198 0.716 0.364

3DCoffee (D/2 + S) 0.261 0.100 0.573 0.294

3DCoffee (D/2 + SAP) 0.255 0.095 0.572 0.289

--------- sequence-only methods ---------

PROMALS 0.393 0.154 0.665 0.336

SPEM 0.326 0.124 0.628 0.318

MUMMALS 0.196 0.081 0.522 0.278

ProbCons 0.166 0.058 0.485 0.246

MAFFT-linsi 0.184 0.070 0.510 0.264

MUSCLE 0.136 0.056 0.433 0.233

T-Coffee 0.134 0.048 0.429 0.223

ClustalW 0.127 0.057 0.390 0.221

--------- structure-only methods ---------

MUSTANG 0.550 0.230 0.779 0.404

PROMALS3D (D) 0.594 0.252 0.802 0.415

PROMALS and SPEM also use predicted secondary structure
The best scores in a column are bold
D: using DaliLite structural constraints
F: using FAST structural constraints
T: using TM-align structural constraints
S: using sequence information
SAP: using SAP structural alignments
D/2: using DaliLite alignments for half of the sequences
SAP/2: using SAP alignments for half of the sequences

Q-score max < 1 because the benchmark consists of pairwise alignments that are not completely consistent within groups, whereas the output is a multiple alignment for the whole group
unclear (to me) how multiple superposition methods are combined
DaliLite generally gave the best structure constraints even though not used in constructing the reference alignments
3D-Coffee and Expresso default is SAP + S, unclear why Expresso webserver results are worse
½ structure alignment tests meant to better represent real-life data in which not all sequences have relatives of known structure; PROMALS3D outperforms 3DCoffee. Similarly, PROMALS outperforms T-Coffee. The authors reason that profile-profile alignments including secondary structure predictions (PROMALS) yield more useful constraints than pairwise sequence alignments (T-Coffee).
structure-only much more successful than sequence-only (but of course, no good for sequences without structures)

Do distantly related structures help? (are their alignments to the representatives' profiles sufficiently correct?)

yes, for > 20% ID
also creating a sequence profile for the 3D structure and then applying profile-profile alignment (darker shading) gives a better result

Table 2. Tests on PREFAB database (Q-score results)

Method Set 1 (0.121/420) Set 2 (0.185/421) Set 3 (0.248/420) Set 4 (0.527/421) All (0.270/1682)

PROMALS3D (D + S) 0.817 0.879 0.921 0.954 0.893

PROMALS3D (F + S) 0.745 0.850 0.896 0.947 0.859

PROMALS3D (T + S) 0.766 0.856 0.902 0.950 0.869

PROMALS3D (D + F + S) 0.818 0.886 0.919 0.952 0.894

PROMALS3D (D + T + S) 0.834 0.884 0.922 0.953 0.898

PROMALS3D (F + T + S) 0.794 0.875 0.909 0.952 0.883

PROMALS3D (D + F + T + S) 0.836 0.894 0.917 0.956 0.900

--------- sequence-only methods ---------

PROMALS 0.570 0.771 0.875 0.946 0.790

SPEM 0.536 0.756 0.865 0.940 0.774

MUMMALS 0.457 0.693 0.834 0.939 0.731

ProbCons 0.428 0.672 0.826 0.936 0.716

MAFFT-linsi 0.443 0.681 0.826 0.938 0.722

MUSCLE 0.372 0.631 0.787 0.930 0.680

ClustalW 0.299 0.536 0.726 0.906 0.617

The total 1682 PREFAB alignments are divided into four semi-equal-sized sets according to sequence identity in the reference multiple alignment. The average sequence identity and number of alignments are in parentheses beneath the set names.

conclusions much the same although the PREFAB tests are "easier"
as %ID increases, adding structure information has less effect

More about the server:

well documented
many parameters are [optionally] adjustable
I don't see anything about a maximum number of input sequences
output includes intermediate results and the final alignment in three formats

03/11/08: entered PDB IDs 2mnr, 4enl, 1nu5. The run finished about 10 minutes later and I downloaded an aligned fasta file; the full results at the PROMALS3D site will be kept for a month. This is not really a test of functionality or speed, but of usability and whether the server actually works (you might be surprised how many don't...). Pairwise sequence identities in the resulting alignment are 13.6%, 14.3%, 19.6%, and it is correct based on the resulting superposition in Chimera: