emma calculates the multiple alignment of
nucleic acid or protein sequences according to the method of J.D.
Thompson, D.C. Higgins, and T.J.Gibson. This is an interface to the
ClustalW distribution.
Here is an example session with emma:
% emma
Input sequence: globins.fasta
Output sequence [hbahum.aln]:
Output file [hbahum.dnd]:
..clustalw17 -infile=5345A -outfile=5345B -align -type=protein ...
CLUSTAL W (1.74) Multiple Sequence Alignments
Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: hbahum 141 aa
Sequence 2: hbbhum 146 aa
Sequence 3: hbghum 146 aa
Sequence 4: hbhagf 148 aa
Sequence 5: hbrlam 149 aa
Sequence 6: mycrhi 151 aa
Sequence 7: myohum 153 aa
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 41
Sequences (1:3) Aligned. Score: 39
Sequences (1:4) Aligned. Score: 21
Sequences (1:5) Aligned. Score: 27
Sequences (1:6) Aligned. Score: 13
Sequences (1:7) Aligned. Score: 26
Sequences (2:3) Aligned. Score: 73
Sequences (2:4) Aligned. Score: 19
Sequences (2:5) Aligned. Score: 19
Sequences (2:6) Aligned. Score: 15
Sequences (2:7) Aligned. Score: 24
Sequences (3:4) Aligned. Score: 21
Sequences (3:5) Aligned. Score: 21
Sequences (3:6) Aligned. Score: 15
Sequences (3:7) Aligned. Score: 23
Sequences (4:5) Aligned. Score: 41
Sequences (4:6) Aligned. Score: 12
Sequences (4:7) Aligned. Score: 16
Sequences (5:6) Aligned. Score: 17
Sequences (5:7) Aligned. Score: 18
Sequences (6:7) Aligned. Score: 11
Guide tree file created: [5345C]
Start of Multiple Alignment
There are 6 groups
Aligning...
Group 1: Sequences: 2 Score:883
Group 2: Sequences: 2 Score:2344
Group 3: Sequences: 3 Score:934
Group 4: Delayed
Group 5: Sequences: 5 Score:950
Group 6: Delayed
Sequence:7 Score:1046
Sequence:6 Score:986
Alignment Score 1746
GCG-Alignment file created [5345B]
Mandatory qualifiers:
- [-inseqs] (seqall)
-
Sequence database USA.
- [-outseq] (seqoutset)
-
The sequence alignment output filename.
- [-dendoutfile] (outfile)
-
The dendogram output filename.
Optional qualifiers (bold if not always prompted):
- -onlydend (boolean)
-
Produce only a dendrogram file.
- -dend (boolean)
-
Select if you want to perform alignment using an old dendrogram.
- -dendfile (string)
-
Name of the old dendrogram file.
- -insist (boolean)
-
Insist that the sequence type be changed to protein.
- -slowfast (menu)
-
A distance is calculated between every pair of sequences, then these
distances are used to construct a dendrogram that guides the final
multiple alignment. The scores are calculated from separate pairwise
alignments. These can be calculated using 2 methods: dynamic
programming (slow but accurate), or by the method of Wilbur and
Lipman (extremely fast but approximate). The slow but accurate method
is fine for short sequences, but will be extremely slow for many
(e.g., greater than100) long (e.g., greater than 1000 residue)
sequences.
- -pwgapc (float)
-
The penalty for opening a gap in the pairwise alignments.
- -pwgapv (float)
-
The penalty for extending a gap by 1 residue in the pairwise
alignments.
- -pwmatrix (menu)
-
A scoring table that describes the similarity of each amino acid to
one another. There are three built-in series of weight matrices
offered. Each consists of several matrixes that work differently at
different evolutionary distances. For details, read the
documentation. Crudely, we store several matrices in memory, spanning
the full range of amino acid distance (from almost identical
sequences to highly divergent ones). For very similar sequences, it
is best to use a strict weight matrix which gives a high score only
to identities and the most favoured conservative substitutions. For
more divergent sequences, it is appropriate to use
"softer" matrixes that give a high
score to many other frequent substitutions.
BLOSUM (Henikoff). These matrixes appear to be the best available for carrying out data base similarity (homology searches). The matrixes used are: Blosum80, 62, 45 and 30.
PAM (Dayhoff). These have been extremely widely used since the late 1970s. We use the PAM 120, 160, 250 and 350 matrixes.
GONNET. These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrixes. We also supply an identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. This matrix is not very useful.
- -pwdnamatrix (menu)
-
A scoring table that describes the scores assigned to matches and
mismatches (including IUB ambiguity codes).
- -pairwisedata (string)
-
Filename of user pairwise matrix.
- -ktup (integer)
-
This is the size of the exact matching fragment. Increase for speed
(maximum is 2 for proteins, 4 for DNA); decrease for sensitivity. For
longer sequences (e.g., greater than1000 residues), you may need to
increase the default.
- -gapw (integer)
-
A penalty for each gap in the fast alignments. It has little affect
on the speed or sensitivity except in the case of extreme values.
- -topdiags (integer)
-
The number of k-tuple matches on each diagonal (in an imaginary dot
matrix plot) is calculated. Only the best ones (those with the most
matches) are used in the alignment. Decrease for speed; increase for
sensitivity.
- -window (integer)
-
This is the number of diagonals around each of the best diagonals
that will be used. Decrease for speed; increase for sensitivity.
- -nopercent (boolean)
-
Fast pairwise alignment: similarity scores: suppresses percentage
score.
- -matrix (menu)
-
This gives a menu where you are offered a choice of weight matrices.
The default for proteins is the PAM series derived by Gonnet and
colleagues. Note that a series is used! The matrix used is dependent
upon the similarity of the sequences to be aligned at this alignment
step. Different matrixes work differently at each evolutionary
distance. There are three built-in series of weight matrixes offered.
Each consists of several matrixes that work differently at different
evolutionary distances. For details, read the documentation. Crudely,
we store several matrices in memory, spanning the full range of amino
acid distance (from almost identical sequences to highly divergent
ones). For very similar sequences, it is best to use a strict weight
matrix which gives a high score only to identities and the most
favoured conservative substitutions. For more divergent sequences, it
is appropriate to use "softer"
matrices that give a high score to many other frequent substitutions.
BLOSUM (Henikoff). These matrixes appear to be the best available for carrying out data base similarity (homology searches). The matrixes used are: Blosum 80, 62, 45 and 30.
PAM (Dayhoff). These have been widely used since the late 1970s. We use the PAM 120, 160, 250 and 350 matrixes.
GONNET. These matrices were derived using almost the same procedure as Dayhoff (above), but are much more up to date and are based on a much larger data set. They appear to be more sensitive than the Dayhoff series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrixes. We also supply an identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. This matrix is not very useful. Alternatively, you can read in your own (just one matrix, not a series).
- -dnamatrix (menu)
-
Provides a menu containing a submenu in which a single matrix (not a
series) can be selected.
- -mamatrix (string)
-
Filename of multiple user alignment matrix.
- -gapc (float)
-
Penalty for opening a gap in the alignment. Increasing the gap
opening penalty will make gaps less frequent.
- -gapv (float)
-
Penalty for extending a gap by 1 residue. Increasing the gap
extension penalty makes gaps shorter. Terminal gaps are not
penalized.
- -[no]endgaps (boolean)
-
"End gap separation" treats end
gaps as internal gaps for the purposes of avoiding gaps that are too
close (set by "gap separation
distance"). If you turn this off, end gaps will be
ignored. This is useful when you want to align fragments where the
end gaps are not biologically meaningful.
- -gapdist (integer)
-
"Gap separation distance" tries to
decrease the chances of gaps being too close. Gaps that are less than
this distance apart are penalized more than other gaps. This does not
prevent close gaps; it only makes them less frequent, resulting in
alignments that have a blocklike appearance.
- -norgap (boolean)
-
"Residue specific penalties" are
amino acid-specific gap penalties that reduce or increase the gap
opening penalties at each position in the alignment or sequence. As
an example, positions that are rich in glycine are more likely to
have an adjacent gap than positions that are rich in valine.
- -hgapres (string)
-
A set of the residues considered hydrophilic. It is used when
introducing Hydrophilic gap penalties.
- -nohgap (boolean)
-
"Hydrophilic gap penalties" are
used to increase the chances of a gap within a run (5 or more
residues) of hydrophilic amino acids; these are likely to be loop or
random coil regions where gaps are more common. The residues that are
considered hydrophilic are set by -hgapres.
- -maxdiv (integer)
-
This switch delays the alignment of the most distantly related
sequences until after the most closely related sequences are aligned.
The setting shows the percent identity level required to delay the
addition of a sequence.
Advanced qualifiers:
- -prot (boolean)
-
Do not change this value.
|