12.1 Common Themes
Many EMBOSS programs have functionality in common. They all
understand the same sorts of sequence addresses, sequence formats,
output formats, and feature formats. The following sections describe
some common themes in EMBOSS.
12.1.1 Uniform Sequence Address
The Uniform Sequence
Address (USA) is a standard sequence
naming used by all EMBOSS applications.
The USA syntax is one of:
The
"::" and
":" syntax is to allow, for
example, "embl" and
"pir" to be both database names and
sequence formats. In addition, EMBOSS allows the command line to
separately define the format and the entry name so that only the
filename is required.
The "file" and
"dbname" forms of USA may have
"format::" in front of them, but
because a database is aware of the format, this structure is
redundant and not recommended.
Any USA may optionally take this subsequence specifier after the main
body of the USA, either in the form
"[start :
end]" or
"[start :
end : r]", where
start and end
are the required start and end positions. Negative positions count
from the end of the sequence. Use of this USA subsequence specifier
is equivalent to using the -sbegin,
-send, or -sreverse
command-line qualifiers.
Table 12-1 contains some USA examples.
Table 12-1. Emboss Uniform Sequence Address (USA) examples
filename
|
xxx.seq
|
A sequence file xxx.seq in any format.
|
format::filename
|
fasta::xxx.seq
|
A sequence file xxx.seq in FASTA format.
|
db:IDname
|
embl:paamir
|
EMBL entry PAAMIR, using whatever access method is defined locally
for the EMBL database.
|
db:AccessionNumber
|
embl:X13776
|
EMBL entry X13776, using whatever access method is defined locally
for the EMBL database. Search by accession number and entry name.
X13776 is the accession number in this case.
|
db-acc:AccessionNumber
|
embl-acc:X13776
|
EMBL entry X13776, using whatever access method is defined locally
for the EMBL database. Search by accession number only.
|
db-id:IDname
|
embl-id:paamir
|
EMBL entry PAAMIR, using whatever access method is defined locally
for the EMBL database. Search by ID only.
|
db-searchfield:word
|
embl-des:lectin
|
EMBL entries containing the word
"lectin" in the Description line.
|
db-searchfield:wcardword
|
embl-org:*human*
|
EMBL entries containing the wildcarded word
"human" in the Organism fields.
|
db:wildcard-ID
|
embl:paami*
|
EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical order,
using whatever access method is defined locally for the EMBL
database.
|
db or db:*
|
embl or EMBL:*
|
All sequences in the EMBL database.
|
@listfile
|
@mylist
|
Reads file mylist and uses each line as a
separate USA. List files can contain references to other lists files
or any other standard USA.
|
list:listfile
|
list:mylist
|
Same as @mylist.
|
programparameters |
|
getz -e [embl-id:paamir] |
|
The pipe character "|" causes
EMBOSS to fire up getz (the SRS sequence
retrieval program) to extract entry PAAMIR from EMBL in EMBL format.
Any application or script which writes one or more sequences to
stdout can be used in this way.
|
asis::sequence
|
asis::atacgcagttatctgaccat
|
So far, the shortest USA we could invent. In
"asis" format the name is the
sequence, so no file needs to be opened. This is a special case. It
was intended as a joke, but could be quite useful for generating
command lines.
|
12.1.2 Sequence Formats
You can
specify the format to use on input by giving the format name with two
colons before the file holding your sequences. For example:
embl::myfile.seq
The format is not required. When reading in a sequence, EMBOSS will
guess the sequence format by trying all known formats until one
succeeds.
When writing out a sequence, EMBOSS will use FASTA format by default.
You can specify another format to use, for example:
gcg::myresults.seq
12.1.2.1 Input sequence formats
To
date, the sequence formats in Table 12-2 are
accepted as input. By default (i.e., no format is explicitly
specified), EMBOSS tries each format in turn until one succeeds.
Table 12-2. EMBOSS input sequence formats
abi
|
ABI trace file format. This is the format of file produced by ABI
sequencing machines. It contains the trace data,
i.e., the probabilities of the 4 bases along the sequencing run,
together with the sequence, as deduced from that data. The sequence
information is what is normally read in and used by EMBOSS programs,
although the trace data is available and may be utilized by some
specialized EMBOSS programs. The code for this is heavily based on
David Mathog's Fortran library with a description of
ABI trace file format (abi.txt): ftp://saf.bio.caltech.edu/pub/software/molbio/abitools.zip.
|
acedb
|
ACeDB format.
|
clustal
aln
|
ClustalW ALN (multiple alignment) format.
|
codata
|
CODATA format.
|
dbid
|
Odd FASTA format with Database name first, folowed by ID name and an
optional accession number, e.g.:
>database name description
or
>database name accession description embl
|
em
|
EMBL entry format, or at least a minimal subset of the fields. The
Staden package and others use EMBL or similar formats for sequence
data.
|
pearson
|
FASTA format with an optional accession number after the sequence
identifier, e.g.:
>name description
or
>name accession description
and with an optional database name in GCG style FASTA format included
as part of the sequence identifier, e.g.:
>database:name accession description
|
gcg
gcg8
|
GCG 9.x and 10.x format with the format and sequence type identified
on the first line of the file. GCG 8.x format where anything up to
the first line containing ".." is
considered as heading, and the remainder is sequence data.
|
genbank
gb
ddbj
|
GENBANK entry format, or at least a minimal subset of the fields.
|
gff
|
GFF format.
|
hennig86
|
Hennig86 format.
|
ig
|
IntelliGenetics format.
|
jackknifer
|
Jackknifer format.
|
jackknifernon
|
Jackknifernon format.
|
nbrf
pir
|
NBRF (PIR) format, as used in the PIR database sequence files.
|
nexus
paup
|
Nexus/PAUP format.
|
nexusnonpaupnon
|
Nexusnon/PAUPnon format.
|
treecon
|
Treecon format.
|
mega
|
Mega format.
|
meganon
|
Meganon format.
|
msf
|
Wisconsin Package GCG's MSF multiple sequence format.
|
ncbi
|
FASTA format with optional accession number and database name in NCBI
style included as part of the sequence identifier, e.g.:
>database|accession|id description
(and other variants on this theme!)
|
pfam
stockholm
|
Pfam format.
|
phylip
|
PHYLIP interleaved multiple alignment format.
|
selex
|
SELEX format is used by Sean Eddy's HMMER package.
It can store RNA secondary structure as part of the sequence
annotation.
|
staden
experiment
|
The experiment file format used by the gap
program in the Staden package, where the sequence identifier is
optional and the remainer is plain text. Some alternative nucleotide
ambiguity codes are used and must be converted.
|
strider
|
DNA Strider format.
|
swissprot
swiss
sw
|
SWISS-PROT entry format, or at least a minimal subset of the fields.
|
text
plain
|
Plain text. This is the format with no format. The whole of the file
is read in as a sequence. No attempt is made to parse the file
contents in any way. Anything is acceptable in this format. This
means that any character will be included in the sequence, even
digits and punctuation. Use this format only when you are sure that
the input sequence file is correct and contains only what you want to
be considered as your sequence.
|
raw
|
Similar to text or plain format. However, raw removes any whitespace
or digits, accepts only alphabetic characters, and rejects anything
else. This format is safer than plain format. Digits, spaces, and TAB
characters are removed and ignored. If a sequence contains other
non-alphabetic characters (e.g., punctuation characters), it is
rejected as erroneous.
|
asis
|
Not a sequence format , but a quick way of entering a sequence on the
command line. It is included here for completeness. In
"asis" format, the actual sequence
appears where a filename would normally be given.
|
asis::atacgcagttatctgacc
|
In "asis" format the name is the
sequence, so no file needs to be opened. This is a special case. It
was intended as a joke, but could be quite useful for generating
command lines.
|
12.1.2.2 Output sequence formats
To
date, the sequence formats in Table 12-3 are
available as output. Some sequence formats can hold multiple
sequences in one file; these are marked as multiple in the table.
Formats such as GCG, plain, and staden can hold only one sequence per
file and are marked as single.
Table 12-3. EMBOSS input sequence formats
gcg
gcg8
|
single
|
Wisconsin Package GCG 9.x and 10.x format with the sequence type on
the first line of the file. GCG 8.x format where anything up to the
first line containing ".." is
considered as heading, and the remainder is sequence data.
|
embl
em
|
multiple
|
EMBL entry format with available fields filled in and others with no
information omitted. The EMBOSS command line allows missing data such
as accession numbers to be provided if they are not obtainable from
the input sequence.
|
swiss
sw
|
multiple
|
SwisProt entry format with available fields filled in and others with
no information omitted. The EMBOSS command line allows missing data
such as accession numbers to be provided if they are not obtainable
from the input sequence.
|
fasta
pearson
|
multiple
|
Standard Pearson FASTA format, but with the accession number included
after the identifier if available.
|
ncbi
|
multiple
|
NCBI style FASTA format with the database name, entry name and
accession number separated by pipe
("|") characters.
|
nbrf
pir
|
multiple
|
NBRF (PIR) format, as used in the PIR database sequence files.
|
genbank
gb
|
multiple
|
GENBANK entry format with available fields filled in and others with
no information omitted. The EMBOSS command line allows missing data
such as accession numbers to be provided if they are not obtainable
from the input sequence.
|
gff
|
multiple
|
GFF format.
|
ig
|
multiple
|
IntelliGenetics format, as used by the IntelliGenetics package.
|
codata
|
multiple
|
CODATA format.
|
stride
|
multiple
|
DNA strider format.
|
acedb
|
multiple
|
ACeDB format.
|
staden
experiment
|
single
|
The experiment file format used by the gap
program in the Staden package. Some alternative nucleotide ambiguity
codes are used and are converted.
|
text
plain
raw
|
single
|
Plain sequence, no annotation or heading.
|
fitch
|
multiple
|
Fitch format.
|
msf
|
multiple
|
Wisconsin Package GCG's MSF multiple sequence format.
|
clustal
aln
|
multiple
|
Clustal multiple sequence format.
|
selex
|
multiple
|
SELEX format.
|
phylip
|
multiple
|
PHYLIP interleaved format.
|
phylip3
|
multiple
|
PHYLIP non-interleaved format that was used in Phylip version 3.2.
|
asn1
|
multiple
|
A subset of ASN.1 containing entry name, accession number,
description and sequence, similar to the current ASN.1 output of
Readseq.
|
hennig86
|
multiple
|
Hennig86 format.
|
mega
|
multiple
|
Mega format.
|
meganon
|
multiple
|
Meganon format.
|
nexus
paup
|
multiple
|
Nexus/PAUP format.
|
nexusnon
paupnon
|
multiple
|
Nexusnon/PAUPnon format.
|
jackknifer
|
multiple
|
Jackknifer format.
|
jackknifernon
|
multiple
|
Jackknifernon format.
|
treecon
|
multiple
|
Treecon format.
|
debug
|
multiple
|
EMBOSS sequence object report for debugging showing all available
fields. Not all fields will contain data�this depends very much
on the input format used.
|
12.1.3 Alignment Formats
When writing out an alignment between
two or more sequences, EMBOSS now uses a standard set of formats.
12.1.3.1 Multiple sequence alignment formats
Table 12-4 contains details about the current set
of multiple sequence alignment formats available in
EMBOSS.
Table 12-4. EMBOSS multiple sequence alignment formats
unknown
multiple
simple
|
These are synonyms for simple format. This format displays the
sequence names, positions and sequences, then puts the markup line
underneath the sequences. When only two sequences are being aligned,
the format is changed to that produced by pair.
|
fasta
|
This is the standard FASTA sequence format with gaps, where many
sequences are concatenated one after the other.
|
msf
|
This is the standard MSF sequence format.
|
trace
|
This is a special verbose format for use in debugging. It is not
intended for normal users.
|
srs
|
This shows the sequence ID name, the sequence position, the sequence
and the sequence position for each line.
|
12.1.3.2 Pairwise sequence alignment formats
Table 12-5 contains details about the current set
of pairwise sequence alignment formats available in
EMBOSS.
Table 12-5. EMBOSS pairwise sequence alignment formats
pair
|
This is the default format used when there are only 2 sequences. When
simple format is selected but there are only 2 sequences, this format
is used. The sequences have the markup line between them.
|
markx0
|
This is the standard default output format used by Bill
Pearson's suite of FASTA programs.
|
markx1
|
This is an alternative output format used by Bill
Pearson's suite of FASTA programs in which
identities are not marked. Instead, conservative replacements are
denoted by "x" and non-conservative
substitutions by "X".
|
markx2
|
This is an alternative output format used by Bill
Pearson's suite of FASTA programs in which the
residues in the second sequence are only shown if they are different
from the first.
|
markx3
|
This is an alternative output format used by Bill
Pearson's suite of FASTA programs in which the
aligned sequences are displayed in FASTA sequence format. These can
be used to build a primitive multiple alignment.
|
markx10
|
This is an alternative output format used by Bill
Pearson's suite of FASTA programs in which the
aligned sequences are displayed in FASTA sequence format and the
sequence length, alignment start and stop information is given in
lines starting with a ";" character
just after the title line for each sequence. It is intended to be
easily parsed by other programs.
|
srspair
|
This is very similar in style to pair format.
|
score
|
This does not display the sequence alignment. It shows only the names
of the sequences, the length of the alignment, and the score.
|
12.1.4 Feature Formats
When
reading or writing features associated with a sequence, a standard
set of formats is used. The feature files can either be a standard
sequence format with a feature table as part of the sequence format,
or the features can be held in a file without the associated
sequence.
Table 12-6 contains details about the current set
of feature formats available in EMBOSS.
Table 12-6. EMBOSS feature formats
embl
em
|
The format used by the EMBL nucleic database.
|
gff
|
The General Feature Format defined by the Sanger Centre.
|
swissprot
swiss
sw
|
The format used by the SWISS-PROT protein database. The feature table
keys are also defined.
|
pir
|
The format used by the PIR protein database.
|
nbrf
|
Only available for input�the same as PIR format.
|
12.1.5 Report Formats
There are many ways in which the results
of an analysis can be reported. Many EMBOSS programs are now able to
output their results in a standard report format�you can change
the report format used by putting -rformat
name on the command line, where
name is the name of one of the standard report
formats.
Table 12-7 contains examples of
garnier analyzing sw:100K_rat output in various
report formats.
Table 12-7. EMBOSS report formats
embl
|
Writes a report in EMBL feature table format.
|
genbank
|
Writes a report in Genbank feature table format.
|
gff
|
Writes a report in GFF feature table format.
|
pir
|
Writes a report in PIR feature table format.
|
swiss
|
Writes a report in SWISS-PROT feature table format.
|
trace
|
Of use only for debugging.
|
listfile
|
Writes out a list file with the start and end points of the motifs
given by "[start:end]" after the
sequence's full USA. This is useful as it is a true
List File that can be read in by other EMBOSS programs using
"@" or
"list::" before the filename.
|
dbmotif
|
Writes a report in DbMotif format.
Format:
Length = [length]
Start = position [start] of sequence
End = position [end] of sequence
... other tags ...
[sequence]
[start and end numbered below sequence with '|' marks]
Blank line
Data reported: Length, Start, End, Sequence (5 bases around feature)
|
diffseq
|
This format is most useful when reporting the results of two aligned
sequences, as in the program diffseq. The report
describes matches, usually short, between two sequences and features
which overlap them.
Format:
[Sequence 1 Name] [start]-[end] Length: [length]
Feature: first sequence feature(s)
Sequence: motif in sequence 1
Sequence: motif in sequence 2
Feature: second sequence feature(s)
[Sequence 2 Name] [start]-[end] Length: [length]
Blank line
|
excel
|
A TAB-delimited table format suitable for reading into spreadsheet
programs such as Excel. Name, start, end, and score are always
reported. Other tags in the report definition are added as extra
columns. All values are (for now) unquoted. Missing values are
reported as ".".
|
feattable
|
Writes a report in FeatTable format. The report is an EMBL feature
table using only the tags in the report definition. There is no
requirement for tag names to match standards for the EMBL feature
table. The original EMBOSS application for this format was
cpgreport.
Format:
FT [type] [start]..[end]
FT /[tagname]=[tagvalue]
Blank line
Data reported: Type, Start, End
|
motif
|
Writes a report in Motif format. Based on the original output format
of antigenic,
helixturnhelix and
sigcleave.
Format:
(1) Score [score] length [length] at [name] [start->[end]
* (marked at position pos)
[sequence]
| |
[start] [end]
[tagname]: tagvalue
Data reported: Name, Start, End, Length, Score, Sequence
|
regions
|
Writes a report in Regions format. The report (unusually for the
current report formats) includes the feature type.
Format:
[type] from [start] to [end] ([length] [name]) ([tagname]:
[tagvalue], [tagname]: [tagvalue] ...)
Data reported: Type, Start, End, Length, Name
|
seqtable
|
Writes a report in SeqTable format. This is a simple table format
that includes the feature sequence. See the following
"table" entry for a version without
the sequence. Missing tag values are reported as
"." The column width is 6, or
longer if the name is longer.
Format:
Start End [tagnames] Sequence
[start] [end] [tagvalues] [sequence]
|
simple
|
Writes a report in SRS simple format. This is a simple parsable
format that does not include the feature sequence (see also SRS
format) for applications where features can be large. Missing tag
values are reported as ".".
Format:
Feature [number]
Name: [ID name]
Start: [start]
End: [end]
Length: [length]
[tagnames:] [tag values]
Blank line
|
srs
|
Writes a report in SRS format. This is a simple parsable format that
includes the feature sequence. Missing tag values are reported as
".".
Format:
Feature [number]
Name: [ID name]
Start: [start]
End: [end]
Length: [length]
Sequence: [sequence]
Score: [score]
[tagnames:] [tag values]
Blank line
|
table
|
Writes a report in Table format. See previous
"seqtable" entry for a version with
the sequence. Missing tag values are reported as
".". The column width is 6, or
longer if the name is longer.
Format:
USA Start End Score [tagnames]
[name] [start] [end] [score] [tagvalues]
|
tagseq
|
Writes a report in Tagseq format. Features are marked up below the
sequence. Originally developed for the garnier
application, this format also has general uses.
Format:
Sequence position written every 10 bases/residues
Sequence (50 residues)
tagname ++++++++++++ +++++++++
Blank line
If the tag value is a 1-letter code, use it in place of
"+".
|
12.1.6 EMBOSS Application Groups
To aid users in finding programs of
interest, the EMBOSS developers have clustered the programs into
application groups. These groups are presented below.
12.1.6.1 Alignment consensus
12.1.6.2 Alignment differences
12.1.6.3 Alignment dot plots
- dotmatcher
|
- dotpath
|
- dottup
|
- polydot
|
|
12.1.6.4 Alignment global
- alignwrap
|
- est2genome
|
- needle
|
- stretcher
|
|
12.1.6.5 Alignment local
- matcher
|
- seqmatchall
|
- supermatcher
|
- water
|
- wordmatch
|
12.1.6.6 Alignment multiple
- emma
|
- plotcon
|
- showalign
|
|
|
- infoalign
|
- prettyplot
|
- tranalign
|
|
|
12.1.6.7 Display
- abiview
|
- pepnet
|
- prettyseq
|
- showalign
|
- showseq
|
- cirdna
|
- pepwheel
|
- remap
|
- showdb
|
- textsearch
|
- lindna
|
- prettyplot
|
- seealso
|
- showfeat
|
|
12.1.6.8 Edit
- cutseq
|
- listor
|
- nthseq
|
- splitter
|
- yank
|
- biosed
|
- extractseq
|
- notseq
|
- skipseq
|
- vectorstrip
|
- degapseq
|
- maskfeat
|
- pasteseq
|
- swissparse
|
|
- descseq
|
- maskseq
|
- revseq
|
- trimest
|
|
- entret
|
- newseq
|
- seqret
|
- trimseq
|
|
- extractfeat
|
- noreturn
|
- seqretsplit
|
- union
|
|
12.1.6.9 Enzyme kinetics
12.1.6.10 Feature tables
- coderet
|
- extractfeat
|
- maskfeat
|
- showfeat
|
- swissparse
|
12.1.6.11 Information
- infoalign
|
- seealso
|
- textsearch
|
- whichdb
|
- wossname
|
- infoseq
|
- showdb
|
- tfm
|
|
|
12.1.6.12 Menus
12.1.6.13 Nucleic 2d structure
12.1.6.14 Nucleic codon usage
- cai
|
- chips
|
- codcmp
|
- cusp
|
- syco
|
12.1.6.15 Nucleic composition
- banana
|
- chaos
|
- dan
|
- isochore
|
|
- btwisted
|
- compseq
|
- freak
|
- wordcount
|
|
12.1.6.16 Nucleic cpg islands
- cpgplot
|
- cpgreport
|
- geecee
|
- newcpgreport
|
- newcpgseek
|
12.1.6.17 Nucleic gene finding
- getorf
|
- marscan
|
- plotorf
|
- showorf
|
- wobble
|
12.1.6.18 Nucleic motifs
- dreg
|
- fuzznuc
|
- fuzztran
|
- marscan
|
|
12.1.6.19 Nucleic mutation
12.1.6.20 Nucleic primers
- eprimer3
|
- primersearch
|
- stssearch
|
|
|
12.1.6.21 Nucleic profiles
12.1.6.22 Nucleic repeats
- einverted
|
- equicktandem
|
- etandem
|
- palindrome
|
|
12.1.6.23 Nucleic restriction
- recoder
|
- remap
|
- restrict
|
- silent
|
|
- redata
|
- restover
|
- showseq
|
|
|
12.1.6.24 Nucleic transcription
12.1.6.25 Nucleic translation
- backtranseq
|
- plotorf
|
- remap
|
- showseq
|
|
- coderet
|
- prettyseq
|
- showorf
|
- transeq
|
|
12.1.6.26 Phylogeny
12.1.6.27 Protein 2d structure
- garnier
|
- hmoment
|
- pepnet
|
- tmap
|
|
- helixturnhelix
|
- pepcoil
|
- pepwheel
|
|
|
12.1.6.28 Protein 3d structure
- contacts
|
- interface
|
- scopalign
|
- seqalign
|
- seqwords
|
- dichet
|
- profgen
|
- scoprep
|
- seqsearch
|
- siggen
|
- hmmgen
|
- psiblasts
|
- scopreso
|
- seqsort
|
- sigscan
|
12.1.6.29 Protein composition
- backtranseq
|
- compseq
|
- iep
|
- octanol
|
- pepwindow
|
- charge
|
- emowse
|
- mwcontam
|
- pepinfo
|
- pepwindowall
|
- checktrans
|
- freak
|
- mwfilter
|
- pepstats
|
|
12.1.6.30 Protein motifs
- antigenic
|
- fuzztran
|
- patmatdb
|
- preg
|
|
- digest
|
- helixturnhelix
|
- patmatmotifs
|
- pscan
|
|
- fuzzpro
|
- oddcomp
|
- pepcoil
|
- sigcleave
|
|
12.1.6.31 Protein mutation
12.1.6.32 Protein profiles
12.1.6.33 Protein structure
12.1.6.34 Test
12.1.6.35 Utilities�database creation
- aaindexextract
|
- groups
|
- pdbtosp
|
- scope
|
- tfextract
|
- cutgextract
|
- hetparse
|
- printsextract
|
- scopnr
|
|
- domainer
|
- nrscope
|
- prosextract
|
- scopparse
|
|
- funky
|
- pdbparse
|
- rebaseextract
|
- scopseqs
|
|
12.1.6.36 Utilities�database indexing
- dbiblast
|
- dbifasta
|
- dbiflat
|
- dbigcg
|
|
12.1.6.37 Utilities�miscellaneous
|