12.1 Common Themes

Many EMBOSS programs have functionality in common. They all understand the same sorts of sequence addresses, sequence formats, output formats, and feature formats. The following sections describe some common themes in EMBOSS.

12.1.1 Uniform Sequence Address

The Uniform Sequence Address (USA) is a standard sequence naming used by all EMBOSS applications.

The USA syntax is one of:

"format::file"
"format::file:entry"
"dbname:entry"
"@listfile" (a file of filenames)

The "::" and ":" syntax is to allow, for example, "embl" and "pir" to be both database names and sequence formats. In addition, EMBOSS allows the command line to separately define the format and the entry name so that only the filename is required.

The "file" and "dbname" forms of USA may have "format::" in front of them, but because a database is aware of the format, this structure is redundant and not recommended.

Any USA may optionally take this subsequence specifier after the main body of the USA, either in the form "[start : end]" or "[start : end : r]", where start and end are the required start and end positions. Negative positions count from the end of the sequence. Use of this USA subsequence specifier is equivalent to using the -sbegin, -send, or -sreverse command-line qualifiers.

Table 12-1 contains some USA examples.

Table 12-1. Emboss Uniform Sequence Address (USA) examples

Type

Example

Comments

filename

xxx.seq

A sequence file xxx.seq in any format.

format::filename

fasta::xxx.seq

A sequence file xxx.seq in FASTA format.

db:IDname

embl:paamir

EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database.

db:AccessionNumber

embl:X13776

EMBL entry X13776, using whatever access method is defined locally for the EMBL database. Search by accession number and entry name. X13776 is the accession number in this case.

db-acc:AccessionNumber

embl-acc:X13776

EMBL entry X13776, using whatever access method is defined locally for the EMBL database. Search by accession number only.

db-id:IDname

embl-id:paamir

EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database. Search by ID only.

db-searchfield:word

embl-des:lectin

EMBL entries containing the word "lectin" in the Description line.

db-searchfield:wcardword

embl-org:*human*

EMBL entries containing the wildcarded word "human" in the Organism fields.

db:wildcard-ID

embl:paami*

EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical order, using whatever access method is defined locally for the EMBL database.

db or db:*

embl or EMBL:*

All sequences in the EMBL database.

@listfile

@mylist

Reads file mylist and uses each line as a separate USA. List files can contain references to other lists files or any other standard USA.

list:listfile

list:mylist

Same as @mylist.

programparameters |

getz -e [embl-id:paamir] |

The pipe character "|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way.

asis::sequence

asis::atacgcagttatctgaccat

So far, the shortest USA we could invent. In "asis" format the name is the sequence, so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines.

12.1.2 Sequence Formats

You can specify the format to use on input by giving the format name with two colons before the file holding your sequences. For example:

embl::myfile.seq

The format is not required. When reading in a sequence, EMBOSS will guess the sequence format by trying all known formats until one succeeds.

When writing out a sequence, EMBOSS will use FASTA format by default. You can specify another format to use, for example:

gcg::myresults.seq

12.1.2.1 Input sequence formats

To date, the sequence formats in Table 12-2 are accepted as input. By default (i.e., no format is explicitly specified), EMBOSS tries each format in turn until one succeeds.

Table 12-2. EMBOSS input sequence formats

Input format

Comments

abi

ABI trace file format. This is the format of file produced by ABI sequencing machines. It contains the trace data, i.e., the probabilities of the 4 bases along the sequencing run, together with the sequence, as deduced from that data. The sequence information is what is normally read in and used by EMBOSS programs, although the trace data is available and may be utilized by some specialized EMBOSS programs. The code for this is heavily based on David Mathog's Fortran library with a description of ABI trace file format (abi.txt): ftp://saf.bio.caltech.edu/pub/software/molbio/abitools.zip.

acedb

ACeDB format.

clustal

aln

ClustalW ALN (multiple alignment) format.

codata

CODATA format.

dbid

Odd FASTA format with Database name first, folowed by ID name and an optional accession number, e.g.:

>database name description

or

>database name accession description embl

em

EMBL entry format, or at least a minimal subset of the fields. The Staden package and others use EMBL or similar formats for sequence data.

pearson

FASTA format with an optional accession number after the sequence identifier, e.g.:

>name description

or

>name accession description

and with an optional database name in GCG style FASTA format included as part of the sequence identifier, e.g.:

>database:name accession description

gcg

gcg8

GCG 9.x and 10.x format with the format and sequence type identified on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data.

genbank

gb

ddbj

GENBANK entry format, or at least a minimal subset of the fields.

gff

GFF format.

hennig86

Hennig86 format.

ig

IntelliGenetics format.

jackknifer

Jackknifer format.

jackknifernon

Jackknifernon format.

nbrf

pir

NBRF (PIR) format, as used in the PIR database sequence files.

nexus

paup

Nexus/PAUP format.

nexusnonpaupnon

Nexusnon/PAUPnon format.

treecon

Treecon format.

mega

Mega format.

meganon

Meganon format.

msf

Wisconsin Package GCG's MSF multiple sequence format.

ncbi

FASTA format with optional accession number and database name in NCBI style included as part of the sequence identifier, e.g.:

>database|accession|id description

(and other variants on this theme!)

pfam

stockholm

Pfam format.

phylip

PHYLIP interleaved multiple alignment format.

selex

SELEX format is used by Sean Eddy's HMMER package. It can store RNA secondary structure as part of the sequence annotation.

staden

experiment

The experiment file format used by the gap program in the Staden package, where the sequence identifier is optional and the remainer is plain text. Some alternative nucleotide ambiguity codes are used and must be converted.

strider

DNA Strider format.

swissprot

swiss

sw

SWISS-PROT entry format, or at least a minimal subset of the fields.

text

plain

Plain text. This is the format with no format. The whole of the file is read in as a sequence. No attempt is made to parse the file contents in any way. Anything is acceptable in this format. This means that any character will be included in the sequence, even digits and punctuation. Use this format only when you are sure that the input sequence file is correct and contains only what you want to be considered as your sequence.

raw

Similar to text or plain format. However, raw removes any whitespace or digits, accepts only alphabetic characters, and rejects anything else. This format is safer than plain format. Digits, spaces, and TAB characters are removed and ignored. If a sequence contains other non-alphabetic characters (e.g., punctuation characters), it is rejected as erroneous.

asis

Not a sequence format , but a quick way of entering a sequence on the command line. It is included here for completeness. In "asis" format, the actual sequence appears where a filename would normally be given.

asis::atacgcagttatctgacc

In "asis" format the name is the sequence, so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines.

12.1.2.2 Output sequence formats

To date, the sequence formats in Table 12-3 are available as output. Some sequence formats can hold multiple sequences in one file; these are marked as multiple in the table. Formats such as GCG, plain, and staden can hold only one sequence per file and are marked as single.

Table 12-3. EMBOSS input sequence formats

Output format

Single/multiple

Comments

gcg

gcg8

single

Wisconsin Package GCG 9.x and 10.x format with the sequence type on the first line of the file. GCG 8.x format where anything up to the first line containing ".." is considered as heading, and the remainder is sequence data.

embl

em

multiple

EMBL entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.

swiss

sw

multiple

SwisProt entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.

fasta

pearson

multiple

Standard Pearson FASTA format, but with the accession number included after the identifier if available.

ncbi

multiple

NCBI style FASTA format with the database name, entry name and accession number separated by pipe ("|") characters.

nbrf

pir

multiple

NBRF (PIR) format, as used in the PIR database sequence files.

genbank

gb

multiple

GENBANK entry format with available fields filled in and others with no information omitted. The EMBOSS command line allows missing data such as accession numbers to be provided if they are not obtainable from the input sequence.

gff

multiple

GFF format.

ig

multiple

IntelliGenetics format, as used by the IntelliGenetics package.

codata

multiple

CODATA format.

stride

multiple

DNA strider format.

acedb

multiple

ACeDB format.

staden

experiment

single

The experiment file format used by the gap program in the Staden package. Some alternative nucleotide ambiguity codes are used and are converted.

text

plain

raw

single

Plain sequence, no annotation or heading.

fitch

multiple

Fitch format.

msf

multiple

Wisconsin Package GCG's MSF multiple sequence format.

clustal

aln

multiple

Clustal multiple sequence format.

selex

multiple

SELEX format.

phylip

multiple

PHYLIP interleaved format.

phylip3

multiple

PHYLIP non-interleaved format that was used in Phylip version 3.2.

asn1

multiple

A subset of ASN.1 containing entry name, accession number, description and sequence, similar to the current ASN.1 output of Readseq.

hennig86

multiple

Hennig86 format.

mega

multiple

Mega format.

meganon

multiple

Meganon format.

nexus

paup

multiple

Nexus/PAUP format.

nexusnon

paupnon

multiple

Nexusnon/PAUPnon format.

jackknifer

multiple

Jackknifer format.

jackknifernon

multiple

Jackknifernon format.

treecon

multiple

Treecon format.

debug

multiple

EMBOSS sequence object report for debugging showing all available fields. Not all fields will contain data�this depends very much on the input format used.

12.1.3 Alignment Formats

When writing out an alignment between two or more sequences, EMBOSS now uses a standard set of formats.

12.1.3.1 Multiple sequence alignment formats

Table 12-4 contains details about the current set of multiple sequence alignment formats available in EMBOSS.

Table 12-4. EMBOSS multiple sequence alignment formats

Name

Comments

unknown

multiple

simple

These are synonyms for simple format. This format displays the sequence names, positions and sequences, then puts the markup line underneath the sequences. When only two sequences are being aligned, the format is changed to that produced by pair.

fasta

This is the standard FASTA sequence format with gaps, where many sequences are concatenated one after the other.

msf

This is the standard MSF sequence format.

trace

This is a special verbose format for use in debugging. It is not intended for normal users.

srs

This shows the sequence ID name, the sequence position, the sequence and the sequence position for each line.

12.1.3.2 Pairwise sequence alignment formats

Table 12-5 contains details about the current set of pairwise sequence alignment formats available in EMBOSS.

Table 12-5. EMBOSS pairwise sequence alignment formats

Name

Comments

pair

This is the default format used when there are only 2 sequences. When simple format is selected but there are only 2 sequences, this format is used. The sequences have the markup line between them.

markx0

This is the standard default output format used by Bill Pearson's suite of FASTA programs.

markx1

This is an alternative output format used by Bill Pearson's suite of FASTA programs in which identities are not marked. Instead, conservative replacements are denoted by "x" and non-conservative substitutions by "X".

markx2

This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the residues in the second sequence are only shown if they are different from the first.

markx3

This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format. These can be used to build a primitive multiple alignment.

markx10

This is an alternative output format used by Bill Pearson's suite of FASTA programs in which the aligned sequences are displayed in FASTA sequence format and the sequence length, alignment start and stop information is given in lines starting with a ";" character just after the title line for each sequence. It is intended to be easily parsed by other programs.

srspair

This is very similar in style to pair format.

score

This does not display the sequence alignment. It shows only the names of the sequences, the length of the alignment, and the score.

12.1.4 Feature Formats

When reading or writing features associated with a sequence, a standard set of formats is used. The feature files can either be a standard sequence format with a feature table as part of the sequence format, or the features can be held in a file without the associated sequence.

Table 12-6 contains details about the current set of feature formats available in EMBOSS.

Table 12-6. EMBOSS feature formats

Name

Comments

embl

em

The format used by the EMBL nucleic database.

gff

The General Feature Format defined by the Sanger Centre.

swissprot

swiss

sw

The format used by the SWISS-PROT protein database. The feature table keys are also defined.

pir

The format used by the PIR protein database.

nbrf

Only available for input�the same as PIR format.

12.1.5 Report Formats

There are many ways in which the results of an analysis can be reported. Many EMBOSS programs are now able to output their results in a standard report format�you can change the report format used by putting -rformat name on the command line, where name is the name of one of the standard report formats.

Table 12-7 contains examples of garnier analyzing sw:100K_rat output in various report formats.

Table 12-7. EMBOSS report formats

Name

Comments

embl

Writes a report in EMBL feature table format.

genbank

Writes a report in Genbank feature table format.

gff

Writes a report in GFF feature table format.

pir

Writes a report in PIR feature table format.

swiss

Writes a report in SWISS-PROT feature table format.

trace

Of use only for debugging.

listfile

Writes out a list file with the start and end points of the motifs given by "[start:end]" after the sequence's full USA. This is useful as it is a true List File that can be read in by other EMBOSS programs using "@" or "list::" before the filename.

dbmotif

Writes a report in DbMotif format.

Format:

Length = [length] Start = position [start] of sequence End = position [end] of sequence ... other tags ... [sequence] [start and end numbered below sequence with '|' marks] Blank line

Data reported: Length, Start, End, Sequence (5 bases around feature)

diffseq

This format is most useful when reporting the results of two aligned sequences, as in the program diffseq. The report describes matches, usually short, between two sequences and features which overlap them.

Format:

[Sequence 1 Name] [start]-[end] Length: [length] Feature: first sequence feature(s) Sequence: motif in sequence 1 Sequence: motif in sequence 2 Feature: second sequence feature(s) [Sequence 2 Name] [start]-[end] Length: [length] Blank line

excel

A TAB-delimited table format suitable for reading into spreadsheet programs such as Excel. Name, start, end, and score are always reported. Other tags in the report definition are added as extra columns. All values are (for now) unquoted. Missing values are reported as ".".

feattable

Writes a report in FeatTable format. The report is an EMBL feature table using only the tags in the report definition. There is no requirement for tag names to match standards for the EMBL feature table. The original EMBOSS application for this format was cpgreport.

Format:

FT [type] [start]..[end] FT /[tagname]=[tagvalue] Blank line

Data reported: Type, Start, End

motif

Writes a report in Motif format. Based on the original output format of antigenic, helixturnhelix and sigcleave.

Format:

(1) Score [score] length [length] at [name] [start->[end] * (marked at position pos) [sequence] | | [start] [end] [tagname]: tagvalue

Data reported: Name, Start, End, Length, Score, Sequence

regions

Writes a report in Regions format. The report (unusually for the current report formats) includes the feature type.

Format:

[type] from [start] to [end] ([length] [name]) ([tagname]: [tagvalue], [tagname]: [tagvalue] ...)

Data reported: Type, Start, End, Length, Name

seqtable

Writes a report in SeqTable format. This is a simple table format that includes the feature sequence. See the following "table" entry for a version without the sequence. Missing tag values are reported as "." The column width is 6, or longer if the name is longer.

Format:

Start End [tagnames] Sequence [start] [end] [tagvalues] [sequence]

simple

Writes a report in SRS simple format. This is a simple parsable format that does not include the feature sequence (see also SRS format) for applications where features can be large. Missing tag values are reported as ".".

Format:

Feature [number] Name: [ID name] Start: [start] End: [end] Length: [length] [tagnames:] [tag values] Blank line

srs

Writes a report in SRS format. This is a simple parsable format that includes the feature sequence. Missing tag values are reported as ".".

Format:

Feature [number] Name: [ID name] Start: [start] End: [end] Length: [length] Sequence: [sequence] Score: [score] [tagnames:] [tag values] Blank line

table

Writes a report in Table format. See previous "seqtable" entry for a version with the sequence. Missing tag values are reported as ".". The column width is 6, or longer if the name is longer.

Format:

USA Start End Score [tagnames] [name] [start] [end] [score] [tagvalues]

tagseq

Writes a report in Tagseq format. Features are marked up below the sequence. Originally developed for the garnier application, this format also has general uses.

Format:

Sequence position written every 10 bases/residues Sequence (50 residues) tagname ++++++++++++ +++++++++ Blank line

If the tag value is a 1-letter code, use it in place of "+".

12.1.6 EMBOSS Application Groups

To aid users in finding programs of interest, the EMBOSS developers have clustered the programs into application groups. These groups are presented below.

12.1.6.1 Alignment consensus

cons

megamerger

merger

12.1.6.2 Alignment differences

diffseq

12.1.6.3 Alignment dot plots

dotmatcher

dotpath

dottup

polydot

12.1.6.4 Alignment global

alignwrap

est2genome

needle

stretcher

12.1.6.5 Alignment local

matcher

seqmatchall

supermatcher

water

wordmatch

12.1.6.6 Alignment multiple

emma

plotcon

showalign

infoalign

prettyplot

tranalign

12.1.6.7 Display

abiview

pepnet

prettyseq

showalign

showseq

cirdna

pepwheel

remap

showdb

textsearch

lindna

prettyplot

seealso

showfeat

12.1.6.8 Edit

cutseq

listor

nthseq

splitter

yank

biosed

extractseq

notseq

skipseq

vectorstrip

degapseq

maskfeat

pasteseq

swissparse

descseq

maskseq

revseq

trimest

entret

newseq

seqret

trimseq

extractfeat

noreturn

seqretsplit

union

12.1.6.9 Enzyme kinetics

findkm

12.1.6.10 Feature tables

coderet

extractfeat

maskfeat

showfeat

swissparse

12.1.6.11 Information

infoalign

seealso

textsearch

whichdb

wossname

infoseq

showdb

tfm

12.1.6.12 Menus

emnu

12.1.6.13 Nucleic 2d structure

einverted

12.1.6.14 Nucleic codon usage

cai

chips

codcmp

cusp

syco

12.1.6.15 Nucleic composition

banana

chaos

dan

isochore

btwisted

compseq

freak

wordcount

12.1.6.16 Nucleic cpg islands

cpgplot

cpgreport

geecee

newcpgreport

newcpgseek

12.1.6.17 Nucleic gene finding

getorf

marscan

plotorf

showorf

wobble

12.1.6.18 Nucleic motifs

dreg

fuzznuc

fuzztran

marscan

12.1.6.19 Nucleic mutation

msbar

shuffleseq

12.1.6.20 Nucleic primers

eprimer3

primersearch

stssearch

12.1.6.21 Nucleic profiles

profit

prophecy

prophet

12.1.6.22 Nucleic repeats

einverted

equicktandem

etandem

palindrome

12.1.6.23 Nucleic restriction

recoder

remap

restrict

silent

redata

restover

showseq

12.1.6.24 Nucleic transcription

tfscan

12.1.6.25 Nucleic translation

backtranseq

plotorf

remap

showseq

coderet

prettyseq

showorf

transeq

12.1.6.26 Phylogeny

distmat

12.1.6.27 Protein 2d structure

garnier

hmoment

pepnet

tmap

helixturnhelix

pepcoil

pepwheel

12.1.6.28 Protein 3d structure

contacts

interface

scopalign

seqalign

seqwords

dichet

profgen

scoprep

seqsearch

siggen

hmmgen

psiblasts

scopreso

seqsort

sigscan

12.1.6.29 Protein composition

backtranseq

compseq

iep

octanol

pepwindow

charge

emowse

mwcontam

pepinfo

pepwindowall

checktrans

freak

mwfilter

pepstats

12.1.6.30 Protein motifs

antigenic

fuzztran

patmatdb

preg

digest

helixturnhelix

patmatmotifs

pscan

fuzzpro

oddcomp

pepcoil

sigcleave

12.1.6.31 Protein mutation

msbar

shuffleseq

12.1.6.32 Protein profiles

profit

prophecy

prophet

12.1.6.33 Protein structure

seqsort

12.1.6.34 Test

histogramtest

12.1.6.35 Utilities�database creation

aaindexextract

groups

pdbtosp

scope

tfextract

cutgextract

hetparse

printsextract

scopnr

domainer

nrscope

prosextract

scopparse

funky

pdbparse

rebaseextract

scopseqs

12.1.6.36 Utilities�database indexing

dbiblast

dbifasta

dbiflat

dbigcg

12.1.6.37 Utilities�miscellaneous

embossdata

embossversion

[ Team LiB ]