How This Book Is Organized
The book is divided into three fundamental areas: data formats,
tools, and biological sequence components.
The data format section contains examples of flat files from key
databases, the definitions of the codes or fields used in each
database, and the sequence feature types/terms and qualifiers for the
nucleotide and protein databases.
While there are many useful publicly and commercially available
programs, we limited the tools section to popular public domain
programs (e.g., BLAST and ClustalW). We also decided to include the
EMBOSS programs. These packages are all excellent examples of
sequence tools that allow bioinformaticians to easily use the command
line to customize their own analyses and workflows. Each program is
described briefly, with one or more examples showing how the program
may be invoked. We also include the definitions, descriptions, and/or
default parameters for each program's command-line
options.
The last section of the book concentrates on information essential to
understanding the individual components that make up a biological
sequence. The tables in this section include nucleotide and protein
codes, genetics codes, and other relevant information. The book is
organized as follows:
Part I
Chapter 1 describes the most common sequence data
format.
Chapter 2 describes the flat file format, field
definitions, and feature tables used in the three most popular
sequence databases.
Chapter 3 describes the flat file format, field
definitions, and feature tables used with the SWISS-PROT protein
database.
Chapter 4 describes the flat file format, field
definitions used with Pfam, the database for predicting the function
of newly discovered proteins.
Chapter 5 describes the flat file format field
definitions used with Prosite, one the many popular databases for
sequence profiles, patterns, and motifs.
Part II
Chapter 6 describes the supported formats and
command-line options of Readseq, a program that reads and writes
nucleotide and protein sequences in many useful formats.
Chapter 7 includes a list of the command-line
options used in the various BLAST (Basic Local Alignment Search Tool)
programs. We've also included a brief description of
each line command and option.
Chapter 8 includes the command-line options for
BLAT, the BLAST-Like Alignment Tool.
Chapter 9 includes the command-line options for
ClustalW, a multiple sequence alignment program for nucleotide
sequences or proteins.
Chapter 10 describes the respective options for
the HMMER (Hidden Markov Model) suite of programs.
Chapter 11 shows examples for using MEME (Multiple
EM for Motif Elicitation), a tool for discovering motifs in a group
of related DNA or protein sequences, and MAST (Motif Alignment and
Search Tool), a tool for searching biological sequence databases for
sequences that contain one or more of a group of known motifs.
We've also included command-line options for each
program.
Chapter 12 includes sequence, aligment, feature,
and report formats for the EMBOSS (European Molecular Biology Open
Software Suite) tools. The chapter also includes a description,
example, and summary of the command-line arguments of each tool in
the suite.
Part III
Appendix A includes tables of the single-letter
nucleotide and amino acid codes, as well as amino acid side chain
data.
Appendix B includes the genetic codes for the most
common organisms.
Appendix C includes useful URLs, further reading,
and references to important journal articles.
Appendix D contains the authors'
proposed contribution to the EMBOSS suite.
|