How This Book Is Organized

The book is divided into three fundamental areas: data formats, tools, and biological sequence components.

The data format section contains examples of flat files from key databases, the definitions of the codes or fields used in each database, and the sequence feature types/terms and qualifiers for the nucleotide and protein databases.

While there are many useful publicly and commercially available programs, we limited the tools section to popular public domain programs (e.g., BLAST and ClustalW). We also decided to include the EMBOSS programs. These packages are all excellent examples of sequence tools that allow bioinformaticians to easily use the command line to customize their own analyses and workflows. Each program is described briefly, with one or more examples showing how the program may be invoked. We also include the definitions, descriptions, and/or default parameters for each program's command-line options.

The last section of the book concentrates on information essential to understanding the individual components that make up a biological sequence. The tables in this section include nucleotide and protein codes, genetics codes, and other relevant information. The book is organized as follows:

Part I

Chapter 1 describes the most common sequence data format.
Chapter 2 describes the flat file format, field definitions, and feature tables used in the three most popular sequence databases.
Chapter 3 describes the flat file format, field definitions, and feature tables used with the SWISS-PROT protein database.
Chapter 4 describes the flat file format, field definitions used with Pfam, the database for predicting the function of newly discovered proteins.
Chapter 5 describes the flat file format field definitions used with Prosite, one the many popular databases for sequence profiles, patterns, and motifs.

Part II

Chapter 6 describes the supported formats and command-line options of Readseq, a program that reads and writes nucleotide and protein sequences in many useful formats.
Chapter 7 includes a list of the command-line options used in the various BLAST (Basic Local Alignment Search Tool) programs. We've also included a brief description of each line command and option.
Chapter 8 includes the command-line options for BLAT, the BLAST-Like Alignment Tool.
Chapter 9 includes the command-line options for ClustalW, a multiple sequence alignment program for nucleotide sequences or proteins.
Chapter 10 describes the respective options for the HMMER (Hidden Markov Model) suite of programs.
Chapter 11 shows examples for using MEME (Multiple EM for Motif Elicitation), a tool for discovering motifs in a group of related DNA or protein sequences, and MAST (Motif Alignment and Search Tool), a tool for searching biological sequence databases for sequences that contain one or more of a group of known motifs. We've also included command-line options for each program.
Chapter 12 includes sequence, aligment, feature, and report formats for the EMBOSS (European Molecular Biology Open Software Suite) tools. The chapter also includes a description, example, and summary of the command-line arguments of each tool in the suite.

Part III

Appendix A includes tables of the single-letter nucleotide and amino acid codes, as well as amino acid side chain data.
Appendix B includes the genetic codes for the most common organisms.
Appendix C includes useful URLs, further reading, and references to important journal articles.
Appendix D contains the authors' proposed contribution to the EMBOSS suite.

[ Team LiB ]