1.2 NCBI's Non-Redundant Database Syntax
You should be
aware of one additional syntax that's used by the
NCBI for their non-redundant database. Since the whole point of the
database is to have sequence entries listed only once, the
description line syntax allows for more than one set of identifier
and description. The sets are delimited by Ctrl-A characters.
Here's what NCBI has to say about this.
These files are all non-redundant; identical sequences are merged
into one entry. To be merged two sequences must have identical
lengths and every residue (or basepair) at every position must be the
same. The FASTA deflines for the different entries that belong to one
sequence are separated by control-A's (^A). In the
following example, both entries gi|1469284 and gi|1477453 have the
same sequence, in every respect.
>gi|1469284 (U05042) afuC gene product [Actinobacillus
pleuropneumoniae]^Agi|1477453 (U04954) afuC gene product [Actinobacillus
pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT
KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ
QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN
KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE
AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE
|