[ Team LiB ] Previous Section Next Section

1.2 NCBI's Non-Redundant Database Syntax

You should be aware of one additional syntax that's used by the NCBI for their non-redundant database. Since the whole point of the database is to have sequence entries listed only once, the description line syntax allows for more than one set of identifier and description. The sets are delimited by Ctrl-A characters. Here's what NCBI has to say about this.

These files are all non-redundant; identical sequences are merged into one entry. To be merged two sequences must have identical lengths and every residue (or basepair) at every position must be the same. The FASTA deflines for the different entries that belong to one sequence are separated by control-A's (^A). In the following example, both entries gi|1469284 and gi|1477453 have the same sequence, in every respect.

>gi|1469284 (U05042) afuC gene product [Actinobacillus 
pleuropneumoniae]^Agi|1477453 (U04954) afuC gene product [Actinobacillus 
pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT
KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ
QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN
KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE
AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE
    [ Team LiB ] Previous Section Next Section