[ Team LiB ] |
11.1 MEMEMEME (Multiple EM for Motif Elicitation) is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif. For details, see Section 11.3 at the end of this chapter. 11.1.1 ExamplesThe following examples use data files provided in this release of MEME. MEME writes its output to standard output, so you will want to redirect it to a file in order for use with MAST. A simple DNA example: meme crp0.s -dna -mod oops -pal > ex1.html MEME looks for a single motif in the file crp0.s which contains DNA sequences in FASTA format. The OOPS model is used so MEME assumes that every sequence contains exactly one occurrence of the motif. The palindrome switch is given so the motif model (PSPM) is converted into a palindrome by combining corresponding frequency columns. MEME automatically chooses the best width for the motif in this example since no width was specified. Searching for motifs on both DNA strands: meme crp0.s -dna -mod oops -revcomp > ex2.html This is like the previous example except that the -revcomp switch tells MEME to consider both DNA strands, and the -pal switch is absent so the palindrome conversion is omitted. When DNA uses both DNA strands, motif occurrences on the two strands may not overlap. That is, any position in the sequence given in the training set may be contained in an occurrence of a motif on the positive strand or the negative strand, but not both. A fast DNA example: meme crp0.s -dna -mod oops -revcomp -w 20 > ex3.html This example differs from the first example in that MEME is told to only consider motifs of width 20. This causes MEME to execute about 10 times faster. The -w switch can also be used with protein datasets if the width of the motifs are known in advance. Using a higher-order background model: meme INO_up800.s -dna -mod tcm -revcomp -bfile yeast.nc.6.freq > ex4.html In this example we use -mod tcm and -bfile yeast.nc.6.freq. This specifies that:
Using a higher-order background model can often result in more sensitive detection of motifs. This is because the background model more accurately models non-motif sequence, allowing MEME to discriminate against it and find the true motifs. A simple protein example: meme lipocalin.s -mod oops -maxw 20 -nmotifs 2 > ex5.html The -dna switch is absent, so MEME assumes the file lipocalin.s contains protein sequences. MEME searches for two motifs each of width less than or equal to 20. (Specifying -maxw 20 makes MEME run faster, since it does not have to consider motifs longer than 20.) Each motif is assumed to occur in each of the sequences because the OOPS model is specified. Another simple protein example: meme farntrans5.s -mod tcm -maxw 40 -maxsites 50 > ex6.html MEME searches for a motif of width up to 40, with up to 50 occurrences in the entire training set. The TCM sequence model is specified, which allows each motif to have any number of occurrences in each sequence. This dataset contains motifs with multiple repeats of motifs in each sequence. This example is fairly time consuming due to the fact that the time required to initialize the motif probability tables is proportional to maxw multiplied by maxsites. By default, MEME only looks for motifs up to 29 letters wide with a maximum total of number of occurrences equal to twice the number of sequences or 30, whichever is less. A much faster protein example: meme farntrans5.s -mod tcm -w 10 -maxsites 30 -nmotifs 3 > ex7.html This time MEME is constrained to search for three motifs of width exactly ten. The effect is to break up the long motif found in the previous example. The -w switch forces motifs to be exactly ten letters wide. This example is much faster because, since only one width is considered, the time to build the motif probability tables is only proportional to maxsites. Splitting the sites into three: meme farntrans5.s -mod tcm -maxw 12 -nsites 24 -nmotifs 3 > ex8.html This forces each motif to have exactly 24 occurrences, and be up to 12 letters wide. A larger protein example with E-value cutoff: meme adh.s -mod zoops -nmotifs 20 -evt 0.01 > ex9.html In this example, MEME looks for up to 20 motifs, but stops when a motif is found with E-value greater than 0.01. Motifs with large E-values are likely to be statistical artifacts rather than biologically significant. 11.1.2 Command-Line Optionsmeme dataset optionalarguments where dataset is a file containing sequences in FASTA format. Table 11-1 summarizes the command-line options for MEME.
|
[ Team LiB ] |