Sunday 29 April 2012

SLiMMaker: regular expressions from aligned peptide sequences

SLiMMaker has a fairly simple function of reading in a set of sequences and generating a regular expression motif from them. It is designed with protein sequences in mind but should work for DNA sequences too. Input sequences can be in fasta format or just plain text (with no sequence headers) and should be aligned already. Gapped positions will be ignored (treated as Xs) and variable length wildcards are not returned.

SLiMMaker considers each column of the input in turn and compresses it into a regular expression element according to some simple rules, screening out rare amino acids and converting particularly degenerate positions into wildcards. Each amino acid in the column that occurs at least X times (as defined by minseq=X) is considered for the regular expression definition for that position. The full set of amino acids meeting this criterion is then assessed for whether to keep it as a defined position, or convert into a wildcard.

First, if the number of different amino acids meeting this criterion is zero or above a second threshold (maxaa=X), the position is defined as a wildcard. Second, the proportion of input sequences matching the amino acid set is compared to a minimum frequency criterion (minfreq=X). Failing to meet this minimum frequency will again result in a wildcard. Otherwise, the amino acid set is added to the SLiM definition as either a fixed position (if only one amino acid met the minseq criterion) or as a degenerate position. Finally, leading and trailing wildcards are removed.

By default, each defined position in a motif will contain amino acids that (a) occur in at least three sequences each, (b) have a combined frequency of >=75%, and (c) have 5 or fewer different amino acids (that occur in 3+ sequences).

Note. The final motif only contains defined positions that match a given frequency of the input (75% by default). Because positions are considered independently, however, the final motif might occur in fewer than 75% of the input sequences. Results will indicate the coverage of the input data but SLiMSearch can be used to check the occurrence stats more thoroughly.

Citation: SLiMMaker is part of the ongoing benchmarking of QSLiMFinder, which should be submitted for publication soon. In the meantime, please cite the SLiMMaker URL: http://bioware.soton.ac.uk/slimmaker.html.

Availability: SLiMMaker is available on request and will shortly be part of the SLiMSuite package.


No comments:

Post a Comment