We propose to utilize the lexicon as a probabilistic language model to describe
the context features of characteristic regions of genes. We base our identification
system on those lexicons derived from training sequences by Viterbi training algorithm
in section
.
First, we explain the identification of the promoter regions. The prediction of promoter region is based on several classifiers. A classifier is defined by two lexicons, one derived from promoter training sequences and the other deriving from specific non-promoter training sequences. Given a fixed-length input sequence, the classifier will compute the generation probabilities based on those two lexicons. The input sequence will be assigned to the promoter class if the generation probability based on promoter lexicon is higher than some threshold. Currently, we construct two classifiers. One decides if the input sequence belongs to promoter region or a intron region. The other decides if the input sequence belongs to promoter region or a CDs region.
The framework of our promoter region identification algorithm named ``PromoterIden" can be described as follows:
To predict if our algorithm is effective, we use a set of empirical parameters
as
. We downloaded the promoter, intron and CDs training data
from a public Representative Benchmark Data Sets of Human DNA Sequences provided by
Berkeley Drosophila Genome Project web site. Those promoters were extracted from
the Eukaryotic Promoter Database rel. 50; the negative set contains coding and non-coding
sequences from the 1998 GENIE data set. After training the corresponding promoter, intron, and
CDs lexicon, we apply above identification of promoter region algorithm to the review
dataset used by Fickett and Hatzigeiogiou [25] consisting 24 promoters
covering a total of 33,120 bp and none of the sequences hits a sequence in EPD.
We use the
non-strand-specific comparison standard in [70] and
compare our result with Audic [1], Autogene [45],
GeneID [44], NNPP 2.1, PromFind [39],
TATA [6], TSSG [76] and TSSW [76].
The result is shown in table 1. Our result is comparable
under this coarse framework, because no other method got higher
TP and fewer FP at same time than our method. It is remarkable that our algorithm, which has not
been "tuned" at all to search for promoter regions, does so well against
published algorithms.
Similarly, we can train the prediction model for intron, CDs, 3' UTR and intergenic regions identification.
Jing Liu 2005-11-17