Identification of eukaryotic promoter region

We propose to utilize the lexicon as a probabilistic language model to describe the context features of characteristic regions of genes. We base our identification system on those lexicons derived from training sequences by Viterbi training algorithm in section [*].

First, we explain the identification of the promoter regions. The prediction of promoter region is based on several classifiers. A classifier is defined by two lexicons, one derived from promoter training sequences and the other deriving from specific non-promoter training sequences. Given a fixed-length input sequence, the classifier will compute the generation probabilities based on those two lexicons. The input sequence will be assigned to the promoter class if the generation probability based on promoter lexicon is higher than some threshold. Currently, we construct two classifiers. One decides if the input sequence belongs to promoter region or a intron region. The other decides if the input sequence belongs to promoter region or a CDs region.

The framework of our promoter region identification algorithm named ``PromoterIden" can be described as follows:

  1. Collect proper promoter sequence dataset and use Viterbi training algorithm to construct deriving lexicon termed $L_{promoter}$. Similarly, construct intron promoter $L_{intron}$ and $L_{CD}$;
  2. Let $W$ be the window size, $O$ be the overlap size. Given an input genomic DNA sequence $S$, use a window of size $W$ and overlap $O$ to slid along $S$ from the start. For each sequence $s_i$( $0\le i \le { \frac{\vert s\vert-w}{O}}$) within the window, let $pr_i$ be the empirical probability of $s_i$,

    1. Compute the log generation probability of $s_i$ termed $pr_i^{promoter}$ based on $L_{promoter}$. Similarly, compute $pr_i^{intron}$ and $pr_i^{CD}$.
    2. Compute the score of $s_i$ defined as $score_i=\hbox{max}(pr_i^{intron}, pr_i^{CD})-pr_i^{promoter}$.
    3. If $score_i<0$ and $pr_i^{promoter}>pr_i$, we assign $s_i$ to candidate promoter region set $G$.

  3. For each candidate sequence $g_i$ in candidate promoter region set $G$, where $g_i=s_j\in S$, if $s_{j-1}\notin G$ and $s_{j+1}\notin G$, then delete $g_i$ from $G$. In this way, we can rule out the single false positive.
  4. For sequences in $G$, compute average promoter probability $averageprob=\sum_{s_i\in G}{pr_i^{promoter}}/\vert G\vert$ and compute max score $maxscore=\hbox{max}_{s_i\in G }{score_i}$.

  5. For each sequence $s_i$ in $G$, if $pr_i^{promoter} >averageprob$ and $score_i > \theta_1 * maxscore$, output $s_i$ as identified promoter region.

To predict if our algorithm is effective, we use a set of empirical parameters as $w=250, o=10,\theta_1=0.8$. We downloaded the promoter, intron and CDs training data from a public Representative Benchmark Data Sets of Human DNA Sequences provided by Berkeley Drosophila Genome Project web site. Those promoters were extracted from the Eukaryotic Promoter Database rel. 50; the negative set contains coding and non-coding sequences from the 1998 GENIE data set. After training the corresponding promoter, intron, and CDs lexicon, we apply above identification of promoter region algorithm to the review dataset used by Fickett and Hatzigeiogiou [25] consisting 24 promoters covering a total of 33,120 bp and none of the sequences hits a sequence in EPD. We use the non-strand-specific comparison standard in [70] and compare our result with Audic [1], Autogene [45], GeneID [44], NNPP 2.1, PromFind [39], TATA [6], TSSG [76] and TSSW [76]. The result is shown in table 1. Our result is comparable under this coarse framework, because no other method got higher TP and fewer FP at same time than our method. It is remarkable that our algorithm, which has not been "tuned" at all to search for promoter regions, does so well against published algorithms.

Similarly, we can train the prediction model for intron, CDs, 3' UTR and intergenic regions identification.


Table 1: Results for the Fickett data set. TP denotes True Positive. FP denotes False Positive. Coverage denotes TP over total of promoters in Fickett data set.
Method TP TP/(TP+FP)(%) FP TP/FP coverage=TP/24(%)
Audic 9 24 29 0.31 37
Autogene 8 15 46 0.17 33
GeneID 11 19 48 0.23 46
NNPP 2.1 14 19 59 0.24 58
PromFind 11 31 24 0.46 46
PromoterScan 3 33 6 0.50 12
TATA 7 23 24 0.30 29
TSSG 10 37 17 0.60 42
TSSW 14 30 33 0.42 58
PromoterIden 11 36 20 0.55 46


Jing Liu 2005-11-17