Li [53] maintains very useful bibbliographies in electronic form.
Important progress in automatic gene finding has come from the combination of statistical technique with signal detection methods. i.e SorFind [40], HEXON [75,77], Xpound [79], GRAIL II [83] and MZEF [87] improves the prediction of individual internal exons significant by associating the measures of the coding potential with the flanking sites.
In 1990, Fieds and Solderlund [27] and Gelfand [32] poineered the field of whole gene structure prediction.
Many different programs of gene finding resulted from many years of research. They differs in prediction techniques, quality estimation and the standard of optimal gene mode.
GeneID [35] first identify all the exons and use a heuristic,rule-based system to assemble those exons into models of one likely gene in each sequence.
GenLang [20] interpret the content measure and signal detections as "leaf rules" in a linguitic context.
HEXON [77] and FGENEH [78] apply linear discriminant analysis to combine several measures in order to descriminate functional regions of sequences.
MORGAN [69] use decision tree to identify content signal and splicing signals and assemble all the components into an optimal gene models.
GeneParser [73,74], GENEVIEW [60] and GRAIL II [84] compuate scores for different identified intervals of sequences and first use dynamical programming techniques to find out the optimal combination of those intervals.
The VEIL [36]uses a HMM system for segmenting anonymous vetebrate sequences into exons, introns and intergenic regions. GENIE [50] introduces GHMM, which models the biological sequences in a better way. In GHMM, each state is also a sub model. GENSCAN [7] further develops the GHMM model by optimizing the sub model recognizing content signals and splicing signals and adding the influence of G+C content.
Program PROCRUSTES [33]use protein (and cDNA) similarity information to indentify genes and predict gene structure.
Jon et. al. [58] introduces a generalized hidd markov phylogeny(GHMP) to discover the conserved regions in sequences from multiple closely related organisms.
A good deal of literature on gene prediction and regulatory elements identification has accumulated in recent years. However, despite the
considerable effort lavished on this problem, Burst and Guig
concluded that most programs
identified less than 50% of exons on average when tested on a standard set of 570 vertebrate
multi-exon genes [8]. And most gene finding methods assumed that the input sequence
contains exactly one complete gene. GENSCAN is a better work [7] in predict primary transcript, however, it
has a over simplified model for some complicated regions like promoter region.
To identify genes in the genomic DNA sequences, we developed algorithms to predict splicing sites, intron, exon etc. regions. Based on those algorithms for identifying genomic regions, the ``Multi Lexicons HMM" model is proposed to identify genes in genomic DNA sequences. We use the PATRICIA [61] data structure to organize our various lexicons for efficient search.
Our training data is a total of 793 unrelated human genes, which was used to train the GENIE gene finding system [50] developed at LBNL and UC Santa Cruz. For reliable prediction of gene, we need to first consider basic signal sensors and content sensors. To identify the Intron/Exon Regions, we used the Position Weight Matrix Model(PWMM) to discriminate true splicing signals from pseudo splicing signals. We also used a trained Intron/Exon lexicon and length distribution of exon, intron regions [59]to differentiate Intron/Exon regions.
After identifying characteristic gene regions, we propose a ``Multi Lexicons HMM" algorithm to identify genes in genomic DNA sequences. We expect that it could be capable of predicting the number of genes in a sequence as well as the location of introns, exons, promoter, non-coding RNA etc.
To accomplish ``Multi Lexicons HMM", we first trained a set of lexicons for introns, exons, promoters, 3' UTR, intergenic and non-coding RNA regions, etc. and a set of signal models for donors, acceptors, start sites, stop sites etc. . When sequentially scanning an input genomic DNA sequence, we use a HMM model to dynamically parse it with most suitable combination of lexicons. The identification of the genes of the input sequence can be generated based on the final combination of those lexicons.
Currently, we use a simplified human gene structure model .
denotes the initial Exon in phase 3.
denotes terminal Exon in phase 3.
denotes the preceding Exon ending in reading
frame phase
.
denotes the Introns whose preceding Exon is in phase
. This model maintains
a consistent reading frame of Exons.
denotes for single exon.
The start codon is ``ATG". The stop codons
are ``TGA,TAA,TAG". When predicted exons are assembled to a translatable mRNA,
there should be no in-frame stop codons in those exons. Because each coding
regions
are surrounded by two non coding
regions
. Given the preceding and following
non coding regions, we can conclude what is the coding region between them. So,
we reduce the the final state to
, where
denotes intergenic
region,
denotes
.
Given an input unannotated genomic sequence
, we need to
annotate it in state
. Let
denote
.
denotes the probability of word
in
Lexicon for state
.
denotes the list of potential Exons ending at
.
denotes one of the exons, which starts at
and ends at
.
The preceding state of
is
, the following state is
.
denotes the generation probability of
based on
exon lexicons
.
The probability of best annotation for
is defined in definition
1.
is the maximum of two parts:
We partitioned our human gene training data into six parts and used five of them as training data. We built corresponding weight matrixes for acceptor and donor sites. We trained lexicons for intron and exon regions respectively. Given an input unannotated human genomic sequence, we first found out all the potential signals based on those signal sensor models. Then we computed all the potential exons and their generation probabilities based on the exon lexicon. Finally, we used the annotation strategy in definition 1 to calculate the best annotation of the input sequence.
Jing Liu 2005-11-17