Introduction:
Authors Steven Salzberg, Arthur
Delcher, Kenneth Fasman and John Henderson’s article A Decision Tree System for Finding Genes in DNA discusses the use
of decision tree system along with Morgan, an integrated system for finding genes
in vertebrate DNA sequences as well as its performance on a benchmark database of
vertebrate DNA.
Summary:
The authors look to expand on the
research on gene finding by combining decision tree classifiers, signal
recognition algorithms and dynamic programming. MORGAN, Multi-frame Optimal
Rule-Based Gene Analyzer, is highly modular thus allowing improvements in any
one aspect of the gene –finding task to be incorporated relatively easily into
the system. The framework of their system is a dynamic programming algorithm that
can efficiently consider the large number of alternative parses that are possible
for any sequence of DNA. The resulting combined system is the first complete gene-finding
system based on decision trees, and the experiments described below demonstrate
that MORGAN is very accurate at finding genes in vertebrate sequence data.
Sample decision tree for classifying human DNA
|
The internal
nodes of the tree represent feature values that are tested for each subsequence
as it is passed to the tree. Subsequences are passed down the tree beginning at
the top, where a "yes" result on any test means that an example should
be passed down to the left. The features tested in this tree include the donor
site score (donor), the sum of the donor and acceptor site scores (d + a), the
in-frame hexamer frequency (hex), and Fickett's position asymmetry statistic (asym).
The leaf nodes contain class distributions for the two classes "exon"
and "pseudo-exon." Each successive node in the tree then represents a
decision that is based on those values, until a final classification is
reached. The bottom nodes of the tree (its leaf nodes) contain class labels indicating
whether the subsequence is an exon or not. In addition, the leaf nodes contain
the distributions of examples from all classes in the training set, which
MORGAN uses to produce probability estimates.
Conclusion:
An important advantage of using decision
trees is that they allow the experimenter to analyze the errors made by the system.
The modular nature of MORGAN makes it possible in some cases to determine which
components of the system are responsible for certain errors, and this helps to guide
future development.
Source:
Salzberg, S., Delcher, A.,
Fasman, K., & Henderson, J. (1998). A decision tree system for finding
genes in dna. Journal of Computational Biology, 5(4), 667-680.
Retrieved from http://online.liebertpub.com/doi/pdf/10.1089/cmb.1998.5.667
I can understand that the decision tree can help in recognizing genes in a genome, but does it actually aid in defining the type of gene or the different components of the gene (such as sequences for specific amino acids)? It sounds like this decision tree model is only a way to recognize exons and the final sequence of the DNA as a method of cross-checking the results of other tests.
ReplyDelete