Monday, March 19, 2012

A Decision Tree System For Finding Genes In DNA

Authors Steven Salzberg, Arthur Delcher, Kenneth Fasman and John Henderson’s article A Decision Tree System for Finding Genes in DNA discusses the use of decision tree system along with Morgan, an integrated system for finding genes in vertebrate DNA sequences as well as its performance on a benchmark database of vertebrate DNA.

The authors look to expand on the research on gene finding by combining decision tree classifiers, signal recognition algorithms and dynamic programming. MORGAN, Multi-frame Optimal Rule-Based Gene Analyzer, is highly modular thus allowing improvements in any one aspect of the gene –finding task to be incorporated relatively easily into the system. The framework of their system is a dynamic programming algorithm that can efficiently consider the large number of alternative parses that are possible for any sequence of DNA. The resulting combined system is the first complete gene-finding system based on decision trees, and the experiments described below demonstrate that MORGAN is very accurate at finding genes in vertebrate sequence data.

Sample decision tree for classifying human DNA

The internal nodes of the tree represent feature values that are tested for each subsequence as it is passed to the tree. Subsequences are passed down the tree beginning at the top, where a "yes" result on any test means that an example should be passed down to the left. The features tested in this tree include the donor site score (donor), the sum of the donor and acceptor site scores (d + a), the in-frame hexamer frequency (hex), and Fickett's position asymmetry statistic (asym). The leaf nodes contain class distributions for the two classes "exon" and "pseudo-exon." Each successive node in the tree then represents a decision that is based on those values, until a final classification is reached. The bottom nodes of the tree (its leaf nodes) contain class labels indicating whether the subsequence is an exon or not. In addition, the leaf nodes contain the distributions of examples from all classes in the training set, which MORGAN uses to produce probability estimates.

An important advantage of using decision trees is that they allow the experimenter to analyze the errors made by the system. The modular nature of MORGAN makes it possible in some cases to determine which components of the system are responsible for certain errors, and this helps to guide future development.

Salzberg, S., Delcher, A., Fasman, K., & Henderson, J. (1998). A decision tree system for finding genes in dna. Journal of Computational Biology, 5(4), 667-680. Retrieved from

1 comment:

  1. I can understand that the decision tree can help in recognizing genes in a genome, but does it actually aid in defining the type of gene or the different components of the gene (such as sequences for specific amino acids)? It sounds like this decision tree model is only a way to recognize exons and the final sequence of the DNA as a method of cross-checking the results of other tests.