Showing posts with label decision tree. Show all posts
Showing posts with label decision tree. Show all posts

Thursday, March 22, 2012

Summary of Findings (Green Team): Decision Tree Analysis (3 out of 5 Stars)

Note: This post represents the synthesis of the thoughts, procedures and experiences of others as represented in the 12 articles read in advance (see previous posts) and the discussion among the students and instructor during the Advanced Analytic Techniques class at Mercyhurst College in March 2012 regarding Decision Tree Analysis specifically. This technique was evaluated based on its overall validity, simplicity, flexibility and its ability to effectively use unstructured data.

Description:
Decision Tree Analysis is an analytic method used to quantitatively measure the possible outcomes as well as the probabilities to predict courses of action. The method uses a finite set of variables applied to nodes based on probability. While relatively easy to communicate results, using the method as a predictive tool could be problematic because each node forces a choice between a finite number of options.

Strengths:
  • The process of creating a decision tree gives the user insight on the value of information.
  • The technique allows for visually presenting the process and different outcomes of the problem.
  • Can have various applications.
  • Can be relatively simple to create.
  • Can be internally or externally focused.
  • Relatively easy to communicate findings.
  • Can be descriptive or estimative.

Weaknesses:
  • The technique is limited to using a finite set of variables.
  • Can potentially lead to multiple outcomes.
  • Multiple risks are hard to assess at the same time as it is a linear process.
  • The use of a decision tree relies on the decisionmaker adhering to it.
  • Event nodes require an evaluation of outcome estimates, which may force quantification of qualitative estimates.
  • Tree may expand rapidly due to the possibilities from just a small number of variables.

How-to:
  • Establish the actors.
  • Establish the possible events, and their sequence of importance (based on an evaluation of the actors’ goals and motivations).
  • Order the events into the nodes and branches of the decision tree. Every node should have branches leading to it and possible choices following it.
  • Assign probabilities to possible courses of actions.
  • Evaluate the probability of each possible event occurring based on assigned values and position on the decision tree.

Personal Application of Technique:
The class was broken into three groups to create a decision tree based on the given situation:
“Imagine you only ever do four things on Saturday: go shopping, watch a movie, play tennis or just stay in. What you do depends on three things: how much money you have (rich or poor), the weather (windy, rainy or sunny), and whether or not your parents are visiting.” Our group created a decision tree with each criteria as independent variables allowing for probability of the choice, whereas the other groups built each decision based on answers to previous. The general outcome of our analysis was to stay in because there was an 80% chance of bad weather and a 70% chance that parents will not visit.

Rating: 3 out of 5

Tuesday, March 20, 2012

Constructing a decision tree from data with hierarchical class labels

Introduction:
Decision Tree construction algorithms have usually assumed that the class labels were categorical or Boolean variables, meaning that the algorithms operate under the assumption that the class labels are flat. In real-world applications, however, there are more complex classification scenarios, where the class labels to be predicted are hierarchically related. For example, in a digital library application, a document can be assigned to topics organized into a topic hierarchy; in web site management, a web page can be placed into categories organized as a hierarchical catalog. In both of these cases, the class labels are naturally organized as a hierarchical structure of class labels which defines an abstraction over class labels.

Summary:
The HLC (Hierarchical class Label Classifier) algorithm is designed to construct a DT from data with hierarchal class labels. It follows the standard framework of classical DT induction methods, such as ID3 and C4.5. The HLC process is:

1. Attribute Selection Measure
An attribute selection measure is a heuristic to measure which attribute is the most discriminatory to split a current node. In view of the weakness of the traditional entropy-based method, a new measure is proposed, called hierarchical-entropy value, by modifying the traditional entropy measure. It can help measure the appropriateness of a node with respect to the given class hierarchical tree.

2. Hierarchical-entropy value
The hierarchical-entropy value of a node vb can be denoted as Hentropy (vb), and can be computed with the following equation:

This result indicates that by using the new hierarchical-entropy measure, the appropriateness of a node in a CHT can be properly measured.

3. Hierarchical Information Gain
Next, decide whether an attribute is a good splitting attribute according to how much hierarchical information is gained by splitting the node through the attribute. First define what the hierarchical information gain is, and then use it to develop a method for choosing the best splitting attribute. The hierarchical information gain is used to select a test attribute ar to split the current node vb, and can be denoted as H-info-Gain (ar,vb). Let vx denote a child node obtained by splitting vb through test attribute ar. The hierarchical information gain can then be computed as follows:

The goal is to select the attribute with the highest H-infoGain value, which is chosen as the next test attribute for the current node.

4. Stop criteria
For the labels at the bottom level, let majority be the label with the most data in vb, and let percent (vb, majority) be the percentage of data in vb whose labels are majority. If the following conditions are met, stop growing node vb; otherwise, the node must be expanded further.

Once stopped, assign a concept label to a stop node vb to cover most of the data without losing too much precision. This is accomplished by using the function getLabel, detailed in the next section.

5. Label assignment
A heuristic for assigning a class label when a node matches the stop criteria. The goal is to use the concept label in the class hierarchical tree with the highest accuracy and precision to label the node. Each concept label of a leaf node vb has three vital indices: accuracy, precision, and score.

Applying the function getLabel to the information you can obtain the accuracy, precision, and score values for each concept label of nodes v1 and v2.

6. Evaluation
An empirical study was done to demonstrate that the proposed method can achieve a good result in both predictive accuracy and predictive precision. Two real-world data sets were used in the experiments. Two criteria were used to compare the performances of different algorithms: (1) accuracy, which is the percentage of future data that can be correctly classified into the right concept label, and (2) precision, which is a measure of specificity of the label data can be assigned to.

To highlight the tradeoffs between predictive accuracy and predictive precision of the resulting hierarchical concept labels, and to explore the performance of our proposed method, we chose a well known decision tree system, C4.5, to compare with our method, HLC.

Conclusion:
The HLC algorithm can achieve higher accuracy and better precision simultaneously. By carefully examining the C4.5 performance data level-by-level, it was found that when accuracy is high, precision becomes low, and vice-versa. This is because the traditional decision tree classifier does not consider data with hierarchical class labels, and thus it cannot handle the tradeoffs between accuracy and precision. On the other hand, since the aim of the project was to construct a DT from data with a class hierarchical tree, the ability to optimize the tradeoff between these two criteria is embedded into the design of the algorithm. The results in Fig. 14 indicate that the design achieved this goal.



The HLC algorithm can be used when evaluating hierarchical data as it will likely allow for more accurate and precise analysis.

Source:
Chen, Yen-Liang; Hu, Hsiao-Wei; Tang, Kwei (2009). Constructing a decision tree from data with hierarchical class labels. Expert Systems with Applications. Volume 36, Issue 3, Part 1, Pages 48384847.
http://140.123.102.14:8080/reportSys/file/paper/devil/devil_1_paper.pdf

Monday, March 19, 2012

A Decision Tree System For Finding Genes In DNA


Introduction:
Authors Steven Salzberg, Arthur Delcher, Kenneth Fasman and John Henderson’s article A Decision Tree System for Finding Genes in DNA discusses the use of decision tree system along with Morgan, an integrated system for finding genes in vertebrate DNA sequences as well as its performance on a benchmark database of vertebrate DNA.

Summary:
The authors look to expand on the research on gene finding by combining decision tree classifiers, signal recognition algorithms and dynamic programming. MORGAN, Multi-frame Optimal Rule-Based Gene Analyzer, is highly modular thus allowing improvements in any one aspect of the gene –finding task to be incorporated relatively easily into the system. The framework of their system is a dynamic programming algorithm that can efficiently consider the large number of alternative parses that are possible for any sequence of DNA. The resulting combined system is the first complete gene-finding system based on decision trees, and the experiments described below demonstrate that MORGAN is very accurate at finding genes in vertebrate sequence data.

Sample decision tree for classifying human DNA

The internal nodes of the tree represent feature values that are tested for each subsequence as it is passed to the tree. Subsequences are passed down the tree beginning at the top, where a "yes" result on any test means that an example should be passed down to the left. The features tested in this tree include the donor site score (donor), the sum of the donor and acceptor site scores (d + a), the in-frame hexamer frequency (hex), and Fickett's position asymmetry statistic (asym). The leaf nodes contain class distributions for the two classes "exon" and "pseudo-exon." Each successive node in the tree then represents a decision that is based on those values, until a final classification is reached. The bottom nodes of the tree (its leaf nodes) contain class labels indicating whether the subsequence is an exon or not. In addition, the leaf nodes contain the distributions of examples from all classes in the training set, which MORGAN uses to produce probability estimates.

Conclusion:
An important advantage of using decision trees is that they allow the experimenter to analyze the errors made by the system. The modular nature of MORGAN makes it possible in some cases to determine which components of the system are responsible for certain errors, and this helps to guide future development.

Source:
Salzberg, S., Delcher, A., Fasman, K., & Henderson, J. (1998). A decision tree system for finding genes in dna. Journal of Computational Biology, 5(4), 667-680. Retrieved from http://online.liebertpub.com/doi/pdf/10.1089/cmb.1998.5.667