A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://en.wikipedia.org/wiki/C4.5_algorithm below:

C4.5 algorithm - Wikipedia

From Wikipedia, the free encyclopedia

Algorithm for making decision trees

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan.[1] C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. In 2011, authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date".[2]

It became quite popular after ranking #1 in the Top 10 Algorithms in Data Mining pre-eminent paper published by Springer LNCS in 2008.[3]

C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = s 1 , s 2 , . . . {\displaystyle S={s_{1},s_{2},...}} of already classified samples. Each sample s i {\displaystyle s_{i}} consists of a p-dimensional vector ( x 1 , i , x 2 , i , . . . , x p , i ) {\displaystyle (x_{1,i},x_{2,i},...,x_{p,i})} , where the x j {\displaystyle x_{j}} represent attribute values or features of the sample, as well as the class in which s i {\displaystyle s_{i}} falls.

At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the partitioned sublists.

This algorithm has a few base cases.

In pseudocode, the general algorithm for building decision trees is:[4]

  1. Check for the above base cases.
  2. For each attribute a, find the normalized information gain ratio from splitting on a.
  3. Let a_best be the attribute with the highest normalized information gain.
  4. Create a decision node that splits on a_best.
  5. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node.

J48 is an open source Java implementation of the C4.5 algorithm in the Weka data mining tool.

Improvements from ID3 algorithm[edit]

C4.5 made a number of improvements to ID3. Some of these are:

Improvements in C5.0/See5 algorithm[edit]

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are:[6][7]

Source for a single-threaded Linux version of C5.0 is available under the GNU General Public License (GPL).

  1. ^ Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
  2. ^ Ian H. Witten; Eibe Frank; Mark A. Hall (2011). "Data Mining: Practical machine learning tools and techniques, 3rd Edition". Morgan Kaufmann, San Francisco. p. 191.
  3. ^ Umd.edu - Top 10 Algorithms in Data Mining
  4. ^ S.B. Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques", Informatica 31(2007) 249-268, 2007
  5. ^ J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.
  6. ^ Is See5/C5.0 Better Than C4.5?
  7. ^ M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer 2013

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.3