Skip to main content
Log in

Data discretization unification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Data discretization is defined as a process of converting continuous data attribute values into a finite set of intervals with minimal loss of information. In this paper, we prove that discretization methods based on informational theoretical complexity and the methods based on statistical measures of data dependency are asymptotically equivalent. Furthermore, we define a notion of generalized entropy and prove that discretization methods based on Minimal description length principle, Gini index, AIC, BIC, and Pearson’s X 2 and G 2 statistics are all derivable from the generalized entropy function. We design a dynamic programming algorithm that guarantees the best discretization based on the generalized entropy notion. Furthermore, we conducted an extensive performance evaluation of our method for several publicly available data sets. Our results show that our method delivers on the average 31% less classification errors than many previously known discretization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agresti A (1990) Categorical data analysis. Wiley, New York

    MATH  Google Scholar 

  2. Auer P, Holte R, Maass W (1995) Theory and applications of agnostic pac-learning with small decision trees. In: Machine learning: proceedings of the twelth international conference. Morgan Kaufmann

  3. Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4): 491–512

    Article  MATH  Google Scholar 

  4. Breiman L, Friedman J, Olshen R, Stone C (1998) Classification and regression trees. CRC Press

  5. Boulle M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55: 53–69

    Article  MATH  Google Scholar 

  6. Boulle M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1): 131–165

    Article  Google Scholar 

  7. Casella G, Berger RL (2001) Statistical inference, 2nd edn. Duxbury Press

  8. Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of European working session on learning, pp 164–178

  9. Ching JY, Wong AKC, Chan KCC (1995) Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans Pattern Anal Mach Intell 17(7): 641–651

    Article  Google Scholar 

  10. Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15

  11. Cover TM, Thomas JA (2006) Elements of information thoery, 2nd edn. Wiley, New York

    Google Scholar 

  12. Dougherty J, Kohavi R, Sahavi M (1995) Supervised and unsupervised discretization of continuous attributes. In: Proceedings of the 12th international conference on machine learning, pp 194–202

  13. Elomaa T, Rousu J (2003) Necessary and sufficient pre-processing in numerical range discretization. Knowl Inf Syst 5(2): 162–182

    Article  Google Scholar 

  14. Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining Knowl Discovery 8: 97–126

    Article  MathSciNet  Google Scholar 

  15. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th joint conference on artificial intelligence, pp 1022–1029

  16. Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2): 219–269

    Article  Google Scholar 

  17. Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press

  18. Hansen MH, Yu B (2001) Model selection and the principle of minimum description length. J Am Statist Assci 96: 454

    MathSciNet  Google Scholar 

  19. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, Heidelberg

    MATH  Google Scholar 

  20. Holte RC (1993) Very simple calssification rules perform well on most commonly used datasets. Mach Learn 11: 63–90

    Article  MATH  Google Scholar 

  21. Johnson N, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  22. Jin R, Breitbart Y (2007) Data discretization unification. Technical Report, Department of Computer Science, Kent State University. http://www.cs.kent.edu/research/techrpts.html

  23. Kerber R (1992) ChiMerge: discretization of numeric attributes. In: National conference on artificial intelligence

  24. Kurgan LA, Cios KJ (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2): 145–153

    Article  Google Scholar 

  25. Kohavi R, Sahami M (1996) Error-based and entropy-based discretization of continuous features. In: Proceedings of the second international conference on knowledge discovery and data mining. Menlo Park. AAAI Press, pp 114–119

  26. Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Mining Knowl Discovery 6: 393–423

    Article  MathSciNet  Google Scholar 

  27. Liu H, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE int’l conference on tools with artificial intelligence

  28. Liu X, Wang H (2005) A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 17(9): 1166–1173

    Article  Google Scholar 

  29. Mussard S, Seyte F, Terraza M (2003) Decomposition of Gini and the generalized entropy inequality measures. Econ Bull 4(7): 1–6

    Google Scholar 

  30. Pfahringer B (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of 12th international conference on machine learning, pp 456–463

  31. Rissanen J (1978) Modeling by shortest data description. Automatica 14: 465–471

    Article  MATH  Google Scholar 

  32. Simovici DA, Jaroszewicz S (2002) An axiomatization of partition entropy. IEEE Trans Inf Theory 48(7): 2138–2142

    Article  MATH  MathSciNet  Google Scholar 

  33. Wallace DL (1959) Bounds on normal approximations to Student’s and the Chi-square distributions. Ann Mathe Stat 30(4): 1121–1130

    Article  MATH  MathSciNet  Google Scholar 

  34. Wallace DL (1960) Correction to “Bounds on Normal Approximations to Student’s and the Chi-Square Distributions”. Ann Math Statist 31(3): 810

    Article  Google Scholar 

  35. Wong AKC, Chiu DKY (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 9(6): 796–805

    Article  Google Scholar 

  36. Yang Y, Webb GI (2003) Weighted proportional k-interval discretization for naive–Bayes classifiers. In: Advances in knowledge discovery and data mining: 7th Pacific-Asia Conference, PAKDD, pp 501–512

  37. UCI Machine Learning Repository (2007) http://www.ics.uci.edu/mlearn/ML.Repository.html

  38. Weka 3 (2007) Data mining software in Java. http://www.cs.waikato.ac.nz/ml/weka

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruoming Jin.

Additional information

This research in part is supported by Lady Davis Fellowship, Haifa, Israel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, R., Breitbart, Y. & Muoh, C. Data discretization unification. Knowl Inf Syst 19, 1–29 (2009). https://doi.org/10.1007/s10115-008-0142-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0142-6

Keywords

Navigation