Abstract
Data discretization is defined as a process of converting continuous data attribute values into a finite set of intervals with minimal loss of information. In this paper, we prove that discretization methods based on informational theoretical complexity and the methods based on statistical measures of data dependency are asymptotically equivalent. Furthermore, we define a notion of generalized entropy and prove that discretization methods based on Minimal description length principle, Gini index, AIC, BIC, and Pearson’s X 2 and G 2 statistics are all derivable from the generalized entropy function. We design a dynamic programming algorithm that guarantees the best discretization based on the generalized entropy notion. Furthermore, we conducted an extensive performance evaluation of our method for several publicly available data sets. Our results show that our method delivers on the average 31% less classification errors than many previously known discretization methods.
Similar content being viewed by others
References
Agresti A (1990) Categorical data analysis. Wiley, New York
Auer P, Holte R, Maass W (1995) Theory and applications of agnostic pac-learning with small decision trees. In: Machine learning: proceedings of the twelth international conference. Morgan Kaufmann
Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4): 491–512
Breiman L, Friedman J, Olshen R, Stone C (1998) Classification and regression trees. CRC Press
Boulle M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55: 53–69
Boulle M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1): 131–165
Casella G, Berger RL (2001) Statistical inference, 2nd edn. Duxbury Press
Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of European working session on learning, pp 164–178
Ching JY, Wong AKC, Chan KCC (1995) Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans Pattern Anal Mach Intell 17(7): 641–651
Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15
Cover TM, Thomas JA (2006) Elements of information thoery, 2nd edn. Wiley, New York
Dougherty J, Kohavi R, Sahavi M (1995) Supervised and unsupervised discretization of continuous attributes. In: Proceedings of the 12th international conference on machine learning, pp 194–202
Elomaa T, Rousu J (2003) Necessary and sufficient pre-processing in numerical range discretization. Knowl Inf Syst 5(2): 162–182
Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining Knowl Discovery 8: 97–126
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th joint conference on artificial intelligence, pp 1022–1029
Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2): 219–269
Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press
Hansen MH, Yu B (2001) Model selection and the principle of minimum description length. J Am Statist Assci 96: 454
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, Heidelberg
Holte RC (1993) Very simple calssification rules perform well on most commonly used datasets. Mach Learn 11: 63–90
Johnson N, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, 2nd edn. Wiley, New York
Jin R, Breitbart Y (2007) Data discretization unification. Technical Report, Department of Computer Science, Kent State University. http://www.cs.kent.edu/research/techrpts.html
Kerber R (1992) ChiMerge: discretization of numeric attributes. In: National conference on artificial intelligence
Kurgan LA, Cios KJ (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2): 145–153
Kohavi R, Sahami M (1996) Error-based and entropy-based discretization of continuous features. In: Proceedings of the second international conference on knowledge discovery and data mining. Menlo Park. AAAI Press, pp 114–119
Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Mining Knowl Discovery 6: 393–423
Liu H, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE int’l conference on tools with artificial intelligence
Liu X, Wang H (2005) A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 17(9): 1166–1173
Mussard S, Seyte F, Terraza M (2003) Decomposition of Gini and the generalized entropy inequality measures. Econ Bull 4(7): 1–6
Pfahringer B (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of 12th international conference on machine learning, pp 456–463
Rissanen J (1978) Modeling by shortest data description. Automatica 14: 465–471
Simovici DA, Jaroszewicz S (2002) An axiomatization of partition entropy. IEEE Trans Inf Theory 48(7): 2138–2142
Wallace DL (1959) Bounds on normal approximations to Student’s and the Chi-square distributions. Ann Mathe Stat 30(4): 1121–1130
Wallace DL (1960) Correction to “Bounds on Normal Approximations to Student’s and the Chi-Square Distributions”. Ann Math Statist 31(3): 810
Wong AKC, Chiu DKY (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 9(6): 796–805
Yang Y, Webb GI (2003) Weighted proportional k-interval discretization for naive–Bayes classifiers. In: Advances in knowledge discovery and data mining: 7th Pacific-Asia Conference, PAKDD, pp 501–512
UCI Machine Learning Repository (2007) http://www.ics.uci.edu/mlearn/ML.Repository.html
Weka 3 (2007) Data mining software in Java. http://www.cs.waikato.ac.nz/ml/weka
Author information
Authors and Affiliations
Corresponding author
Additional information
This research in part is supported by Lady Davis Fellowship, Haifa, Israel.
Rights and permissions
About this article
Cite this article
Jin, R., Breitbart, Y. & Muoh, C. Data discretization unification. Knowl Inf Syst 19, 1–29 (2009). https://doi.org/10.1007/s10115-008-0142-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0142-6