Data discretization unification

Jin, Ruoming; Breitbart, Yuri; Muoh, Chibuike

doi:10.1007/s10115-008-0142-6

Data discretization unification

Regular Paper
Published: 17 May 2008

Volume 19, pages 1–29, (2009)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ruoming Jin¹,
Yuri Breitbart¹ &
Chibuike Muoh¹

358 Accesses
46 Citations
Explore all metrics

Abstract

Data discretization is defined as a process of converting continuous data attribute values into a finite set of intervals with minimal loss of information. In this paper, we prove that discretization methods based on informational theoretical complexity and the methods based on statistical measures of data dependency are asymptotically equivalent. Furthermore, we define a notion of generalized entropy and prove that discretization methods based on Minimal description length principle, Gini index, AIC, BIC, and Pearson’s X ² and G ² statistics are all derivable from the generalized entropy function. We design a dynamic programming algorithm that guarantees the best discretization based on the generalized entropy notion. Furthermore, we conducted an extensive performance evaluation of our method for several publicly available data sets. Our results show that our method delivers on the average 31% less classification errors than many previously known discretization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agresti A (1990) Categorical data analysis. Wiley, New York
MATH Google Scholar
Auer P, Holte R, Maass W (1995) Theory and applications of agnostic pac-learning with small decision trees. In: Machine learning: proceedings of the twelth international conference. Morgan Kaufmann
Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4): 491–512
Article MATH Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1998) Classification and regression trees. CRC Press
Boulle M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55: 53–69
Article MATH Google Scholar
Boulle M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1): 131–165
Article Google Scholar
Casella G, Berger RL (2001) Statistical inference, 2nd edn. Duxbury Press
Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of European working session on learning, pp 164–178
Ching JY, Wong AKC, Chan KCC (1995) Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans Pattern Anal Mach Intell 17(7): 641–651
Article Google Scholar
Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15
Cover TM, Thomas JA (2006) Elements of information thoery, 2nd edn. Wiley, New York
Google Scholar
Dougherty J, Kohavi R, Sahavi M (1995) Supervised and unsupervised discretization of continuous attributes. In: Proceedings of the 12th international conference on machine learning, pp 194–202
Elomaa T, Rousu J (2003) Necessary and sufficient pre-processing in numerical range discretization. Knowl Inf Syst 5(2): 162–182
Article Google Scholar
Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining Knowl Discovery 8: 97–126
Article MathSciNet Google Scholar
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th joint conference on artificial intelligence, pp 1022–1029
Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2): 219–269
Article Google Scholar
Hand D, Mannila H, Smyth P (2001) Principles of data mining. MIT Press
Hansen MH, Yu B (2001) Model selection and the principle of minimum description length. J Am Statist Assci 96: 454
MathSciNet Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, Heidelberg
MATH Google Scholar
Holte RC (1993) Very simple calssification rules perform well on most commonly used datasets. Mach Learn 11: 63–90
Article MATH Google Scholar
Johnson N, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, 2nd edn. Wiley, New York
MATH Google Scholar
Jin R, Breitbart Y (2007) Data discretization unification. Technical Report, Department of Computer Science, Kent State University. http://www.cs.kent.edu/research/techrpts.html
Kerber R (1992) ChiMerge: discretization of numeric attributes. In: National conference on artificial intelligence
Kurgan LA, Cios KJ (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2): 145–153
Article Google Scholar
Kohavi R, Sahami M (1996) Error-based and entropy-based discretization of continuous features. In: Proceedings of the second international conference on knowledge discovery and data mining. Menlo Park. AAAI Press, pp 114–119
Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Mining Knowl Discovery 6: 393–423
Article MathSciNet Google Scholar
Liu H, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE int’l conference on tools with artificial intelligence
Liu X, Wang H (2005) A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 17(9): 1166–1173
Article Google Scholar
Mussard S, Seyte F, Terraza M (2003) Decomposition of Gini and the generalized entropy inequality measures. Econ Bull 4(7): 1–6
Google Scholar
Pfahringer B (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of 12th international conference on machine learning, pp 456–463
Rissanen J (1978) Modeling by shortest data description. Automatica 14: 465–471
Article MATH Google Scholar
Simovici DA, Jaroszewicz S (2002) An axiomatization of partition entropy. IEEE Trans Inf Theory 48(7): 2138–2142
Article MATH MathSciNet Google Scholar
Wallace DL (1959) Bounds on normal approximations to Student’s and the Chi-square distributions. Ann Mathe Stat 30(4): 1121–1130
Article MATH MathSciNet Google Scholar
Wallace DL (1960) Correction to “Bounds on Normal Approximations to Student’s and the Chi-Square Distributions”. Ann Math Statist 31(3): 810
Article Google Scholar
Wong AKC, Chiu DKY (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 9(6): 796–805
Article Google Scholar
Yang Y, Webb GI (2003) Weighted proportional k-interval discretization for naive–Bayes classifiers. In: Advances in knowledge discovery and data mining: 7th Pacific-Asia Conference, PAKDD, pp 501–512
UCI Machine Learning Repository (2007) http://www.ics.uci.edu/mlearn/ML.Repository.html
Weka 3 (2007) Data mining software in Java. http://www.cs.waikato.ac.nz/ml/weka

Download references

Author information

Authors and Affiliations

Department of Computer Science, Kent State University, Kent, OH, 44241, USA
Ruoming Jin, Yuri Breitbart & Chibuike Muoh

Authors

Ruoming Jin
View author publications
You can also search for this author in PubMed Google Scholar
Yuri Breitbart
View author publications
You can also search for this author in PubMed Google Scholar
Chibuike Muoh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruoming Jin.

Additional information

This research in part is supported by Lady Davis Fellowship, Haifa, Israel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, R., Breitbart, Y. & Muoh, C. Data discretization unification. Knowl Inf Syst 19, 1–29 (2009). https://doi.org/10.1007/s10115-008-0142-6

Download citation

Received: 29 October 2007
Revised: 18 December 2007
Accepted: 29 January 2008
Published: 17 May 2008
Issue Date: April 2009
DOI: https://doi.org/10.1007/s10115-008-0142-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data discretization unification

Abstract

Access this article

Similar content being viewed by others

Using discretization for extending the set of predictive features

Graph clustering-based discretization of splitting and merging methods (GraphS and GraphM)

A Comparison of Two Approaches to Discretization: Multiple Scanning and C4.5

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data discretization unification

Abstract

Access this article

Similar content being viewed by others

Using discretization for extending the set of predictive features

Graph clustering-based discretization of splitting and merging methods (GraphS and GraphM)

A Comparison of Two Approaches to Discretization: Multiple Scanning and C4.5

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation