Chebyshev approaches for imbalanced data streams regression models

Aminian, Ehsan; Ribeiro, Rita P.; Gama, João

doi:10.1007/s10618-021-00793-1

Chebyshev approaches for imbalanced data streams regression models

Published: 20 September 2021

Volume 35, pages 2389–2466, (2021)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

636 Accesses
8 Citations
2 Altmetric
Explore all metrics

Abstract

In recent years data stream mining and learning from imbalanced data have been active research areas. Even though solutions exist to tackle these two problems, most of them are not designed to handle challenges inherited from both problems. As far as we are aware, the few approaches in the area of learning from imbalanced data streams fall in the context of classification, and no efforts on the regression domain have been reported yet. This paper proposes a technique that uses sampling strategies to cope with imbalanced data streams in a regression setting, where the most important cases have rare and extreme target values. Specifically, we employ under-sampling and over-sampling strategies that resort to Chebyshev’s inequality value as a heuristic to disclose the type of incoming cases (i.e. frequent or rare). We have evaluated our proposal by applying it in the training of models by four well-known regression algorithms over fourteen benchmark data sets. We conducted a series of experiments with different setups on both synthetic and real-world data sets. The experimental results confirm our approach’s effectiveness by showing the models’ superior performance trained by each of the sampling strategies compared with their baseline pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey of transfer learning

Article Open access 28 May 2016

A Review on Random Forest: An Ensemble Classifier

Notes

References

Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17:255–287
Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604
Google Scholar
Block HD (1988) The perceptron: a model for brain functioning I. Neurocomputing: foundations of research. MIT Press, Cambridge, pp 135–150
Google Scholar
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49:1–50
Article Google Scholar
Branco P, Torgo L, Ribeiro RP (2019) Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343:76–99
Article Google Scholar
Brzezinski D, Minku LL, Pewinski T, Stefanowski J, Szumaczuk A (2021) The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowl Inf Syst 63(6):1429–1469
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Darrab S, Broneske D, Saake G (2021) Modern applications and challenges for rare itemset mining. Int J Mach Learn Comput. https://doi.org/10.18178/ijmlc.2021.11.3.1037
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dua D, Graff C (2017) UCI machine learning repository
Duarte J, Gama J, Bifet A (2016) Adaptive model rules from high-speed data streams. ACM Trans Knowl Discov Data 10(3):30:1-30:22
Finch T (2009) Incremental calculation of weighted mean and variance. http://nfs-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/antiforgery/stats.pdf
Gabsi N (2011) Extension et interrogation de résumés de flux de données. Ph. D. thesis, Télécom ParisTech
Gama J (2010) Knowledge discovery from data streams. Chapman and Hall/CRC data mining and knowledge discovery series. CRC Press, Boca Raton
Book Google Scholar
Gama J, Sebastião R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346
Article MathSciNet Google Scholar
Ghazikhani A, Monsefi R, Yazdi HS (2013) Recursive least square perceptron model for non-stationary and imbalanced data stream classification. Evol Syst 4(2):119–131
Article Google Scholar
Ghazikhani A, Monsefi R, Yazdi HS (2014) Online neural network model for non-stationary and imbalanced data stream classification. Int J Mach Learn Cybern 5(1):51–62
Article Google Scholar
Godase A, Attar V (2012) Classifier ensemble for imbalanced data stream classification. In: Proceedings of the CUBE international information technology conference, pp 284–289
Grzyb J, Klikowski J, Woźniak M (2021) Hellinger distance weighted ensemble for imbalanced data stream classification. J Comput Sci 51:101314
Article Google Scholar
Ikonomovska E, Gama J, Dzeroski S (2011) Learning model trees from evolving data streams. Data Min Knowl Discov 23(1):128–168
Article MathSciNet Google Scholar
Korycki Ł, Krawczyk B (2021) Concept drift detection from multi-class imbalanced data streams. arXiv preprint arXiv:2104.10228
Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Icml, vol 97. Nashville, pp 179–186
Lee SS (2000) Noisy replication in skewed binary classification. Comput Stat Data Anal 34(2):165–191
Article Google Scholar
Maglie A (2016) ReactiveX and RxJava, pp 1–9
Moniz N, Ribeiro R, Cerqueira V, Chawla N (2018) Smoteboost for regression: improving the prediction of extreme values. In: IEEE 5th international conference on data science and advanced analytics (DSAA). IEEE, pp 150–159
Reunanen N, Raty T, Jokinen JJ, Hoyt T, Culler D (2020) Unsupervised online detection and prediction of outliers in streams of sensor data. Int J Data Sci Anal 9:285–314
Article Google Scholar
Ribeiro RP (2011) Utility-based regression. Ph. D. thesis, Dep. Computer Science, Faculty of Sciences, University of Porto
Ribeiro RP, Moniz N (2020) Imbalanced regression and extreme value prediction. Mach Learn 109(9–10):1803–1835
Article MathSciNet Google Scholar
Torgo L, Ribeiro R (2007) Utility-based regression. In: European conference on principles of data mining and knowledge discovery. Springer, pp 597–604
Torgo L, Ribeiro RP, Pfahringer B, Branco P (2013) Smote for regression. In: Portuguese conference on artificial intelligence. Springer, pp 378–389
Wang S, Minku LL, Chawla NV, Yao X (2019) Learning from data streams and class imbalance. Connect Sci 31(2):103–104
Article Google Scholar
Zhang Y, Liu W, Ren X, Ren Y (2017) Dual weighted extreme learning machine for imbalanced data stream classification. J Intell Fuzzy Syst 33(2):1143–1154
Article Google Scholar
Zyblewski P, Ksieniewicz P, Woźniak M (2019) Classifier selection for highly imbalanced data streams with minority driven ensemble. In: International conference on artificial intelligence and soft computing. Springer, pp 626–635
Zyblewski P, Sabourin R, Wozniak M (2021) Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Inf Fusion 66:138–154
Article Google Scholar

Download references

Acknowledgements

This research was funded from national funds through FCT - Science and Technology Foundation, in the context of the project FailStopper (DSAIPA /DS/0086/2018).

Author information

Authors and Affiliations

LIAAD, INESC TEC, Porto, Portugal
Ehsan Aminian
Faculty of Science, University of Porto, Porto, Portugal
Rita P. Ribeiro
School of Economics, University of Porto, Porto, Portugal
João Gama

Authors

Ehsan Aminian
View author publications
You can also search for this author in PubMed Google Scholar
Rita P. Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
João Gama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ehsan Aminian.

Additional information

Responsible editor: Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Experimental evaluation tables

1.1 More about the data sets

Since we have used short names to refer to the data sets in the main text and make it easier for the readers to find the data sets, we report their complete name, source, and the number of their attributes in Table 8.

Table 9 Data sets characteristics: number of cases (#cases), number of cases with rare extreme target values (#rare cases) with \(thr_\phi =0.9\) that are low extremes (i.e below the median), high extremes (i.e. above the median) or both

Full size table

Table 10 Data sets characteristics : number of cases (#cases), number of cases with rare extreme target values (#rare cases) with \(thr_\phi =1.0\) that are low extremes (i.e below the median), high extremes (i.e. above the median) or both

Full size table

1.2 Results of the experiments: sensitivity analysis varying \(\phi \)

Tables 11, 12, 13, 14, 15 and 16 show results of the paired comparisons for the four learner models trained by our (Under/Over)-sampling strategies, ChebyUS and ChebyOS, against their baseline versions. The symbols \(\triangleright \) and \(\triangleleft \) indicate that the ChebyUS or ChebyOS method is significantly better or worse, respectively, compared to the baseline. In each experiment, the value obtained by the error function (i.e. \(RMSE_{\phi }\) or RMSE) for prediction of the learner mentioned in the column header over the data set specified in the corresponding row has been reported. The numbers inside the cells are the average and corresponding standard deviation of the results on ten rounds of experiments. We report the statistical significance (level 95%) of the difference between each pair using the two symbols \(\triangleright \) and \(\triangleleft \) pointing to the significantly better method. As we have reported results for different levels of \(thr_\phi \), information about the number/percentage of the rare cases observed in each data set presented in Tables 9 and 10.

Table 11 RMSE and \(RMSE_{\phi }\) estimates considering all the observations: \(thr_\phi = 0\)

Full size table

Table 12 RMSE and \(RMSE_{\phi }\) estimates considering all (double-side) rare cases: \(thr_\phi = 0.8\)

Full size table

Table 13 RMSE and \(RMSE_{\phi }\) estimates considering all (double-side) rare cases: \(thr_\phi = 0.9\)

Full size table

Table 14 RMSE and \(RMSE_{\phi }\) estimates considering all (double-side) rare cases: \(thr_\phi = 1.0\)

Full size table

Chebyshev’s probability, K value and relevance function \(\phi ()\) graphs for each data set

The value of function \(\phi ()\) to each example of the data sets along with their given probability by Algorithm 1, their given K value by Algorithm 2 and also the box plot of the data set have been shown in the following figures. The red points on the graphs indicate the rare cases considered in each data set.

Table 15 \(RMSE_{\phi }\) estimates considering left-hand side rare cases (\(thr_\phi = 0.8\))

Full size table

Table 16 \(RMSE_{\phi }\) estimates considering right-hand side rare cases (\(thr_\phi = 0.8\))

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aminian, E., Ribeiro, R.P. & Gama, J. Chebyshev approaches for imbalanced data streams regression models. Data Min Knowl Disc 35, 2389–2466 (2021). https://doi.org/10.1007/s10618-021-00793-1

Download citation

Received: 18 July 2020
Accepted: 16 August 2021
Published: 20 September 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s10618-021-00793-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Chebyshev approaches for imbalanced data streams regression models

Abstract

Access this article

Similar content being viewed by others