Skip to main content
Log in

Chebyshev approaches for imbalanced data streams regression models

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

In recent years data stream mining and learning from imbalanced data have been active research areas. Even though solutions exist to tackle these two problems, most of them are not designed to handle challenges inherited from both problems. As far as we are aware, the few approaches in the area of learning from imbalanced data streams fall in the context of classification, and no efforts on the regression domain have been reported yet. This paper proposes a technique that uses sampling strategies to cope with imbalanced data streams in a regression setting, where the most important cases have rare and extreme target values. Specifically, we employ under-sampling and over-sampling strategies that resort to Chebyshev’s inequality value as a heuristic to disclose the type of incoming cases (i.e. frequent or rare). We have evaluated our proposal by applying it in the training of models by four well-known regression algorithms over fourteen benchmark data sets. We conducted a series of experiments with different setups on both synthetic and real-world data sets. The experimental results confirm our approach’s effectiveness by showing the models’ superior performance trained by each of the sampling strategies compared with their baseline pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. https://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html.

  2. https://sci2s.ugr.es/keel/category.php?cat=reg.

  3. https://www.kaggle.com/tsaustin/us-used-car-sales-data.

  4. https://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html.

  5. https://github.com/ehaminian/imbalancedDataStream.

References

  • Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17:255–287

    Google Scholar 

  • Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

    Article  Google Scholar 

  • Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  • Block HD (1988) The perceptron: a model for brain functioning I. Neurocomputing: foundations of research. MIT Press, Cambridge, pp 135–150

    Google Scholar 

  • Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49:1–50

    Article  Google Scholar 

  • Branco P, Torgo L, Ribeiro RP (2019) Pre-processing approaches for imbalanced distributions in regression. Neurocomputing 343:76–99

    Article  Google Scholar 

  • Brzezinski D, Minku LL, Pewinski T, Stefanowski J, Szumaczuk A (2021) The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowl Inf Syst 63(6):1429–1469

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  • Darrab S, Broneske D, Saake G (2021) Modern applications and challenges for rare itemset mining. Int J Mach Learn Comput. https://doi.org/10.18178/ijmlc.2021.11.3.1037

    Article  Google Scholar 

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Dua D, Graff C (2017) UCI machine learning repository

  • Duarte J, Gama J, Bifet A (2016) Adaptive model rules from high-speed data streams. ACM Trans Knowl Discov Data 10(3):30:1-30:22

  • Finch T (2009) Incremental calculation of weighted mean and variance. http://nfs-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/antiforgery/stats.pdf

  • Gabsi N (2011) Extension et interrogation de résumés de flux de données. Ph. D. thesis, Télécom ParisTech

  • Gama J (2010) Knowledge discovery from data streams. Chapman and Hall/CRC data mining and knowledge discovery series. CRC Press, Boca Raton

    Book  Google Scholar 

  • Gama J, Sebastião R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346

    Article  MathSciNet  Google Scholar 

  • Ghazikhani A, Monsefi R, Yazdi HS (2013) Recursive least square perceptron model for non-stationary and imbalanced data stream classification. Evol Syst 4(2):119–131

    Article  Google Scholar 

  • Ghazikhani A, Monsefi R, Yazdi HS (2014) Online neural network model for non-stationary and imbalanced data stream classification. Int J Mach Learn Cybern 5(1):51–62

    Article  Google Scholar 

  • Godase A, Attar V (2012) Classifier ensemble for imbalanced data stream classification. In: Proceedings of the CUBE international information technology conference, pp 284–289

  • Grzyb J, Klikowski J, Woźniak M (2021) Hellinger distance weighted ensemble for imbalanced data stream classification. J Comput Sci 51:101314

    Article  Google Scholar 

  • Ikonomovska E, Gama J, Dzeroski S (2011) Learning model trees from evolving data streams. Data Min Knowl Discov 23(1):128–168

    Article  MathSciNet  Google Scholar 

  • Korycki Ł, Krawczyk B (2021) Concept drift detection from multi-class imbalanced data streams. arXiv preprint arXiv:2104.10228

  • Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Icml, vol 97. Nashville, pp 179–186

  • Lee SS (2000) Noisy replication in skewed binary classification. Comput Stat Data Anal 34(2):165–191

    Article  Google Scholar 

  • Maglie A (2016) ReactiveX and RxJava, pp 1–9

  • Moniz N, Ribeiro R, Cerqueira V, Chawla N (2018) Smoteboost for regression: improving the prediction of extreme values. In: IEEE 5th international conference on data science and advanced analytics (DSAA). IEEE, pp 150–159

  • Reunanen N, Raty T, Jokinen JJ, Hoyt T, Culler D (2020) Unsupervised online detection and prediction of outliers in streams of sensor data. Int J Data Sci Anal 9:285–314

    Article  Google Scholar 

  • Ribeiro RP (2011) Utility-based regression. Ph. D. thesis, Dep. Computer Science, Faculty of Sciences, University of Porto

  • Ribeiro RP, Moniz N (2020) Imbalanced regression and extreme value prediction. Mach Learn 109(9–10):1803–1835

    Article  MathSciNet  Google Scholar 

  • Torgo L, Ribeiro R (2007) Utility-based regression. In: European conference on principles of data mining and knowledge discovery. Springer, pp 597–604

  • Torgo L, Ribeiro RP, Pfahringer B, Branco P (2013) Smote for regression. In: Portuguese conference on artificial intelligence. Springer, pp 378–389

  • Wang S, Minku LL, Chawla NV, Yao X (2019) Learning from data streams and class imbalance. Connect Sci 31(2):103–104

    Article  Google Scholar 

  • Zhang Y, Liu W, Ren X, Ren Y (2017) Dual weighted extreme learning machine for imbalanced data stream classification. J Intell Fuzzy Syst 33(2):1143–1154

    Article  Google Scholar 

  • Zyblewski P, Ksieniewicz P, Woźniak M (2019) Classifier selection for highly imbalanced data streams with minority driven ensemble. In: International conference on artificial intelligence and soft computing. Springer, pp 626–635

  • Zyblewski P, Sabourin R, Wozniak M (2021) Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Inf Fusion 66:138–154

    Article  Google Scholar 

Download references

Acknowledgements

This research was funded from national funds through FCT - Science and Technology Foundation, in the context of the project FailStopper (DSAIPA /DS/0086/2018).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ehsan Aminian.

Additional information

Responsible editor: Johannes Fürnkranz.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Experimental evaluation tables

1.1 More about the data sets

Since we have used short names to refer to the data sets in the main text and make it easier for the readers to find the data sets, we report their complete name, source, and the number of their attributes in Table 8.

Table 9 Data sets characteristics: number of cases (#cases), number of cases with rare extreme target values (#rare cases) with \(thr_\phi =0.9\) that are low extremes (i.e below the median), high extremes (i.e. above the median) or both
Table 10 Data sets characteristics : number of cases (#cases), number of cases with rare extreme target values (#rare cases) with \(thr_\phi =1.0\) that are low extremes (i.e below the median), high extremes (i.e. above the median) or both

1.2 Results of the experiments: sensitivity analysis varying \(\phi \)

Tables 11, 12, 13, 14, 15 and 16 show results of the paired comparisons for the four learner models trained by our (Under/Over)-sampling strategies, ChebyUS and ChebyOS, against their baseline versions. The symbols \(\triangleright \) and \(\triangleleft \) indicate that the ChebyUS or ChebyOS method is significantly better or worse, respectively, compared to the baseline. In each experiment, the value obtained by the error function (i.e. \(RMSE_{\phi }\) or RMSE) for prediction of the learner mentioned in the column header over the data set specified in the corresponding row has been reported. The numbers inside the cells are the average and corresponding standard deviation of the results on ten rounds of experiments. We report the statistical significance (level 95%) of the difference between each pair using the two symbols \(\triangleright \) and \(\triangleleft \) pointing to the significantly better method. As we have reported results for different levels of \(thr_\phi \), information about the number/percentage of the rare cases observed in each data set presented in Tables 9 and 10.

Table 11 RMSE and \(RMSE_{\phi }\) estimates considering all the observations: \(thr_\phi = 0\)
Table 12 RMSE and \(RMSE_{\phi }\) estimates considering all (double-side) rare cases: \(thr_\phi = 0.8\)
Table 13 RMSE and \(RMSE_{\phi }\) estimates considering all (double-side) rare cases: \(thr_\phi = 0.9\)
Table 14 RMSE and \(RMSE_{\phi }\) estimates considering all (double-side) rare cases: \(thr_\phi = 1.0\)

Chebyshev’s probability, K value and relevance function \(\phi ()\) graphs for each data set

The value of function \(\phi ()\) to each example of the data sets along with their given probability by Algorithm 1, their given K value by Algorithm 2 and also the box plot of the data set have been shown in the following figures. The red points on the graphs indicate the rare cases considered in each data set.

Table 15 \(RMSE_{\phi }\) estimates considering left-hand side rare cases (\(thr_\phi = 0.8\))
Table 16 \(RMSE_{\phi }\) estimates considering right-hand side rare cases (\(thr_\phi = 0.8\))
Fig. 15
figure 15

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of puma32H data set

Fig. 16
figure 16

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of cpusum data set

Fig. 17
figure 17

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of elevator data set

Fig. 18
figure 18

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of bike data set

Fig. 19
figure 19

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of energy data set

Fig. 20
figure 20

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of calhousing data set

Fig. 21
figure 21

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of gasemission data set

Fig. 22
figure 22

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of mv data set

Fig. 23
figure 23

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of fried data set

Fig. 24
figure 24

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of pollution data set

Fig. 25
figure 25

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of car price data set

Fig. 26
figure 26

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of query data set

Fig. 27
figure 27

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of GPU data set

Fig. 28
figure 28

Box plot, K-value, Chebyshev’s probability and relevance function \(\phi ()\) for the target variable of 3d spatial network data set

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aminian, E., Ribeiro, R.P. & Gama, J. Chebyshev approaches for imbalanced data streams regression models. Data Min Knowl Disc 35, 2389–2466 (2021). https://doi.org/10.1007/s10618-021-00793-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00793-1

Keywords

Navigation