ALBERT — All Library Books, journals and Electronic Records Telegrafenberg

1

Unknown

On classifier behavior in the presence of mislabeling noise (2016)

Springer

In: Data Mining and Knowledge Discovery

add to mindlist on the mindlist

Details

Publication Date: 2016-12-06

Description: Machine learning algorithms perform differently in settings with varying levels of training set mislabeling noise. Therefore, the choice of the right algorithm for a particular learning problem is crucial. The contribution of this paper is towards two, dual problems: first, comparing algorithm behavior; and second, choosing learning algorithms for noisy settings. We present the “sigmoid rule” framework, which can be used to choose the most appropriate learning algorithm depending on the properties of noise in a classification problem. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. We study the characteristics of the sigmoid function using five representative non-sequential classifiers, namely, Naïve Bayes, kNN, SVM, a decision tree classifier, and a rule-based classifier, and three widely used sequential classifiers based on hidden Markov models, conditional random fields and recursive neural networks. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, showing that we can estimate an expected performance over a dataset regardless of the underlying algorithm. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

2

Unknown

Visualizing the behavior and some symmetry properties of Bayesian confirmation measures (2016)

Springer

In: Data Mining and Knowledge Discovery

add to mindlist on the mindlist

Details

Publication Date: 2016-12-06

Description: Bayesian confirmation measures, a special class of interestingness measures, are functions usually adopted in ranking inductive rules generated by data mining methods such as association rule mining, decision trees, rough sets. Till now a plethora of measures have been defined in many different ways. Identifying and effectively distinguishing among them is a difficult task. In this paper we propose a unified visual approach aimed at comparing and classifying a large subset of Bayesian confirmation measures (those satisfying the initial and final probability dependence condition). We first reduce the set of variables in their analytical expression to only two, thus allowing to draw their contour lines on the plane. We observe that two dimensional contour lines plots represent a sort of fingerprints of the confirmation measures and, therefore, this geometric visualization can be used as an effective tool in order to investigate properties and behavior of the measures. We highlight the potential of this approach not only to study known measures but also in order to invent new measures satisfying given required characteristics. We finally define, following the geometry of the plots, a new set of symmetry properties of confirmation measures and describe geometrically four classical symmetries.

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

3

Unknown

Tiers for peers: a practical algorithm for discovering hierarchy in weighted networks (2016)

Springer

In: Data Mining and Knowledge Discovery

add to mindlist on the mindlist

Details

Publication Date: 2016-12-06

Description: Interactions in many real-world phenomena can be explained by a strong hierarchical structure. Typically, this structure or ranking is not known; instead we only have observed outcomes of the interactions, and the goal is to infer the hierarchy from these observations. Discovering a hierarchy in the context of directed networks can be formulated as follows: given a graph, partition vertices into levels such that, ideally, there are only edges from upper levels to lower levels. The ideal case can only happen if the graph is acyclic. Consequently, in practice we have to introduce a penalty function that penalizes edges violating the hierarchy. A practical variant for such penalty is agony, where each violating edge is penalized based on the severity of the violation. Hierarchy minimizing agony can be discovered in time, and much faster in practice. In this paper we introduce several extensions to agony. We extend the definition for weighted graphs and allow a cardinality constraint that limits the number of levels. While, these are conceptually trivial extensions, current algorithms cannot handle them, nor they can be easily extended. We solve the problem by showing the connection to the capacitated circulation problem, and we demonstrate that we can compute the exact solution fast in practice for large datasets. We also introduce a provably fast heuristic algorithm that produces rankings with competitive scores. In addition, we show that we can compute agony in polynomial time for any convex penalty, and, to complete the picture, we show that minimizing hierarchy with any concave penalty is an NP -hard problem.

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

4

Unknown

Explaining clusterings of process instances (2016)

Springer

In: Data Mining and Knowledge Discovery

add to mindlist on the mindlist

Details

Publication Date: 2016-12-06

Description: This paper presents a technique that aims to increase human understanding of trace clustering solutions. The clustering techniques under scrutiny stem from the process mining domain, where the clustering of process instances is deemed a useful technique to analyse process data with a large variety of behaviour. Until now, the most often used method to inspect clustering solutions in this domain is visual inspection of the clustering results. This paper proposes a more thorough approach based on the post hoc application of supervised learning with support vector machines on cluster results. Our approach learns concise rules to describe why a specific instance is included in a certain cluster based on specific control-flow based feature variables. An extensive experimental evaluation is presented showing that our technique outperforms alternatives. Likewise, we are able to identify features that lead to shorter and more accurate explanations.

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

5

Unknown

On classifier behavior in the presence of mislabeling noise (2016)

Mirylenka, Katsiaryna ; Giannakopoulos, George ; Do, Le Minh ; [et al.]

Springer

In: Data Mining and Knowledge Discovery. 2016; 31(3): 661-701. Published 2016 Dec 05. doi: 10.1007/s10618-016-0484-8.

add to mindlist on the mindlist

Details

Publication Date: 2016-12-05

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

6

Unknown

Visualizing the behavior and some symmetry properties of Bayesian confirmation measures (2016)

Celotto, Emilio

Springer

In: Data Mining and Knowledge Discovery. 2016; 31(3): 739-773. Published 2016 Dec 05. doi: 10.1007/s10618-016-0487-5.

add to mindlist on the mindlist

Details

Publication Date: 2016-12-05

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

7

Unknown

Tiers for peers: a practical algorithm for discovering hierarchy in weighted networks (2016)

Tatti, Nikolaj

Springer

In: Data Mining and Knowledge Discovery. 2016; 31(3): 702-738. Published 2016 Dec 05. doi: 10.1007/s10618-016-0485-7.

add to mindlist on the mindlist

Details

Publication Date: 2016-12-05

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

8

Unknown

Explaining clusterings of process instances (2016)

De Koninck, Pieter ; De Weerdt, Jochen ; vanden Broucke, Seppe K. L. M.

Springer

In: Data Mining and Knowledge Discovery. 2016; 31(3): 774-808. Published 2016 Dec 05. doi: 10.1007/s10618-016-0488-4.

add to mindlist on the mindlist

Details

Publication Date: 2016-12-05

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

9

Unknown

Detecting cooperative and organized spammer groups in micro-blogging community (2016)

Springer

In: Data Mining and Knowledge Discovery

add to mindlist on the mindlist

Details

Publication Date: 2016-11-28

Description: In recent years, social spammers become rampant and evolve a number of variations in most social networks. In micro-blogging community, there are a typical type of anomalous groups consisting of cooperative and organized spammers, and they are hired by public relation companies and paid for posting tweets with certain content. They intentionally evolve their content and behavior patterns to prevent them from being detected, and cooperatively hijack the trending topics with a deliberate point of view which would affect people’s judgments and decisions seriously. Due to the evolving nature and hidden behavior of this type of spammers, we have to deal with two important issues to solve the problem of detecting this type of spammer groups. One is to detect the anomalous topics hijacked by spammer groups from numerous trending topics. Another is to detect the members of spammer group from the users joining anomalous topics. In this paper, we propose a two-stage topology-based method to detect spammer groups partially distributed in multiple trending topics. In the first stage, we detect the anomalous topics from plenty of trending topics according to a new similarity measure based on subgraph ranking. A topic is identified as anomalous if the topology characteristics of retweeting networks between adjacent periods change dramatically. In the second stage, we obtain several anomalous topic sequences through a few initial labeled spammers by employing the basic idea of label propagation, and cluster the users who join each topic sequence into group spammers and normal users by their total authorities. The total authority of user is his/her weighted cumulative authorities in anomalous topics of each topic sequence, and authority in each topic is defined based on the out-degree of user in the retweeting network. The experimental results based on real-world data collected from Sina micro-blogging site demonstrate that our similarity measure keeps a leading performance in all evaluation metrics, and our method can effectively detect the group spammers compared with other methods.

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

10

Unknown

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances (2016)

Springer

In: Data Mining and Knowledge Discovery

add to mindlist on the mindlist

Details

Publication Date: 2016-11-27

Description: In the last 5 years there have been a large number of new time series classification algorithms proposed in the literature. These algorithms have been evaluated on subsets of the 47 data sets in the University of California, Riverside time series classification archive. The archive has recently been expanded to 85 data sets, over half of which have been donated by researchers at the University of East Anglia. Aspects of previous evaluations have made comparisons between algorithms difficult. For example, several different programming languages have been used, experiments involved a single train/test split and some used normalised data whilst others did not. The relaunch of the archive provides a timely opportunity to thoroughly evaluate algorithms on a larger number of datasets. We have implemented 18 recently proposed algorithms in a common Java framework and compared them against two standard benchmark classifiers (and each other) by performing 100 resampling experiments on each of the 85 datasets. We use these results to test several hypotheses relating to whether the algorithms are significantly more accurate than the benchmarks and each other. Our results indicate that only nine of these algorithms are significantly more accurate than both benchmarks and that one classifier, the collective of transformation ensembles, is significantly more accurate than all of the others. All of our experiments and results are reproducible: we release all of our code, results and experimental details and we hope these experiments form the basis for more robust testing of new algorithms in the future.

Print ISSN: 1384-5810

Electronic ISSN: 1573-756X

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext