ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (1,179)
  • Springer  (1,179)
  • American Association for the Advancement of Science
  • MDPI
  • Oxford University Press
  • Data Mining and Knowledge Discovery  (371)
  • 2259
  • Computer Science  (1,179)
Collection
  • Articles  (1,179)
Publisher
  • Springer  (1,179)
  • American Association for the Advancement of Science
  • MDPI
  • Oxford University Press
Years
Topic
  • Computer Science  (1,179)
  • 1
    Publication Date: 2020-07-06
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    Publication Date: 2020-07-07
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    Publication Date: 2020-07-02
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    Publication Date: 2020-07-02
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    Publication Date: 2020-07-03
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    Publication Date: 2007-01-26
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    Publication Date: 2015-06-07
    Description: Community search is the problem of finding a good community for a given set of query vertices. One of the most studied formulations of community search asks for a connected subgraph that contains all query vertices and maximizes the minimum degree. All existing approaches to min-degree-based community search suffer from limitations concerning efficiency, as they need to visit (large part of) the whole input graph, as well as accuracy, as they output communities quite large and not really cohesive. Moreover, some existing methods lack generality: they handle only single-vertex queries, find communities that are not optimal in terms of minimum degree, and/or require input parameters. In this work we advance the state of the art on community search by proposing a novel method that overcomes all these limitations: it is in general more efficient and effective—one/two orders of magnitude on average, it can handle multiple query vertices, it yields optimal communities, and it is parameter-free. These properties are confirmed by an extensive experimental analysis performed on various real-world graphs.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    Publication Date: 2015-09-17
    Description: This paper studies the problem of identifying a single contagion source when partial timestamps of a contagion process are available. We formulate the source localization problem as a ranking problem on graphs , where infected nodes are ranked according to their likelihood of being the source. Two ranking algorithms, cost-based ranking and tree-based ranking, are proposed in this paper. Experimental evaluations with synthetic and real-world data show that our algorithms significantly improve the ranking accuracy compared with four existing algorithms.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    Publication Date: 2015-06-13
    Description: Event detection has been one of the most important research topics in social media analysis. Most of the traditional approaches detect events based on fixed temporal and spatial resolutions, while in reality events of different scales usually occur simultaneously, namely, they span different intervals in time and space. In this paper, we propose a novel approach towards multiscale event detection using social media data, which takes into account different temporal and spatial scales of events in the data. Specifically, we explore the properties of the wavelet transform, which is a well-developed multiscale transform in signal processing, to enable automatic handling of the interaction between temporal and spatial scales. We then propose a novel algorithm to compute a data similarity graph at appropriate scales and detect events of different scales simultaneously by a single graph-based clustering process. Furthermore, we present spatiotemporal statistical analysis of the noisy information present in the data stream, which allows us to define a novel term-filtering procedure for the proposed event detection algorithm and helps us study its behavior using simulated noisy data. Experimental results on both synthetically generated data and real world data collected from Twitter demonstrate the meaningfulness and effectiveness of the proposed approach. Our framework further extends to numerous application domains that involve multiscale and multiresolution data analysis.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    Publication Date: 2016-07-14
    Description: Label noise can be a major problem in classification tasks, since most machine learning algorithms rely on data labels in their inductive process. Thereupon, various techniques for label noise identification have been investigated in the literature. The bias of each technique defines how suitable it is for each dataset. Besides, while some techniques identify a large number of examples as noisy and have a high false positive rate, others are very restrictive and therefore not able to identify all noisy examples. This paper investigates how label noise detection can be improved by using an ensemble of noise filtering techniques. These filters, individual and ensembles, are experimentally compared. Another concern in this paper is the computational cost of ensembles, once, for a particular dataset, an individual technique can have the same predictive performance as an ensemble. In this case the individual technique should be preferred. To deal with this situation, this study also proposes the use of meta-learning to recommend, for a new dataset, the best filter. An extensive experimental evaluation of the use of individual filters, ensemble filters and meta-learning was performed using public datasets with imputed label noise. The results show that ensembles of noise filters can improve noise filtering performance and that a recommendation system based on meta-learning can successfully recommend the best filtering technique for new datasets. A case study using a real dataset from the ecological niche modeling domain is also presented and evaluated, with the results validated by an expert.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    Publication Date: 2016-08-05
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    facet.materialart.
    Unknown
    Springer
    Publication Date: 2016-08-05
    Description: User tastes are constantly drifting over time as users are exposed to different types of products. The ability to model the tendency of both user preferences and product attractiveness is vital to the success of recommender systems (RSs). We propose a Bayesian Wishart matrix factorization method to model the temporal dynamics of variations among user preferences and item attractiveness in a novel algorithmic perspective. The proposed method is able to well model and properly control diverse rating behaviors across time frames and related temporal effects within time frames in the tendency of user preferences and item attractiveness. We evaluate the proposed method on two synthetic and three real-world benchmark datasets for RSs. Experimental results demonstrate that our proposed method significantly outperforms a variety of state-of-the-art methods in RSs.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    Publication Date: 2016-06-30
    Description: Outlier detection techniques in spatial data should allow to identify two types of outliers: global and local ones. Local outliers typically have non-spatial attributes that strongly differ from those observed on their neighbors. Detecting local outliers requires to be able to work locally, on neighborhoods, in order to take into account the spatial dependence between the statistical units under consideration, even though the outlyingness is usually measured on the non-spatial variables. Many procedures have been outlined in the literature, but their number reduces when one wants to deal with multivariate non-spatial attributes. In this paper, focus is on the multivariate context. A review of existing procedures is done. A new approach, based on a two-step improvement of an existing one, is also designed and compared with the benchmarked methods by means of examples and simulations.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    Publication Date: 2016-05-27
    Description: In many data analysis tasks it is important to understand the relationships between different datasets. Several methods exist for this task but many of them are limited to two datasets and linear relationships. In this paper, we propose a new efficient algorithm, termed cocoreg , for the extraction of variation common to all datasets in a given collection of arbitrary size. cocoreg extends redundancy analysis to more than two datasets, utilizing chains of regression functions to extract the shared variation in the original data space. The algorithm can be used with any linear or non-linear regression function, which makes it robust, straightforward, fast, and easy to implement and use. We empirically demonstrate the efficacy of shared variation extraction using the cocoreg algorithm on five artificial and three real datasets.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    Publication Date: 2016-05-12
    Description: Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    Publication Date: 2016-04-30
    Description: The goal of early classification of time series is to predict the class value of a sequence early in time, when its full length is not yet available. This problem arises naturally in many contexts where the data is collected over time and the label predictions have to be made as soon as possible. In this work, a method based on probabilistic classifiers is proposed for the problem of early classification of time series. An important feature of this method is that, in its learning stage, it discovers the timestamps in which the prediction accuracy for each class begins to surpass a pre-defined threshold. This threshold is defined as a percentage of the accuracy that would be obtained if the full series were available, and it is defined by the user. The class predictions for new time series will only be made in these timestamps or later. Furthermore, when applying the model to a new time series, a class label will only be provided if the difference between the two largest predicted class probabilities is higher than or equal to a certain threshold, which is calculated in the training step. The proposal is validated on 45 benchmark time series databases and compared with several state-of-the-art methods, and obtains superior results in both earliness and accuracy. In addition, we show the practical applicability of our method for a real-world problem: the detection and identification of bird calls in a biodiversity survey scenario.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    facet.materialart.
    Unknown
    Springer
    Publication Date: 2016-07-13
    Description: Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases. This paper introduces a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    Publication Date: 2013-09-14
    Description: We are concerned with the issue of tracking changes of variable dependencies from multivariate time series. Conventionally, this issue has been addressed in the batch scenario where the whole data set is given at once, and the change detection must be done in a retrospective way. This paper addresses this issue in a sequential scenario where multivariate data are sequentially input and the detection must be done in a sequential fashion. We propose a new method for sequential tracking of variable dependencies. In it we employ a Bayesian network as a representation of variable dependencies. The key ideas of our method are: (1) we extend the theory of dynamic model selection, which has been developed in the batch-learning scenario, into the sequential setting, and apply it to our issue, (2) we conduct the change detection sequentially using dynamic programming per a window where we employ the Hoeffding’s bound to automatically determine the window size. We empirically demonstrate that our proposed method is able to perform change detection more efficiently than a conventional batch method. Further, we give a new framework of an application of variable dependency change detection, which we call Ad Impact Relation analysis (AIR). In it, we detect the time point when a commercial message advertisement has given an impact on the market and effectively visualize the impact through network changes. We employ real data sets to demonstrate the validity of AIR.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    Publication Date: 2013-04-10
    Description: We introduce the dependence distance , a new notion of the intrinsic distance between points, derived as a pointwise extension of statistical dependence measures between variables. We then introduce a dimension reduction procedure for preserving this distance, which we call the dependence map . We explore its theoretical justification, connection to other methods, and empirical behavior on real data sets.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    Publication Date: 2013-04-10
    Description: Non-negative matrix factorization (NMF) is a method to obtain a representation of data using non-negativity constraints. A popular approach is alternating non-negative least squares (ANLS). As is well known, if the sequence generated by ANLS has at least one limit point, then the limit point is a stationary point of NMF. However, no evdience has shown that the sequence generated by ANLS has at least one limit point. In order to overcome this shortcoming, we propose a modified strategy for ANLS in this paper. The modified strategy can ensure the sequence generated by ANLS has at least one limit point, and this limit point is a stationary point of NMF. The results of numerical experiments are reported to show the effectiveness of the proposed algorithm.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    Publication Date: 2013-04-10
    Description: We introduce a framework for the optimal extraction of flat clusterings from local cuts through cluster hierarchies. The extraction of a flat clustering from a cluster tree is formulated as an optimization problem and a linear complexity algorithm is presented that provides the globally optimal solution to this problem in semi-supervised as well as in unsupervised scenarios. A collection of experiments is presented involving clustering hierarchies of different natures, a variety of real data sets, and comparisons with specialized methods from the literature.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    facet.materialart.
    Unknown
    Springer
    Publication Date: 2013-04-10
    Description: A considerable amount of work has been done in data clustering research during the last four decades, and a myriad of methods has been proposed focusing on different data types, proximity functions, cluster representation models, and cluster presentation. However, clustering remains a challenging problem due to its ill-posed nature: it is well known that off-the-shelf clustering methods may discover different patterns in a given set of data, mainly because every clustering algorithm has its own bias resulting from the optimization of different criteria. This bias becomes even more important as in almost all real-world applications, data is inherently high-dimensional and multiple clustering solutions might be available for the same data collection. In this respect, the problems of projective clustering and clustering ensembles have been recently defined to deal with the high dimensionality and multiple clusterings issues, respectively. Nevertheless, despite such two issues can often be encountered together, existing approaches to the two problems have been developed independently of each other. In our earlier work Gullo et al. (Proceedings of the international conference on data mining (ICDM), 2009a ) we introduced a novel clustering problem, called projective clustering ensembles (PCE): given a set (ensemble) of projective clustering solutions, the goal is to derive a projective consensus clustering, i.e., a projective clustering that complies with the information on object-to-cluster and the feature-to-cluster assignments given in the ensemble. In this paper, we enhance our previous study and provide theoretical and experimental insights into the PCE problem. PCE is formalized as an optimization problem and is designed to satisfy desirable requirements on independence from the specific clustering ensemble algorithm, ability to handle hard as well as soft data clustering, and different feature weightings. Two PCE formulations are defined: a two-objective optimization problem, in which the two objective functions respectively account for the object- and feature-based representations of the solutions in the ensemble, and a single-objective optimization problem, in which the object- and feature-based representations are embedded into a single function to measure the distance error between the projective consensus clustering and the projective ensemble. The significance of the proposed methods for solving the PCE problem has been shown through an extensive experimental evaluation based on several datasets and comparatively with projective clustering and clustering ensemble baselines.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    Publication Date: 2013-04-10
    Description: The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    Publication Date: 2015-05-07
    Description: Context-aware recommendation algorithms focus on refining recommendations by considering additional information, available to the system. This topic has gained a lot of attention recently. Among others, several factorization methods were proposed to solve the problem, although most of them assume explicit feedback which strongly limits their real-world applicability. While these algorithms apply various loss functions and optimization strategies, the preference modeling under context is less explored due to the lack of tools allowing for easy experimentation with various models. As context dimensions are introduced beyond users and items, the space of possible preference models and the importance of proper modeling largely increases. In this paper we propose a general factorization framework (GFF), a single flexible algorithm that takes the preference model as an input and computes latent feature matrices for the input dimensions. GFF allows us to easily experiment with various linear models on any context-aware recommendation task, be it explicit or implicit feedback based. The scaling properties makes it usable under real life circumstances as well. We demonstrate the framework’s potential by exploring various preference models on a 4-dimensional context-aware problem with contexts that are available for almost any real life datasets. We show in our experiments—performed on five real life, implicit feedback datasets—that proper preference modelling significantly increases recommendation accuracy, and previously unused models outperform the traditional ones. Novel models in GFF also outperform state-of-the-art factorization algorithms. We also extend the method to be fully compliant to the Multidimensional Dataspace Model, one of the most extensive data models of context-enriched data. Extended GFF allows the seamless incorporation of information into the factorization framework beyond context, like item metadata, social networks, session information, etc. Preliminary experiments show great potential of this capability.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    Publication Date: 2015-05-07
    Description: Dynamic time warping (DTW) has proven itself to be an exceptionally strong distance measure for time series. DTW in combination with one-nearest neighbor, one of the simplest machine learning methods, has been difficult to convincingly outperform on the time series classification task. In this paper, we present a simple technique for time series classification that exploits DTW’s strength on this task. But instead of directly using DTW as a distance measure to find nearest neighbors, the technique uses DTW to create new features which are then given to a standard machine learning method. We experimentally show that our technique improves over one-nearest neighbor DTW on 31 out of 47 UCR time series benchmark datasets. In addition, this method can be easily extended to be used in combination with other methods. In particular, we show that when combined with the symbolic aggregate approximation (SAX) method, it improves over it on 37 out of 47 UCR datasets. Thus the proposed method also provides a mechanism to combine distance-based methods like DTW with feature-based methods like SAX. We also show that combining the proposed classifiers through ensembles further improves the performance on time series classification.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    Publication Date: 2015-05-07
    Description: In multi-instance learning, instances are organized into bags, and a bag is labeled positive if it contains at least one positive instance, and negative otherwise; the labels of the individual instances are not given. The task is to learn a classifier from this limited information. While the original task description involved learning an instance classifier, in the literature the task is often interpreted as learning a bag classifier. Depending on which of these two interpretations is used, it is more natural to evaluate classifiers according to how well they predict, respectively, instance labels or bag labels. In the literature, however, the two interpretations are often mixed, or the intended interpretation is left implicit. In this paper, we investigate the difference between bag-level and instance-level accuracy, both analytically and empirically. We show that there is a substantial difference between these two, and better performance on one does not necessarily imply better performance on the other. It is therefore useful to clearly distinguish the two settings, and always use the evaluation criterion most relevant for the task at hand. We show experimentally that the same conclusions hold for area under the ROC curve.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    Publication Date: 2015-05-07
    Description: Over the past decade, time series clustering has become an increasingly important research topic in data mining community. Most existing methods for time series clustering rely on distances calculated from the entire raw data using the Euclidean distance or Dynamic Time Warping distance as the distance measure. However, the presence of significant noise, dropouts, or extraneous data can greatly limit the accuracy of clustering in this domain. Moreover, for most real world problems, we cannot expect objects from the same class to be equal in length. As a consequence, most work on time series clustering only considers the clustering of individual time series “behaviors,” e.g., individual heart beats or individual gait cycles, and contrives the time series in some way to make them all equal in length. However, automatically formatting the data in such a way is often a harder problem than the clustering itself. In this work, we show that by using only some local patterns and deliberately ignoring the rest of the data, we can mitigate the above problems and cluster time series of different lengths, e.g., cluster one heartbeat with multiple heartbeats. To achieve this, we exploit and extend a recently introduced concept in time series data mining called shapelets . Unlike existing work, our work demonstrates the unintuitive fact that shapelets can be learned from unlabeled time series. We show, with extensive empirical evaluation in diverse domains, that our method is more accurate than existing methods. Moreover, in addition to accurate clustering results, we show that our work also has the potential to give insight into the domains to which it is applied. While a brute-force algorithm to discover shapelets in an unsupervised way could be untenably slow, we introduce two novel optimization procedures to significantly speed up the unsupervised-shapelet discovery process and allow it to be cast as an anytime algorithm.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    Publication Date: 2015-05-15
    Description: One of the biggest setbacks in traditional frequent pattern mining is that overwhelmingly many of the discovered patterns are redundant. A prototypical example of such redundancy is a freerider pattern where the pattern contains a true pattern and some additional noise events. A technique for filtering freerider patterns that has proved to be efficient in ranking itemsets is to use a partition model where a pattern is divided into two subpatterns and the observed support is compared to the expected support under the assumption that these two subpatterns occur independently. In this paper we develop a partition model for episodes, patterns discovered from sequential data. An episode is essentially a set of events, with possible restrictions on the order of events. Unlike with itemset mining, computing the expected support of an episode requires surprisingly sophisticated methods. In order to construct the model, we partition the episode into two subepisodes. We then model how likely the events in each subepisode occur close to each other. If this probability is high—which is often the case if the subepisode has a high support—then we can expect that when one event from a subepisode occurs, then the remaining events occur also close by. This approach increases the expected support of the episode, and if this increase explains the observed support, then we can deem the episode uninteresting. We demonstrate in our experiments that using the partition model can effectively and efficiently reduce the redundancy in episodes.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    Publication Date: 2014-11-05
    Description: A variety of applications, such as information extraction, intrusion detection and protein fold recognition, can be expressed as sequences of discrete events or elements (rather than unordered sets of features), that is, there is an order dependence among the elements composing each data instance. These applications may be modeled as classification problems, and in this case the classifier should exploit sequential interactions among the elements, so that the ordering relationship among them is properly captured. Dominant approaches to this problem include: (i) learning Hidden Markov Models, (ii) exploiting frequent sequences extracted from the data and (iii) computing string kernels. Such approaches, however, are computationally hard and vulnerable to noise, especially if the data shows long range dependencies (i.e., long subsequences are necessary in order to model the data). In this paper we provide simple algorithms that build highly effective sequential classifiers. Our algorithms are based on enumerating approximately contiguous subsequences from the training set on a demand-driven basis, exploiting a lightweight and flexible subsequence matching function and an innovative subsequence enumeration strategy called pattern silhouettes , making our learning algorithms fast and the corresponding classifiers robust to noisy data. Our empirical results on a variety of datasets indicate that the best trade-off between accuracy and learning time is usually obtained by limiting the length of the subsequences by a factor of \(\log {n}\) , which leads to a \(O(n\log {n})\) learning cost (where \(n\) is the length of the sequence being classified). Finally, we show that, in most of the cases, our classifiers are faster than existing solutions (sometimes, by orders of magnitude), also providing significant accuracy improvements in most of the evaluated cases.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    Publication Date: 2016-03-23
    Description: In adversarial classification, the interaction between classifiers and adversaries can be modeled as a game between two players. It is natural to model this interaction as a dynamic game of incomplete information, since the classifier does not know the exact intentions of the different types of adversaries (senders). For these games, equilibrium strategies can be approximated and used as input for classification models. In this paper we show how to model such interactions between players, as well as give directions on how to approximate their mixed strategies. We propose perceptron-like machine learning approximations as well as novel Adversary-Aware Online Support Vector Machines. Results in a real-world adversarial environment show that our approach is competitive with benchmark online learning algorithms, and provides important insights into the complex relations among players.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    Publication Date: 2015-12-28
    Description: We propose ClusPath, a novel algorithm for detecting general evolution tendencies in a population of entities. We show how abstract notions, such as the Swedish socio-economical model (in a political dataset) or the companies fiscal optimization (in an economical dataset) can be inferred from low-level descriptive features. Such high-level regularities in the evolution of entities are detected by combining spatial and temporal features into a spatio-temporal dissimilarity measure and using semi-supervised clustering techniques. The relations between the evolution phases are modeled using a graph structure, inferred simultaneously with the partition, by using a “slow changing world” assumption. The idea is to ensure a smooth passage for entities along their evolution paths, which catches the long-term trends in the dataset. Additionally, we also provide a method, based on an evolutionary algorithm, to tune the parameters of ClusPath to new, unseen datasets. This method assesses the fitness of a solution using four opposed quality measures and proposes a balanced compromise.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    Publication Date: 2016-01-13
    Description: In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    Publication Date: 2016-03-30
    Description: The outlying property detection problem (OPDP) is the problem of discovering the properties distinguishing a given object, known in advance to be an outlier in a database, from the other database objects. This problem has been recently analyzed focusing on categorical attributes only. However, numerical attributes are very relevant and widely used in databases. Therefore, in this paper, we analyze the OPDP within a context where also numerical attributes are taken into account, which represents a relevant case left open in the literature. As major contributions, we present an efficient parameter-free algorithm to compute the measure of object exceptionality we introduce, and propose a unified framework for mining exceptional properties in the presence of both categorical and numerical attributes.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Knowledge discovery and information extraction of large and complex datasets has attracted great attention in wide-ranging areas from statistics and biology to medicine. Tools from machine learning, data mining, and neurocomputing have been extensively explored and utilized to accomplish such compelling data analytics tasks. However, for time-series data presenting active dynamic characteristics, many of the state-of-the-art techniques may not perform well in capturing the inherited temporal structures in these data. In this paper, integrating the Koopman operator and linear dynamical systems theory with support vector machines, we develop a novel dynamic data mining framework to construct low-dimensional linear models that approximate the nonlinear flow of high-dimensional time-series data generated by unknown nonlinear dynamical systems. This framework then immediately enables pattern recognition, e.g., classification, of complex time-series data to distinguish their dynamic behaviors by using the trajectories generated by the reduced linear systems. Moreover, we demonstrate the applicability and efficiency of this framework through the problems of time-series classification in bioinformatics and healthcare, including cognitive classification and seizure detection with fMRI and EEG data, respectively. The developed Koopman dynamic learning framework then lays a solid foundation for effective dynamic data mining and promises a mathematically justified method for extracting the dynamics and significant temporal structures of nonlinear dynamical systems. 〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Density estimation is a versatile technique underlying many data mining tasks and techniques, ranging from exploration and presentation of static data, to probabilistic classification, or identifying changes or irregularities in streaming data. With the pervasiveness of embedded systems and digitisation, this latter type of streaming and evolving data becomes more important. Nevertheless, research in density estimation has so far focused on stationary data, leaving the task of of extrapolating and predicting density at time points outside a training window an open problem. For this task, temporal density extrapolation (TDX) is proposed. This novel method models and predicts gradual monotonous changes in a distribution. It is based on the expansion of basis functions, whose weights are modelled as functions of compositional data over time by using an isometric log-ratio transformation. Extrapolated density estimates are then obtained by extrapolating the weights to the requested time point, and querying the density from the basis functions with back-transformed weights. Our approach aims for broad applicability by neither being restricted to a specific parametric distribution, nor relying on cluster structure in the data. It requires only two additional extrapolation-specific parameters, for which reasonable defaults exist. Experimental evaluation on various data streams, synthetic as well as from the real-world domains of credit scoring and environmental health, shows that the model manages to capture monotonous drift patterns accurately and better than existing methods. Thereby, it requires not more than 1.5 times the run time of a corresponding static density estimation approach.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Increasingly, discrimination by algorithms is perceived as a societal and legal problem. As a response, a number of criteria for implementing algorithmic fairness in machine learning have been developed in the literature. This paper proposes the continuous fairness algorithm 〈span〉 〈span〉\((\hbox {CFA}\theta )\)〈/span〉 〈/span〉 which enables a continuous interpolation between different fairness definitions. More specifically, we make three main contributions to the existing literature. First, our approach allows the decision maker to continuously vary between specific concepts of individual and group fairness. As a consequence, the algorithm enables the decision maker to adopt intermediate “worldviews” on the degree of discrimination encoded in algorithmic processes, adding nuance to the extreme cases of “we’re all equal” and “what you see is what you get” proposed so far in the literature. Second, we use optimal transport theory, and specifically the concept of the barycenter, to maximize decision maker utility under the chosen fairness constraints. Third, the algorithm is able to handle cases of intersectionality, i.e., of multi-dimensional discrimination of certain groups on grounds of several criteria. We discuss three main examples (credit applications; college admissions; insurance contracts) and map out the legal and policy implications of our approach. The explicit formalization of the trade-off between individual and group fairness allows this post-processing approach to be tailored to different situational contexts in which one or the other fairness criterion may take precedence. Finally, we evaluate our model experimentally.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉We consider the class of linear predictors over all logical conjunctions of binary attributes, which we refer to as the class of combinatorial binary models (CBMs) in this paper. CBMs are of high knowledge interpretability but naïve learning of them from labeled data requires exponentially high computational cost with respect to the length of the conjunctions. On the other hand, in the case of large-scale datasets, long conjunctions are effective for learning predictors. To overcome this computational difficulty, we propose an algorithm, 〈em〉GRAfting for Binary datasets (GRAB)〈/em〉, which efficiently learns CBMs within the 〈span〉 〈span〉\(L_1\)〈/span〉 〈/span〉-regularized loss minimization framework. The key idea of GRAB is to adopt weighted frequent itemset mining for the most time-consuming step in the grafting algorithm, which is designed to solve large-scale 〈span〉 〈span〉\(L_1\)〈/span〉 〈/span〉-RERM problems by an iterative approach. Furthermore, we experimentally showed that linear predictors of CBMs are effective in terms of prediction accuracy and knowledge discovery.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as 〈span〉 〈span〉\(\ell _p\)〈/span〉 〈/span〉-norm with 〈span〉 〈span〉\(p〉0\)〈/span〉 〈/span〉), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and 〈span〉 〈span〉\(m_p\)〈/span〉 〈/span〉-dissimilarity (〈span〉 〈span〉\(p〉0\)〈/span〉 〈/span〉), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise 〈span〉 〈span〉\(m_p\)〈/span〉 〈/span〉-dissimilarity where 〈span〉 〈span〉\(p\ge 0\)〈/span〉 〈/span〉 by introducing 〈span〉 〈span〉\(m_0\)〈/span〉 〈/span〉-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and 〈em〉k〈/em〉NN classification tasks. Our findings show that the fully data-dependent measure of 〈span〉 〈span〉\(m_p\)〈/span〉 〈/span〉-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets. 〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    facet.materialart.
    Unknown
    Springer
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Networked data involve complex information from multifaceted channels, including topology structures, node content, and/or node labels etc., where structure and content are often correlated but are not always consistent. A typical scenario is the citation relationships in scholarly publications where a paper is cited by others not because they have the same content, but because they share one or multiple subject matters. To date, while many network embedding methods exist to take the node content into consideration, they all consider node content as simple flat word/attribute set and nodes sharing connections are assumed to have dependency with respect to all words or attributes. In this paper, we argue that considering topic-level semantic interactions between nodes is crucial to learn discriminative node embedding vectors. In order to model pairwise topic relevance between linked text nodes, we propose topical network embedding, where interactions between nodes are built on the shared latent topics. Accordingly, we propose a unified optimization framework to simultaneously learn topic and node representations from the network text contents and structures, respectively. Meanwhile, the structure modeling takes the learned topic representations as conditional context under the principle that two nodes can infer each other contingent on the shared latent topics. Experiments on three real-world datasets demonstrate that our approach can learn significantly better network representations, i.e., 4.1% improvement over the state-of-the-art methods in terms of Micro-F1 on Cora dataset. (The source code of the proposed method is available through the github link: 〈a href="https://github.com/codeshareabc/TopicalNE"〉https://github.com/codeshareabc/TopicalNE〈/a〉.)〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Machine learning algorithms can be applied to several practical problems, such as spam, fraud and intrusion detection, and customer preferences, among others. In most of these problems, data come in streams, which mean that data distribution may change over time, leading to concept drift. The literature is abundant on providing supervised methods based on error monitoring for explicit drift detection. However, these methods may become infeasible in some real-world applications—where there is no fully labeled data available, and may depend on a significant decrease in accuracy to be able to detect drifts. There are also methods based on blind approaches, where the decision model is updated constantly. However, this may lead to unnecessary system updates. In order to overcome these drawbacks, we propose in this paper a semi-supervised drift detector that uses an ensemble of classifiers based on self-training online learning and dynamic classifier selection. For each unknown sample, a dynamic selection strategy is used to choose among the ensemble’s component members, the classifier most likely to be the correct one for classifying it. The prediction assigned by the chosen classifier is used to compute an estimate of the error produced by the ensemble members. The proposed method monitors such a pseudo-error in order to detect drifts and to update the decision model only after drift detection. The achievement of this method is relevant in that it allows drift detection and reaction and is applicable in several practical problems. The experiments conducted indicate that the proposed method attains high performance and detection rates, while reducing the amount of labeled data used to detect drift.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing projection methods for data visualization use predefined criteria to choose the representation of data. There is a lack of methods that (i) use information on what the user has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns can be input as knowledge to the system. The knowledge syntax here is intuitive, such as “this set of points forms a cluster”, and requires no knowledge of maths. This background knowledge is used to find a maximum entropy distribution of the data, after which the user is provided with data projections for which the data and the maximum entropy distribution differ the most, hence showing the user aspects of data that are maximally informative given the background knowledge. We study the computational performance of our model and present use cases on synthetic and real data. We find that the model allows the user to learn information efficiently from various data sources and works sufficiently fast in practice. In addition, we provide an open source EDA demonstrator system implementing our model with tailored interactive visualizations. We conclude that the information theoretic approach to EDA where patterns observed by a user are formalized as constraints provides a principled, intuitive, and efficient basis for constructing an EDA system.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Season length estimation is the task of identifying the number of observations in the dominant repeating pattern of seasonal time series data. As such, it is a common pre-processing task crucial for various downstream applications. Inferring season length from a real-world time series is often challenging due to phenomena such as slightly varying period lengths and noise. These issues may, in turn, lead practitioners to dedicate considerable effort to preprocessing of time series data since existing approaches either require dedicated parameter-tuning or their performance is heavily domain-dependent. Hence, to address these challenges, we propose SAZED: spectral and average autocorrelation zero distance density. SAZED is a versatile ensemble of multiple, specialized time series season length estimation approaches. The combination of various base methods selected with respect to domain-agnostic criteria and a novel seasonality isolation technique, allow a broad applicability to real-world time series of varied properties. Further, SAZED is theoretically grounded and parameter-free, with a computational complexity of 〈span〉 〈span〉\(\mathcal {O}(n\log n)\)〈/span〉 〈/span〉, which makes it applicable in practice. In our experiments, SAZED was statistically significantly better than every other method on at least one dataset. The datasets we used for the evaluation consist of time series data from various real-world domains, sterile synthetic test cases and synthetic data that were designed to be seasonal and yet have no finite statistical moments of any order. 〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset 〈em〉X〈/em〉 is analyzed to derive relevant patterns 〈em〉Z〈/em〉 (latent variables) and, then, such patterns are used to reconstruct a new dataset 〈span〉 〈span〉\(X'\)〈/span〉 〈/span〉 that is like 〈em〉X〈/em〉 but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining (〈span〉 〈span〉\(\texttt {IFM}\)〈/span〉 〈/span〉), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of 〈span〉 〈span〉\(\texttt {IFM}\)〈/span〉 〈/span〉 within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of 〈span〉 〈span〉\(\texttt {IFM}\)〈/span〉 〈/span〉, an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉In renewable energy forecasting, data are typically collected by geographically distributed sensor networks, which poses several issues. (i) Data represent physical properties that are subject to concept drift, i.e., their characteristics could change over time. To address the concept drift phenomenon, adaptive online learning methods should be considered. (ii) The error distribution is typically non-Gaussian. Therefore, traditional quality performance criteria during training, like the mean-squared error, are less suitable. In the literature, entropy-based criteria have been proposed to deal with this problem. (iii) Spatially-located sensors introduce some form of autocorrelation, that is, values collected by sensors show a correlation strictly due to their relative spatial proximity. Although all these issues have already been investigated in the literature, they have not been investigated in combination. In this paper, we propose a new method which learns artificial neural networks by addressing all these issues. The method performs online adaptive training and enriches the entropy measures with spatial information of the data, in order to take into account spatial autocorrelation. Experimental results on two photovoltaic power production datasets are clearly favorable for entropy-based measures that take into account spatial autocorrelation, also when compared with state-of-the art methods.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Defining appropriate distance functions is a crucial aspect of effective and efficient similarity-based prediction and retrieval. Relational data are especially challenging in this regard. By viewing relational data as multi-relational graphs, one can easily see that a distance between a pair of nodes can be defined in terms of a virtually unlimited class of features, including node attributes, attributes of node neighbors, structural aspects of the node neighborhood and arbitrary combinations of these properties. In this paper we propose a rich and flexible class of metrics on graph entities based on earth mover’s distance applied to a hierarchy of complex counts-of-counts statistics. We further propose an approximate version of the distance using sums of marginal earth mover’s distances. We show that the approximation is correct for many cases of practical interest and allows efficient nearest-neighbor retrieval when combined with a simple metric tree data structure. An experimental evaluation on two real-world scenarios highlights the flexibility of our framework for designing metrics representing different notions of similarity. Substantial improvements in similarity-based prediction are reported when compared to solutions based on state-of-the-art graph kernels.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural Networks (DNNs) to perform this task. This is surprising as deep learning has seen very successful applications in the last years. DNNs have indeed revolutionized the field of computer vision especially with the advent of novel deeper architectures such as Residual and Convolutional Neural Networks. Apart from images, sequential data such as text and audio can also be processed with DNNs to reach state-of-the-art performance for document classification and speech recognition. In this article, we study the current state-of-the-art performance of deep learning algorithms for TSC by presenting an empirical study of the most recent DNN architectures for TSC. We give an overview of the most successful deep learning applications in various time series domains under a unified taxonomy of DNNs for TSC. We also provide an open source deep learning framework to the TSC community where we implemented each of the compared approaches and evaluated them on a univariate TSC benchmark (the UCR/UEA archive) and 12 multivariate time series datasets. By training 8730 deep learning models on 97 time series datasets, we propose the most exhaustive study of DNNs for TSC to date.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Co-clustering is known to be a very powerful and efficient approach in unsupervised learning because of its ability to partition data based on both the observations and the variables of a given dataset. However, in high-dimensional context co-clustering methods may fail to provide a meaningful result due to the presence of noisy and/or irrelevant features. In this paper, we tackle this issue by proposing a novel co-clustering model which assumes the existence of a noise cluster, that contains all irrelevant features. A variational expectation-maximization-based algorithm is derived for this task, where the automatic variable selection as well as the joint clustering of objects and variables are achieved via a Bayesian framework. Experimental results on synthetic datasets show the efficiency of our model in the context of high-dimensional noisy data. Finally, we highlight the interest of the approach on two real datasets which goal is to study genetic diversity across the world.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉Many problem settings in machine learning are concerned with the simultaneous prediction of multiple target variables of diverse type. Amongst others, such problem settings arise in multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. These subfields of machine learning are typically studied in isolation, without highlighting or exploring important relationships. In this paper, we present a unifying view on what we call multi-target prediction (MTP) problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉This paper proposes an unsupervised node-ranking model that considers not only the attributes of nodes in a graph but also the incompleteness of the graph structure. We formulate the unsupervised ranking task into an optimization task and propose a deep neural network (DNN) structure to solve it. The rich representation capability of the DNN structure together with a novel design of the objectives allow the proposed model to significantly outperform the state-of-the-art ranking solutions.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    Publication Date: 2019
    Description: 〈h3〉Abstract〈/h3〉 〈p〉We revisit in this paper the problem of inferring a diffusion network from information cascades. In our study, we make no assumptions on the underlying diffusion model, in this way obtaining a generic method with broader practical applicability. Our approach exploits the pairwise adoption-time intervals from cascades. Starting from the observation that 〈em〉different kinds of information spread differently〈/em〉, these time intervals are interpreted as samples drawn from unknown (conditional) distributions. In order to statistically distinguish them, we propose a novel method using Reproducing Kernel Hilbert Space embeddings. Experiments on both synthetic and real-world data from Twitter and Flixster show that our method significantly outperforms the state-of-the-art methods. We argue that our algorithm can be implemented by parallel batch processing, in this way meeting the needs in terms of efficiency and scalability of real-world applications.〈/p〉
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    Publication Date: 2015-07-19
    Description: A knowledge base of triples like (subject entity, predicate relation,object entity) is a very important resource for knowledge management. It is very useful for human-like reasoning, query expansion, question answering (Siri) and other related AI tasks. However, such a knowledge base often suffers from incompleteness due to a large volume of increasing knowledge in the real world and a lack of reasoning capability. In this paper, we propose a Pairwise-interaction Differentiated Embeddings model to embed entities and relations in the knowledge base to low dimensional vector representations and then predict the possible truth of additional facts to extend the knowledge base. In addition, we present a probability-based objective function to improve the model optimization. Finally, we evaluate the model by considering the problem of computing how likely the additional triple is true for the task of knowledge base completion. Experiments on WordNet and Freebase show the excellent performance of our model and algorithm.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    Publication Date: 2015-10-20
    Description: Mining frequent tree patterns has many applications in different areas such as XML data, bioinformatics and World Wide Web. The crucial step in frequent pattern mining is frequency counting, which involves a matching operator to find occurrences (instances) of a tree pattern in a given collection of trees. A widely used matching operator for tree-structured data is subtree homeomorphism , where an edge in the tree pattern is mapped onto an ancestor-descendant relationship in the given tree. Tree patterns that are frequent under subtree homeomorphism are usually called embedded patterns . In this paper, we present an efficient algorithm for subtree homeomorphism with application to frequent pattern mining. We propose a compact data-structure, called occ , which stores only information about the rightmost paths of occurrences and hence can encode and represent several occurrences of a tree pattern. We then define efficient join operations on the occ data-structure, which help us count occurrences of tree patterns according to occurrences of their proper subtrees. Based on the proposed subtree homeomorphism method, we develop an effective pattern mining algorithm, called TPMiner. We evaluate the efficiency of TPMiner on several real-world and synthetic datasets. Our extensive experiments confirm that TPMiner always outperforms well-known existing algorithms, and in several cases the improvement with respect to existing algorithms is significant.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    Publication Date: 2015-10-25
    Description: We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: “In a stream S of ( x ,  y ) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H ”. We call this problem as CHH. We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace much smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    Publication Date: 2015-06-30
    Description: Betweenness centrality is a fundamental measure in social network analysis, expressing the importance or influence of individual vertices (or edges) in a network in terms of the fraction of shortest paths that pass through them. Since exact computation in large networks is prohibitively expensive, we present two efficient randomized algorithms for betweenness estimation. The algorithms are based on random sampling of shortest paths and offer probabilistic guarantees on the quality of the approximation. The first algorithm estimates the betweenness of all vertices (or edges): all approximate values are within an additive factor \(\varepsilon \in (0,1)\) from the real values, with probability at least \(1-\delta \) . The second algorithm focuses on the top-K vertices (or edges) with highest betweenness and estimate their betweenness value to within a multiplicative factor \(\varepsilon \) , with probability at least \(1-\delta \) . This is the first algorithm that can compute such approximation for the top-K vertices (or edges). By proving upper and lower bounds to the VC-dimension of a range set associated with the problem at hand, we can bound the sample size needed to achieve the desired approximations. We obtain sample sizes that are independent from the number of vertices in the network and only depend on a characteristic quantity that we call the vertex-diameter, that is the maximum number of vertices in a shortest path. In some cases, the sample size is completely independent from any quantitative property of the graph. An extensive experimental evaluation on real and artificial networks shows that our algorithms are significantly faster and much more scalable as the number of vertices grows than other algorithms with similar approximation guarantees.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    Publication Date: 2015-06-03
    Description: Much of the vast literature on time series classification makes several assumptions about data and the algorithm’s eventual deployment that are almost certainly unwarranted. For example, many research efforts assume that the beginning and ending points of the pattern of interest can be correctly identified, during both the training phase and later deployment. Another example is the common assumption that queries will be made at a constant rate that is known ahead of time, thus computational resources can be exactly budgeted. In this work, we argue that these assumptions are unjustified, and this has in many cases led to unwarranted optimism about the performance of the proposed algorithms. As we shall show, the task of correctly extracting individual gait cycles, heartbeats, gestures, behaviors, etc., is generally much more difficult than the task of actually classifying those patterns. Likewise, gesture classification systems deployed on a device such as Google Glass may issue queries at frequencies that range over an order of magnitude, making it difficult to plan computational resources. We propose to mitigate these problems by introducing an alignment-free time series classification framework. The framework requires only very weakly annotated data, such as “in this ten minutes of data, we see mostly normal heartbeats \(\ldots \) ,” and by generalizing the classic machine learning idea of data editing to streaming/continuous data, allows us to build robust, fast and accurate anytime classifiers. We demonstrate on several diverse real-world problems that beyond removing unwarranted assumptions and requiring essentially no human intervention, our framework is both extremely fast and significantly more accurate than current state-of-the-art approaches.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    Publication Date: 2016-06-28
    Description: We propose a method for unsupervised group matching, which is the task of finding correspondence between groups across different domains without cross-domain similarity measurements or paired data. For example, the proposed method can find matching of topic categories in different languages without alignment information. The proposed method interprets a group as a probability distribution, which enables us to handle uncertainty in a limited amount of data, and to incorporate the high order information on groups. Groups are matched by maximizing the dependence between distributions, in which we use the Hilbert Schmidt independence criterion for measuring the dependence. By using kernel embedding which maps distributions into a reproducing kernel Hilbert space, we can calculate the dependence between distributions without density estimation. In the experiments, we demonstrate the effectiveness of the proposed method using synthetic and real data sets including an application to cross-lingual topic matching.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    Publication Date: 2016-07-01
    Description: The problem of sampling from data streams has attracted significant interest in the last decade. Whichever sampling criteria is considered ( uniform sample, maximally diverse sample, etc.), the challenges stem from the relatively small amount of memory available in the face of unbounded streams. In this work we consider an interesting extension of this problem, the framework of which is stimulated by recent improvements in sensing technologies and robotics. In some situations it is not only possible to digitally sense some aspects of the world, but to physically capture a tangible aspect of that world. Currently deployed examples include devices that can capture water/air samples, and devices that capture individual insects or fish. Such devices create an interesting twist on the stream sampling problem, because in most cases, the decision to take a physical sample is irrevocable . In this work we show how to generalize diversification sampling strategies to the irrevocable-choice setting, demonstrating our ideas on several real world domains.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    Publication Date: 2015-09-30
    Description: There has been a growing recognition that issues of data quality, which are routine in practice, can materially affect the assessment of learned model performance. In this paper, we develop some analytic results that are useful in sizing the biases associated with tests of discriminatory model power when these are performed using corrupt (“noisy”) data. As it is sometimes unavoidable to test models with data that are known to be corrupt, we also provide some guidance on interpreting results of such tests. In some cases, with appropriate knowledge of the corruption mechanism, the true values of the performance statistics such as the area under the ROC curve may be recovered (in expectation), even when the underlying data have been corrupted. We also provide estimators of the standard errors of such recovered performance statistics. An analysis of the estimators reveals interesting behavior including the observation that “noisy” data does not “cancel out” across models even when the same corrupt data set is used to test multiple candidate models. Because our results are analytic, they may be applied in a broad range of settings and this can be done without the need for simulation.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    Publication Date: 2016-06-10
    Description: We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph , ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    Publication Date: 2016-06-16
    Description: This paper presents a framework for exact discovery of the top- k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern—a concept on which most interestingness measures directly rely—with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top- k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    Publication Date: 2016-06-07
    Description: We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation (a summary ) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this work we study a very natural type of summary: the original set of vertices is partitioned into a small number of supernodes connected by superedges to form a complete weighted graph. The superedge weights are the edge densities between vertices in the corresponding supernodes. To quantify the dissimilarity between the original graph and a summary, we adopt the reconstruction error and the cut-norm error . By exposing a connection between graph summarization and geometric clustering problems (i.e., k -means and k -median), we develop the first polynomial-time approximation algorithms to compute the best possible summary of a certain size under both measures. We discuss how to use our summaries to store a (lossy or lossless) compressed graph representation and to approximately answer a large class of queries about the original graph, including adjacency, degree, eigenvector centrality, and triangle and subgraph counting. Using the summary to answer queries is very efficient as the running time to compute the answer depends on the number of supernodes in the summary, rather than the number of nodes in the original graph.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    Publication Date: 2016-06-07
    Description: Most of the empirical evaluations of active learning approaches in the literature have focused on a single classifier and a single performance measure. We present an extensive empirical evaluation of common active learning baselines using two probabilistic classifiers and several performance measures on a number of large datasets. In addition to providing important practical advice, our findings highlight the importance of overlooked choices in active learning experiments in the literature. For example, one of our findings shows that model selection is as important as devising an active learning approach, and choosing one classifier and one performance measure can often lead to unexpected and unwarranted conclusions. Active learning should generally improve the model’s capability to distinguish between instances of different classes, but our findings show that the improvements provided by active learning for one performance measure often came at the expense of another measure. We present several such results, raise questions, guide users and researchers to better alternatives, caution against unforeseen side effects of active learning, and suggest future research directions.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    facet.materialart.
    Unknown
    Springer
    Publication Date: 2016-05-27
    Description: Finding dense subgraphs is an important problem in graph mining and has many practical applications. At the same time, while large real-world networks are known to have many communities that are not well-separated, the majority of the existing work focuses on the problem of finding a single densest subgraph. Hence, it is natural to consider the question of finding the top - k densest subgraphs . One major challenge in addressing this question is how to handle overlaps: eliminating overlaps completely is one option, but this may lead to extracting subgraphs not as dense as it would be possible by allowing a limited amount of overlap. Furthermore, overlaps are desirable as in most real-world graphs there are vertices that belong to more than one community, and thus, to more than one densest subgraph. In this paper we study the problem of finding top - k overlapping densest subgraphs , and we present a new approach that improves over the existing techniques, both in theory and practice. First, we reformulate the problem definition in a way that we are able to obtain an algorithm with constant-factor approximation guarantee . Our approach relies on using techniques for solving the max-sum diversification problem, which however, we need to extend in order to make them applicable to our setting. Second, we evaluate our algorithm on a collection of benchmark datasets and show that it convincingly outperforms the previous methods, both in terms of quality and efficiency.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    Publication Date: 2015-05-21
    Description: In this paper we present a novel method for clustering words in micro-blogs, based on the similarity of the related temporal series. Our technique, named SAX*, uses the Symbolic Aggregate ApproXimation algorithm to discretize the temporal series of terms into a small set of levels, leading to a string for each. We then define a subset of “interesting” strings, i.e. those representing patterns of collective attention. Sliding temporal windows are used to detect co-occurring clusters of tokens with the same or similar string. To assess the performance of the method we first tune the model parameters on a 2-month 1 % Twitter stream, during which a number of world-wide events of differing type and duration (sports, politics, disasters, health, and celebrities) occurred. Then, we evaluate the quality of all discovered events in a 1-year stream, “googling” with the most frequent cluster n-grams and manually assessing how many clusters correspond to published news in the same temporal slot. Finally, we perform a complexity evaluation and we compare SAX* with three alternative methods for event discovery. Our evaluation shows that SAX* is at least one order of magnitude less complex than other temporal and non-temporal approaches to micro-blog clustering.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    Publication Date: 2015-05-29
    Description: Diffusion magnetic resonance imaging data allows reconstructing the neural pathways of the white matter of the brain as a set of 3D polylines. This kind of data sets provides a means of study of the anatomical structures within the white matter, in order to detect neurologic diseases and understand the anatomical connectivity of the brain. To the best of our knowledge, there is still not an effective or satisfactory method for automatic processing of these data. Therefore, a manually guided visual exploration of experts is crucial for the purpose. However, because of the large size of these data sets, visual exploration and analysis has also become intractable. In order to make use of the advantages of both manual and automatic analysis, we have developed a new visual data mining tool for the analysis of human brain anatomical connectivity. With such tool, humans and automatic algorithms capabilities are integrated in an interactive data exploration and analysis process. A very important aspect to take into account when designing this tool, was to provide the user with comfortable interaction. For this purpose, we tackle the scalability issue in the different stages of the system, including the automatic algorithm and the visualization and interaction techniques that are used.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    Publication Date: 2015-04-09
    Description: Many networks can be modeled as signed graphs. These include social networks, and relationships/interactions networks. Detecting sub-structures in such networks helps us understand user behavior, predict links, and recommend products. In this paper, we detect dense sub-structures from a signed graph, called quasi antagonistic communities ( QAC s). An antagonistic community consists of two groups of users expressing positive relationships within each group but negative relationships across groups. Instead of requiring complete set of negative links across its groups, a QAC allows a small number of inter-group negative links to be missing. We propose an algorithm, Mascot , to find all maximal quasi antagonistic communities ( MQAC s). Mascot consists of two stages: pruning and enumeration stages. Based on the properties of QAC , we propose four pruning rules to reduce the size of candidate graphs in the pruning stage. We use an enumeration tree to enumerate all strongly connected subgraphs in a top–down fashion in the second stage before they are used to construct MQAC s. We have conducted extensive experiments using synthetic signed graphs and two real networks to demonstrate the efficiency and accuracy of the Mascot algorithm. We have also found that detecting MQAC s helps us to predict the signs of links.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    Publication Date: 2015-04-24
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    facet.materialart.
    Unknown
    Springer
    Publication Date: 2015-06-16
    Description: Graphs—such as friendship networks—that evolve over time are an example of data that are naturally represented as binary tensors. Similarly to analysing the adjacency matrix of a graph using a matrix factorization, we can analyse the tensor by factorizing it. Unfortunately, tensor factorizations are computationally hard problems, and in particular, are often significantly harder than their matrix counterparts. In case of Boolean tensor factorizations—where the input tensor and all the factors are required to be binary and we use Boolean algebra—much of that hardness comes from the possibility of overlapping components. Yet, in many applications we are perfectly happy to partition at least one of the modes. For instance, in the aforementioned time-evolving friendship networks, groups of friends might be overlapping, but the time points at which the network was captured are always distinct. In this paper we investigate what consequences this partitioning has on the computational complexity of the Boolean tensor factorizations and present a new algorithm for the resulting clustering problem. This algorithm can alternatively be seen as a particularly regularized clustering algorithm that can handle extremely high-dimensional observations. We analyse our algorithm with the goal of maximizing the similarity and argue that this is more meaningful than minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient 0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm for Boolean tensor clustering achieves high scalability, high similarity, and good generalization to unseen data with both synthetic and real-world data sets.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    Publication Date: 2015-01-23
    Description: When we are investigating an object in a data set, which itself may or may not be an outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we identify the novel problem of mining outlying aspects on numeric data . Given a query object \(o\) in a multidimensional numeric data set \(O\) , in which subspace is \(o\) most outlying? Technically, we use the rank of the probability density of an object in a subspace to measure the outlyingness of the object in the subspace. A minimal subspace where the query object is ranked the best is an outlying aspect. Computing the outlying aspects of a query object is far from trivial. A naïve method has to calculate the probability densities of all objects and rank them in every subspace, which is very costly when the dimensionality is high. We systematically develop a heuristic method that is capable of searching data sets with tens of dimensions efficiently. Our empirical study using both real data and synthetic data demonstrates that our method is effective and efficient.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    Publication Date: 2014-12-09
    Description: In order to find patterns in data, it is often necessary to aggregate or summarise data at a higher level of granularity. Selecting the appropriate granularity is a challenging task and often no principled solutions exist. This problem is particularly relevant in analysis of data with sequential structure. We consider this problem for a specific type of data, namely event sequences. We introduce the problem of finding the best set of window lengths for analysis of event sequences for algorithms with real-valued output. We present suitable criteria for choosing one or multiple window lengths and show that these naturally translate into a computational optimisation problem. We show that the problem is NP-hard in general, but that it can be approximated efficiently and even analytically in certain cases. We give examples of tasks that demonstrate the applicability of the problem and present extensive experiments on both synthetic data and real data from several domains. We find that the method works well in practice, and that the optimal sets of window lengths themselves can provide new insight into the data.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    Publication Date: 2015-02-19
    Description: We study the problem of finding the longest common sub-pattern (LCSP) shared by two sequences of temporal intervals. In particular we are interested in finding the LCSP of the corresponding arrangements. Arrangements of temporal intervals are a powerful way to encode multiple concurrent labeled events that have a time duration. Discovering commonalities among such arrangements is useful for a wide range of scientific fields and applications, as it can be seen by the number and diversity of the datasets we use in our experiments. In this paper, we define the problem of LCSP and prove that it is NP-complete by demonstrating a connection between graphs and arrangements of temporal intervals. This connection leads to a series of interesting open problems. In addition, we provide an exact algorithm to solve the LCSP problem, and also propose and experiment with three polynomial time and space under-approximation techniques. Finally, we introduce two upper bounds for LCSP and study their suitability for speeding up 1-NN search. Experiments are performed on seven datasets taken from a wide range of real application domains, plus two synthetic datasets. Lastly, we describe several application cases that demonstrate the need and suitability of LCSP.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    Publication Date: 2015-02-05
    Description: Huge volumes of biomedical text data discussing about different biomedical entities are being generated every day. Hidden in those unstructured data are the strong relevance relationships between those entities, which are critical for many interesting applications including building knowledge bases for the biomedical domain and semantic search among biomedical entities. In this paper, we study the problem of discovering strong relevance between heterogeneous typed biomedical entities from massive biomedical text data. We first build an entity correlation graph from data, in which the collection of paths linking two heterogeneous entities offer rich semantic contexts for their relationships, especially those paths following the patterns of top- \(k\) selected meta paths inferred from data. Guided by such meta paths, we design a novel relevance measure to compute the strong relevance between two heterogeneous entities, named \({\mathsf {EntityRel}}\) . Our intuition is, two entities of heterogeneous types are strongly relevant if they have strong direct links or they are linked closely to other strongly relevant heterogeneous entities along paths following the selected patterns. We provide experimental results on mining strong relevance between drugs and diseases. More than 20 millions of MEDLINE abstracts and 5 types of biological entities (Drug, Disease, Compound, Target, MeSH) are used to construct the entity correlation graph. A prototype of drug search engine for disease queries is implemented. Extensive comparisons are made against multiple state-of-the-arts in the examples of Drug–Disease relevance discovery.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    Publication Date: 2015-01-28
    Description: For many multi-dimensional data applications, tensor operations as well as relational operations both need to be supported throughout the data lifecycle. Tensor based representations (including two widely used tensor decompositions, CP and Tucker decompositions) are proven to be effective in multi-aspect data analysis and tensor decomposition is an important tool for capturing high-order structures in multi-dimensional data. Although tensor decomposition is shown to be effective for multi-dimensional data analysis, the cost of tensor decomposition is often very high. Since the number of modes of the tensor data is one of the main factors contributing to the costs of the tensor operations, in this paper, we focus on reducing the modality of the input tensors to tackle the computational cost of the tensor decomposition process. We propose a novel decomposition-by-normalization scheme that first normalizes the given relation into smaller tensors based on the functional dependencies of the relation, decomposes these smaller tensors, and then recombines the sub-results to obtain the overall decomposition. The decomposition and recombination steps of the decomposition-by-normalization scheme fit naturally in settings with multiple cores. This leads to a highly efficient, effective, and parallelized decomposition-by-normalization algorithm for both dense and sparse tensors for CP and Tucker decompositions. Experimental results confirm the efficiency and effectiveness of the proposed decomposition-by-normalization scheme compared to the conventional nonnegative CP decomposition and Tucker decomposition approaches.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    facet.materialart.
    Unknown
    Springer
    Publication Date: 2015-02-04
    Description: Finding subsets of a dataset that somehow deviate from the norm, i.e. where something interesting is going on, is a classical Data Mining task. In traditional local pattern mining methods, such deviations are measured in terms of a relatively high occurrence (frequent itemset mining), or an unusual distribution for one designated target attribute (common use of subgroup discovery). These, however, do not encompass all forms of “interesting”. To capture a more general notion of interestingness in subsets of a dataset, we develop Exceptional Model Mining (EMM). This is a supervised local pattern mining framework, where several target attributes are selected, and a model over these targets is chosen to be the target concept. Then, we strive to find subgroups: subsets of the dataset that can be described by a few conditions on single attributes. Such subgroups are deemed interesting when the model over the targets on the subgroup is substantially different from the model on the whole dataset. For instance, we can find subgroups where two target attributes have an unusual correlation, a classifier has a deviating predictive performance, or a Bayesian network fitted on several target attributes has an exceptional structure. We give an algorithmic solution for the EMM framework, and analyze its computational complexity. We also discuss some illustrative applications of EMM instances, including using the Bayesian network model to identify meteorological conditions under which food chains are displaced, and using a regression model to find the subset of households in the Chinese province of Hunan that do not follow the general economic law of demand.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    Publication Date: 2015-02-05
    Description: Nodes in complex networks inherently represent different kinds of functional or organizational roles . In the dynamic process of an information cascade, users play different roles in spreading the information: some act as seeds to initiate the process, some limit the propagation and others are in-between. Understanding the roles of users is crucial in modeling the cascades. Previous research mainly focuses on modeling users behavior based upon the dynamic exchange of information with neighbors. We argue however that the structural patterns in the neighborhood of nodes may already contain enough information to infer users’ roles, independently from the information flow in itself. To approach this possibility, we examine how network characteristics of users affect their actions in the cascade. We also advocate that temporal information is very important. With this in mind, we propose an unsupervised methodology based on ensemble clustering to classify users into their social roles in a network, using not only their current topological positions, but also considering their history over time. Our experiments on two social networks, Flickr and Digg, show that topological metrics indeed possess discriminatory power and that different structural patterns correspond to different parts in the process. We observe that user commitment in the neighborhood affects considerably the influence score of users. In addition, we discover that the cohesion of neighborhood is important in the blocking behavior of users. With this we can construct topological fingerprints that can help us in identifying social roles, based solely on structural social ties, and independently from nodes activity and how information flows.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    Publication Date: 2014-11-29
    Description: The data storage paradigm has changed in the last decade, from operational databases to data repositories that make easier to analyze data and mining information. Among those, the primary multidimensional model represents data through star schemas, where each relation denotes an event involving a set of dimensions or business perspectives. Mining data modeled as a star schema presents two major challenges, namely: mining extremely large amounts of data and dealing with several data tables at the same time. In this paper, we describe an algorithm— Star FP Stream , in detail. This algorithm aims for finding the set of frequent patterns in a large star schema, mining directly the data, in their original structure, and exploring the most efficient techniques for mining data streams. Experiments were conducted over two star schemas, in the healthcare and sales domains.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    Publication Date: 2015-07-07
    Description: Time series data mining has received much greater interest along with the increase in temporal data sets from different domains such as medicine, finance, multimedia, etc. Representations are important to reduce dimensionality and generate useful similarity measures. High-level representations such as Fourier transforms, wavelets, piecewise polynomial models, etc., were considered previously. Recently, autoregressive kernels were introduced to reflect the similarity of the time series. We introduce a novel approach to model the dependency structure in time series that generalizes the concept of autoregression to local autopatterns. Our approach generates a pattern-based representation along with a similarity measure called learned pattern similarity (LPS). A tree-based ensemble-learning strategy that is fast and insensitive to parameter settings is the basis for the approach. Then, a robust similarity measure based on the learned patterns is presented. This unsupervised approach to represent and measure the similarity between time series generally applies to a number of data mining tasks (e.g., clustering, anomaly detection, classification). Furthermore, an embedded learning of the representation avoids pre-defined features and an extraction step which is common in some feature-based approaches. The method generalizes in a straightforward manner to multivariate time series. The effectiveness of LPS is evaluated on time series classification problems from various domains. We compare LPS to eleven well-known similarity measures. Our experimental results show that LPS provides fast and competitive results on benchmark datasets from several domains. Furthermore, LPS provides a research direction and template approach that breaks from the linear dependency models to potentially foster other promising nonlinear approaches.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    Publication Date: 2015-07-07
    Description: A measure of distance between two clusterings has important applications, including clustering validation and ensemble clustering. Generally, such distance measure provides navigation through the space of possible clusterings. Mostly used in cluster validation, a normalized clustering distance, a.k.a. agreement measure, compares a given clustering result against the ground-truth clustering. The two widely-used clustering agreement measures are adjusted rand index and normalized mutual information. In this paper, we present a generalized clustering distance from which these two measures can be derived. We then use this generalization to construct new measures specific for comparing (dis)agreement of clusterings in networks, a.k.a. communities. Further, we discuss the difficulty of extending the current, contingency based, formulations to overlapping cases, and present an alternative algebraic formulation for these (dis)agreement measures. Unlike the original measures, the new co-membership based formulation is easily extendable for different cases, including overlapping clusters and clusters of inter-related data. These two extensions are, in particular, important in the context of finding communities in complex networks.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    Publication Date: 2015-07-07
    Description: As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    Publication Date: 2012-11-10
    Description:    The generalized Dirichlet distribution has been shown to be a more appropriate prior than the Dirichlet distribution for naïve Bayesian classifiers. When the dimension of a generalized Dirichlet random vector is large, the computational effort for calculating the expected value of a random variable can be high. In document classification, the number of distinct words that is the dimension of a prior for naïve Bayesian classifiers is generally more than ten thousand. Generalized Dirichlet priors can therefore be inapplicable for document classification from the viewpoint of computational efficiency. In this paper, some properties of the generalized Dirichlet distribution are established to accelerate the calculation of the expected values of random variables. Those properties are then used to construct noninformative generalized Dirichlet priors for naïve Bayesian classifiers with multinomial models. Our experimental results on two document sets show that generalized Dirichlet priors can achieve a significantly higher prediction accuracy and that the computational efficiency of naïve Bayesian classifiers is preserved. Content Type Journal Article Pages 1-22 DOI 10.1007/s10618-012-0296-4 Authors Tzu-Tsung Wong, Institute of Information Management, National Cheng Kung University, 1, Ta-Sheuh Road, Tainan, 701 Taiwan, ROC Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    Publication Date: 2012-10-16
    Description:    A matrix M is said to be k -anonymous if for each row r in M there are at least k − 1 other rows in M which are identical to r . The NP-hard k - Anonymity problem asks, given an n × m -matrix M over a fixed alphabet and an integer s 〉 0, whether M can be made k -anonymous by suppressing (blanking out) at most s entries. Complementing previous work, we introduce two new “data-driven” parameterizations for k - Anonymity —the number t in of different input rows and the number t out of different output rows—both modeling aspects of data homogeneity. We show that k - Anonymity is fixed-parameter tractable for the parameter t in , and that it is NP-hard even for t out = 2 and alphabet size four. Notably, our fixed-parameter tractability result implies that k - Anonymity can be solved in linear time when t in is a constant. Our computational hardness results also extend to the related privacy problems p - Sensitivity and ℓ - Diversity , while our fixed-parameter tractability results extend to p - Sensitivity and the usage of domain generalization hierarchies, where the entries are replaced by more general data instead of being completely suppressed. Content Type Journal Article Pages 1-27 DOI 10.1007/s10618-012-0293-7 Authors Robert Bredereck, Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Berlin, Germany André Nichterlein, Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Berlin, Germany Rolf Niedermeier, Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Berlin, Germany Geevarghese Philip, Max-Planck-Institut für Informatik, Saarbrücken, Germany Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    Publication Date: 2016-01-26
    Description: We investigate algorithms for efficiently detecting anomalies in real-valued one-dimensional time series. Past work has shown that a simple brute force algorithm that uses as an anomaly score the Euclidean distance between nearest neighbors of subsequences from a testing time series and a training time series is one of the most effective anomaly detectors. We investigate a very efficient implementation of this method and show that it is still too slow for most real world applications. Next, we present a new method based on summarizing the training time series with a small set of exemplars. The exemplars we use are feature vectors that capture both the high frequency and low frequency information in sets of similar subsequences of the time series. We show that this exemplar-based method is both much faster than the efficient brute force method as well as a prediction-based method and also handles a wider range of anomalies. We compare our algorithm across a large variety of publicly available time series and encourage others to do the same. Our exemplar-based algorithm is able to process time series in minutes that would take other methods days to process.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    Publication Date: 2016-01-26
    Description: An asymmetric correlation measure commonly used in social economics, called the Gini correlation, is defined between a numerical response and a rank. We generalize the definition of this correlation so that it can be applied to data mining. The new definition, called the generalized Gini correlation, is found to include special cases that are equivalent to common evaluation measures used in data mining, for example, the LIFT measures for a binary response and the expected profit measure for a monetary response. We consider estimation and inference regarding this generalized Gini correlation. The asymptotic distribution of the estimated correlation is derived with the help of some empirical process theory. We consider several ways of constructing confidence intervals and demonstrate their performance numerically. Our paper is interdisciplinary and makes contributions to both the Gini literature and the literature of statistical inference of performance measures in data mining.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    Publication Date: 2016-01-28
    Description: This paper presents a new efficient exact algorithm for listing triangles in a large graph. While the problem of listing triangles in a graph has been considered before, dealing with large graphs continues to be a challenge. Although previous research has attempted to tackle the challenge, this is the first contribution that addresses this problem on a compressed copy of the input graph. In fact, the proposed solution lists the triangles without decompressing the graph. This yields interesting improvements in both storage requirement of the graphs and their time processing.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    Publication Date: 2016-02-05
    Description: Personal route prediction has emerged as an important topic within the mobility mining domain. In this context, many proposals apply an off-line learning process before being able to run the on-line prediction algorithm. The present work introduces a novel framework that integrates the route learning and the prediction algorithm in an on-line manner. By means of a thin-client and server architecture, it also puts forward a new concept for route abstraction based on the detection of spatial regions where certain velocity features of routes frequently change. The proposal is evaluated by real-world and synthetic datasets and compared with a well-established mechanism by exhibiting quite promising results.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    Publication Date: 2016-03-02
    Description: Many complex multi-target prediction problems that concern large target spaces are characterised by a need for efficient prediction strategies that avoid the computation of predictions for all targets explicitly. Examples of such problems emerge in several subfields of machine learning, such as collaborative filtering, multi-label classification, dyadic prediction and biological network inference. In this article we analyse efficient and exact algorithms for computing the top- K predictions in the above problem settings, using a general class of models that we refer to as separable linear relational models. We show how to use those inference algorithms, which are modifications of well-known information retrieval methods, in a variety of machine learning settings. Furthermore, we study the possibility of scoring items incompletely, while still retaining an exact top- K retrieval. Experimental results in several application domains reveal that the so-called threshold algorithm is very scalable, performing often many orders of magnitude more efficiently than the naive approach.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
    Publication Date: 2015-12-02
    Description: Some supervised tasks are presented with a numerical output but decisions have to be made in a discrete, binarised, way, according to a particular cutoff. This binarised regression task is a very common situation that requires its own analysis, different from regression and classification—and ordinal regression. We first investigate the application cases in terms of the information about the distribution and range of the cutoffs and distinguish six possible scenarios, some of which are more common than others. Next, we study two basic approaches: the retraining approach, which discretises the training set whenever the cutoff is available and learns a new classifier from it, and the reframing approach, which learns a regression model and sets the cutoff when this is available during deployment. In order to assess the binarised regression task, we introduce context plots featuring error against cutoff. Two special cases are of interest, the \( UCE \) and \( OCE \) curves, showing that the area under the former is the mean absolute error and the latter is a new metric that is in between a ranking measure and a residual-based measure. A comprehensive evaluation of the retraining and reframing approaches is performed using a repository of binarised regression problems created on purpose, concluding that no method is clearly better than the other, except when the size of the training data is small.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 88
    Publication Date: 2016-01-16
    Description: A community within a graph can be broadly defined as a set of vertices that exhibit high cohesiveness (relatively high number of edges within the set) and low conductance (relatively low number of edges leaving the set). Community detection is a fundamental graph processing analytic that can be applied to several application domains, including social networks. In this context, communities are often overlapping , as a person can be involved in more than one community (e.g., friends, and family); and evolving, since the structure of the network changes. We address the problem of streaming overlapping community detection, where the goal is to maintain communities in the presence of streaming updates. This way, the communities can be updated more efficiently. To this end, we introduce SONIC—a find-and-merge type of community detection algorithm that can efficiently handle streaming updates. SONIC first detects when graph updates yield significant community changes. Upon the detection, it updates the communities via an incremental merge procedure. The SONIC algorithm incorporates two additional techniques to speed-up the incremental merge; min-hashing and inverted indexes . Results show that SONIC can provide high quality overlapping communities, while handling streaming updates several orders of magnitude faster than the alternatives performing from-scratch computation.
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    Publication Date: 2012-09-03
    Description:    Many real-world networks, including social and information networks, are dynamic structures that evolve over time. Such dynamic networks are typically visualized using a sequence of static graph layouts. In addition to providing a visual representation of the network structure at each time step, the sequence should preserve the mental map between layouts of consecutive time steps to allow a human to interpret the temporal evolution of the network. In this paper, we propose a framework for dynamic network visualization in the on-line setting where only present and past graph snapshots are available to create the present layout. The proposed framework creates regularized graph layouts by augmenting the cost function of a static graph layout algorithm with a grouping penalty , which discourages nodes from deviating too far from other nodes belonging to the same group, and a temporal penalty , which discourages large node movements between consecutive time steps. The penalties increase the stability of the layout sequence, thus preserving the mental map. We introduce two dynamic layout algorithms within the proposed framework, namely dynamic multidimensional scaling and dynamic graph Laplacian layout. We apply these algorithms on several data sets to illustrate the importance of both grouping and temporal regularization for producing interpretable visualizations of dynamic networks. Content Type Journal Article Pages 1-33 DOI 10.1007/s10618-012-0286-6 Authors Kevin S. Xu, EECS Department, University of Michigan, 1301 Beal Avenue, Ann Arbor, MI 48109-2122, USA Mark Kliger, Omek Interactive, Ltd., Beit Shemesh, Israel Alfred O. Hero III, EECS Department, University of Michigan, 1301 Beal Avenue, Ann Arbor, MI 48109-2122, USA Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    Publication Date: 2012-09-03
    Description:    The latent behavior of a physical system that can exhibit extreme events such as hurricanes or rainfalls, is complex. Recently, a very promising means for studying complex systems has emerged through the concept of complex networks. Networks representing relationships between individual objects usually exhibit community dynamics. Conventional community detection methods mainly focus on either mining frequent subgraphs in a network or detecting stable communities in time-varying networks. In this paper, we formulate a novel problem— detection of predictive and phase-biased communities in contrasting groups of networks , and propose an efficient and effective machine learning solution for finding such anomalous communities. We build different groups of networks corresponding to different system’s phases, such as higher or low hurricane activity, discover phase-related system components as seeds to help bound the search space of community generation in each network, and use the proposed contrast-based technique to identify the changing communities across different groups. The detected anomalous communities are hypothesized (1) to play an important role in defining the target system’s state(s) and (2) to improve the predictive skill of the system’s states when used collectively in the ensemble of predictive models. When tested on the two important extreme event problems—identification of tropical cyclone-related and of African Sahel rainfall-related climate indices—our algorithm demonstrated the superior performance in terms of various skill and robustness metrics, including 8–16 % accuracy increase, as well as physical interpretability of detected communities. The experimental results also show the efficiency of our algorithm on synthetic datasets. Content Type Journal Article Pages 1-34 DOI 10.1007/s10618-012-0289-3 Authors Zhengzhang Chen, North Carolina State University, Raleigh, NC 27695, USA William Hendrix, North Carolina State University, Raleigh, NC 27695, USA Hang Guan, Zhejiang University, Hangzhou, 31000 Zhejiang, China Isaac K. Tetteh, North Carolina State University, Raleigh, NC 27695, USA Alok Choudhary, Northwestern University, Evanston, IL 60201, USA Fredrick Semazzi, North Carolina State University, Raleigh, NC 27695, USA Nagiza F. Samatova, North Carolina State University, Raleigh, NC 27695, USA Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    Publication Date: 2012-09-03
    Description:    Given a clustering algorithm, how can we adapt it to find multiple, nonredundant, high-quality clusterings? We focus on algorithms based on vector quantization and describe a framework for automatic ‘alternatization’ of such algorithms. Our framework works in both simultaneous and sequential learning formulations and can mine an arbitrary number of alternative clusterings. We demonstrate its applicability to various clustering algorithms— k -means, spectral clustering, constrained clustering, and co-clustering—and effectiveness in mining a variety of datasets. Content Type Journal Article Pages 1-32 DOI 10.1007/s10618-012-0288-4 Authors M. Shahriar Hossain, Department of Mathematics and Computer Science, Virginia State University, 1 Hayden Drive, Petersburg, VA 23806, USA Naren Ramakrishnan, Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA Ian Davidson, Department of Computer Science, University of California, Davis, CA 95616, USA Layne T. Watson, Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    Publication Date: 2012-08-27
    Description:    To support analysis and modelling of large amounts of spatio-temporal data having the form of spatially referenced time series (TS) of numeric values, we combine interactive visual techniques with computational methods from machine learning and statistics. Clustering methods and interactive techniques are used to group TS by similarity. Statistical methods for TS modelling are then applied to representative TS derived from the groups of similar TS. The framework includes interactive visual interfaces to a library of modelling methods supporting the selection of a suitable method, adjustment of model parameters, and evaluation of the models obtained. The models can be externally stored, communicated, and used for prediction and in further computational analyses. From the visual analytics perspective, the framework suggests a way to externalize spatio-temporal patterns emerging in the mind of the analyst as a result of interactive visual analysis: the patterns are represented in the form of computer-processable and reusable models. From the statistical analysis perspective, the framework demonstrates how TS analysis and modelling can be supported by interactive visual interfaces, particularly, in a case of numerous TS that are hard to analyse individually. From the application perspective, the framework suggests a way to analyse large numbers of spatial TS with the use of well-established statistical methods for TS analysis. Content Type Journal Article Pages 1-29 DOI 10.1007/s10618-012-0285-7 Authors Natalia Andrienko, Fraunhofer Institute IAIS (Intelligent Analysis and Information Systems), Sankt Augustin, Germany Gennady Andrienko, Fraunhofer Institute IAIS (Intelligent Analysis and Information Systems), Sankt Augustin, Germany Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    Publication Date: 2012-07-09
    Description:    In this paper, a new visual and interactive user interface for OLAP is presented, and its strengths and weaknesses examined. A survey on 3D interfaces for OLAP is detailed, which shows that only one interface that uses Virtual Reality has been proposed. Then we present our approach: it consists of a 3D representation of OLAP cubes where many OLAP operators have been integrated and where several measures can be visualized. A 3D stereoscopic screen can be used in conjunction with a 3D mouse. Finally a user study is reported that compares standard dynamic cross-tables with our interface on different tasks. We conclude that 3D with stereoscopy is not as promising as expected even with recent 3D devices. Content Type Journal Article Pages 1-18 DOI 10.1007/s10618-012-0279-5 Authors Sébastien Lafon, Computer Science Laboratory, University François-Rabelais of Tours, 64 Avenue Jean Portalis, 37200 Tours, France Fatma Bouali, IUT, University of Lille 2, 25–27 Rue du Maréchal Foch, 59100 Roubaix, France Christiane Guinot, CE.R.I.E.S., 20 Rue Victor Noir, 92521 Neuilly-sur-Seine, France Gilles Venturini, Computer Science Laboratory, University François-Rabelais of Tours, 64 Avenue Jean Portalis, 37200 Tours, France Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    Publication Date: 2012-07-09
    Description:    Joint sparsity offers powerful structural cues for feature selection, especially for variables that are expected to demonstrate a “grouped” behavior. Such behavior is commonly modeled via group-lasso, multitask lasso, and related methods where feature selection is effected via mixed-norms. Several mixed-norm based sparse models have received substantial attention, and for some cases efficient algorithms are also available. Surprisingly, several constrained sparse models seem to be lacking scalable algorithms. We address this deficiency by presenting batch and online (stochastic-gradient) optimization methods, both of which rely on efficient projections onto mixed-norm balls. We illustrate our methods by applying them to the multitask lasso. We conclude by mentioning some open problems. Content Type Journal Article Pages 1-20 DOI 10.1007/s10618-012-0277-7 Authors Suvrit Sra, Max Planck Institute for Intelligent Systems, Tübingen, Germany Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    facet.materialart.
    Unknown
    Springer
    Publication Date: 2012-07-09
    Description:    We study the extent to which social ties between people can be inferred in large social network, in particular via active user interactions. In most online social networks, relationships are lack of meaning labels (e.g., “colleague” and “intimate friends”) due to various reasons. Understanding the formation of different types of social relationships can provide us insights into the micro-level dynamics of the social network. In this work, we precisely define the problem of inferring social ties and propose a Partially-Labeled Pairwise Factor Graph Model (PLP-FGM) for learning to infer the type of social relationships. The model formalizes the problem of inferring social ties into a flexible semi-supervised framework. We test the model on three different genres of data sets and demonstrate its effectiveness. We further study how to leverage user interactions to help improve the inferring accuracy. Two active learning algorithms are proposed to actively select relationships to query users for their labels. Experimental results show that with only a few user corrections, the accuracy of inferring social ties can be significantly improved. Finally, to scale the model to handle real large networks, a distributed learning algorithm has been developed. Content Type Journal Article Pages 1-28 DOI 10.1007/s10618-012-0274-x Authors Honglei Zhuang, Department of Computer Science and Technology, Tsinghua University, Beijing, China Jie Tang, Department of Computer Science and Technology, Tsinghua University, Beijing, China Wenbin Tang, Department of Computer Science and Technology, Tsinghua University, Beijing, China Tiancheng Lou, Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China Alvin Chin, Nokia Research Center, Beijing, China Xia Wang, Nokia Research Center, Beijing, China Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    Publication Date: 2012-07-09
    Description:    Previous studies on network mining have focused primarily on learning a single task (such as classification or community detection) on a given network. This paper considers the problem of multi-task learning on heterogeneous network data. Specifically, we present a novel framework that enables one to perform classification on one network and community detection in another related network. Multi-task learning is accomplished by introducing a joint objective function that must be optimized to ensure the classes in one network are consistent with the link structure, nodal attributes, as well as the communities detected in another network. We provide both theoretical and empirical analysis of the framework. We also show that the framework can be extended to incorporate prior information about the correspondences between the clusters and classes in different networks. Experiments performed on both real-world and synthetic data sets demonstrate the effectiveness of the joint framework compared to applying classification and community detection algorithms on each network separately. Content Type Journal Article Pages 1-30 DOI 10.1007/s10618-012-0260-3 Authors Prakash Mandayam Comar, Department of Computer Science & Engineering, Michigan State University, East Lansing, MI, USA Pang-Ning Tan, Department of Computer Science & Engineering, Michigan State University, East Lansing, MI, USA Anil K. Jain, Department of Computer Science & Engineering, Michigan State University, East Lansing, MI, USA Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    Publication Date: 2012-07-09
    Description:    Influence is a complex and subtle force that governs social dynamics and user behaviors. Understanding how users influence each other can benefit various applications, e.g., viral marketing, recommendation, information retrieval and etc. While prior work has mainly focused on qualitative aspect, in this article, we present our research in quantitatively learning influence between users in heterogeneous networks. We propose a generative graphical model which leverages both heterogeneous link information and textual content associated with each user in the network to mine topic-level influence strength. Based on the learned direct influence, we further study the influence propagation and aggregation mechanisms: conservative and non-conservative propagations to derive the indirect influence. We apply the discovered influence to user behavior prediction in four different genres of social networks: Twitter, Digg, Renren, and Citation. Qualitatively, our approach can discover some interesting influence patterns from these heterogeneous networks. Quantitatively, the learned influence strength greatly improves the accuracy of user behavior prediction. Content Type Journal Article Pages 1-34 DOI 10.1007/s10618-012-0252-3 Authors Lu Liu, Capital Medical University, Beijing, China Jie Tang, Tsinghua University, Beijing, China Jiawei Han, University of Illinois at Urbana-Champaign, Champaign, IL, USA Shiqiang Yang, Tsinghua University, Beijing, China Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    Publication Date: 2012-07-09
    Description:    Data mining and statistical learning techniques are powerful analysis tools yet to be incorporated in the domain of urban studies and transportation research. In this work, we analyze an activity-based travel survey conducted in the Chicago metropolitan area over a demographic representative sample of its population. Detailed data on activities by time of day were collected from more than 30,000 individuals (and 10,552 households) who participated in a 1-day or 2-day survey implemented from January 2007 to February 2008. We examine this large-scale data in order to explore three critical issues: (1) the inherent daily activity structure of individuals in a metropolitan area, (2) the variation of individual daily activities—how they grow and fade over time, and (3) clusters of individual behaviors and the revelation of their related socio-demographic information. We find that the population can be clustered into 8 and 7 representative groups according to their activities during weekdays and weekends, respectively. Our results enrich the traditional divisions consisting of only three groups (workers, students and non-workers) and provide clusters based on activities of different time of day. The generated clusters combined with social demographic information provide a new perspective for urban and transportation planning as well as for emergency response and spreading dynamics, by addressing when, where, and how individuals interact with places in metropolitan areas. Content Type Journal Article Pages 1-33 DOI 10.1007/s10618-012-0264-z Authors Shan Jiang, Department of Urban Studies and Planning, Massachusetts Institute of Technology, 77 Massachusetts Ave. E55-19E, Cambridge, MA 02142, USA Joseph Ferreira, Department of Urban Studies and Planning, Massachusetts Institute of Technology, 77 Massachusetts Ave. 9-532, Cambridge, MA 02139, USA Marta C. González, Department of Civil and Environmental Engineering and Engineering Systems Division, Massachusetts Institute of Technology, 77 Massachusetts Ave. Room 1-153, Cambridge, MA 02139, USA Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    Publication Date: 2012-07-09
    Description:    Data sources representing attribute information in combination with network information are widely available in today’s applications. To realize the full potential for knowledge extraction, mining techniques like clustering should consider both information types simultaneously. Recent clustering approaches combine subspace clustering with dense subgraph mining to identify groups of objects that are similar in subsets of their attributes as well as densely connected within the network. While those approaches successfully circumvent the problem of full-space clustering, their limited cluster definitions are restricted to clusters of certain shapes. In this work we introduce a density-based cluster definition, which takes into account the attribute similarity in subspaces as well as a local graph density and enables us to detect clusters of arbitrary shape and size. Furthermore, we avoid redundancy in the result by selecting only the most interesting non-redundant clusters. Based on this model, we introduce the clustering algorithm DB-CSC, which uses a fixed point iteration method to efficiently determine the clustering solution. We prove the correctness and complexity of this fixed point iteration analytically. In thorough experiments we demonstrate the strength of DB-CSC in comparison to related approaches. Content Type Journal Article Pages 1-27 DOI 10.1007/s10618-012-0272-z Authors Stephan Günnemann, Data Management and Data Exploration Group, RWTH Aachen University, Aachen, Germany Brigitte Boden, Data Management and Data Exploration Group, RWTH Aachen University, Aachen, Germany Thomas Seidl, Data Management and Data Exploration Group, RWTH Aachen University, Aachen, Germany Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    Publication Date: 2012-07-09
    Description:    We introduce the dependence distance , a new notion of the intrinsic distance between points, derived as a pointwise extension of statistical dependence measures between variables. We then introduce a dimension reduction procedure for preserving this distance, which we call the dependence map . We explore its theoretical justification, connection to other methods, and empirical behavior on real data sets. Content Type Journal Article Pages 1-21 DOI 10.1007/s10618-012-0267-9 Authors Kichun Lee, Industrial Engineering, Hanyang University, Seoul, 133-791 Republic of Korea Alexander Gray, College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0205, USA Heeyoung Kim, Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0205, USA Journal Data Mining and Knowledge Discovery Online ISSN 1573-756X Print ISSN 1384-5810
    Print ISSN: 1384-5810
    Electronic ISSN: 1573-756X
    Topics: Computer Science
    Published by Springer
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...