ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (572)
  • 2010-2014  (572)
  • 1990-1994
  • 1985-1989
  • 1950-1954
  • IEEE Transactions on Knowledge and Data Engineering  (572)
  • 1274
  • Computer Science  (572)
  • Technology
  • 1
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: In a top- $k$ Geometric Intersection Query (top- $k$ GIQ) problem, a set of $n$ weighted, geometric objects in ${bb R}^d$ is to be pre-processed into a compact data structure so that for any query geometric object, $q$ , and integer $k>0$ , the $k$ largest-weight objects intersected by $q$ can be reported efficiently. While the top- $k$ problem has been studied extensively for non-geometric problems (e.g., recommender systems), the geometric version has received little attention. This paper gives a general technique to solve any top-
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    Publication Date: 2014-11-08
    Description: Rule induction method based on rough set theory (RST) has received much attention recently since it may generate a minimal set of rules from the decision system for real-life applications by using of attribute reduction and approximations. The decision system may vary with time, e.g., the variation of objects, attributes and attribute values. The reduction and approximations of the decision system may alter on Attribute Values’ Coarsening and Refining (AVCR), a kind of variation of attribute values, which results in the alteration of decision rules simultaneously. This paper aims for dynamic maintenance of decision rules $w.r.t.$ AVCR. The definition of minimal discernibility attribute set is proposed firstly, which aims to improve the efficiency of attribute reduction in RST. Then, principles of updating decision rules in case of AVCR are discussed. Furthermore, the rough set-based methods for updating decision rules in the inconsistent decision system are proposed. The complexity analysis and extensive experiments on UCI data sets have verified the effectiveness and efficiency of the proposed methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: A major mining task for binary matrixes is the extraction of approximate top- (k) patterns that are able to concisely describe the input data. The top- (k) pattern discovery problem is commonly stated as an optimization one, where the goal is to minimize a given cost function, see the accuracy of the data description. In this work, we review several greedy algorithms, and discuss PaNDa + , an algorithmic framework able to optimize different cost functions generalized into a unifying formulation. We evaluated the goodness of the algorithm by measuring the quality of the extracted patterns. We adapted standard quality measures to assess the capability of the algorithm to discover both the items and transactions of the patterns embedded in the data. The evaluation was conducted on synthetic data, where patterns were artificially embedded, and on real-world text collection, where each document is labeled with a topic. Finally, in order to qualitatively evaluate the usefulness of the discovered patterns, we exploited PaNDa + to detect overlapping communities in a bipartite network. The results show that PaNDa + is able to discover high-quality patterns in both synthetic and real-world datasets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: Mining network evolution has emerged as an intriguing research topic in many domains such as data mining, social networks, and machine learning. While a bulk of research has focused on mining evolutionary patterns of homogeneous networks (e.g., networks of friends), however, most real-world networks are heterogeneous, containing objects of different types, such as authors, papers, venues, and terms in a bibliographic network. Modeling co-evolution of multityped objects can capture richer information than that on single-typed objects alone. For example, studying co-evolution of authors, venues, and terms in a bibliographic network can tell better the evolution of research areas than just examining co-author network or term network alone. In this paper, we study mining co-evolution of multityped objects in a special type of heterogeneous networks, called star networks, and examine how the multityped objects influence each other in the network evolution. A hierarchical Dirichlet process mixture model-based evolution model is proposed, which detects the co-evolution of multityped objects in the form of multityped cluster evolution in dynamic star networks. An efficient inference algorithm is provided to learn the proposed model. Experiments on several real networks (DBLP, Twitter, and Delicious) validate the effectiveness of the model and the scalability of the algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: In this work, we define cost-free learning (CFL) formally in comparison with cost-sensitive learning (CSL). The main difference between them is that a CFL approach seeks optimal classification results without requiring any cost information, even in the class imbalance problem. In fact, several CFL approaches exist in the related studies, such as sampling and some criteria-based approaches. However, to our best knowledge, none of the existing CFL and CSL approaches are able to process the abstaining classifications properly when no information is given about errors and rejects. Based on information theory, we propose a novel CFL which seeks to maximize normalized mutual information of the targets and the decision outputs of classifiers. Using the strategy, we can handle binary/multi-class classifications with/without abstaining. Significant features are observed from the new strategy. While the degree of class imbalance is changing, the proposed strategy is able to balance the errors and rejects accordingly and automatically. Another advantage of the strategy is its ability of deriving optimal rejection thresholds for abstaining classifications and the “equivalent” costs in binary classifications. The connection between rejection thresholds and ROC curve is explored. Empirical investigation is made on several benchmark data sets in comparison with other existing approaches. The classification results demonstrate a promising perspective of the strategy in machine learning.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: Multivariate time series are common in many application domains, particularly in industrial processes with a large number of sensors installed for process monitoring and control. Often, such data encapsulate complex relations among individual series. This paper presents a new type of patterns in multivariate time series, referred to as temporal associations, to capture a wide range of local relations along and across individual series. A scalable algorithm is developed to discover frequent associations by incorporating (1) redundancy pruning of patterns in single time series and (2) two conditions to avoid over-counting the occurrences of associations, thus greatly reducing the space and runtime complexity of the discovery process. A statistical significance measure is also introduced for ranking and post-pruning discovered associations. To evaluate the proposed method, synthetic data sets and a real world data set taken from the time series mining repository as well as a large data set obtained from a delayed coking plant are used. The experiments demonstrated that the discovered associations capture the local relations in multiple time series and that the proposed method is scalable to large data sets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: Short texts are popular on today’s web, especially with the emergence of social media. Inferring topics from large scale short texts becomes a critical but challenging task for many content analysis tasks. Conventional topic models such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) learn topics from document-level word co-occurrences by modeling each document as a mixture of topics, whose inference suffers from the sparsity of word co-occurrence patterns in short texts. In this paper, we propose a novel way for short text topic modeling, referred as biterm topic model (BTM) . BTM learns topics by directly modeling the generation of word co-occurrence patterns (i.e., biterms) in the corpus, making the inference effective with the rich corpus-level information. To cope with large scale short text data, we further introduce two online algorithms for BTM for efficient topic learning. Experiments on real-word short text collections show that BTM can discover more prominent and coherent topics, and significantly outperform the state-of-the-art baselines. We also demonstrate the appealing performance of the two online BTM algorithms on both time efficiency and topic learning.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    Publication Date: 2014-11-08
    Description: In the literature about association analysis, many interestingness measures have been proposed to assess the quality of obtained association rules in order to select a small set of the most interesting among them. In the particular case of hierarchically organized items and generalized association rules connecting them, a measure that dealt appropriately with the hierarchy would be advantageous. Here we present the further developments of a new class of such hierarchical interestingness measures and compare them with a large set of conventional measures and with three hierarchical pruning methods from the literature. The aim is to find interesting pairwise generalized association rules connecting the concepts of multiple ontologies. Interested in the broad empirical evaluation of interestingness measures, we compared the rules obtained by 37 methods on four real world data sets against predefined ground truth sets of associations. To this end, we adopted a framework of instance-based ontology matching and extended the set of performance measures by two novel measures: relation learning recall and precision which take into account hierarchical relationships.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: A highly comparative, feature-based approach to time series classification is introduced that uses an extensive database of algorithms to extract thousands of interpretable features from time series. These features are derived from across the scientific time-series analysis literature, and include summaries of time series in terms of their correlation structure, distribution, entropy, stationarity, scaling properties, and fits to a range of time-series models. After computing thousands of features for each time series in a training set, those that are most informative of the class structure are selected using greedy forward feature selection with a linear classifier. The resulting feature-based classifiers automatically learn the differences between classes using a reduced number of time-series properties, and circumvent the need to calculate distances between time series. Representing time series in this way results in orders of magnitude of dimensionality reduction, allowing the method to perform well on very large data sets containing long time series or time series of different lengths. For many of the data sets studied, classification performance exceeded that of conventional instance-based classifiers, including one nearest neighbor classifiers using euclidean distances and dynamic time warping and, most importantly, the features selected provide an understanding of the properties of the data set, insight that can guide further scientific investigation.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: The explosive usage of social media produces massive amount of unlabeled and high-dimensional data. Feature selection has been proven to be effective in dealing with high-dimensional data for efficient learning and data mining. Unsupervised feature selection remains a challenging task due to the absence of label information based on which feature relevance is often assessed. The unique characteristics of social media data further complicate the already challenging problem of unsupervised feature selection, e.g., social media data is inherently linked, which makes invalid the independent and identically distributed assumption, bringing about new challenges to unsupervised feature selection algorithms. In this paper, we investigate a novel problem of feature selection for social media data in an unsupervised scenario. In particular, we analyze the differences between social media data and traditional attribute-value data, investigate how the relations extracted from linked data can be exploited to help select relevant features, and propose a novel unsupervised feature selection framework, LUFS, for linked social media data. We systematically design and conduct systemic experiments to evaluate the proposed framework on data sets from real-world social media websites. The empirical study demonstrates the effectiveness and potential of our proposed framework.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: The discovery of process models from event logs has emerged as one of the crucial problems for enabling the continuous support in the life-cycle of an information system. However, in a decade of process discovery research, the algorithms and tools that have appeared are known to have strong limitations in several dimensions. The size of the logs and the formal properties of the model discovered are the two main challenges nowadays. In this paper we propose the use of numerical abstract domains for tackling these two problems, for the particular case of the discovery of Petri nets. First, numerical abstract domains enable the discovery of general process models, requiring no knowledge (e.g., the bound of the Petri net to derive) for the discovery algorithm. Second, by using divide and conquer techniques we are able to control the size of the process discovery problems. The methods proposed in this paper have been implemented in a prototype tool and experiments are reported illustrating the significance of this fresh view of the process discovery problem.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: Given a real world graph, how should we lay-out its edges? How can we compress it? These questions are closely related, and the typical approach so far is to find clique-like communities, like the ‘cavemen graph’, and compress them. We show that the block-diagonal mental image of the ‘cavemen graph’ is the wrong paradigm, in full agreement with earlier results that real world graphs have no good cuts. Instead, we propose to envision graphs as a collection of hubs connecting spokes, with super-hubs connecting the hubs, and so on, recursively. Based on the idea, we propose the SlashBurn method to recursively split a graph into hubs and spokes connected only by the hubs. We also propose techniques to select the hubs and give an ordering to the spokes, in addition to the basic SlashBurn. We give theoretical analysis of the proposed hub selection methods. Our view point has several advantages: (a) it avoids the ‘no good cuts’ problem, (b) it gives better compression, and (c) it leads to faster execution times for matrix-vector operations, which are the back-bone of most graph processing tools. Through experiments, we show that SlashBurn consistently outperforms other methods for all data sets, resulting in better compression and faster running time. Moreover, we show that SlashBurn with the appropriate spokes ordering can further improve compression while hardly sacrificing the running time.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: In this paper, we introduce “task trail” to understand user search behaviors. We define a task to be an atomic user information need, whereas a task trail represents all user activities within that particular task, such as query reformulations, URL clicks. Previously, web search logs have been studied mainly at session or query level where users may submit several queries within one task and handle several tasks within one session. Although previous studies have addressed the problem of task identification, little is known about the advantage of using task over session or query for search applications. In this paper, we conduct extensive analyses and comparisons to evaluate the effectiveness of task trails in several search applications: determining user satisfaction, predicting user search interests, and suggesting related queries. Experiments on large scale data sets of a commercial search engine show that: (1) Task trail performs better than session and query trails in determining user satisfaction; (2) Task trail increases webpage utilities of end users comparing to session and query trails; (3) Task trails are comparable to query trails but more sensitive than session trails in measuring different ranking functions; (4) Query terms from the same task are more topically consistent to each other than query terms from different tasks; (5) Query suggestion based on task trail is a good complement of query suggestions based on session trail and click-through bipartite. The findings in this paper verify the need of extracting task trails from web search logs and enhance applications in search and recommendation systems.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: This paper studies the problem of mining named entity translations by aligning comparable corpora. Current state-of-the-art approaches mine a translation pair by aligning an entity graph in one language to another based on node similarity or propagated similarity of related entities. However, they, building on the assumption of “symmetry”, quickly deteriorate on “weakly” comparable corpora with some asymmetry. In this paper, we pursue two directions for overcoming relation and entity asymmetry respectively. The first approach starts from weakly comparable corpora (for high recall) then ensures precision by selective propagation only to entities of symmetric relations. The second approach starts from parallel corpora (for high precision) then enhances recall by extending the translation matrix based on node similarity and contextual similarity. Our experimental results on English-Chinese corpora show that both approaches are effective and complementary. Our combined approach outperforms the best-performing baseline in terms of F1-score by up to 0.28.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: The knowledge remembered by the human body and reflected by the dexterity of body motion is called embodied knowledge. In this paper, we propose a new method using singular value decomposition for extracting embodied knowledge from the time-series data of the motion. We compose a matrix from the time-series data and use the left singular vectors of the matrix as the patterns of the motion and the singular values as a scalar, by which each corresponding left singular vector affects the matrix. Two experiments were conducted to validate the method. One is a gesture recognition experiment in which we categorize gesture motions by two kinds of models with indexes of similarity and estimation that use left singular vectors. The proposed method obtained a higher correct categorization ratio than principal component analysis (PCA) and correlation efficiency (CE). The other is an ambulation evaluation experiment in which we distinguished the levels of walking disability. The first singular values derived from the walking acceleration were suggested to be a reliable criterion to evaluate walking disability. Finally we discuss the characteristic and significance of the embodied knowledge extraction using the singular value decomposition proposed in this paper.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B $^+$ -tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B $^+$ -tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B $^+$ -tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B $^+$ -tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: A top- k query retrieves the best (k) tuples by assigning scores for each tuple in a target relation with respect to a user-specific scoring function. This paper studies the problem of constructing an indexing structure for supporting top- k queries over varying scoring functions and retrieval sizes. The existing research efforts can be categorized into three approaches: list- , layer- , and view-based approaches. In this paper, we mainly focus on the layer-based approach that pre-materializes tuples into consecutive multiple layers. We first propose a dual-resolution layer that consists of coarse-level and fine-level layers. Specifically, we build coarse-level layers using skylines , and divide each coarse-level layer into fine-level sublayers using convex skylines . To make our proposed dual-resolution layer scalable , we then address the following optimization directions: 1) index construction; 2) disk-based storage scheme; 3) the design of the virtual layer; and 4) index maintenance for tuple updates. Our evaluation results show that our proposed method is more scalable than the state-of-the-art methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: The past decade has seen a dramatic increase in the amount of data captured and made available to scientists for research. This increase amplifies the difficulty scientists face in finding the data most relevant to their information needs. In prior work, we hypothesized that Information Retrieval-style ranked search can be applied to data sets to help a scientist discover the most relevant data amongst the thousands of data sets in many formats, much like text-based ranked search helps users make sense of the vast number of Internet documents. To test this hypothesis, we explored the use of ranked search for scientific data using an existing multi-terabyte observational archive as our test-bed. In this paper, we investigate whether the concept of varying relevance, and therefore ranked search, applies to numeric data—that is, are data sets are enough like documents for Information Retrieval techniques and evaluation measures to apply? We present a user study that demonstrates that data set similarity resonates with users as a basis for relevance and, therefore, for ranked search. We evaluate a prototype implementation of ranked search over data sets with a second user study and demonstrate that ranked search improves a scientist's ability to find needed data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: In recent years, probabilistic data management has received a lot of attention due to several applications that deal with uncertain data: RFID systems, sensor networks, data cleaning, scientific and biomedical data management, and approximate schema mappings. Query evaluation is a challenging problem in probabilistic databases, proved to be #P-hard. A general method for query evaluation is based on the lineage of the query and reduces the query evaluation problem to computing the probability of a propositional formula. The main approaches proposed in the literature to approximate probabilistic queries confidence computation are based on Monte Carlo simulation, or formula compilation into decision diagrams (e.g., d-trees). The former executes a polynomial, but with too many, iterations, while the latter is polynomial for easy queries, but may be exponential in the worst case. We designed a new optimized Monte Carlo algorithm that drastically reduces the number of iterations and proposed an efficient parallel version that we implemented on GPU. Thanks to the elevated degree of parallelism provided by the GPU, combined with the linear speedup of our algorithm, we managed to reduce significantly the long running time required by a sequential Monte Carlo algorithm. Experimental results show that our algorithm is so efficient as to be comparable with the formula compilation approach, but with the significant advantage of avoiding exponential behavior.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: Graph-based ranking models have been widely applied in information retrieval area. In this paper, we focus on a well known graph-based model - the Ranking on Data Manifold model, or Manifold Ranking (MR). Particularly, it has been successfully applied to content-based image retrieval, because of its outstanding ability to discover underlying geometrical structure of the given image database. However, manifold ranking is computationally very expensive, which significantly limits its applicability to large databases especially for the cases that the queries are out of the database (new samples). We propose a novel scalable graph-based ranking model called Efficient Manifold Ranking (EMR), trying to address the shortcomings of MR from two main perspectives: scalable graph construction and efficient ranking computation. Specifically, we build an anchor graph on the database instead of a traditional $k$ -nearest neighbor graph, and design a new form of adjacency matrix utilized to speed up the ranking. An approximate method is adopted for efficient out-of-sample retrieval. Experimental results on some large scale image databases demonstrate that EMR is a promising method for real world retrieval applications.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named optimized cartesian K-means (ock-means), to better encode the data points for more accurate approximate nearest neighbor search. In ock-means, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life data sets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: This paper considers the problem of determinizing probabilistic data to enable such data to be stored in legacy systems that accept only deterministic input. Probabilistic data may be generated by automated data analysis/enrichment techniques such as entity resolution, information extraction, and speech processing. The legacy system may correspond to pre-existing web applications such as Flickr, Picasa, etc. The goal is to generate a deterministic representation of probabilistic data that optimizes the quality of the end-application built on deterministic data. We explore such a determinization problem in the context of two different data processing tasks—triggers and selection queries. We show that approaches such as thresholding or top-1 selection traditionally used for determinization lead to suboptimal performance for such applications. Instead, we develop a query-aware strategy and show its advantages over existing solutions through a comprehensive empirical evaluation over real and synthetic datasets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: Identifying which text corpus leads in the context of a topic presents a great challenge of considerable interest to researchers. Recent research into lead-lag analysis has mainly focused on estimating the overall leads and lags between two corpora. However, real-world applications have a dire need to understand lead-lag patterns both globally and locally. In this paper, we introduce TextPioneer , an interactive visual analytics tool for investigating lead-lag across corpora from the global level to the local level. In particular, we extend an existing lead-lag analysis approach to derive two-level results. To convey multiple perspectives of the results, we have designed two visualizations, a novel hybrid tree visualization that couples a radial space-filling tree with a node-link diagram and a twisted-ladder-like visualization. We have applied our method to several corpora and the evaluation shows promise, especially in support of text comparison at different levels of detail.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: The extended space forest is a new method for decision tree construction in which training is done with input vectors including all the original features and their random combinations. The combinations are generated with a difference operator applied to random pairs of original features. The experimental results show that extended space versions of ensemble algorithms have better performance than the original ensemble algorithms. To investigate the success dynamics of the extended space forest, the individual accuracy and diversity creation powers of ensemble algorithms are compared. The Extended Space Forest creates more diversity when it uses all the input features than Bagging and Rotation Forest. It also results in more individual accuracy when it uses random selection of the features than Random Subspace and Random Forest methods. It needs more training time because of using more features than the original algorithms. But its testing time is lower than the others because it generates less complex base learners.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: The visualization of information contained in reports is an important aspect of human-computer interaction, for both the accuracy and the complexity of relationships between data must be preserved. A greater attention has been paid to individual report visualization through different types of standard graphs (Histograms, Pies, etc.). However, this kind of representation provides separate information items and gives no support to visualize their relationships which are extremely important for most decision processes. This paper presents a design methodology exploiting the visual language CoDe based on a logic paradigm. CoDe allows to organize the visualization through the CoDe model which graphically represents relationships between information items and can be considered a conceptual map of the view. The proposed design methodology is composed of four phases: the CoDe Modeling and OLAP Operation pattern definition phases define the CoDe model and underlying metadata information, the OLAP Operation phase physically extracts data from a data warehouse and the Report Visualization phase generates the final visualization. Moreover, a case study on real data is provided.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: This paper studies the problem of finding objects with durable quality over time in historical time series databases. For example, a sociologist may be interested in the top 10 web search terms during the period of some historical events; the police may seek for vehicles that move close to a suspect 70 percent of the time during a certain time period and so on. Durable top-$(k)$ (DTop-$(k)$) and nearest neighbor ($({rm D}k{rm NN})$) queries can be viewed as natural extensions of the standard snapshot top-$(k)$ and NN queries to timestamped sequences of values or locations. Although their snapshot counterparts have been studied extensively, to our knowledge, there is little prior work that addresses this new class of durable queries. Existing methods for DTop-$(k)$ processing either apply trivial solutions, or rely on domain-specific properties. Motivated by this, we propose efficient and scalable algorithms for the DTop-$(k)$ and $({rm D}k{rm NN})$ queries, based on novel indexing and query evaluation techniques. Our experiments show that the proposed algorithms outperform previous and baseline solutions by a wide margin.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: As the uncertainty is inherent in a wide spectrum of applications such as radio frequency identification (RFID) networks and location-based services (LBS), it is highly demanded to address the uncertainty of the objects. In this paper, we propose a novel indexing structure, named $(U)$-Quadtree, to organize the uncertain objects in the multidimensional space such that the queries can be processed efficiently by taking advantage of $(U)$-Quadtree. Particularly, we focus on the range search on multidimensional uncertain objects since it is a fundamental query in a spatial database. We propose a cost model which carefully considers various factors that may impact the performance. Then, an effective and efficient index construction algorithm is proposed to build the optimal $(U)$-Quadtree regarding the cost model. We show that $(U)$-Quadtree can also efficiently support other types of queries such as uncertain range query and nearest neighbor query. Comprehensive experiments demonstrate that our techniques outperform the existing works on multidimensional uncertain objects.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: The vast majority of existing approaches to opinion feature extraction rely on mining patterns only from a single review corpus, ignoring the nontrivial disparities in word distributional characteristics of opinion features across different corpora. In this paper, we propose a novel method to identify opinion features from online reviews by exploiting the difference in opinion feature statistics across two corpora, one domain-specific corpus (i.e., the given review corpus) and one domain-independent corpus (i.e., the contrasting corpus). We capture this disparity via a measure called domain relevance (DR), which characterizes the relevance of a term to a text collection. We first extract a list of candidate opinion features from the domain review corpus by defining a set of syntactic dependence rules. For each extracted candidate feature, we then estimate its intrinsic-domain relevance (IDR) and extrinsic-domain relevance (EDR) scores on the domain-dependent and domain-independent corpora, respectively. Candidate features that are less generic (EDR score less than a threshold) and more domain-specific (IDR score greater than another threshold) are then confirmed as opinion features. We call this interval thresholding approach the intrinsic and extrinsic domain relevance (IEDR) criterion. Experimental results on two real-world review domains show the proposed IEDR approach to outperform several other well-established methods in identifying opinion features.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: Social networks model the social activities between individuals, which change as time goes by. In light of useful information from such dynamic networks, there is a continuous demand for privacy-preserving data sharing with analyzers, collaborators or customers. In this paper, we address the privacy risks of identity disclosures in sequential releases of a dynamic network. To prevent privacy breaches, we proposed novel $(k^w)$-structural diversity anonymity, where $(k)$ is an appreciated privacy level and $(w)$ is a time period that an adversary can monitor a victim to collect the attack knowledge. We also present a heuristic algorithm for generating releases satisfying $(k^w)$-structural diversity anonymity so that the adversary cannot utilize his knowledge to reidentify the victim and take advantages. The evaluations on both real and synthetic data sets show that the proposed algorithm can retain much of the characteristics of the networks while confirming the privacy protection.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: If knowledge such as classification rules are extracted from sample data in a distributed way, it may be necessary to combine or fuse these rules. In a conventional approach this would typically be done either by combining the classifiers' outputs (e.g., in form of a classifier ensemble) or by combining the sets of classification rules (e.g., by weighting them individually). In this paper, we introduce a new way of fusing classifiers at the level of parameters of classification rules. This technique is based on the use of probabilistic generative classifiers using multinomial distributions for categorical input dimensions and multivariate normal distributions for the continuous ones. That means, we have distributions such as Dirichlet or normal-Wishart distributions over parameters of the classifier. We refer to these distributions as hyperdistributions or second-order distributions. We show that fusing two (or more) classifiers can be done by multiplying the hyperdistributions of the parameters and derive simple formulas for that task. Properties of this new approach are demonstrated with a few experiments. The main advantage of this fusion approach is that the hyperdistributions are retained throughout the fusion process. Thus, the fused components may, for example, be used in subsequent training steps (online training).
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: Advanced microarray technologies have enabled to simultaneously monitor the expression levels of all genes. An important problem in microarray data analysis is to discover phenotype structures. The goal is to 1) find groups of samples corresponding to different phenotypes (such as disease or normal), and 2) for each group of samples, find the representative expression pattern or signature that distinguishes this group from others. Some methods have been proposed for this issue, however, a common drawback is that the identified signatures often include a large number of genes but with low discriminative power. In this paper, we propose a $(g^ast)$-sequence model to address this limitation, where the ordered expression values among genes are profitably utilized. Compared with the existing methods, the proposed sequence model is more robust to noise and allows to discover the signatures with more discriminative power using fewer genes. This is important for the subsequent analysis by the biologists. We prove that the problem of phenotype structure discovery is NP-complete. An efficient algorithm, FINDER, is developed, which includes three steps: 1) trivial $(g^ast)$-sequences identifying, 2) phenotype structure discovery, and 3) refinement. Effective pruning strategies are developed to further improve the efficiency. We evaluate the performance of FINDER and the existing methods using both synthetic and real gene expression data sets. Extensive experimental results show that FINDER dramatically improves the accuracy of the phenotype structures discovered (in terms of both statistical and biological significance) and detects signatures with high discriminative power. Moreover, it is orders of magnitude faster than other alternatives.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: One-to-many data linkage is an essential task in many domains, yet only a handful of prior publications have addressed this issue. Furthermore, while traditionally data linkage is performed among entities of the same type, it is extremely necessary to develop linkage techniques that link between matching entities of different types as well. In this paper, we propose a new one-to-many data linkage method that links between entities of different natures. The proposed method is based on a one-class clustering tree (OCCT) that characterizes the entities that should be linked together. The tree is built such that it is easy to understand and transform into association rules, i.e., the inner nodes consist only of features describing the first set of entities, while the leaves of the tree represent features of their matching entities from the second data set. We propose four splitting criteria and two different pruning methods which can be used for inducing the OCCT. The method was evaluated using data sets from three different domains. The results affirm the effectiveness of the proposed method and show that the OCCT yields better performance in terms of precision and recall (in most cases it is statistically significant) when compared to a C4.5 decision tree-based linkage method.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: This paper designs an efficient image hashing with a ring partition and a nonnegative matrix factorization (NMF), which has both the rotation robustness and good discriminative capability. The key contribution is a novel construction of rotation-invariant secondary image, which is used for the first time in image hashing and helps to make image hash resistant to rotation. In addition, NMF coefficients are approximately linearly changed by content-preserving manipulations, so as to measure hash similarity with correlation coefficient. We conduct experiments for illustrating the efficiency with 346 images. Our experiments show that the proposed hashing is robust against content-preserving operations, such as image rotation, JPEG compression, watermark embedding, Gaussian low-pass filtering, gamma correction, brightness adjustment, contrast adjustment, and image scaling. Receiver operating characteristics (ROC) curve comparisons are also conducted with the state-of-the-art algorithms, and demonstrate that the proposed hashing is much better than all these algorithms in classification performances with respect to robustness and discrimination.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-02-08
    Description: Similarity query is a fundamental problem in database, data mining and information retrieval research. Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional querying techniques. The existing work on querying incomplete data addresses the problem where the data values on certain dimensions are unknown. However, in many real-life applications, such as data collected by a sensor network in a noisy environment, not only the data values but also the dimension information may be missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete data. A probabilistic framework is developed to model this problem so that the users can find objects in the database that are similar to the query with probability guarantee. Missing dimension information poses great computational challenge, since all possible combinations of missing dimensions need to be examined when evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant data objects without explicitly examining all missing dimension combinations. A probability triangle inequality is also employed to further prune the search space and speed up the query process. The proposed probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Knowledge discovery from scientific articles has received increasing attention recently since huge repositories are made available by the development of the Internet and digital databases. In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. In the existing topic models, little effort is made to differentiate these two roles. We believe that the topic distributions of these two roles are different and related in a certain way. In this paper, we propose a Bernoulli process topic (BPT) model which considers the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multi-level hierarchical structure of citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach. An efficient computation algorithm is proposed to overcome the difficulty of matrix inverse operation. In addition to conducting the experimental evaluations on the document modeling and document clustering tasks, we also apply the BPT model to well known corpora to discover the latent topics, recommend important citations, detect the trends of various research areas in computer science between 1991 and 1998, and to investigate the interactions among the research areas. The comparisons against state-of-the-art methods demonstrate a very promising performance. The implementations and the data sets are available online .
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Authority flow techniques like PageRank and ObjectRank can provide personalized ranking of typed entity-relationship graphs. There are two main ways to personalize authority flow ranking: Node-based personalization, where authority originates from a set of user-specific nodes; edge-based personalization, where the importance of different edge types is user-specific. We propose the first approach to achieve efficient edge-based personalization using a combination of precomputation and runtime algorithms. In particular, we apply our method to ObjectRank, where a personalized weight assignment vector (WAV) assigns different weights to each edge type or relationship type. Our approach includes a repository of rankings for various WAVs. We consider the following two classes of approximation: (a) SchemaApprox is formulated as a distance minimization problem at the schema level; (b) DataApprox is a distance minimization problem at the data graph level. SchemaApprox is not robust since it does not distinguish between important and trivial edge types based on the edge distribution in the data graph. In contrast, DataApprox has a provable error bound. Both SchemaApprox and DataApprox are expensive so we develop efficient heuristic implementations, ScaleRank and PickOne respectively. Extensive experiments on the DBLP data graph show that ScaleRank provides a fast and accurate personalized authority flow ranking.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Although several distance or similarity functions for trees have been introduced, their performance is not always satisfactory in many applications, ranging from document clustering to natural language processing. This research proposes a new similarity function for trees, namely Extended Subtree (EST), where a new subtree mapping is proposed. EST generalizes the edit base distances by providing new rules for subtree mapping. Further, the new approach seeks to resolve the problems and limitations of previous approaches. Extensive evaluation frameworks are developed to evaluate the performance of the new approach against previous proposals. Clustering and classification case studies utilizing three real-world and one synthetic labeled data sets are performed to provide an unbiased evaluation where different distance functions are investigated. The experimental results demonstrate the superior performance of the proposed distance function. In addition, an empirical runtime analysis demonstrates that the new approach is one of the best tree distance functions in terms of runtime efficiency.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Conventional spatial queries, such as range search and nearest neighbor retrieval, involve only conditions on objects’ geometric properties. Today, many modern applications call for novel forms of queries that aim to find objects satisfying both a spatial predicate, and a predicate on their associated texts. For example, instead of considering all the restaurants, a nearest neighbor query would instead ask for the restaurant that is the closest among those whose menus contain “steak, spaghetti, brandy” all at the same time. Currently, the best solution to such queries is based on the IR $^2$ -tree, which, as shown in this paper, has a few deficiencies that seriously impact its efficiency. Motivated by this, we develop a new access method called the spatial inverted index that extends the conventional inverted index to cope with multidimensional data, and comes with algorithms that can answer nearest neighbor queries with keywords in real time. As verified by experiments, the proposed techniques outperform the IR $^2$ -tree in query response time significantly, often by a factor of orders of magnitude.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Local thresholding algorithms were first presented more than a decade ago and have since been applied to a variety of data mining tasks in peer-to-peer systems, wireless sensor networks, and in grid systems. One critical assumption made by those algorithms has always been cycle-free routing. The existence of even one cycle may lead all peers to the wrong outcome. Outside the lab, unfortunately, cycle freedom is not easy to achieve. This work is the first to lift the requirement of cycle freedom by presenting a local thresholding algorithm suitable for general network graphs. The algorithm relies on a new repositioning of the problem in weighted vector arithmetics, on a new stopping rule, whose proof does not require that the network be cycle free, and on new methods for balance correction when the stopping rule fails. The new stopping and update rules permit calculation of the very same functions that were calculable using previous algorithms, which do assume cycle freedom. The algorithm is implemented on a standard peer-to-peer simulator and is validated for networks of up to 80,000 peers, organized in three different topologies representative of major current distributed systems: the Internet, structured peer-to-peer systems, and wireless sensor networks.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Frequent weighted itemsets represent correlations frequently holding in data in which items may weight differently. However, in some contexts, e.g., when the need is to minimize a certain cost function, discovering rare data correlations is more interesting than mining frequent ones. This paper tackles the issue of discovering rare and weighted itemsets, i.e., the infrequent weighted itemset (IWI) mining problem. Two novel quality measures are proposed to drive the IWI mining process. Furthermore, two algorithms that perform IWI and Minimal IWI mining efficiently, driven by the proposed measures, are presented. Experimental results show efficiency and effectiveness of the proposed approach.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    Publication Date: 2014-03-25
    Description: Traditional active learning methods require the labeler to provide a class label for each queried instance. The labelers are normally highly skilled domain experts to ensure the correctness of the provided labels, which in turn results in expensive labeling cost. To reduce labeling cost, an alternative solution is to allow nonexpert labelers to carry out the labeling task without explicitly telling the class label of each queried instance. In this paper, we propose a new active learning paradigm, in which a nonexpert labeler is only asked “whether a pair of instances belong to the same class”, namely, a pairwise label homogeneity. Under such circumstances, our active learning goal is twofold: (1) decide which pair of instances should be selected for query, and (2) how to make use of the pairwise homogeneity information to improve the active learner. To achieve the goal, we propose a “Pairwise Query on Max-flow Paths” strategy to query pairwise label homogeneity from a nonexpert labeler, whose query results are further used to dynamically update a Min-cut model (to differentiate instances in different classes). In addition, a “Confidence-based Data Selection” measure is used to evaluate data utility based on the Min-cut model’s prediction results. The selected instances, with inferred class labels, are included into the labeled set to form a closed-loop active learning process. Experimental results and comparisons with state-of-the-art methods demonstrate that our new active learning paradigm can result in good performance with nonexpert labelers.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: In this paper we present a framework for automatic exploitation of news in stock trading strategies. Events are extracted from news messages presented in free text without annotations. We test the introduced framework by deriving trading strategies based on technical indicators and impacts of the extracted events. The strategies take the form of rules that combine technical trading indicators with a news variable, and are revealed through the use of genetic programming. We find that the news variable is often included in the optimal trading rules, indicating the added value of news for predictive purposes and validating our proposed framework for automatically incorporating news in stock trading strategies.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: We formulate and investigate the novel problem of finding the skyline $k$ -tuple groups from an $n$ -tuple data set—i.e., groups of $k$ tuples which are not dominated by any other group of equal size, based on aggregate-based group dominance relationship. The major technical challenge is to identify effective anti-monotonic properties for pruning the search space of skyline groups. To this end, we first show that the anti-monotonic property in the well-known Apriori algorithm does not hold for skyline group pruning. Then, we identify two anti-monotonic properties with varying degrees of applicability: order-specific property which applies to SUM, MIN, and MAX as well as weak candidate-generation property which applies to MIN and MAX only. Experimental results on both real and synthetic data sets verify that the proposed algorithms achieve orders of magnitude performance gain over the baseline method.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Access control mechanisms protect sensitive information from unauthorized users. However, when sensitive information is shared and a Privacy Protection Mechanism (PPM) is not in place, an authorized user can still compromise the privacy of a person leading to identity disclosure. A PPM can use suppression and generalization of relational data to anonymize and satisfy privacy requirements, e.g., $k$ -anonymity and $l$ -diversity, against identity and attribute disclosure. However, privacy is achieved at the cost of precision of authorized information. In this paper, we propose an accuracy-constrained privacy-preserving access control framework. The access control policies define selection predicates available to roles while the privacy requirement is to satisfy the $k$ -anonymity or $l$ -diversity. An additional constraint that needs to be satisfied by the PPM is the imprecision bound for each selection predicate. The techniques for workload-aware anonymization for selection predicates have been discussed in the literature. However, to the best of our knowledge, the problem of satisfying the accuracy constraints for multiple roles has not been studied before. In our formulation of the aforementioned problem, we propose heuristics for anonymization algorithms and show empirically that the proposed approach satisfies imprecision bounds for more permissions and has lower total imprecision than the current state of the art.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: We identify relation completion (RC) as one recurring problem that is central to the success of novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation ${cal R}$ , RC attempts at linking entity pairs between two entity lists under the relation ${cal R}$ . To accomplish the RC goals, we propose to formulate search queries for each query entity $alpha$ based on some auxiliary information, so that to detect its target entity $beta$ from the set of retrieved documents. For instance, a pattern-based method (PaRE) uses extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns may decrease the probability of finding suitable target entities. As an alternative, we propose CoRE method that uses context terms learned surrounding the expression of a relation as the auxiliary information in formulating queries. The experimental results based on several real-world web data collections demonstrate that CoRE reaches a much higher accuracy than PaRE for the purpose of RC.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Activity recognition is a key task for the development of advanced and effective ubiquitous applications in fields like ambient assisted living. A major problem in designing effective recognition algorithms is the difficulty of incorporating long-range dependencies between distant time instants without incurring substantial increase in computational complexity of inference. In this paper we present a novel approach for introducing long-range interactions based on sequential pattern mining. The algorithm searches for patterns characterizing time segments during which the same activity is performed. A probabilistic model is learned to represent the distribution of pattern matches along sequences, trying to maximize the coverage of an activity segment by a pattern match. The model is integrated in a segmental labeling algorithm and applied to novel sequences, tagged according to matches of the extracted patterns. The rationale of the approach is that restricting dependencies to span the same activity segment (i.e., sharing the same label), allows keeping inference tractable. An experimental evaluation shows that enriching sensor-based representations with the mined patterns allows improving results over sequential and segmental labeling algorithms in most of the cases. An analysis of the discovered patterns highlights non-trivial interactions spanning over a significant time horizon.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: This paper takes the shortest path discovery to study efficient relational approaches to graph search queries. We first abstract three enhanced relational operators, based on which we introduce an FEM framework to bridge the gap between relational operations and graph operations. We show new features introduced by recent SQL standards, such as window function and merge statement, can improve the performance of the FEM framework. Second, we propose an edge weight aware graph partitioning schema and design a bi-directional restrictive BFS (breadth-first-search)over partitioned tables, which improves the scalability and performance without extra indexing overheads. The final extensive experimental results illustrate our relational approach with optimization strategies can achieve high scalability and performance.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: The main aim of this paper is to develop a community discovery scheme in a multi-dimensional network for data mining applications. In online social media, networked data consists of multiple dimensions/entities such as users, tags, photos, comments, and stories. We are interested in finding a group of users who interact significantly on these media entities. In a co-citation network, we are interested in finding a group of authors who relate to other authors significantly on publication information in titles, abstracts, and keywords as multiple dimensions/entities in the network. The main contribution of this paper is to propose a framework (MultiComm)to identify a seed-based community in a multi-dimensional network by evaluating the affinity between two items in the same type of entity (same dimension)or different types of entities (different dimensions)from the network. Our idea is to calculate the probabilities of visiting each item in each dimension, and compare their values to generate communities from a set of seed items. In order to evaluate a high quality of generated communities by the proposed algorithm, we develop and study a local modularity measure of a community in a multi-dimensional network. Experiments based on synthetic and real-world data sets suggest that the proposed framework is able to find a community effectively. Experimental results have also shown that the performance of the proposed algorithm is better in accuracy than the other testing algorithms in finding communities in multi-dimensional networks.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-03-25
    Description: Pattern classification systems are commonly used in adversarial applications, like biometric authentication, network intrusion detection, and spam filtering, in which data can be purposely manipulated by humans to undermine their operation. As this adversarial scenario is not taken into account by classical design methods, pattern classification systems may exhibit vulnerabilities, whose exploitation may severely affect their performance, and consequently limit their practical utility. Extending pattern classification theory and design methods to adversarial settings is thus a novel and very relevant research direction, which has not yet been pursued in a systematic way. In this paper, we address one of the main open issues: evaluating at design phase the security of pattern classifiers, namely, the performance degradation under potential attacks they may incur during operation. We propose a framework for empirical evaluation of classifier security that formalizes and generalizes the main ideas proposed in the literature, and give examples of its use in three real applications. Reported results show that security evaluation can provide a more complete understanding of the classifier’s behavior in adversarial environments, and lead to better design choices.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: We develop a methodology to predict box office performance of a movie at the point of green-lighting, when only its script and estimated production budget are available. We extract three levels of textual features (genre and content, semantics, and bag-of-words) from scripts using screenwriting domain knowledge, human input, and natural language processing techniques. These textual variables define a distance metric across scripts, which is then used as an input for a kernel-based approach to assess box office performance. We show that our proposed methodology predicts box office revenues more accurately (29 percent lower mean squared error (MSE)) compared to benchmark methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Accurately recognizing the object’s behavior from the uncertain sensor data is a key issue of Internet of Things application. For example, in urban management monitoring system, it is necessary to have an autonomous analyzing module that can online monitor object’s behavior based on environmental monitoring information in order to prevent an emergent situation in advance. In this work, we present an a pproximate o bject’s b ehavior a nalysis method, called AOBA, which can recognize behavioral patterns of the hybrid objects which include patrolman, watering cart, street lamp etc. In intelligent urban management. AOBA consists of two phases: filtering phase and recognizing phase . In the filtering phase , a -approximate pre-matching algorithm based on (q) -grams distance is introduced to select possible pattern rapidly, which can discard huge amount insignificant or dirty data; in the recognizing phase , aiming to the temporal and the spatial characteristics of sensor data, an improved bit-parallel string matching algorithm is proposed to recognize the (k) -approximate multiple patterns over event sequences selected by the filtering phase . Experiments on real urban monitoring data and synthetic data show that the proposed method can efficiently discriminate object’s behavior. Compared with the existing method, the proposed method provides a fault-tolerant approximate pattern recognition solution.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: The multiple longest common subsequence (MLCS) problem, related to the identification of sequence similarity, is an important problem in many fields. As an NP-hard problem, its exact algorithms have difficulty in handling large-scale data and time- and space-efficient algorithms are required in real-world applications. To deal with time constraints, anytime algorithms have been proposed to generate good solutions with a reasonable time. However, there exists little work on space-efficient MLCS algorithms. In this paper, we formulate the MLCS problem into a graph search problem and present two space-efficient anytime MLCS algorithms, SA-MLCS and SLA-MLCS. SA-MLCS uses an iterative beam widening search strategy to reduce space usage during the iterative process of finding better solutions. Based on SA-MLCS, SLA-MLCS, a space-bounded algorithm, is developed to avoid space usage from exceeding available memory. SLA-MLCS uses a replacing strategy when SA-MLCS reaches a given space bound. Experimental results show SA-MLCS and SLA-MLCS use an order of magnitude less space and time than the state-of-the-art approximate algorithm MLCS-APP while finding better solutions. Compared to the state-of-the-art anytime algorithm Pro-MLCS, SA-MLCS and SLA-MLCS can solve an order of magnitude larger size instances. Furthermore, SLA-MLCS can find much better solutions than SA-MLCS on large size instances.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Users are often faced with the problem of finding complementary items that together achieve a single common goal (e.g., a starter kit for a novice astronomer, a collection of question/answers related to low-carb nutrition, a set of places to visit on holidays). In this paper, we argue that for some application scenarios returning item bundles is more appropriate than ranked lists. Thus we define composite retrieval as the problem of finding $k$ bundles of complementary items. Beyond complementarity of items, the bundles must be valid w.r.t. a given budget, and the answer set of $k$ bundles must exhibit diversity. We formally define the problem and show that in its general form is ${bf NP}$ -hard  and that also the special cases in which each bundle is formed by only one item, or only one bundle is sought, are hard. Our characterization however suggests how to adopt a two-phase approach (Produce-and-Choose, or PAC) in which we first produce many valid bundles, and then we choose $k$ among them. For the first phase we devise two ad-hoc clustering algorithms, while for the second phase we adapt heuristics with approximation guarantees for a related problem. We also devise another approach which is based on first finding a $k$ -clustering and then selecting a valid bundle from each of the produced cl- sters (Cluster-and-Pick, or CAP). We compare experimentally the proposed methods on two real-world data sets: the first data set is given by a sample of touristic attractions in 10 large European cities, while the second is a large database of user-generated restaurant reviews from Yahoo! Local. Our experiments show that when diversity is highly important, CAP is the best option, while when diversity is less important, a PAC approach constructing bundles around randomly chosen pivots, is better.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: In this paper, we tackle the problem of discovering movement-based communities of users, where users in the same community have similar movement behaviors. Note that the identification of movement-based communities is beneficial to location-based services and trajectory recommendation services. Specifically, we propose a framework to mine movement-based communities which consists of three phases: 1) constructing trajectory profiles of users, 2) deriving similarity between trajectory profiles, and 3) discovering movement-based communities. In the first phase, we design a data structure, called the Sequential Probability tree (SP-tree), as a user trajectory profile. SP-trees not only derive sequential patterns, but also indicate transition probabilities of movements. Moreover, we propose two algorithms: BF (standing for breadth-first) and DF (standing for depth-first) to construct SP-tree structures as user profiles. To measure the similarity values among users’ trajectory profiles, we further develop a similarity function that takes SP-tree information into account. In light of the similarity values derived, we formulate an objective function to evaluate the quality of communities. According to the objective function derived, we propose a greedy algorithm Geo-Cluster to effectively derive communities. To evaluate our proposed algorithms, we have conducted comprehensive experiments on two real data sets. The experimental results show that our proposed framework can effectively discover movement-based user communities.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Privacy is becoming a key requirement for ICT applications that handle personal data. However, Database Management Systems (DBMSs), which are devoted to data collection and processing by definition, still do not provide the proper support for privacy policies. Policies are enforced by ad-hoc programmed software modules that complement DBMS access control services. This practice is time consuming, error prone, and neither general nor scalable. This work does a first step to overcome these limits. We propose a systematic approach to the automatic development of a monitor that regulates the execution of SQL queries based on purpose based privacy policies. The proposed solution does not require programming, it is general, platform independent and usable with most of the existing relational DBMSs.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: A growing number of enterprises use complex event processing for monitoring and controlling their operations, while business process models are used to document working procedures. In this work, we propose a comprehensive method for complex event processing optimization using business process models. Our proposed method is based on the extraction of behaviorial constraints that are used, in turn, to rewrite patterns for event detection, and select and transform execution plans. We offer a set of rewriting rules that is shown to be complete with respect to the $all$ , $seq$ , and $any$ patterns. The effectiveness of our method is demonstrated in an experimental evaluation with a large number of processes from an insurance company. We illustrate that the proposed optimization leads to significant savings in query processing. By integrating the optimization in state-of-the-art systems for event pattern matching, we demonstrate that these savings materialize in different technical infrastructures and can be combined with existing optimization techniques.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: In some real world applications, like information retrieval and data classification, we often are confronted with the situation that the same semantic concept can be expressed using different views with similar information. Thus, how to obtain a certain Semantically Consistent Patterns (SCP) for cross-view data, which embeds the complementary information from different views, is of great importance for those applications. However, the heterogeneity among cross-view representations brings a significant challenge on mining the SCP. In this paper, we propose a general framework to discover the SCP for cross-view data. Specifically, aiming at building a feature-isomorphic space among different views, a novel Isomorphic Relevant Redundant Transformation (IRRT) isfirst proposed. The IRRT linearly maps multiple heterogeneous low-level feature spaces to a high-dimensional redundantfeature-isomorphic one, which we name as mid-level space. Thus, much more complementary information from different views can be captured. Furthermore, to mine the semantic consistency among the isomorphic representations in the mid-level space, we propose a new Correlation-based Joint Feature Learning (CJFL) model to extract a unique high-level semantic subspace shared across the feature-isomorphic data. Consequently, the SCP for cross-view data can be obtained. Comprehensive experiments on three data sets demonstrate the advantages of our framework in classification and retrieval.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Existing methods have addressed the issue of handling each individual skyline query performed on data sets with partially ordered domains. However, it is still very challenging to process such queries for on-line applications with low response time. In this paper, we introduce a cache-based framework, called CSS, for further reducing the query processing time to support high-responsive applications. Skyline queries that were previously processed with user preferences similar to those of the current query contribute useful candidate result points. Hence, the answered queries are cached with both their results and user preferences such that the query processor can rapidly retrieve the result for a new query only from the result sets of selected queries with compatible user preferences. We introduce a similarity measure that establishes the level of similarity between the user preferences of a new query and a cached query; hence the system can start with the most similar candidates. Furthermore, if a new query is only partially answerable from the cache, then the query processor utilizes the partial result sets and performs less expensive constraint skyline queries guided by violated preferences. Furthermore, we introduce two access methods for cached queries indexed by their user preferences to only access a set of relevant cached queries for similarity measures. Extensive experiments are presented to demonstrate the performance and utility of our novel approach.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Keyword search is a useful tool for exploring large ${sf RDF}$ data sets. Existing techniques either rely on constructing a distance matrix for pruning the search space or building summaries from the ${sf RDF}$ graphs for query processing. In this work, we show that existing techniques have serious limitations in dealing with realistic, large ${sf RDF}$ data with tens of millions of triples. Furthermore, the existing summarization techniques may lead to incorrect/incomplete results. To address these issues, we propose an effective summarization algorithm to summarize the ${sf RDF}$ data. Given a keyword query, the summaries lend significant pruning powers to exploratory keyword search and result in much better efficiency compared to previous works. Unlike existing techniques, our search algorithms always return correct results. Besides, the summaries we built can be updated incrementally and efficiently. Experiments on both benchmark and large real ${sf RDF}$ data sets show that our techniques are scalable and efficient.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: The backup taken of a file system must be consistent, preserving data integrity across files in the file system. With file system sizes getting very large, and with demand for continuous access to data, backup has to be taken when the file system is active (is online). Arbitrarily taken online backup may result in an inconsistent backup copy. We propose a scheme referred to as mutual serializability to take a consistent backup of an active file system assuming that the file system supports transactions. The scheme extends the set of conflicting operations to include read-read conflicts, and it is shown that if the backup transaction is mutually serializable with every other transaction individually, a consistent backup copy is obtained. The user transactions continue to serialize within themselves using some standard concurrency control protocol such as Strict 2PL. We put our scheme into a formal framework to prove its correctness, and the formalization as well as the correctness proof are independent of the concurrency control protocol used to serialize user transactions. The scheme has been implemented and experiments show that consistent online backup is possible with reasonable overhead.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Exponential growth of information generated by online social networks demands effective and scalable recommender systems to give useful results. Traditional techniques become unqualified because they ignore social relation data; existing social recommendation approaches consider social network structure, but social contextual information has not been fully considered. It is significant and challenging to fuse social contextual factors which are derived from users’ motivation of social behaviors into social recommendation. In this paper, we investigate the social recommendation problem on the basis of psychology and sociology studies, which exhibit two important factors: individual preference and interpersonal influence. We first present the particular importance of these two factors in online behavior prediction. Then we propose a novel probabilistic matrix factorization method to fuse them in latent space. We further provide a scalable algorithm which can incrementally process the large scale data. We conduct experiments on both Facebook style bidirectional and Twitter style unidirectional social network data sets. The empirical results and analysis on these two large data sets demonstrate that our method significantly outperforms the existing approaches.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Affinity Propagation (AP) clustering has been successfully used in a lot of clustering problems. However, most of the applications deal with static data. This paper considers how to apply AP in incremental clustering problems. First, we point out the difficulties in Incremental Affinity Propagation (IAP) clustering, and then propose two strategies to solve them. Correspondingly, two IAP clustering algorithms are proposed. They are IAP clustering based on K-Medoids (IAPKM) and IAP clustering based on Nearest Neighbor Assignment (IAPNA). Five popular labeled data sets, real world time series and a video are used to test the performance of IAPKM and IAPNA. Traditional AP clustering is also implemented to provide benchmark performance. Experimental results show that IAPKM and IAPNA can achieve comparable clustering performance with traditional AP clustering on all the data sets. Meanwhile, the time cost is dramatically reduced in IAPKM and IAPNA. Both the effectiveness and the efficiency make IAPKM and IAPNA able to be well used in incremental clustering tasks.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Most frequent and expensive queries in social networks involve multi-user operations such as requesting the latest tweets or news-feeds of friends. The performance of such queries are heavily dependent on the data partitioning and replication methodologies adopted by the underlying systems. Existing solutions for data distribution in these systems involve hash- or graph-based approaches that ignore the multi-way relations among data. In this work, we propose a novel data partitioning and selective replication method that utilizes the temporal information in prior workloads to predict future query patterns. Our method utilizes the social network structure and the temporality of the interactions among its users to construct a hypergraph that correctly models multi-user operations. It then performs simultaneous partitioning and replication of this hypergraph to reduce the query span while respecting load balance and I/O load constraints under replication. To test our model, we enhance the Cassandra NoSQL system to support selective replication and we implement a social network application (a Twitter clone) utilizing our enhanced Cassandra. We conduct experiments on a cloud computing environment (Amazon EC2) to test the developed systems. Comparison of the proposed method with hash- and enhanced graph-based schemes indicate that it significantly improves latency and throughput.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Search Engine Marketing (SEM) agencies manage thousands of search keywords for their clients. The campaign management dashboards provided by advertisement brokers have interfaces to change search campaign attributes. Using these dashboards, advertisers create test variants for various bid choices, keyword ideas, and advertisement text options. Later on, they conduct controlled experiments for selecting the best performing variants. Given a large keyword portfolio and many variants to consider, campaign management can easily become a burden on even experienced advertisers. In order to target users in need of a particular service, advertisers have to determine the purchase intents or information needs of target users. Once the target intents are determined, advertisers can target those users with relevant search keywords. In order to formulate information needs and to scale campaign management with increasing number of keywords, we propose a framework called TopicMachine , where we learn the latent topics hidden in the available search terms reports. Our hypothesis is that these topics correspond to the set of information needs that best match-make a given client with users. In our experiments, TopicMachine outperformed its closest competitor by $41$ percent on predicting total user subscriptions.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Mobile devices with geo-positioning capabilities (e.g., GPS) enable users to access information that is relevant to their present location. Users are interested in querying about points of interest (POI) in their physical proximity, such as restaurants, cafes, ongoing events, etc. Entities specialized in various areas of interest (e.g., certain niche directions in arts, entertainment, travel) gather large amounts of geo-tagged data that appeal to subscribed users. Such data may be sensitive due to their contents. Furthermore, keeping such information up-to-date and relevant to the users is not an easy task, so the owners of such data sets will make the data accessible only to paying customers. Users send their current location as the query parameter, and wish to receive as result the nearest POIs, i.e., nearest-neighbors (NNs). But typical data owners do not have the technical means to support processing queries on a large scale, so they outsource data storage and querying to a cloud service provider. Many such cloud providers exist who offer powerful storage and computational infrastructures at low cost. However, cloud providers are not fully trusted, and typically behave in an honest-but-curious fashion. Specifically, they follow the protocol to answer queries correctly, but they also collect the locations of the POIs and the subscribers for other purposes. Leakage of POI locations can lead to privacy breaches as well as financial losses to the data owners, for whom the POI data set is an important source of revenue. Disclosure of user locations leads to privacy violations and may deter subscribers from using the service altogether. In this paper, we propose a family of techniques that allow processing of NN queries in an untrusted outsourced environment, while at the same time protecting both the POI and querying users’ positions. Our techniques rely on mutable order preserving encoding (mOPE), the only secure order-preserving encryption method known to-- ate. We also provide performance optimizations to decrease the computational cost inherent to processing on encrypted data, and we consider the case of incrementally updating data sets. We present an extensive performance evaluation of our techniques to illustrate their viability in practice.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: In our increasingly connected and instrumented world, live data recording the interactions between people, systems, and the environment is available in various domains, such as telecommunciations and social media. This data often takes the form of a temporally evolving graph, where entities are the vertices and the interactions between them are the edges. An important feature of this graph is that the number of edges it has grows continuously, as new interactions take place. We call such graphs interaction graphs. In this paper we study the problem of storing interaction graphs such that temporal queries on them can be answered efficiently. Since interaction graphs are append-only and edges are added continuously, traditional graph layout and storage algorithms that are batch based cannot be applied directly. We present the design and implementation of a system that caches recent interactions in memory, while quickly placing the expired interactions to disk blocks such that those edges that are likely to be accessed together are placed together. We develop live block formation algorithms that are fast, yet can take advantage of temporal and spatial locality among the edges to optimize the storage layout with the goal of improving query performance. We evaluate the system on synthetic as well as real-world interaction graphs, and show that our block formation algorithms are effective for answering temporal neighborhood queries on the graph. Such queries form a foundation for building more complex online and offline temporal analytics on interaction graphs.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-10-08
    Description: Graphs are becoming increasingly dominant in modeling real-life networked data including social and biological networks, the WWW and the Semantic Web, etc. Graph pattern queries are useful for gathering information with expressive semantics from these graph-structured data. Current methods for graph pattern query processing have performance deficiency caused by inefficiencies of the underlying reachability index and costly merge-join operations on huge amounts of tuple-formatted intermediate results. To overcome the above problems, this paper contributes in the following aspects to boost graph pattern query evaluation. First, we propose an improved hop-based reachability indexing scheme 3-Hop* which gains faster reachability query evaluation, less indexing costs and better scalabilities than state-of-the-art hop-based methods. Second, we propose a two-stage node filtering algorithm based on 3-Hop* to answer tree pattern queries more efficiently. Tree pattern queries serve as the underlying facility for graph pattern query evaluation. Furthermore, we use a graph representation of the intermediate results during node filtering and final results enumeration. Experiments on real-life and synthetic datasets demonstrate the effectiveness of the proposed methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: The process of record linkage seeks to integrate instances that correspond to the same entity. Record linkage has traditionally been performed through the comparison of identifying field values ( e.g. , Surname ), however, when databases are maintained by disparate organizations, the disclosure of such information can breach the privacy of the corresponding individuals. Various private record linkage (PRL) methods have been developed to obscure such identifiers, but they vary widely in their ability to balance competing goals of accuracy, efficiency and security. The tokenization and hashing of field values into Bloom filters (BF) enables greater linkage accuracy and efficiency than other PRL methods, but the encodings may be compromised through frequency-based cryptanalysis. Our objective is to adapt a BF encoding technique to mitigate such attacks with minimal sacrifices in accuracy and efficiency. To accomplish these goals, we introduce a statistically-informed method to generate BF encodings that integrate bits from multiple fields, the frequencies of which are provably associated with a minimum number of fields. Our method enables a user-specified tradeoff between security and accuracy. We compare our encoding method with other techniques using a public dataset of voter registration records and demonstrate that the increases in security come with only minor losses to accuracy.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-11-08
    Description: It is nowadays well-established that the construction of quality domain ontologies benefits from the involvement in the modelling process of more actors, possibly having different roles and skills. To be effective, the collaboration between these actors has to be fostered, enabling each of them to actively and readily participate to the development of the ontology, favoring as much as possible the direct involvement of the domain experts in the authoring activities. Recent works have shown that ontology modelling tools based on wikis’ paradigm and technology could contribute in meeting these collaborative requirements. This paper investigates, both at the theoretical and empirical level, the effectiveness of wiki features for collaborative ontology authoring in supporting teamworks composed of domain experts and knowledge engineers, as well as their impact on the entire process of collaborative ontology modelling and entity lifecycle.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not represented well which leads to high misclassification error. We introduce two probabilistic oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The proposed approaches use the joint probability distribution of data attributes and Gibbs sampling to generate new minority class samples. While RACOG selects samples produced by the Gibbs sampler based on a predefined lag , wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using nine UCI data sets that were carefully modified to exhibit class imbalance and one new application domain data set with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    Publication Date: 2014-12-06
    Description: Although mostly used for pattern classification, linear discriminant analysis (LDA) can also be used in feature selection as an effective measure to evaluate the separative ability of a feature subset. When applied to feature selection on high-dimensional small-sized (HDSS) data (generally) with class-imbalance, LDA encounters four problems, including singularity of scatter matrix, overfitting, overwhelming and prohibitively computational complexity. In this study, we propose the LDA-based feature selection method minority class emphasized linear discriminant analysis (MCE-LDA) with a new regularization technique to address the first three problems. Different to giving equal or more emphasis to majority class in conventional forms of regularization, the proposed regularization emphasizes more on minority class, with the expectation of improving overall performance by alleviating overwhelming of majority class to minority class as well as overfitting in minority class. In order to reduce computational overhead, an incremental implementation of LDA-based feature selection has been introduced. Comparative studies with other forms of regularization to LDA as well as with other popular feature selection methods on five HDSS problems show that MCE-LDA can produce feature subsets with excellent performance in both classification and robustness. Further experimental results of true positive rate (TPR) and true negative rate (TNR) have also verified the effectiveness of the proposed technique in alleviating overwhelming and overfitting problems.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: It is common that the objects in a spatial database (e.g., restaurants/hotels) are associated with keyword(s) to indicate their businesses/services/features. An interesting problem known as Closest Keywords search is to query objects, called keyword cover , which together cover a set of query keywords and have the minimum inter-objects distance. In recent years, we observe the increasing availability and importance of keyword rating in object evaluation for the better decision making. This motivates us to investigate a generic version of Closest Keywords search called Best Keyword Cover which considers inter-objects distance as well as the keyword rating of objects. The baseline algorithm is inspired by the methods of Closest Keywords search which is based on exhaustively combining objects from different query keywords to generate candidate keyword covers. When the number of query keywords increases, the performance of the baseline algorithm drops dramatically as a result of massive candidate keyword covers generated. To attack this drawback, this work proposes a much more scalable algorithm called keyword nearest neighbor expansion (keyword-NNE). Compared to the baseline algorithm, keyword-NNE algorithm significantly reduces the number of candidate keyword covers generated. The in-depth analysis and extensive experiments on real data sets have justified the superiority of our keyword-NNE algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Phase change memory (PCM) has been considered an attractive alternative to flash memory and DRAM. It has promising features, including non-volatile storage, byte addressability, fast read and write operations, and supports random accesses. However, there are challenges in designing algorithms for PCM-based memory systems, such as longer write latency and higher energy consumption compared to DRAM. In this paper, we propose a new predictive B (^{+}) - tree index, called the B (^{p}) - tree , which is tailored for database systems that make use of PCM. Our B (^{p}) - tree reduces data movements caused by tree node splits and merges that arise from insertions and deletions. This is achieved by pre-allocating space on PCM for near future data. To ensure the space are allocated where they are needed, we propose a novel predictive model to ascertain future data distribution based on the current data. In addition, as in [4] , when keys are inserted into a leaf node, they are packed but need not be in sorted order. We have implemented the B (^{p}) - tree in PostgreSQL and evaluated it in an emulated environment. Our experimental results show that the B (^{p}) - tree significantly reduces the number of writes, therefore making it write and energy efficient and suitable for a PCM-like hardware environment.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: This paper formulates a multi-graph learning task. In our problem setting, a bag contains a number of graphs and a class label. A bag is labeled positive if at least one graph in the bag is positive, and negative otherwise. In addition, the genuine label of each graph in a positive bag is unknown, and all graphs in a negative bag are negative. The aim of multi-graph learning is to build a learning model from a number of labeled training bags to predict previously unseen test bags with maximum accuracy. This problem setting is essentially different from existing multi-instance learning (MIL), where instances in MIL share well-defined feature values, but no features are available to represent graphs in a multi-graph bag. To solve the problem, we propose a Multi-Graph Feature based Learning ( gMGFL) algorithm that explores and selects a set of discriminative subgraphs as features to transfer each bag into a single instance, with the bag label being propagated to the transferred instance. As a result, the multi-graph bags form a labeled training instance set, so generic learning algorithms, such as decision trees, can be used to derive learning models for multi-graph classification. Experiments and comparisons on real-world multi-graph tasks demonstrate the algorithm performance.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Both cost-sensitive classification and online learning have been extensively studied in data mining and machine learning communities, respectively. However, very limited study addresses an important intersecting problem, that is, “Cost-Sensitive Online Classification”. In this paper, we formally study this problem, and propose a new framework for Cost-Sensitive Online Classification by directly optimizing cost-sensitive measures using online gradient descent techniques. Specifically, we propose two novel cost-sensitive online classification algorithms, which are designed to directly optimize two well-known cost-sensitive measures: (i) maximization of weighted sum of sensitivity and specificity , and (ii) minimization of weighted misclassification cost . We analyze the theoretical bounds of the cost-sensitive measures made by the proposed algorithms, and extensively examine their empirical performance on a variety of cost-sensitive online classification tasks. Finally, we demonstrate the application of the proposed technique for solving several online anomaly detection tasks, showing that the proposed technique could be a highly efficient and effective tool to tackle cost-sensitive online classification tasks in various application domains.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: In graph theory, $k$ -core is a key metric used to identify subgraphs of high cohesion, also known as the ‘dense’ regions of a graph. As the real world graphs such as social network graphs grow in size, the contents get richer and the topologies change dynamically, we are challenged not only to materialize $k$ -core subgraphs for one time but also to maintain them in order to keep up with continuous updates. Adding to the challenge is that real world data sets are outgrowing the capacity of a single server and its main memory. These challenges inspired us to propose a new set of distributed algorithms for $k$ -core view construction and maintenance on a horizontally scaling storage and computing platform. Our algorithms execute against the partitioned graph data in parallel and take advantage of $k$ -core properties to aggressively prune unnecessary computation. Experimental evaluation results demonstrated orders of magnitude speedup and advantages of maintaining $k$ -core incrementally and in batch windows over complete reconstruction. Our algorithms thus enable practitioners to create and maintain many $k$ -core views on different topics in rich social network content si- ultaneously.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Location-based services have attracted significant attention due to modern mobile phones equipped with GPS devices. These services generate large amounts of spatio-textual data which contain both spatial location and textual descriptions. Since a spatio-textual object may have different representations, possibly because of deviations of GPS or different user descriptions, it calls for efficient methods to integrate spatio-textual data from different sources. In this paper we study a new research problem called spatio-textual similarity join: given two sets of spatio-textual objects, find the similar object pairs. We make the following contributions: (1) We develop a filter-and-refine framework and devise several efficient algorithms. We extend the prefix filter technique to generate spatial and textual signatures for the objects and build inverted index on top of these signatures. Then we generate candidate pairs using the inverted lists of signatures. Finally we refine the candidates and generate the final result. (2) We study how to generate high-quality signatures for spatial information. We develop an MBR-prefix based signature to prune large numbers of dissimilar object pairs. (3) We propose a hybrid signature scheme to support both textual pruning and spatial pruning simultaneously. (4) Experimental results on real and synthetic datasets show that our algorithms achieve high performance and scale well.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: The (k) -core decomposition in a graph is a fundamental problem for social network analysis. The problem of (k) -core decomposition is to calculate the core number for every node in a graph. Previous studies mainly focus on (k) -core decomposition in a static graph. There exists a linear time algorithm for (k) -core decomposition in a static graph. However, in many real-world applications such as online social networks and the Internet, the graph typically evolves over time. In such applications, a key issue is to maintain the core numbers of nodes when the graph changes over time. A simple implementation is to perform the linear time algorithm to recompute the core number for every node after the graph is updated. Such simple implementation is expensive when the graph is very large. In this paper, we propose a new efficient algorithm to maintain the core number for every node in a dynamic graph. Our main result is that only certain nodes need to update their core numbers when the graph is changed by inserting/deleting an edge. We devise an efficient algorithm to identify and recompute the core numbers of such nodes. The complexity of our algorithm is independent of the graph size. In addition, to further accelerate the algorithm, we develop two pruning strategies by exploiting the lower and upper bounds of the core number. Finally, we conduct extensive experiments over both real-world and synthetic datasets, and the results demonstrate the efficiency of the proposed algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Clustering categorical sequences is an important and difficult data mining task. Despite recent efforts, the challenge remains, due to the lack of an inherently meaningful measure of pairwise similarity. In this paper, we propose a novel variable-order Markov framework, named weighted conditional probability distribution (WCPD), to model clusters of categorical sequences. We propose an efficient and effective approach to solve the challenging problem of model initialization. To initialize the WCPD model, we propose to use a first-order Markov model built on a weighted fuzzy indicator vector representation of categorical sequences, which we call the WFI Markov model. Based on a cascade optimization framework that combines the WCPD and WFI models, we design a new divisive hierarchical clustering algorithm for clustering categorical sequences. Experimental results on data sets from three different domains demonstrate the promising performance of our models and clustering algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: This paper focuses on an important query in scientific simulation data analysis: the Spatial Distance Histogram (SDH). The computation time of an SDH query using brute force method is quadratic. Often, such queries are executed continuously over certain time periods, increasing the computation time. We propose highly efficient approximate algorithm to compute SDH over consecutive time periods with provable error bounds. The key idea of our algorithm is to derive statistical distribution of distances from the spatial and temporal characteristics of particles. Upon organizing the data into a Quad-tree based structure, the spatiotemporal characteristics of particles in each node of the tree are acquired to determine the particles’ spatial distribution as well as their temporal locality in consecutive time periods. We report our efforts in implementing and optimizing the above algorithm in graphics processing units (GPUs) as means to further improve the efficiency. The accuracy and efficiency of the proposed algorithm is backed by mathematical analysis and results of extensive experiments using data generated from real simulation studies.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Physical activity consists of complex behavior, typically structured in bouts which can consist of one continuous movement (e.g., exercise) or many sporadic movements (e.g., household chores). Each bout can be represented as a block of feature vectors corresponding to the same activity type. This paper introduces a general distance metric technique to use this block representation to first predict activity type, and then uses the predicted activity to estimate energy expenditure within a novel framework. This distance metric, dubbed Bipart, learns block-level information from both training and test sets, combining both to form a projection space which materializes block-level constraints. Thus, Bipart provides a space which can improve the bout classification performance of all classifiers. We also propose an energy expenditure estimation framework which leverages activity classification in order to improve estimates. Comprehensive experiments on waist-mounted accelerometer data, comparing Bipart against many similar methods as well as other classifiers, demonstrate the superior activity recognition of Bipart, especially in low-information experimental settings.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Topic modeling has become a widely used tool for document management. However, there are few topic models distinguishing the importance of documents on different topics. In this paper, we propose a framework $LIMTopic$ to incorporate link based importance into topic modeling. To instantiate the framework, RankTopic and HITSTopic are proposed by incorporating topical pagerank and topical HITS into topic modeling respectively. Specifically, ranking methods are first used to compute the topical importance of documents. Then, a generalized relation is built between link importance and topic modeling. We empirically show that LIMTopic converges after a small number of iterations in most experimental settings. The necessity of incorporating link importance into topic modeling is justified based on KL-Divergences between topic distributions converted from topical link importance and those computed by basic topic models. To investigate the document network summarization performance of topic models, we propose a novel measure called log-likelihood of ranking-integrated document-word matrix. Extensive experimental results show that LIMTopic performs better than baseline models in generalization performance, document clustering and classification, topic interpretability and document network summarization performance. Moreover, RankTopic has comparable performance with relational topic model (RTM) and HITSTopic performs much better than baseline models in document clustering and classification.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: In this paper, we consider the collaborative data publishing problem for anonymizing horizontally partitioned data at multiple data providers. We consider a new type of “insider attack” by colluding data providers who may use their own data records (a subset of the overall data) to infer the data records contributed by other data providers. The paper addresses this new threat, and makes several contributions. First, we introduce the notion of (m) -privacy, which guarantees that the anonymized data satisfies a given privacy constraint against any group of up to (m) colluding data providers. Second, we present heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently checking (m) -privacy given a group of records. Third, we present a data provider-aware anonymization algorithm with adaptive (m) -privacy checking strategies to ensure high utility and (m) -privacy of anonymized data with efficiency. Finally, we propose secure multi-party computation protocols for collaborative data publishing with (m) -privacy. All protocols are extensively analyzed and their security and efficiency are formally proved. Experiments on real-life datasets suggest that our approach achieves better or comparable utility and efficiency than existing and baseline algorithms while satisfying (m) -privacy.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Social tags inherent in online music services such as Last.fm provide a rich source of information on musical moods. The abundance of social tags makes this data highly beneficial for developing techniques to manage and retrieve mood information, and enables study of the relationships between music content and mood representations with data substantially larger than that available for conventional emotion research. However, no systematic assessment has been done on the accuracy of social tags and derived semantic models at capturing mood information in music. We propose a novel technique called Affective Circumplex Transformation (ACT) for representing the moods of music tracks in an interpretable and robust fashion based on semantic computing of social tags and research in emotion modeling. We validate the technique by predicting listener ratings of moods in music tracks, and compare the results to prediction with the Vector Space Model (VSM), Singular Value Decomposition (SVD), Nonnegative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA). The results show that ACT consistently outperforms the baseline techniques, and its performance is robust against a low number of track-level mood tags. The results give validity and analytical insights for harnessing millions of music tracks and associated mood data available through social tags in application development.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Discovering hyponym relations among domain-specific terms is a fundamental task in taxonomy learning and knowledge acquisition. However, the great diversity of various domain corpora and the lack of labeled training sets make this task very challenging for conventional methods that are based on text content. The hyperlink structure of Wikipedia article pages was found to contain recurring network motifs in this study, indicating the probability of a hyperlink being a hyponym hyperlink. Hence, a novel hyponym relation extraction approach based on the network motifs of Wikipedia hyperlinks was proposed. This approach automatically constructs motif-based features from the hyperlink structure of a domain; every hyperlink is mapped to a 13-dimensional feature vector based on the 13 types of three-node motifs. The approach extracts structural information from Wikipedia and heuristically creates a labeled training set. Classification models were determined from the training sets for hyponym relation extraction. Two experiments were conducted to validate our approach based on seven domain-specific datasets obtained from Wikipedia. The first experiment, which utilized manually labeled data, verified the effectiveness of the motif-based features. The second experiment, which utilized an automatically labeled training set of different domains, showed that the proposed approach performs better than the approach based on lexico-syntactic patterns and achieves comparable result to the approach based on textual features. Experimental results show the practicability and fairly good domain scalability of the proposed approach.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 88
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Majority of the existing works on network analysis study properties that are related to the global topology of a network. Examples of such properties include diameter, power-law exponent, and spectra of graph Laplacian. Such works enhance our understanding of real-life networks, or enable us to generate synthetic graphs with real-life graph properties. However, many of the existing problems on networks require the study of local topological structures of a network, which did not get the deserved attention in the existing works. In this work, we use graphlet frequency distribution (GFD) as an analysis tool for understanding the variance of local topological structure in a network; we also show that it can help in comparing, and characterizing real-life networks. The main bottleneck to obtain GFD is the excessive computation cost for obtaining the frequency of each of the graphlets in a large network. To overcome this, we propose a simple, yet powerful algorithm, called Graft , that obtains the approximate graphlet frequency for all graphlets that have up-to five vertices. Comparing to an exact counting algorithm, our algorithm achieves a speedup factor between 10 and 100 for a negligible counting error, which is, on average, less than 5 percent.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Spectral clustering is a powerful tool for unsupervised data analysis. In this paper, we propose a context-aware hypergraph similarity measure (CAHSM), which leads to robust spectral clustering in the case of noisy data. We construct three types of hypergraphs—the pairwise hypergraph, the (k) -nearest-neighbor ( (k) NN) hypergraph, and the high-order over-clustering hypergraph. The pairwise hypergraph captures the pairwise similarity of data points; the (k) NNhypergraph captures the neighborhood of each point; and the clustering hypergraph encodes high-order contexts within the dataset. By combining the affinity information from these three hypergraphs, the CAHSM algorithm is able to explore the intrinsic topological information of the dataset. Therefore, data clustering using CAHSM tends to be more robust. Considering the intra-cluster compactness and the inter-cluster separability of vertices, we further design a discriminative hypergraph partitioning criterion (DHPC). Using both CAHSM and DHPC, a robust spectral clustering algorithm is developed. Theoretical analysis and experimental evaluation demonstrate the effectiveness and robustness of the proposed algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Similarity search is an important function in many applications, which usually focuses on measuring the similarity between objects with the same type. However, in many scenarios, we need to measure the relatedness between objects with different types. With the surge of study on heterogeneous networks, the relevance measure on objects with different types becomes increasingly important. In this paper, we study the relevance search problem in heterogeneous networks, where the task is to measure the relatedness of heterogeneous objects (including objects with the same type or different types). A novel measure HeteSim is proposed, which has the following attributes: (1) a uniform measure: it can measure the relatedness of objects with the same or different types in a uniform framework; (2) a path-constrained measure: the relatedness of object pairs are defined based on the search path that connects two objects through following a sequence of node types; (3) a semi-metric measure: HeteSim has some good properties (e.g., self-maximum and symmetric), which are crucial to many data mining tasks. Moreover, we analyze the computation characteristics of HeteSim and propose the corresponding quick computation strategies. Empirical studies show that HeteSim can effectively and efficiently evaluate the relatedness of heterogeneous objects.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Web-page recommendation plays an important role in intelligent Web systems. Useful knowledge discovery from Web usage data and satisfactory knowledge representation for effective Web-page recommendations are crucial and challenging. This paper proposes a novel method to efficiently provide better Web-page recommendation through semantic-enhancement by integrating the domain and Web usage knowledge of a website. Two new models are proposed to represent the domain knowledge. The first model uses an ontology to represent the domain knowledge. The second model uses one automatically generated semantic network to represent domain terms, Web-pages, and the relations between them. Another new model, the conceptual prediction model, is proposed to automatically generate a semantic network of the semantic Web usage knowledge, which is the integration of domain knowledge and Web usage knowledge. A number of effective queries have been developed to query about these knowledge bases. Based on these queries, a set of recommendation strategies have been proposed to generate Web-page candidates. The recommendation results have been compared with the results obtained from an advanced existing Web Usage Mining (WUM) method. The experimental results demonstrate that the proposed method produces significantly higher performance than the WUM method.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: Data heterogeneity in XML retrieval activities can be tackled by giving users the possibility to obtain approximate answers to their queries. XPath query relaxation has been proposed as a mechanism to provide approximate answers in the case of positive XPath queries. Under this mechanism, the “satisfaction score” of an answer is defined by looking at how the query must be relaxed to produce that answer, and the user is provided with the best (k) answers according to their satisfaction score (top- (k) query answering). In this paper we investigate the problem of top- (k) query answering for XPath queries with negation. We tackle the challenging issues that need to be carefully considered when dealing with the approximation of negated conditions and propose an incremental top- (k) query answering technique based on query relaxation. Specifically, after defining a weighted query language and its semantics, we develop a general incremental query evaluation framework, which is flexible enough to support different evaluation strategies. The experimental assessment confirms the effectiveness of the whole framework.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Description: This paper studies the problem of querying Bounded Spatial Datasets (BSDs). A BSD contains i) objects with known locations, and ii) unknown regions , each of which bounds an unknown number of objects, within a coverage area. We consider applications where each BSD is hosted on a server or site connected to a communication network and the BSDs overlap in their coverage areas. The challenge is to query the distributed BSDs to retrieve all objects and to minimize the unknown regions which may contain objects satisfying the query, while minimizing the data transmission volume and number of interactions between the query client and the sites. We develop query processing algorithms for two important types of spatial queries, namely, range and (k) -nearest-neighbor ( (k) NN) queries. We develop the site-based approach and the area-based approach for efficiently processing range and (k) NN queries on distributed BSDs. They aim to process only a subset of the sites to obtain the full answer for a query. Thus, optimal site selection and the corresponding site querying methods are important problems studied in this paper. In the area-based approach, we prove an optimal division and derive a practical heuristic to partition a query and select the best processing site for each partition, hence achieving even better efficiency than the site-based approach. Simulation results based on three real spatial datasets show that our proposed approaches significantly outperform the baseline that uses a centralized approach in terms of data transmission volume and the number of interactions between the query client and the distributed sites.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-09-06
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: Kernel principal component analysis and the reconstruction error is an effective anomaly detection technique for non-linear data sets. In an environment where a phenomenon is generating data that is non-stationary, anomaly detection requires a recomputation of the kernel eigenspace in order to represent the current data distribution. Recomputation is a computationally complex operation and reducing computational complexity is therefore a key challenge. In this paper, we propose an algorithm that is able to accurately remove data from a kernel eigenspace without performing a batch recomputation. Coupled with a kernel eigenspace update, we demonstrate that our technique is able to remove and add data to a kernel eigenspace more accurately than existing techniques. An adaptive version determines an appropriately sized sliding window of data and when a model update is necessary. Experimental evaluations on both synthetic and real-world data sets demonstrate the superior performance of the proposed approach in comparison to alternative incremental KPCA approaches and alternative anomaly detection techniques.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: Early detection of patients with elevated risk of developing diabetes mellitus is critical to the improved prevention and overall clinical management of these patients. We aim to apply association rule mining to electronic medical records (EMR) to discover sets of risk factors and their corresponding subpopulations that represent patients at particularly high risk of developing diabetes. Given the high dimensionality of EMRs, association rule mining generates a very large set of rules which we need to summarize for easy clinical use. We reviewed four association rule set summarization techniques and conducted a comparative evaluation to provide guidance regarding their applicability, strengths and weaknesses. We proposed extensions to incorporate risk of diabetes into the process of finding an optimal summary. We evaluated these modified techniques on a real-world prediabetic patient cohort. We found that all four methods produced summaries that described subpopulations at high risk of diabetes with each method having its clear strength. For our purpose, our extension to the Buttom-Up Summarization (BUS) algorithm produced the most suitable summary. The subpopulations identified by this summary covered most high-risk patients, had low overlap and were at very high risk of diabetes.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: The problem of mobile sequential recommendation is to suggest a route connecting a set of pick-up points for a taxi driver so that he/she is more likely to get passengers with less travel cost. Essentially, a key challenge of this problem is its high computational complexity. In this paper, we propose a novel dynamic programming based method to solve the mobile sequential recommendation problem consisting of two separate stages: an offline pre-processing stage and an online search stage. The offline stage pre-computes potential candidate sequences from a set of pick-up points. A backward incremental sequence generation algorithm is proposed based on the identified iterative property of the cost function. Simultaneously, an incremental pruning policy is adopted in the process of sequence generation to reduce the search space of the potential sequences effectively. In addition, a batch pruning algorithm is further applied to the generated potential sequences to remove some non-optimal sequences of a given length. Since the pruning effectiveness keeps growing with the increase of the sequence length, at the online stage, our method can efficiently find the optimal driving route for an unloaded taxi in the remaining candidate sequences. Moreover, our method can handle the problem of optimal route search with a maximum cruising distance or a destination constraint. Experimental results on real and synthetic data sets show that both the pruning ability and the efficiency of our method surpass the state-of-the-art methods. Our techniques can therefore be effectively employed to address the problem of mobile sequential recommendation with many pick-up points in real-world applications.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: The key task in developing graph-based learning algorithms is constructing an informative graph to express the contextual information of a data manifold. Since traditional graph construction methods are sensitive to noise and less datum-adaptive to changes in density, a new method called $ell^1$ -graph was proposed recently. A graph construction needs to have two important properties: sparsity and locality. The $ell^1$ -graph has a strong sparsity property, but a weak locality property. Thus, we propose a new method of constructing an informative graph using auto-grouped sparse regularization based on the $ell^1$ -graph, which is called as Group Sparse graph (GS-graph). We also show how to efficiently construct a GS-graph in reproducing kernel Hilbert space with the kernel trick. The new methods, the GS-graph and its kernelized version (KGS-graph), have the same noise-insensitive property as that of $ell^1$ -graph and also can successively preserve the properties of sparsity and locality simultaneously. Furthermore, we integrate the proposed graph with several graph-based learning algorithms to demonstrate the effectiveness of our method. The empirical studies on benchmarks show that the proposed methods outperform the $ell^1$ -graph and other traditional graph construction methods in various learning tasks.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: With the increasing volume of images users share through social sites, maintaining privacy has become a major problem, as demonstrated by a recent wave of publicized incidents where users inadvertently shared personal information. In light of these incidents, the need of tools to help users control access to their shared content is apparent. Toward addressing this need, we propose an Adaptive Privacy Policy Prediction (A3P) system to help users compose privacy settings for their images. We examine the role of social context, image content, and metadata as possible indicators of users’ privacy preferences. We propose a two-level framework which according to the user’s available history on the site, determines the best available privacy policy for the user’s images being uploaded. Our solution relies on an image classification framework for image categories which may be associated with similar policies, and on a policy prediction algorithm to automatically generate a policy for each newly uploaded image, also according to users’ social features. Over time, the generated policies will follow the evolution of users’ privacy attitude. We provide the results of our extensive evaluation over 5,000 policies, which demonstrate the effectiveness of our system, with prediction accuracies over 90 percent.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2014-12-06
    Description: Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing bugs is bug triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in manual work, text classification techniques are applied to conduct automatic bug triage. In this paper, we address the problem of data reduction for bug triage, i.e., how to reduce the scale and improve the quality of bug data. We combine instance selection with feature selection to simultaneously reduce data scale on the bug dimension and the word dimension. To determine the order of applying instance selection and feature selection, we extract attributes from historical bug data sets and build a predictive model for a new bug data set. We empirically investigate the performance of data reduction on totally 600,000 bug reports of two large open source projects, namely Eclipse and Mozilla. The results show that our data reduction can effectively reduce the data scale and improve the accuracy of bug triage. Our work provides an approach to leveraging techniques on data processing to form reduced and high-quality bug data in software development and maintenance.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...