ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Articles  (242)
  • Institute of Electrical and Electronics Engineers (IEEE)  (242)
  • 2015-2019  (242)
  • 1990-1994
  • 1945-1949
  • 2016  (242)
  • IEEE Transactions on Knowledge and Data Engineering  (242)
  • 1274
  • Computer Science  (242)
  • Geography
  • Mechanical Engineering, Materials Science, Production Engineering, Mining and Metallurgy, Traffic Engineering, Precision Mechanics
Collection
  • Articles  (242)
Publisher
  • Institute of Electrical and Electronics Engineers (IEEE)  (242)
Years
  • 2015-2019  (242)
  • 1990-1994
  • 1945-1949
Year
Topic
  • Computer Science  (242)
  • Geography
  • Mechanical Engineering, Materials Science, Production Engineering, Mining and Metallurgy, Traffic Engineering, Precision Mechanics
  • 1
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving paradigm has gained a lot of attention and has been a hot research topic in the field of computer science. Even though a lot of work has been conducted on this topic, so far we do not have a comprehensive survey on most relevant work done in the crowdsourcing field. In this paper, we aim to offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing. According to their focus, we divide this work into three parts, which are: incentive design, task assignment, and quality control. For each part, we start with different problems faced in that area followed by a brief description of existing work and a discussion of pros and cons. In addition, we also present a real scenario on how the different techniques are used in implementing a location-based crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose crowdsourcing techniques and present some open problems in this area.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 2
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Any important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring platforms are an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowdsourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowdsourced data management.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 3
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Principal component analysis and the residual error is an effective anomaly detection technique. In an environment where anomalies are present in the training set, the derived principal components can be skewed by the anomalies. A further aspect of anomaly detection is that data might be distributed across different nodes in a network and their communication to a centralized processing unit is prohibited due to communication cost. Current solutions to distributed anomaly detection rely on a hierarchical network infrastructure to aggregate data or models; however, in this environment, links close to the root of the tree become critical and congested. In this paper, an algorithm is proposed that is more robust in its derivation of the principal components of a training set containing anomalies. A distributed form of the algorithm is then derived where each node in a network can iterate towards the centralized solution by exchanging small matrices with neighboring nodes. Experimental evaluations on both synthetic and real-world data sets demonstrate the superior performance of the proposed approach in comparison to principal component analysis and alternative anomaly detection techniques. In addition, it is shown that in a variety of network infrastructures, the distributed form of the anomaly detection model is able to derive a close approximation of the centralized model.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 4
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Many applications deal with moving object datasets, e.g., mobile phone social networking, scientific simulations, and ride-sharing services. These applications need to handle a tremendous number of spatial objects that continuously move and execute spatial queries to explore their surroundings. To manage such update-heavy workloads, several throwaway index structures have recently been proposed, where a static index is rebuilt periodically from scratch rather than updated incrementally. It has been shown that throwaway indices outperform specialized moving-object indices that maintain location updates incrementally. However, throwaway indices suffer from scalability due to their single-server design and the only distributed throwaway index (D-MOVIES), extension of a centralized approach, does not scale out as the number of servers increases, especially during query processing phase.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 5
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Collaborative filtering (CF) is out of question the most widely adopted and successful recommendation approach. A typical CF-based recommender system associates a user with a group of like-minded users based on their individual preferences over all the items, either explicit or implicit, and then recommends to the user some unobserved items enjoyed by the group. However, we find that two users with similar tastes on one item subset may have totally different tastes on another set. In other words, there exist many user-item subgroups each consisting of a subset of items and a group of like-minded users on these items. It is more reasonable to predict preferences through one user's correlated subgroups, but not the entire user-item matrix. In this paper, to find meaningful subgroups, we formulate a new Multiclass Co-Clustering (MCoC) model, which captures relations of user-to-item, user-to-user, and item-to-item simultaneously. Then, we combine traditional CF algorithms with subgroups for improving their top- $N$ recommendation performance. Our approach can be seen as a new extension of traditional clustering CF models. Systematic experiments on several real data sets have demonstrated the effectiveness of our proposed approach.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 6
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: With the rapid development of Web 2.0 and Online To Offline (O2O) marketing model, various online e vent- b ased s ocial n etwork s (EBSNs) are getting popular. An important task of EBSNs is to facilitate the most satisfactory event-participant arrangement for both sides, i.e., events enroll more participants and participants are arranged with personally interesting events. Existing approaches usually focus on the arrangement of each single event to a set of potential users, or ignore the conflicts between different events, which leads to infeasible or redundant arrangements. In this paper, to address the shortcomings of existing approaches, we first identify a more general and useful event-participant arrangement problem, called G lobal E vent-participant A rrangement with C onflict and C apacity ( $GEACC$ ) problem, focusing on the conflicts of different events and making event-participant arrangements in a global view. We find that the GEACC problem is NP-hard due to the conflicts among events. Thus, we design two approximation algorithms with provable approximation ratios and an exact algorithm with pruning technique to address this problem. In addition, we propose an online setting of GEACC, called OnlineGEACC, which is also practical in real-world scenarios. We further design an online algorithm with provable performance guarantee. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 7
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Given a point $p$ and a set of points $S$ , the kNN operation finds the $k$ closest points to in $S$ . It is a computational intensive task with a large range of applications such as knowledge discovery or data mining. However, as the volume and the dimension of data increase, only distributed approaches can perform such costly operation in a reasonable time. Recent works have focused on implementing efficient solutions using the MapReduce programming model because it is suitable for distributed large scale data processing. Although these works provide different solutions to the same problem, each one has particular constraints and properties. In this paper, we compare the different existing approaches for computing kNN on MapReduce, first theoretically, and then by performing an extensive experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN computation on MapReduce: data pre-processing, data partitioning, and computation. We then analyze each step from load balancing, accuracy, and complexity aspects. Experiments in this paper use a variety of datasets, and analyze the impact of data volume, data dimension, and the value of k from many perspectives like time and space complexity, and accuracy. The experimental part brings new advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this is the first pape- that compares kNN computing methods on MapReduce both theoretically and experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based practical problems in the context of big data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 8
    Publication Date: 2016-08-05
    Description: Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many natural and engineering complex systems, e.g., protein networks, power grid, and transportation systems. Most of the existing techniques view the community mining problem as an optimization problem based on a given quality function(e.g., modularity), however none of them are grounded with a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining efficiency and the community quality still remains an open problem. In this paper, we attempt to address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable influence factor is proposed to measure the leadership of each node, those nodes with highest local leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system to describe the dynamical assignment of community membership; and formulate the serval conditions to guarantee the convergence of each node's dynamic trajectory, by which the hierarchical community structure of the network can be revealed. The proposed dynamical system is independent of the quality function used, so could also be applied in other community mining models. Our algorithm is highly efficient: the computational complexity analysis shows that the execution time is nearly linearly dependent on the number of nodes in sparse networks. We finally give demonstrative applications of the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the algorithmic performance.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 9
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: When a microblogging user adopts some content propagated to her, we can attribute that to three behavioral factors, namely, topic virality , user virality , and user susceptibility . Topic virality measures the degree to which a topic attracts propagations by users. User virality and susceptibility refer to the ability of a user to propagate content to other users, and the propensity of a user adopting content propagated to her, respectively. In this paper, we study the problem of mining these behavioral factors specific to topics from microblogging content propagation data. We first construct a three dimensional tensor for representing the propagation instances. We then propose a tensor factorization framework to simultaneously derive the three sets of behavioral factors. Based on this framework, we develop a numerical factorization model and another probabilistic factorization variant. We also develop an efficient algorithm for the models’ parameters learning. Our experiments on a large Twitter dataset and synthetic datasets show that the proposed models can effectively mine the topic-specific behavioral factors of users and tweet topics. We further demonstrate that the proposed models consistently outperforms the other state-of-the-art content based models in retweet prediction over time.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 10
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups ( clusters ) of anomalies; i.e., sets of points which collectively exhibit abnormal patterns. In many applications, this can lead to a better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD .
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 11
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Multilabel classification is prevalent in many real-world applications where data instances may be associated with multiple labels simultaneously. In multilabel classification, exploiting label correlations is an essential but nontrivial task. Most of the existing multilabel learning algorithms are either ineffective or computationally demanding and less scalable in exploiting label correlations. In this paper, we propose a co-evolutionary multilabel hypernetwork (Co-MLHN) as an attempt to exploit label correlations in an effective and efficient way. To this end, we firstly convert the traditional hypernetwork into a multilabel hypernetwork (MLHN) where label correlations are explicitly represented. We then propose a co-evolutionary learning algorithm to learn an integrated classification model for all labels. The proposed Co-MLHN exploits arbitrary order label correlations and has linear computational complexity with respect to the number of labels. Empirical studies on a broad range of multilabel data sets demonstrate that Co-MLHN achieves competitive results against state-of-the-art multilabel learning algorithms, in terms of both classification performance and scalability with respect to the number of labels.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 12
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Query auto completion (QAC) methods recommend queries to search engine users when they start entering a query. Current QAC methods mostly rank query completions based on their past popularity, i.e., on the number of times they have previously been submitted as a query. However, query popularity changes over time and may vary drastically across users. Accordingly, the ranking of query completions should be adjusted. Previous time-sensitive and user-specific QAC methods have been developed separately, yielding significant improvements over methods that are neither time-sensitive nor personalized. We propose a hybrid QAC method that is both time-sensitive and personalized. We extend it to handle long-tail prefixes, which we achieve by assigning optimal weights to the contribution from time-sensitivity and personalization. Using real-world search log datasets, we return top $N$ query suggestions ranked by predicted popularity as estimated from popularity trends and cyclic popularity behavior; we rerank them by integrating similarities to a user's previous queries (both in the current session and in previous sessions). Our method outperforms state-of-the-art time-sensitive QAC baselines, achieving total improvements of between 3 and 7 percent in terms of mean reciprocal rank (MRR). After optimizing the weights, our extended model achieves MRR improvements of between 4 and 8 percent.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 13
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Wrappers are pieces of software used to extract data from websites and structure them for further application processing. Unfortunately, websites are continuously evolving and structural changes happen with no forewarning, which usually results in wrappers working incorrectly. Thus, wrappers maintenance is necessary for detecting whether wrapper is extracting erroneous data. The solution consists of using verification models to detect whether wrapper output is statistically similar to the output produced by the wrapper itself when it was successfully invoked in the past. Current proposals present some weaknesses, as the data used to build these models are supposed to be homogeneous, independent, or representative enough, or following a single predefined mathematical model. In this paper, we present MAVE, a novel multilevel wrapper verification system that is based on one-class classification techniques to overcome previous weaknesses. The experimental results show that our proposal outperforms accuracy of current solutions.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 14
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Online portfolio selection has attracted increasing attention from data mining and machine learning communities in recent years. An important theory in financial markets is mean reversion, which plays a critical role in some state-of-the-art portfolio selection strategies. Although existing mean reversion strategies have been shown to achieve good empirical performance on certain datasets, they seldom carefully deal with noise and outliers in the data, leading to suboptimal portfolios, and consequently yielding poor performance in practice. In this paper, we propose to exploit the reversion phenomenon by using robust $L_1$ -median estimators, and design a novel online portfolio selection strategy named “Robust Median Reversion” (RMR), which constructs optimal portfolios based on the improved reversion estimator. We examine the performance of the proposed algorithms on various real markets with extensive experiments. Empirical results show that RMR can overcome the drawbacks of existing mean reversion algorithms and achieve significantly better results. Finally, RMR runs in linear time, and thus is suitable for large-scale real-time algorithmic trading applications.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 15
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: The development of a topic in a set of topic documents is constituted by a series of person interactions at a specific time and place. Knowing the interactions of the persons mentioned in these documents is helpful for readers to better comprehend the documents. In this paper, we propose a topic person interaction detection method called SPIRIT, which classifies the text segments in a set of topic documents that convey person interactions. We design the rich interactive tree structure to represent syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based convolution kernel to identify interactive segments. Experiment results based on real world topics demonstrate that the proposed rich interactive tree structure effectively detects the topic person interactions and that our method outperforms many well-known relation extraction and protein-protein interaction methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 16
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: A social recommendation system has attracted a lot of attention recently in the research communities of information retrieval, machine learning, and data mining. Traditional social recommendation algorithms are often based on batch machine learning methods which suffer from several critical limitations, e.g., extremely expensive model retraining cost whenever new user ratings arrive, unable to capture the change of user preferences over time. Therefore, it is important to make social recommendation system suitable for real-world online applications where data often arrives sequentially and user preferences may change dynamically and rapidly. In this paper, we present a new framework of online social recommendation from the viewpoint of online graph regularized user preference learning (OGRPL), which incorporates both collaborative user-item relationship as well as item content features into an unified preference learning process. We further develop an efficient iterative procedure, OGRPL-FW which utilizes the Frank-Wolfe algorithm, to solve the proposed online optimization problem. We conduct extensive experiments on several large-scale datasets, in which the encouraging results demonstrate that the proposed algorithms obtain significantly lower errors (in terms of both RMSE and MAE) than the state-of-the-art online recommendation methods when receiving the same amount of training data in the online learning process.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 17
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: In this paper, we revisit the private over-threshold data aggregation problem. We formally define the problem's security requirements as both data and user privacy goals. To achieve both goals, and to strike a balance between efficiency and functionality, we devise an efficient cryptographic construction and its proxy-based variant. Both schemes are provably secure in the semi-honest model. Our key idea for the constructions and their malicious variants is to compose two encryption functions tightly coupled in a way that the two functions are commutative and one public-key encryption has an additive homomorphism. We call that double encryption. We analyze the computational and communication complexities of our construction, and show that it is much more efficient than the existing protocols in the literature. Specifically, our protocol has linear complexity in computation and communication with respect to the number of users. Its round complexity is also linear in the number of users. Finally, we show that our basic protocol is efficiently transformed into a stronger protocol secure in the presence of malicious adversaries, and provide the resulting protocol's performance and security analysis.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 18
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 19
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Automated feature selection is important for text categorization to reduce feature size and to speed up learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination ( $MD$ ) and methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 20
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: The proliferation of location-based social networks, such as Foursquare and Facebook Places, offers a variety of ways to record human mobility, including user generated geo-tagged contents, check-in services, and mobile apps. Although trajectory data is of great value to many applications, it is challenging to analyze and mine trajectory data due to the complex characteristics reflected in human mobility, which is affected by multiple contextual information. In this paper, we propose a Multi-Context Trajectory Embedding Model, called MC-TEM, to explore contexts in a systematic way. MC-TEM is developed in the distributed representation learning framework, and it is flexible to characterize various kinds of useful contexts for different applications. To the best of our knowledge, it is the first time that the distributed representation learning methods apply to trajectory data. We formally incorporate multiple context information of trajectory data into the proposed model, including user-level, trajectory-level, location-level, and temporal contexts. All the context information is represented in the same embedding space. We apply MC-TEM to two challenging tasks, namely location recommendation and social link prediction. We conduct extensive experiments on three real-world datasets. Extensive experiment results have demonstrated the superiority of our MC-TEM model over several state-of-the-art methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 21
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: The study of urban networks reveals that the accessibility of important city objects for the vehicle traffic and pedestrians is significantly correlated to the popularity, micro-criminality, micro-economic vitality, and social liveability of the city, and is always the chief factor in regulating the growth and expansion of the city. The accessibility between different components of an urban structure are frequently measured along the streets and routes considered as edges of a planar graph, while the traffic ultimate destination points and street junctions are treated as vertices. For estimation of the accessibility of destination vertex $j$ from vertex $i$ through urban networks, in particular, the random walks are used to calculate the expected distance a random walker starting from $i$ makes before $j$ is visited (known as access time ). The state-of-the-art of access time computation is costly in large planar graphs since it involves matrix operation over entire graph. The time complexity is $O(n^{2.376})$ where $n$ is the number of vertices in the planar graph. To enable efficient access time query answering in large planar graphs, this work proposes the first access time oracle which is based on the proposed access time decomposition and reconstruction scheme. The oracle is a hierarchical data structure with deliberate design on the relationships between different hierarchical levels. The storage requirement of the proposed oracle is $O(n^{frac{4}{3}}log log n)$ and the access time query response time is $O(n^{frac{2}{3}})$ . The extensive tests on a number of large real-world road networks (with up to about 2 million vertices) have verified the superiority of the proposed oracle.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 22
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Clustering is one of the research hotspots in the field of data mining and has extensive applications in practice. Recently, Rodriguez and Laio [1] published a clustering algorithm on Science that identifies the clustering centers in an intuitive way and clusters objects efficiently and effectively. However, the algorithm is sensitive to a preassigned parameter and suffers from the identification of the “ideal” number of clusters. To overcome these shortages, this paper proposes a new clustering algorithm that can detect the clustering centers automatically via statistical testing. Specifically, the proposed algorithm first defines a new metric to measure the density of an object that is more robust to the preassigned parameter, further generates a metric to evaluate the centrality of each object. Afterwards, it identifies the objects with extremely large centrality metrics as the clustering centers via an outward statistical testing method. Finally, it groups the remaining objects into clusters containing their nearest neighbors with higher density. Extensive experiments are conducted over different kinds of clustering data sets to evaluate the performance of the proposed algorithm and compare with the algorithm in Science. The results show the effectiveness and robustness of the proposed algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 23
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: A representative skyline contains $k$ skyline points that can represent its corresponding full skyline. The existing measuring criteria of $k$ representative skylines are specifically designed for static data, and they cannot effectively handle streaming data. In this paper, we focus on the problem of calculating the $k$ representative skyline over data streams. First, we propose a new criterion to choose $k$ skyline points as the $k$ representative skyline for data stream environments, termed the $k$ largest dominance skyline ( $k$ -LDS), which is representative to the entire data set and is highly stable over the streaming data. Second, we propose an efficient exact algorithm, called Prefix-based Algorithm (PBA), to solve the $k$ -LDS problem in a 2-dimensional space. The time complexity of PBA is only $mathcal {O}((M-k)times k)$ where $M$ is the size of the full skyline set. Third, the $k$ -LDS problem for a $d$ -dimensional ( $dge 3$ ) space turns out to be very complex. Therefore, a greedy algorithm is designed to answer $k$ -LDS queries. To further accelerate the calculation, we propose a $epsilon$ -greedy algorithm which can achieve an approximate factor of $frac{1}{(1+epsilon)}(1-frac{1}{sqrt{e}})$
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 24
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Domain adaptation generalizes a learning model across source domain and target domain that are sampled from different distributions. It is widely applied to cross-domain data mining for reusing labeled information and mitigating labeling consumption. Recent studies reveal that deep neural networks can learn abstract feature representation, which can reduce, but not remove, the cross-domain discrepancy. To enhance the invariance of deep representation and make it more transferable across domains, we propose a unified deep adaptation framework for jointly learning transferable representation and classifier to enable scalable domain adaptation, by taking the advantages of both deep learning and optimal two-sample matching. The framework constitutes two inter-dependent paradigms, unsupervised pre-training for effective training of deep models using deep denoising autoencoders, and supervised fine-tuning for effective exploitation of discriminative information using deep neural networks, both learned by embedding the deep representations to reproducing kernel Hilbert spaces (RKHSs) and optimally matching different domain distributions. To enable scalable learning, we develop a linear-time algorithm using unbiased estimate that scales linearly to large samples. Extensive empirical results show that the proposed framework significantly outperforms state of the art methods on diverse adaptation tasks: sentiment polarity prediction, email spam filtering, newsgroup content categorization, and visual object recognition.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 25
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: With the rapid growth of various applications on the Internet, recommender systems become fundamental for helping users alleviate the problem of information overload. Since contextual information is a significant factor in modeling the user behavior, various context-aware recommendation methods have been proposed recently. The state-of-the-art context modeling methods usually treat contexts as certain dimensions similar to those of users and items, and capture relevances between contexts and users/items. However, such kind of relevance has much difficulty in explanation. Some works on multi-domain relation prediction can also be used for the context-aware recommendation, but they have limitations in generating recommendations under a large amount of contextual information. Motivated by recent works in natural language processing, we represent each context value with a latent vector, and model the contextual information as a semantic operation on the user and item. Besides, we use the contextual operating tensor to capture the common semantic effects of contexts. Experimental results show that the proposed Context Operating Tensor (COT) model yields significant improvements over the competitive compared methods on three typical datasets. From the experimental results of COT, we also obtain some interesting observations which follow our intuition.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 26
    Publication Date: 2016-07-08
    Description: In many applications, one can obtain descriptions about the same objects or events from a variety of sources. As a result, this will inevitably lead to data or information conflicts. One important problem is to identify the true information (i.e., the truths ) among conflicting sources of data. It is intuitive to trust reliable sources more when deriving the truths, but it is usually unknown which one is more reliable a priori . Moreover, each source possesses a variety of properties with different data types. An accurate estimation of source reliability has to be made by modeling multiple properties in a unified model. Existing conflict resolution work either does not conduct source reliability estimation, or models multiple properties separately. In this paper, we propose to resolve conflicts among multiple sources of heterogeneous data types. We model the problem using an optimization framework where truths and source reliability are defined as two sets of unknown variables. The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. Different loss functions can be incorporated into this framework to recognize the characteristics of various data types, and efficient computation approaches are developed. The proposed framework is further adapted to deal with streaming data in an incremental fashion and large-scale data in MapReduce model. Experiments on real-world weather, stock, and flight data as well as simulated multi-source data demonstrate the advantage of jointly modeling different data types in the proposed framework.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 27
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Detection of non-overlapping and overlapping communities are essentially the same problem. However, current algorithms focus either on finding overlapping or non-overlapping communities. We present a generalized framework that can identify both non-overlapping and overlapping communities, without any prior input about the network or its community distribution. To do so, we introduce a vertex-based metric, GenPerm , that quantifies by how much a vertex belongs to each of its constituent communities. Our community detection algorithm is based on maximizing the GenPerm over all the vertices in the network. We demonstrate, through experiments over synthetic and real-world networks, that GenPerm is more effective than other metrics in evaluating community structure. Further, we show that due to its vertex-centric property, GenPerm can be used to unfold several inferences beyond community detection, such as core-periphery analysis and message spreading. Our algorithm for maximizing GenPerm outperforms six state-of-the-art algorithms in accurately predicting the ground-truth labels. Finally, we discuss the problem of resolution limit in overlapping communities and demonstrate that maximizing GenPerm can mitigate this problem.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 28
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Social hierarchy (i.e., pyramid structure of societies) is a fundamental concept in sociology and social network analysis. The importance of social hierarchy in a social network is that the topological structure of the social hierarchy is essential in both shaping the nature of social interactions between individuals and unfolding the structure of the social networks. The social hierarchy found in a social network can be utilized to improve the accuracy of link prediction, provide better query results, rank web pages, and study information flow and spread in complex networks. In this paper, we model a social network as a directed graph $G$ , and consider the social hierarchy as DAG (directed acyclic graph) of $G$ , denoted as $G_D$ . By DAG, all the vertices in $G$ can be partitioned into different levels, the vertices at the same level represent a disjoint group in the social hierarchy, and all the edges in DAG follow one direction. The main issue we study in this paper is how to find DAG $G_D$ in $G$ . The approach we take is to find $G_D$ by removing all possible cycles from $G$ such that $G = {cal U}(G) cup G_D$ , where ${cal U}(G)$ is a maximum Eulerian subgraph which contains all possible cycles. We give the reasons for doing so, investigate the properties of $G_D$ found, and discuss the applications. In addition, we develop a novel two-phase algorithm, called Greedy-&-Refine, which greedily computes an Eulerian subgraph and then refines this greedy solution to find the maximum Eulerian subgraph. We give a bound between the greedy solution and the optimal. The quality of our greedy approach is high. We conduct comprehensive experimental studies over 14 real-world datasets. The results show that our algorithms are at least two orders of magnitude faster than the baseline algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 29
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Visualization provides a powerful means for data analysis. But to be practical, visual analytics tools must support smooth and flexible use of visualizations at a fast rate. This becomes increasingly onerous with the ever-increasing size of real-world datasets. First, large databases make interaction more difficult once query response time exceeds several seconds. Second, any attempt to show all data points will overload the visualization, resulting in chaos that will only confuse the user. Over the last few years, substantial effort has been put into addressing both of these issues and many innovative solutions have been proposed. Indeed, data visualization is a topic that is too large to be addressed in a single survey paper. Thus, we restrict our attention here to interactive visualization of large data sets. Our focus then is skewed in a natural way towards query processing problem—provided by an underlying database system—rather than to the actual data visualization problem.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 30
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Almost all of the existing domain adaptation methods assume that all test data belong to a single stationary target distribution. However, in many real world applications, data arrive sequentially and the data distribution is continuously evolving. In this paper, we tackle the problem of adaptation to a continuously evolving target domain that has been recently introduced. We assume that the available data for the source domain are labeled but the examples of the target domain can be unlabeled and arrive sequentially. Moreover, the distribution of the target domain can evolve continuously over time. We propose the Evolving Domain Adaptation (EDA) method that first finds a new feature space in which the source domain and the current target domain are approximately indistinguishable. Therefore, source and target domain data are similarly distributed in the new feature space and we use a semi-supervised classification method to utilize both the unlabeled data of the target domain and the labeled data of the source domain. Since test data arrives sequentially, we propose an incremental approach both for finding the new feature space and for semi-supervised classification. Experiments on several real datasets demonstrate the superiority of our proposed method in comparison to the other recent methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 31
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: The widespread use of online social networks (OSNs) to disseminate information and exchange opinions, by the general public, news media, and political actors alike, has enabled new avenues of research in computational political science. In this paper, we study the problem of quantifying and inferring the political leaning of Twitter users. We formulate political leaning inference as a convex optimization problem that incorporates two ideas: (a) users are consistent in their actions of tweeting and retweeting about political issues, and (b) similar users tend to be retweeted by similar audience. We then apply our inference technique to 119 million election-related tweets collected in seven months during the 2012 U.S. presidential election campaign. On a set of frequently retweeted sources, our technique achieves 94 percent accuracy and high rank correlation as compared with manually created labels. By studying the political leaning of 1,000 frequently retweeted sources, 232,000 ordinary users who retweeted them, and the hashtags used by these sources, our quantitative study sheds light on the political demographics of the Twitter population, and the temporal dynamics of political polarization as events unfold.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 32
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Comparing two clustering results of a data set is a challenging task in cluster analysis. Many external validity measures have been proposed in the literature. A good measure should be invariant to the changes of data size, cluster size, and number of clusters. We give an overview of existing set matching indexes and analyze their properties. Set matching measures are based on matching clusters from two clusterings. We analyze the measures in three parts: 1) cluster similarity, 2) matching, and 3) overall measurement. Correction for chance is also investigated and we prove that normalized mutual information and variation of information are intrinsically corrected. We propose a new scheme of experiments based on synthetic data for evaluation of an external validity index. Accordingly, popular external indexes are evaluated and compared when applied to clusterings of different data size, cluster size, and number of clusters. The experiments show that set matching measures are clearly better than the other tested. Based on the analytical comparisons, we introduce a new index called Pair Sets Index (PSI).
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 33
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Many commercial products and academic research activities are embracing behavior analysis as a technique for improving detection of attacks of many sorts—from retweet boosting, hashtag hijacking to link advertising. Traditional approaches focus on detecting dense blocks in the adjacency matrix of graph data, and recently, the tensors of multimodal data. No method gives a principled way to score the suspiciousness of dense blocks with different numbers of modes and rank them to draw human attention accordingly. In this paper, we first give a list of axioms that any metric of suspiciousness should satisfy; we propose an intuitive, principled metric that satisfies the axioms, and is fast to compute; moreover, we propose CrossSpot , an algorithm to spot dense blocks that are worth inspecting, typically indicating fraud or some other noteworthy deviation from the usual, and sort them in the order of importance (“suspiciousness”). Finally, we apply CrossSpot to the real data, where it improves the F1 score over previous techniques by 68 percent and finds suspicious behavioral patterns in social datasets spanning 0.3 billion posts.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 34
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: With the rapid development of mobile devices and crowdsourcing platforms, the spatial crowdsourcing has attracted much attention from the database community. Specifically, the spatial crowdsourcing refers to sending location-based requests to workers, based on their current positions. In this paper, we consider a spatial crowdsourcing scenario, in which each worker has a set of qualified skills, whereas each spatial task (e.g., repairing a house, decorating a room, and performing entertainment shows for a ceremony) is time-constrained, under the budget constraint, and required a set of skills. Under this scenario, we will study an important problem, namely multi-skill spatial crowdsourcing (MS-SC), which finds an optimal worker-and-task assignment strategy, such that skills between workers and tasks match with each other, and workers’ benefits are maximized under the budget constraint. We prove that the MS-SC problem is NP-hard and intractable. Therefore, we propose three effective heuristic approaches, including greedy, $g$ -divide-and-conquer and cost-model-based adaptive algorithms to get worker-and-task assignments. Through extensive experiments, we demonstrate the efficiency and effectiveness of our MS-SC processing approaches on both real and synthetic data sets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 35
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Twitter has become one of the largest microblogging platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short period of time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper $sf {TopicSketch}$ , a sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that $sf {TopicSketch}$ on a single machine can potentially handle hundreds of millions tweets per day, which is on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-granularity.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 36
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Recent studies show that disk-based graph computation systems on just a single PC can be as highly competitive as cluster-based systems on large-scale problems. Inspired by this remarkable progress, we develop VENUS, a disk-based graph computation system which is able to handle billion-scale graphs efficiently on a commodity PC. VENUS adopts a novel computing architecture that features vertex-centric “streamlined” processing—the graph is sequentially loaded and an update function is executed for each vertex in parallel on the fly. VENUS deliberately avoids loading batch edge data by separating read-only structure data from mutable vertex data on disk, and minimizes random IOs by caching vertex data in the main memory whenever possible. The streamlined processing is realized with efficient sequential scan over massive structure data and fast feeding the update function for a large number of vertices. Extensive evaluation on large real-world and synthetic graphs has demonstrated the efficiency of VENUS. For example, to run the PageRank algorithm on a Twitter graph of 42 million vertices and 1.4 billion edges, Spark needs 8.1 minutes with 50 machines and GraphChi spends 13 minutes using high-speed SSD, while VENUS only takes 5 minutes on one machine with an ordinary hard disk.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 37
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: The problem of counting triangles in graphs has been well studied in the literature. However, all existing algorithms, exact or approximate, spend at least linear time in the size of the graph (except a recent theoretical result), which can be prohibitive on today's large graphs. Nevertheless, we observe that the ideas in many existing triangle counting algorithms can be coupled with random sampling to yield potentially sublinear-time algorithms that return an approximation of the triangle count without looking at the whole graph. This paper makes these random sampling algorithms more explicit, and presents an experimental and analytical comparison of different approaches, identifying the best performers among a number of candidates.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 38
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Predicting plausible links that may emerge between pairs of nodes is an important task in social network analysis, with over a decade of active research. Here, we propose a novel framework for link prediction. It integrates signals from node features, the existing local link neighborhood of a node pair, community-level link density, and global graph properties. Our framework uses a stacked two-level learning paradigm. At the lower level, the first two kinds of features are processed by a novel local learner. Its outputs are then integrated with the last two kinds of features by a conventional discriminative learner at the upper-level. We also propose a new stratified sampling scheme for evaluating link prediction algorithms in the face of an extremely large number of potential edges, out of which very few will ever materialize. It is not tied to a specific application of link prediction, but robust to a range of application requirements. We report on extensive experiments with seven benchmark datasets and over five competitive baseline systems. The system we present consistently shows at least 10 percent accuracy improvement over state-of-the-art, and over 30 percent improvement in some cases. We also demonstrate, through ablation, that our features are complementary in terms of the signals and accuracy benefits they provide.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 39
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Max-flow has been adopted for semi-supervised data modelling, yet existing algorithms were derived only for the learning from static data. This paper proposes an online max-flow algorithm for the semi-supervised learning from data streams. Consider a graph learned from labelled and unlabelled data, and the graph being updated dynamically for accommodating online data adding and retiring. In learning from the resulting non stationary graph, we augment and de-augment paths to update max-flow with a theoretical guarantee that the updated max-flow equals to that from batch retraining. For classification, we compute min-cut over current max-flow, so that minimized number of similar sample pairs are classified into distinct classes. Empirical evaluation on real-world data reveals that our algorithm outperforms state-of-the-art stream classification algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 40
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Due to the fact that existing database systems are increasingly more difficult to use, improving the quality and the usability of database systems has gained tremendous momentum over the last few years. In particular, the feature of explaining why some expected tuples are missing in the result of a query has received more attention. In this paper, we study the problem of explaining missing answers to top-k queries in the context of SQL (i.e., with selection, projection, join, and aggregation). To approach this problem, we use the query-refinement method. That is, given as inputs the original top-k SQL query and a set of missing tuples, our algorithms return to the user a refined query that includes both the missing tuples and the original query results. Case studies and experimental results show that our algorithms are able to return high quality explanations efficiently.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 41
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-05-03
    Description: The number of applications based on Apache Hadoop is dramatically increasing due to the robustness and dynamic features of this system. At the heart of Apache Hadoop, the Hadoop Distributed File System (HDFS) provides the reliability and high availability for computation by applying a static replication by default. However, because of the characteristics of parallel operations on the application layer, the access rate for each data file in HDFS is completely different. Consequently, maintaining the same replication mechanism for every data file leads to detrimental effects on the performance. By rigorously considering the drawbacks of the HDFS replication, this paper proposes an approach to dynamically replicate the data file based on the predictive analysis. With the help of probability theory, the utilization of each data file can be predicted to create a corresponding replication strategy. Eventually, the popular files can be subsequently replicated according to their own access potentials. For the remaining low potential files, an erasure code is applied to maintain the reliability. Hence, our approach simultaneously improves the availability while keeping the reliability in comparison to the default scheme. Furthermore, the complexity reduction is applied to enhance the effectiveness of the prediction when dealing with Big Data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 42
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-05-03
    Description: As remote sensing equipment and networked observational devices continue to proliferate, their corresponding data volumes have surpassed the storage and processing capabilities of commodity computing hardware. This trend has led to the development of distributed storage frameworks that incrementally scale out by assimilating resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of research: extracting insights, relationships, and models from the underlying datasets. The focus of this study is twofold: exploratory and predictive analytics over voluminous, multidimensional datasets in a distributed environment. Both of these types of analysis represent a higher-level abstraction over standard query semantics; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. The algorithms presented in this work were evaluated empirically on a real-world geospatial time-series dataset in a production environment, and are broadly applicable across other storage frameworks.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 43
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-05-03
    Description: Researchers have recently shown that declarative database query languages, such as Datalog, could naturally be used to specify and implement network protocols and services. In this paper, we present a declarative framework for the specification, execution, simulation, and analysis of distributed applications. Distributed applications, including routing protocols, can be specified using a Declarative Networking language, called D2C, whose semantics capture the notion of a Distributed State Machine (DSM), i.e., a network of computational nodes that communicate with each other through the exchange of data. The D2C specification can be directly executed using the DSM computational infrastructure of our framework. The same specification can be simulated and formally verified. The simulation component integrates the DSM tool within a network simulation environment and allows developers to simulate network dynamics and collect data about the execution in order to evaluate application responses to network changes. The formal analysis component of our framework, instead, complements the empirical testing by supporting the verification of different classes of properties of distributed algorithms, including convergence of network routing protocols. To demonstrate the generality of our framework, we show how it can be used to analyze two classes of network routing protocols, a path vector and a Mobile Ad-Hoc Network (MANET) routing protocol, and execute a distributed algorithm for pattern formation in multi-robot systems.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 44
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-05-03
    Description: Additive models are regression methods which model the response variable as the sum of univariate transfer functions of the input variables. Key benefits of additive models are their accuracy and interpretability on many real-world tasks. Additive models are however not adapted to problems involving a large number (e.g., hundreds) of input variables, as they are prone to overfitting in addition to losing interpretability. In this paper, we introduce a novel framework for applying additive models to a large number of input variables. The key idea is to reduce the task dimensionality by deriving a small number of new covariates obtained by linear combinations of the inputs, where the linear weights are estimated with regard to the regression problem at hand. The weights are moreover constrained to prevent overfitting and facilitate the interpretation of the derived covariates. We establish identifiability of the proposed model under mild assumptions and present an efficient approximate learning algorithm. Experiments on synthetic and real-world data demonstrate that our approach compares favorably to baseline methods in terms of accuracy, while resulting in models of lower complexity and yielding practical insights into high-dimensional real-world regression tasks. Our framework broadens the applicability of additive models to high-dimensional problems while maintaining their interpretability and potential to provide practical insights.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 45
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work, we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior. We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators. A prototype is available as a web application at http://regex.inginf.units.it .
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 46
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels. We measure the interestingness of a pattern in a given class of sequences by combining the cohesion and the support of the pattern. We use the discovered patterns to generate confident classification rules, and present two different ways of building a classifier. The first classifier is based on an improved version of the existing method of classification based on association rules, while the second ranks the rules by first measuring their value specific to the new data object. Experimental results show that our rule based classifiers outperform existing comparable classifiers in terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use different kinds of patterns as features to represent each sequence as a feature vector. We then apply a variety of machine learning algorithms for sequence classification, experimentally demonstrating that the patterns we discover represent the sequences well, and prove effective for the classification task.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 47
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: This work presents a feature selection method based on proximity relations learning. Each single feature is treated as a binary classifier that predicts for any three objects X, A, and B whether X is close to A or B. The performance of the classifier is a direct measure of feature quality. Any linear combination of feature-based binary classifiers naturally corresponds to feature selection. Thus, the feature selection problem is transformed into an ensemble learning problem of combining many weak classifiers into an optimized strong classifier. We provide a theoretical analysis of the generalization error of our proposed method which validates the effectiveness of our proposed method. Various experiments are conducted on synthetic data, four UCI data sets and 12 microarray data sets, and demonstrate the success of our approach applying to feature selection. A weakness of our algorithm is high time complexity.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 48
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Although many successful ensemble clustering approaches have been developed in recent years, there are still two limitations to most of the existing approaches. First, they mostly overlook the issue of uncertain links, which may mislead the overall consensus process. Second, they generally lack the ability to incorporate global information to refine the local links. To address these two limitations, in this paper, we propose a novel ensemble clustering approach based on sparse graph representation and probability trajectory analysis. In particular, we present the elite neighbor selection strategy to identify the uncertain links by locally adaptive thresholds and build a sparse graph with a small number of probably reliable links. We argue that a small number of probably reliable links can lead to significantly better consensus results than using all graph links regardless of their reliability. The random walk process driven by a new transition probability matrix is utilized to explore the global information in the graph. We derive a novel and dense similarity measure from the sparse graph by analyzing the probability trajectories of the random walkers, based on which two consensus functions are further proposed. Experimental results on multiple real-world datasets demonstrate the effectiveness and efficiency of our approach.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 49
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: The problem we aim to address is the optimization of cost management for executing multiple continuous queries on data streams, where each query is defined by several filters, each of which monitors certain status of the data stream. Specially, the filter can be shared by different queries and expensive to evaluate. The conventional objective for such a problem is to minimize the overall execution cost to solve all queries, by planning the order of filter evaluation in shared strategy. However, in the streaming scenario, the characteristics of data items may change in process, which can bring some uncertainty to the outcome of individual filter evaluation, and affect the plan of query execution as well as the overall execution cost. In our work, considering the influence of the uncertain variation of data characteristics, we propose a framework to deal with the dynamic adjustment of filter ordering for query execution on data stream, and focus on the issues of cost management. By incrementally monitoring and analyzing the results of filter evaluation, our proposed approach can be effectively adaptive to the varied stream behavior and adjust the optimal ordering of filter evaluation, so as to optimize the execution cost. In order to achieve satisfactory performance and efficiency, we also discuss the trade-off between the adaptivity of our framework and the overhead incurred by filter adaption. The experimental results on synthetic and two real data sets (traffic and multimedia) show that our framework can effectively reduce and balance the overall query execution cost and keep high adaptivity in streaming scenario.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 50
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this paper, we propose an efficient overlapping community detection algorithm using a seed expansion approach. The key idea of our algorithm is to find good seeds, and then greedily expand these seeds based on a community metric. Within this seed expansion method, we investigate the problem of how to determine good seed nodes in a graph. In particular, we develop new seeding strategies for a personalized PageRank clustering scheme that optimizes the conductance community score. An important step in our method is the neighborhood inflation step where seeds are modified to represent their entire vertex neighborhood. Experimental results show that our seed expansion algorithm outperforms other state-of-the-art overlapping community detection methods in terms of producing cohesive clusters and identifying ground-truth communities. We also show that our new seeding strategies are better than existing strategies, and are thus effective in finding good overlapping communities in real-world networks.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 51
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Feature selection, selecting the most informative subset of features, is an important research direction in dimension reduction. The combinatorial search in feature selection is essentially a binary optimization problem, known as NP hard, which can be alleviated by learning feature weights. Traditional feature weights algorithms rely on heuristic search path. These approaches neglect the interaction and dependency between different features, and thus provide no guarantee for optimality. In this paper, we propose a novel joint feature weights learning framework, which imposes both nonnegative and $ell _{2,1}$ -norm constraints on the feature weights matrix. The nonnegative property ensures the physical significance of learned feature weights. Meanwhile, $ell _{2,1}$ -norm minimization achieves joint selection of the most relevant features by exploiting the whole feature space. More importantly, an efficient iterative algorithm with proved convergence is designed to optimize a convex objective function. Using this framework as a platform, we propose new supervised and unsupervised joint feature selection methods. Particularly, in the proposed unsupervised method, nonnegative graph embedding is developed to exploit intrinsic structure in the weighted space. Comparative experiments on seven real-world data sets indicate that our framework is both effective and efficient.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 52
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Efficiently answering XML keyword queries has attracted much research effort in the last decade. The key factors resulting in the inefficiency of existing methods are the common-ancestor-repetition (CAR) and visiting-useless-nodes (VUN) problems. To address the CAR problem, we propose a generic top-down processing strategy to answer a given keyword query w.r.t. LCA/SLCA/ELCA semantics. By “ top-down ”, we mean that we visit all common ancestor (CA) nodes in a depth-first, left-to-right order; by “ generic ”, we mean that our method is independent of the query semantics. To address the VUN problem, we propose to use child nodes, rather than descendant nodes to test the satisfiability of a node $v$ w.r.t. the given semantics. We propose two algorithms that are based on either traditional inverted lists or our newly proposed LLists to improve the overall performance. We further propose several algorithms that are based on hash search to simplify the operation of finding CA nodes from all involved LLists. The experimental results verify the benefits of our methods according to various evaluation metrics.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 53
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Frequent sequence mining is well known and well studied problem in datamining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to $approx 3/4cdot P$ where $P$ is the number of processors. In the experimental evaluation, we show that our method performs significantly better then the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 54
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 55
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: We address the problem of finding query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query. We assume that the important aspects of a query are usually presented and repeated in the query’s top retrieved documents in the style of lists, and query facets can be mined out by aggregating these significant lists. We propose a systematic solution, which we refer to as QDMiner, to automatically mine query facets by extracting and grouping frequent lists from free text, HTML tags, and repeat regions within top search results. Experimental results show that a large number of lists do exist and useful query facets can be mined by QDMiner. We further analyze the problem of list duplication, and find better query facets can be mined by modeling fine-grained similarities between lists and penalizing the duplicated lists.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 56
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: Unsupervised Cross-domain Sentiment Classification is the task of adapting a sentiment classifier trained on a particular domain ( source domain ), to a different domain ( target domain ), without requiring any labeled data for the target domain. By adapting an existing sentiment classifier to previously unseen target domains, we can avoid the cost for manual data annotation for the target domain. We model this problem as embedding learning, and construct three objective functions that capture: (a) distributional properties of pivots (i.e., common features that appear in both source and target domains), (b) label constraints in the source domain documents, and (c) geometric properties in the unlabeled documents in both source and target domains. Unlike prior proposals that first learn a lower-dimensional embedding independent of the source domain sentiment labels, and next a sentiment classifier in this embedding, our joint optimisation method learns embeddings that are sensitive to sentiment classification. Experimental results on a benchmark dataset show that by jointly optimising the three objectives we can obtain better performances in comparison to optimising each objective function separately, thereby demonstrating the importance of task-specific embedding learning for cross-domain sentiment classification. Among the individual objective functions, the best performance is obtained by (c). Moreover, the proposed method reports cross-domain sentiment classification accuracies that are statistically comparable to the current state-of-the-art embedding learning methods for cross-domain sentiment classification.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 57
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: Microblogging services have become among the most popular services on the web in the last few years. This led to significant increase in data size, speed, and applications. This paper presents Venus ; a system that supports real-time spatial queries on microblogs. Venus supports its queries on a spatial boundary $R$ and a temporal boundary $T$ , from which only the top- $k$ microblogs are returned in the query answer based on a spatio-temporal ranking function. Supporting such queries requires Venus to digest hundreds of millions of real-time microblogs in main-memory with high rates, yet, it provides low query responses and efficient memory utilization. To this end, Venus employs: (1) an efficient in-memory spatio-temporal index that digests high rates of incoming microblogs in real time, (2) a scalable query processor that prune the search space, $R$ and $T$ , effectively to provide low query latency on millions of items in real time, and (3) a group of memory optimization techniques that provide system administrators with different options to save significant memory resources while keeping the query accuracy- almost perfect. Venus memory optimization techniques make use of the local arrival rates of microblogs to smartly shed microblogs that are old enough not to contribute to any query answer. In addition, Venus can adaptively, in real time, adjust its load shedding based on both the spatial distribution and the parameters of incoming query loads. All Venus components can accommodate different spatial and temporal ranking functions that are able to capture the importance of each dimension differently depending on the applications requirements. Extensive experimental results based on real Twitter data and actual locations of Bing search queries show that Venus supports high arrival rates of up to 64 K microblogs/second and average query latency of 4 msec.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 58
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: Merlin supports exploratory search in large databases. The user interacts with it by specifying probability distributions over attributes, which express imprecise conditions about the entities of interest. Merlin helps the user home in on the right query conditions by addressing three key challenges: (1) efficiently computing results for an imprecise query, (2) providing feedback about the sensitivity of the result to changes of individual conditions, and (3) suggesting new conditions. We formally introduce the notion of sensitivity and prove structural properties that enable efficient algorithms for quantifying the effect of uncertainty in user-specified conditions. To support interactive responses, we also develop techniques that can deliver probability estimates within a given realtime limit and are able to adapt automatically as interactive query refinement proceeds.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 59
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: Instance coreference resolution is an essential problem in studying semantic web, and it is also critical for the implementation of web of data and future integration and application of semantic data. In this paper, we propose to use Memetic Algorithm (MA) to solve this instance coreference problem in a sequential stage, i.e., the instance-level matching is carried out with the result of schema-level matching. We first give the optimization model for schema-level matching and instance-level matching. Then, we, respectively, present profile similarity measures and the rough evaluation metrics with the assumption that the golden alignment for both schema-level matching and instance-level matching is one-to-one. Furthermore, we give the details of the MA. Finally, the experiments of comparing our approach with the state-of-the-art systems on OAEI benchmarks and real-world datasets are conducted and the results demonstrate that our approach is effective.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 60
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently realize these approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 61
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: This paper presents a project called KnowIng camera prototype SyStem (KISS) for real-time places-of-interest (POI) recognition and annotation for smartphone photos, with the availability of online geotagged images for POIs as our knowledge base. We propose a “Spatial+Visual” (S+V) framework which consists of a probabilistic field-of-view (pFOV) model in the spatial phase and sparse coding similarity metric in the visual phase to recognize phone-captured POIs. Moreover, we put forward an offline Collaborative Salient Area (COSTAR) mining algorithm to detect common visual features (called Costars) among the noisy photos geotagged on each POI, thus to clean the geotagged image database. The mining result can be utilized to annotate the region-of-interest on the query image during the online query processing. Besides, this mining procedure also improves the efficiency and accuracy of the S+V framework. Furthermore, we extend the pFOV model into a Bayesian FOV( $beta$ FOV) model which improves the spatial recognition accuracy by more than 30 percent and also further alleviates visual computation. From a bayesian point of view, the likelihood of a certain POI being captured by phones is a prior probability in pFOV model which is represented as a posterior probability in $beta$ FOV model.Our experiments in the real-world and Oxford 5K datasets show promising recognition results. In order to provide a fine-grained annotation ground truth, we labeled a new dataset based on Oxford 5K and make it public available on the web. Our COSTAR mining techniqueoutperforms state-of-the-art approach on both dataset.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 62
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Information Retrieval (IR) is concerned with indexing and retrieving documents including information relevant to a user's information need. Relevance Feedback (RF) is a class of effective algorithms for improving Information Retrieval (IR) and it consists of gathering further data representing the user's information need and automatically creating a new query. In this paper, we propose a class of RF algorithms inspired by quantum detection to re-weight the query terms and to re-rank the document retrieved by an IR system. These algorithms project the query vector on a subspace spanned by the eigenvector which maximizes the distance between the distribution of quantum probability of relevance and the distribution of quantum probability of non-relevance. The experiments showed that the RF algorithms inspired by quantum detection can outperform the state-of-the-art algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 63
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the detection of spatial events when smartphone users move around in their daily lives. However, the credibility of those detected events can be negatively impacted by unreliable participants with low-quality data. Consequently, a major challenge in mobile crowdsourcing is truth discovery, i.e., to discover true events from diverse and noisy participants’ reports. This problem is uniquely distinct from its online counterpart in that it involves uncertainties in both participants’ mobility and reliability . Decoupling these two types of uncertainties through location tracking will raise severe privacy and energy issues, whereas simply ignoring missing reports or treating them as negative reports will significantly degrade the accuracy of truth discovery. In this paper, we propose two new unsupervised models, i.e., Truth finder for Spatial Events (TSE) and Personalized Truth finder for Spatial Events (PTSE), to tackle this problem. In TSE, we model location popularity, location visit indicators, truths of events, and three-way participant reliability in a unified framework. In PTSE, we further model personal location visit tendencies. These proposed models are capable of effectively handling various types of uncertainties and automatically discovering truths without any supervision or location tracking. Experimental results on both real-world and synthetic datasets demonstrate that our proposed models outperform existing state-of-the-art truth discovery approaches in the mobile crowdsourcing environment.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 64
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Denormalization is a common tactic for enhancing performance of data warehouses, though its side-effect is quite obvious. Besides being confronted with update abnormality, denormalization has to consume additional storage space. As a result, this tactic is rarely used in main memory databases, which regards storage space, i.e., RAM, as scarce resource. Nevertheless, our research reveals that main memory database can benefit enormously from denormalization, as it is able to remarkably simplify the query processing plans and reduce the computation cost. In this paper, we present A-Store, a main memory OLAP engine customized for star/snowflake schemas. Instead of generating fully materialized denormalization, A-Store resorts to virtual denormalization by treating array indexes as primary keys. This design allows us to harvest the benefit of denormalization without sacrificing additional RAM space. A-Store uses a generic query processing model for all SPJGA queries. It applies a number of state-of-the-art optimization methods, such as vectorized scan and aggregation, to achieve superior performance. Our experiments show that A-Store outperforms the most prestigious MMDB systems significantly in star/snowflake schema based query processing.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 65
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Peer-to-peer networking offers a scalable solution for sharing multimedia data across the network. With a large amount of visual data distributed among different nodes, it is an important but challenging issue to perform content-based retrieval in peer-to-peer networks. While most of the existing methods focus on indexing high dimensional visual features and have limitations of scalability, in this paper we propose a scalable approach for content-based image retrieval in peer-to-peer networks by employing the bag-of-visual-words model. Compared with centralized environments, the key challenge is to efficiently obtain a global codebook, as images are distributed across the whole peer-to-peer network. In addition, a peer-to-peer network often evolves dynamically, which makes a static codebook less effective for retrieval tasks. Therefore, we propose a dynamic codebook updating method by optimizing the mutual information between the resultant codebook and relevance information, and the workload balance among nodes that manage different codewords. In order to further improve retrieval performance and reduce network cost, indexing pruning techniques are developed. Our comprehensive experimental results indicate that the proposed approach is scalable in evolving and distributed peer-to-peer networks, while achieving improved retrieval accuracy.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 66
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Time is pervasive of reality, and many relational database approaches have been developed to cope with it. In practical applications, facts can repeat several times, and only the overall period of time containing all the repetitions may be known (consider, e.g., On January, John attended five meetings of the Bioinformatics project ). While some temporal relational databases have faced facts repeated at (known) periodic time, or single facts occurred at temporally indeterminate time, the conjunction of non-periodic repetitions and temporal indeterminacy has not been faced yet. Coping with this problem requires an in-depth extension of current techniques. In this paper, we have introduced a new data model, and new definitions of relational algebraic operators coping with the above issues. We have studied the properties of the new model and algebra (with emphasis on the reducibility property), and how it can be integrated with other models in the literature.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 67
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Hashing methods have proven to be useful for a variety of tasks and have attracted extensive attention in recent years. Various hashing approaches have been proposed to capture similarities between textual, visual, and cross-media information. However, most of the existing works use a bag-of-words methods to represent textual information. Since words with different forms may have similar meaning, semantic level text similarities can not be well processed in these methods. To address these challenges, in this paper, we propose a novel method called semantic cross-media hashing (SCMH), which uses continuous word representations to capture the textual similarity at the semantic level and use a deep belief network (DBN) to construct the correlation between different modalities. To demonstrate the effectiveness of the proposed method, we evaluate the proposed method on three commonly used cross-media data sets are used in this work. Experimental results show that the proposed method achieves significantly better performance than state-of-the-art approaches. Moreover, the efficiency of the proposed method is comparable to or better than that of some other hashing methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 68
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: A wide spectrum of Internet-scale mobile applications, ranging from social networking, gaming and entertainment to emergency response and crisis management, all require efficient and scalable All k Nearest Neighbor (AkNN) computations over millions of moving objects every few seconds to be operational. Most traditional techniques for computing AkNN queries are centralized, lacking both scalability and efficiency. Only recently, distributed techniques for shared-nothing cloud infrastructures have been proposed to achieve scalability for large datasets. These batch-oriented algorithms are sub-optimal due to inefficient data space partitioning and data replication among processing units. In this paper, we present Spitfire , a distributed algorithm that provides a scalable and high-performance AkNN processing framework. Our proposed algorithm deploys a fast load-balanced partitioning scheme along with an efficient replication-set selection algorithm, to provide fast main-memory computations of the exact AkNN results in a batch-oriented manner. We evaluate, both analytically and experimentally, how the pruning efficiency of the Spitfire algorithm plays a pivotal role in reducing communication and response time up to an order of magnitude, compared to three other state-of-the-art distributed AkNN algorithms executed in distributed main-memory.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 69
    Publication Date: 2016-03-08
    Description: Metric learning is a key problem for many data mining and machine learning applications, and has long been dominated by Mahalanobis methods. Recent advances in nonlinear metric learning have demonstrated the potential power of non-Mahalanobis distance functions, particularly tree-based functions. We propose a novel nonlinear metric learning method that uses an iterative, hierarchical variant of semi-supervised max-margin clustering to construct a forest of cluster hierarchies, where each individual hierarchy can be interpreted as a weak metric over the data. By introducing randomness during hierarchy training and combining the output of many of the resulting semi-random weak hierarchy metrics, we can obtain a powerful and robust nonlinear metric model. This method has two primary contributions: first, it is semi-supervised, incorporating information from both constrained and unconstrained points. Second, we take a relaxed approach to constraint satisfaction, allowing the method to satisfy different subsets of the constraints at different levels of the hierarchy rather than attempting to simultaneously satisfy all of them. This leads to a more robust learning algorithm. We compare our method to a number of state-of-the-art benchmarks on $k$ -nearest neighbor classification, large-scale image retrieval and semi-supervised clustering problems, and find that our algorithm yields results comparable or superior to the state-of-the-art.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 70
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: In this paper, we study the multi-task learning problem with a new perspective of considering the link structure of data and task relationship modeling simultaneously. In particular, we first introduce the Matrix Gaussian (MG) distribution and Matrix Generalized Inverse Gaussian (MGIG) distribution, then define a Matrix Gaussian Matrix Generalized Inverse Gaussian (MG-MGIG) prior. Based on this prior, we propose a novel multi-task learning algorithm, the Bayesian Multi-task Relationship Learning (BMTRL) algorithm. To incorporate the link structure into the framework of BMTRL, we propose link constraints between samples. Through combining the BMTRL algorithm with the link constraints, we propose the Bayesian Multi-task Relationship Learning with Link Constraints (BMTRL-LC) algorithm. Further, we apply the manifold theory to provide an extension of BMTRL-LC to data with no link structure. Specifically, BMTRL-LC is effective for multi-task learning with only limited training samples, which is not addressed in the existing literature. To make the computation tractable, we simultaneously use a convex optimization method and sampling techniques. In particular, we adopt two stochastic EM algorithms for BMTRL and BMTRL-LC, respectively. The experimental results on three real datasets demonstrate the promise of the proposed algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 71
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: In the age of Web 2.0, community user contributed questions and answers provide an important alternative for knowledge acquisition through web search. Question retrieval in current community-based question answering (CQA) services do not, in general, work well for long and complex queries, such as the questions. The main reasons are the verboseness in natural language queries and the word mismatch between the queries and the candidate questions in the CQA archive during retrieval. To address these two problems, existing solutions try to refine the search queries by distinguishing the key concepts in the queries and expanding the queries with relevant content. However, using the existing query refinement approaches can only identify the key and non-key concepts, while the differences between the key concepts are overlooked. Moreover, the existing query expansion approaches, not only overlook the weights of key concepts in the queries, but also fail to consider concept level expansion for them. In this paper, we explore a key concept identification approach for query refinement and a pivot language translation based approach to explore key concept paraphrasing. We further propose a new question retrieval model which can seamlessly integrate the key concepts and their paraphrases. The experimental results demonstrate that the integrated retrieval model significantly outperforms the state-of-the-art models in question retrieval.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 72
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Crowdsourcing refers to solving large problems by involving human workers that solve component sub-problems or tasks. In data crowdsourcing, the problem involves data acquisition, management, and analysis. In this paper, we provide an overview of data crowdsourcing, giving examples of problems that the authors have tackled, and presenting the key design steps involved in implementing a crowdsourced solution. We also discuss some of the open challenges that remain to be solved.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 73
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Collaborative Filtering (CF) is one of the most successful recommendation approaches to cope with information overload in the real world. However, typical CF methods equally treat every user and item, and cannot distinguish the variation of user's interests across different domains. This violates the reality that user's interests always center on some specific domains, and the users having similar tastes on one domain may have totally different tastes on another domain. Motivated by the observation, in this paper, we propose a novel Domain-sensitive Recommendation (DsRec) algorithm, to make the rating prediction by exploring the user-item subgroup analysis simultaneously, in which a user-item subgroup is deemed as a domain consisting of a subset of items with similar attributes and a subset of users who have interests in these items. The proposed framework of DsRec includes three components: a matrix factorization model for the observed rating reconstruction, a bi-clustering model for the user-item subgroup analysis, and two regularization terms to connect the above two components into a unified formulation. Extensive experiments on Movielens-100K and two real-world product review datasets show that our method achieves the better performance in terms of prediction accuracy criterion over the state-of-the-art methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 74
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Many websites offering Location Based Services (LBS) provide a $k$ NN search interface that returns the top- $k$ nearest-neighbor objects (e.g., nearest restaurants) for a given query location. This paper addresses the problem of crawling all objects efficiently from an LBS website, through the public $k$ NN web search interface it provides. Specifically, we develop crawling algorithm for 2D and higher-dimensional spaces, respectively, and demonstrate through theoretical analysis that the overhead of our algorithms can be bounded by a function of the number of dimensions and the number of crawled objects, regardless of the underlying distributions of the objects. We also extend the algorithms to leverage scenarios where certain auxiliary information about the underlying data distribution, e.g., the population density of an area which is often positively correlated with the density of LBS objects, is available. Extensive experiments on real-world datasets demonstrate the superiority of our algorithms over the state-of-the-art competitors in the literature.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 75
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Owing to the wide availability of the global positioning system (GPS) and digital mapping of roads, road network navigation services have become a basic application on many mobile devices. Path planning, a fundamental function of road network navigation services, finds a route between the specified start location and destination. The efficiency of this path planning function is critical for mobile users on roads due to various dynamic scenarios, such as a sudden change in driving direction, unexpected traffic conditions, lost or unstable GPS signals, and so on. In these scenarios, the path planning service needs to be delivered in a timely fashion. In this paper, we propose a system, namely, Path Planning by Caching (PPC) , to answer a new path planning query in real time by efficiently caching and reusing historical queried-paths. Unlike the conventional cache-based path planning systems, where a queried-path in cache is used only when it matches perfectly with the new query, PPC leverages the partially matched queries to answer part(s) of the new query. As a result, the server only needs to compute the unmatched path segments, thus significantly reducing the overall system workload. Comprehensive experimentation on a real road network database shows that our system outperforms the state-of-the-art path planning techniques by reducing 32 percent of the computation latency on average.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 76
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Given a query graph $q$ , retrieving the data graphs $g$ from a set $D$ of data graphs such that $q$ contains $g$ , namely supergraph containment search, is fundamental in graph data analysis with a wide range of real applications. It is very challenging due to the NP-Completeness of subgraph isomorphism testing. Driven by many real applications, in this paper, we study the problem of probabilistic supergraph search; that is, given a set $D$ of uncertain data graphs, a certain query graph $q$ and a probability threshold $theta$ , we retrieve the data graphs $- ^{u}$ from $D$ such that the probability of $q$ containing $g^{u}$ is not smaller than $theta$ . We show that besides the NP-Completeness of subgraph isomorphism testing, the problem of calculating probabilities is #P-Complete; thus, it is even more challenging than the supergraph containment search. To tackle the computational hardness, we first propose two novel pruning rules, based on probabilistic connectivity and features, respectively, to efficiently prune non-promising data graphs. Then, efficient verification algorithms are developed with the aim of sharing computation and terminating non-promising computation as early as possible. Extensive performance studies on both real and synthetic data demonstrate the efficiency and effectiveness of our techniques in practice.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 77
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: The availability of many heterogeneous but related views of data has arisen in numerous clustering problems. Different views encode distinct representations of the same data, which often admit the same underlying cluster structure. The goal of multi-view clustering is to properly combine information from multiple views so as to generate high quality clustering results that are consistent across different views. Based on max-product belief propagation, we propose a novel multi-view clustering algorithm termed multi-view affinity propagation (MVAP). The basic idea is to establish a multi-view clustering model consisting of two components, which measure the within-view clustering quality and the explicit clustering consistency across different views, respectively. Solving this model is NP-hard, and a multi-view affinity propagation is proposed, which works by passing messages both within individual views and across different views. However, the exemplar consistency constraint makes the optimization almost impossible. To this end, by using some previously designed mathematical techniques, the messages as well as the cluster assignment vector computations are simplified to get simple yet functionally equivalent computations. Experimental results on several real-world multi-view datasets show that MVAP outperforms existing multi-view clustering algorithms. It is especially suitable for clustering more than two views.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 78
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-03-08
    Description: Due to low quality of crowdsourced labelers, the integrated label of each example is usually inferred from its multiple noisy labels provided by different labelers. This paper proposes a novel algorithm, Ground Truth Inference using Clustering (GTIC), to improve the quality of integrated labels for multi-class labeling. For a K labeling case, GTIC utilizes the multiple noisy label sets of examples to generate features. Then, it uses a K-Means algorithm to cluster all examples into K different groups, each of which is mapped to a specific class. Examples in the same cluster are assigned a corresponding class label. We compare GTIC with four existing multi-class ground truth inference algorithms, majority voting (MV), Dawid & Skene's (DS), ZenCrowd (ZC) and Spectral DS (SDS), on one synthetic and eight real-world datasets. Experimental results show that the performance of GTIC is significantly superior to the others in terms of both accuracy and M-AUC. Besides, the running time of GTIC is about twenty times faster than EM-based complicated inference algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 79
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: Given a set of objects $mathcal {O}$ , each with $d$ numeric attributes, a top- $k$ preference scores these objects using a linear combination of their attribute values, where the weight on each attribute reflects the interest in this attribute. Given a query preference $q$ , a top- $k$ query finds the $k$ objects in $mathcal {O}$ with highest scores with respect to $q$ . Given a query object $o$ and a set of preferences $mathcal {Q}$ , a reverse top- $k$ query finds all preferences $qin mathcal {Q}$ for which $o$ becomes one of the top $k$ objects with respect to $q$ . Previous solutions to these problems are effective only in low dimensions. In this paper, we develop a solution for much higher dimensions (up to high tens), if many preferences exhibit sparsity —i.e., each specifies non-zero weights for only a handful (say $5$ - $7$ ) of attributes (though the subsets of such attributes and their weights can vary greatly). Our idea is to select carefully a set of low-dimensional core subspaces to “cover” the sparse preferences in a workload. These subspaces allow us to index them more effectively than the full-dimensional space. Being multi-dimensional, each subspace covers many possible preferences; furthermore, multiple subspaces can jointly cover
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 80
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: Answering queries using views has proven effective for querying relational and semistructured data. This paper investigates this issue for graph pattern queries based on graph simulation. We propose a notion of pattern containment to characterize graph pattern matching using graph pattern views. We show that a pattern query can be answered using a set of views if and only if it is contained in the views. Based on this characterization, we develop efficient algorithms to answer graph pattern queries. We also study problems for determining (minimal, minimum) containment of pattern queries. We establish their complexity (from cubic-time to NP-complete) and provide efficient checking algorithms (approximation when the problem is intractable). In addition, when a pattern query is not contained in the views, we study maximally contained rewriting to find approximate answers; we show that it is in cubic-time to compute such rewriting, and present a rewriting algorithm. We experimentally verify that these methods are able to efficiently answer pattern queries on large real-world graphs.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 81
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-01-12
    Description: So far, transactional memory—although a promising technique—suffered from the absence of an efficient hardware implementation. Intel’s Haswell microarchitecture introduced hardware transactional memory (HTM) in mainstream CPUs. HTM allows for efficient concurrent, atomic operations, which is also highly desirable in the context of databases. On the other hand, HTM has several limitations that, in general, prevent a one-to-one mapping of database transactions to HTM transactions. In this work, we devise several building blocks that can be used to exploit HTM in main-memory databases. We show that HTM allows for achieving nearly lock-free processing of database transactions by carefully controlling the data layout and the access patterns. The HTM component is used for detecting the (infrequent) conflicts, which allows for an optimistic, and thus very low-overhead execution of concurrent transactions. We evaluate our approach on a four-core desktop and a 28-core server system and find that HTM indeed provides a scalable, powerful, and easy to use synchronization primitive.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 82
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: How can users who know neither programming nor statistics explore large databases? We present a novel interface, designed to guide explorers through their data: Blaeu. Blaeu is a database front-end, “boosted” with unsupervised learning primitives. Thanks to these primitives, it can summarize and recommend queries . Our first contribution is Blaeu's interaction model. With Blaeu, users explore the data through data maps . A data map is an interactive set of clusters, which users navigate with zooms and projections. Our second contribution is Blaeu's engine. We present three mapping algorithms, for three different settings. The first algorithm deals with small to medium databases, the second one targets high dimensional spaces, and the last one focuses on speed and interaction. We then present an optimization strategy based on sampling. Our experiments reveal that Blaeu can cluster millions of tuples with hundreds of columns in a few seconds on commodity hardware.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 83
    Publication Date: 2016-04-01
    Description: In recent years, the boundaries between e-commerce and social networking have become increasingly blurred. Many e-commerce websites support the mechanism of social login where users can sign on the websites using their social network identities such as their Facebook or Twitter accounts. Users can also post their newly purchased products on microblogs with links to the e-commerce product web pages. In this paper, we propose a novel solution for cross-site cold-start product recommendation , which aims to recommend products from e-commerce websites to users at social networking sites in “cold-start” situations, a problem which has rarely been explored before. A major challenge is how to leverage knowledge extracted from social networking sites for cross-site cold-start product recommendation. We propose to use the linked users across social networking sites and e-commerce websites (users who have social networking accounts and have made purchases on e-commerce websites) as a bridge to map users’ social networking features to another feature representation for product recommendation. In specific, we propose learning both users’ and products’ feature representations (called user embeddings and product embeddings, respectively) from data collected from e-commerce websites using recurrent neural networks and then apply a modified gradient boosting trees method to transform users’ social networking features into user embeddings. We then develop a feature-based matrix factorization approach which can leverage the learnt user embeddings for cold-start product recommendation. Experimental results on a large dataset constructed from the largest Chinese microblogging service Sina Weibo and the largest Chinese B2C e-commerce website JingDong have shown the effectiveness of our proposed framework.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 84
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: High-dimensional $k$ nearest neighbor (kNN) search has a wide range of applications in multimedia information retrieval. Existing disk-based $k$ NN search methods incur significant I/O costs in the candidate refinement phase. In this paper, we propose to cache compact approximate representations of data points in main memory in order to reduce the candidate refinement time during $k$ NN search. This problem raises two challenging issues: (i) which is the most effective encoding scheme for data points to support $k$ NN search? and (ii) what is the optimal number of bits for encoding a data point? For (i), we formulate and solve a novel histogram optimization problem that decides the most effective encoding scheme. For (ii), we develop a cost model for automatically tuning the optimal number of bits for encoding points. In addition, our approach is generic and applicable to exact / approximate $k$ NN search methods. Extensive experimental results on real datasets demonstrate that our proposal can accelerate the candidate refinement time of $k$ NN search by at least an order of magnitude.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 85
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Obtaining the items highly-relevant to a given set of query items is a key task for various applications, such as recommendation and relationship prediction. A family of path-based relevance metrics, which quantify item relevance based on the paths in an item graph, have been shown to be effective in capturing the relevance in many applications. Despite their effectiveness, path-based relevance normally requires time-consuming iterative computation. We propose an approach to obtain the top-k most relevant items for a given query item set quickly. Our approach uses novel score bounds to detect the emergence of the top-k items during the computation. The approach is designed for a distributed environment, which makes it scale for massive graphs having billions of nodes. Our experimental results show that the proposed approach can provide the results up to two order of magnitudes faster than previously proposed approaches and can scale well with both the size of input and the number of machines used in the computation.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 86
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Travel planning and recommendation are important aspects of transportation. We propose and investigate a novel Collective Travel Planning (CTP) query that finds the lowest-cost route connecting multiple sources and a destination, via at most $k$ meeting points. When multiple travelers target the same destination (e.g., a stadium or a theater), they may want to assemble at meeting points and then go together to the destination by public transport to reduce their global travel cost (e.g., energy, money, or greenhouse-gas emissions). This type of functionality holds the potential to bring significant benefits to society and the environment, such as reducing energy consumption and greenhouse-gas emissions, enabling smarter and greener transportation, and reducing traffic congestions. The CTP query is Max SNP-hard. To compute the query efficiently, we develop two algorithms, including an exact algorithm and an approximation algorithm. The exact algorithm is capable finding the optimal result for small values of $k$ (e.g., $k = 2$ ) in interactive time, while the approximation algorithm, which has a $5$ -approximation ratio, is suitable for other situations. The performance of the CTP query is studied experimentally with real and synthetic spatial data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 87
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Language learners’ limited lexical knowledge leads to imprecise wording. This is especially true when they attempt to express their emotions. Many learners rely heavily on the traditional thesaurus. Unfortunately, this fails to provide appropriate suggestions for lexical choices. To better aid English-as-a-second-language learners with word choices, we propose RESOLVE, which provides ranked synonyms of emotion words based on contextual information. RESOLVE suggests precise emotion words regarding the events in the relevant context. Patterns are learned to capture emotion events, and various factors are considered in the scoring function for ranking emotion words. We also describe an online writing system developed using RESOLVE and evaluate its effectiveness for learning assistance with a writing task. Experimental results showed that RESOLVE yielded a superior performance on NDCG@5 which significantly outperformed both PMI and SVM approaches, and offered better suggestions than Roget's Thesaurus and PIGAI (an online automated essay scoring system). Moreover, when applying it to the writing task, students' appropriateness with emotion words was 30 percent improved. Less-proficient learners benefited more from RESOLVE than highly-proficient learners. Post-tests also showed that after using RESOLVE, less-proficient learners’ ability to use emotion words approached that of highly-proficient learners. RESOLVE thus enables learners to use precise emotion words.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 88
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Business intelligence (BI) systems depend on efficient integration of disparate and often heterogeneous data. The integration of data is governed by data-intensive flows and is driven by a set of information requirements. Designing such flows is in general a complex process, which due to the complexity of business environments is hard to be done manually. In this paper, we deal with the challenge of efficient design and maintenance of data-intensive flows and propose an incremental approach, namely CoAl , for semi-automatically consolidating data-intensive flows satisfying a given set of information requirements. CoAl  works at the logical level and consolidates data flows from either high-level information requirements or platform-specific programs. As CoAl  integrates a new data flow, it opts for maximal reuse of existing flows and applies a customizable cost model tuned for minimizing the overall cost of a unified solution. We demonstrate the efficiency and effectiveness of our approach through an experimental evaluation using our implemented prototype.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 89
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Utility mining is a new development of data mining technology. Among utility mining problems, utility mining with the itemset share framework is a hard one as no anti-monotonicity property holds with the interestingness measure. Prior works on this problem all employ a two-phase, candidate generation approach with one exception that is however inefficient and not scalable with large databases. The two-phase approach suffers from scalability issue due to the huge number of candidates. This paper proposes a novel algorithm that finds high utility patterns in a single phase without generating candidates. The novelties lie in a high utility pattern growth approach, a lookahead strategy, and a linear data structure. Concretely, our pattern growth approach is to search a reverse set enumeration tree and to prune search space by utility upper bounding. We also look ahead to identify high utility patterns without enumeration by a closure property and a singleton property. Our linear data structure enables us to compute a tight bound for powerful pruning and to directly identify high utility patterns in an efficient and scalable way, which targets the root cause with prior algorithms. Extensive experiments on sparse and dense, synthetic and real world data suggest that our algorithm is up to 1 to 3 orders of magnitude more efficient and is more scalable than the state-of-the-art algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 90
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-01
    Description: Meso-scale structural analysis, like core decomposition has uncovered groups of nodes that play important roles in the underlying complex systems. The existing core decomposition approaches generally focus on node properties like degree and strength. The node centric approaches can only capture a limited information about the local neighborhood topology. In the present work, we propose a group density based core analysis approach that overcome the drawbacks of the node centric approaches. The proposed algorithmic approach focuses on weight density, cohesiveness, and stability of a substructure. The method also assigns an unique score to every node that rank the nodes based on their degree of core-ness. To determine the correctness of the proposed method, we propose a synthetic benchmark with planted core structure. A performance test on the null model is carried out using a weighted lattice without core structures. We further test the stability of the approach against random noise. The experimental results prove the superiority of our algorithm over the state-of-the-arts. We finally analyze the core structures of several popular weighted network models and real life weighted networks. The experimental results reveal important node ranking and hierarchical organization of the complex networks, which give us better insight about the underlying systems.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 91
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-04-08
    Description: Top- $k$ proximity query in large graphs is a fundamental problem with a wide range of applications. Various random walk based measures have been proposed to measure the proximity between different nodes. Although these measures are effective, efficiently computing them on large graphs is a challenging task. In this paper, we develop an efficient and exact local search method, FLoS (Fast Local Search), for top- $k$ proximity query in large graphs. FLoS guarantees the exactness of the solution. Moreover, it can be applied to a variety of commonly used proximity measures. FLoS is based on the no local optimum property of proximity measures. We show that many measures have no local optimum. Utilizing this property, we introduce several operations to manipulate transition probabilities and develop tight lower and upper bounds on the proximity values. The lower and upper bounds monotonically converge to the exact proximity value when more nodes are visited. We further extend FLoS to measures having local optimum by utilizing relationship among different measures. We perform comprehensive experiments on real and synthetic large graphs to evaluate the efficiency and effectiveness of the proposed method.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 92
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: Listwise learning to rank (LTR) is aimed at constructing a ranking model from listwise training data to order objects. In most existing studies, each training instance consists of a set of objects described by preference features. In a preference feature space for the objects in training, the structure of the objects is associated with the absolute preference degrees for the objects. The degrees significantly influence the ordering of the objects. Nevertheless, the structure of the training objects in their preference feature space has rarely been studied. In addition, most listwise LTR algorithms yield a single linear ranking model for all objects, but this ranking model may be insufficient to capture the underlying nonlinear ranking mechanism among all objects. This study proposes a divide-and-train method to learn a nonlinear ranking model from listwise training data. First, a rank-preserving clustering approach is used to infer the structure of objects in their preference feature space and all the objects in training data are divided into several clusters. Each cluster is assumed to correspond to a preference degree and an ordinal regression function is then learned. Second, considering that relations exist among the clusters, a multi-task listwise ranking approach is then employed to train linear ranking functions for all the clusters (or preference degrees) simultaneously. Our proposed method utilizes both the (relative) preferences among objects and the intrinsic structure of objects. Experimental results on benchmark data sets suggest that the proposed method outperforms state-of-the-art listwise LTR algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 93
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: Data exchange is the process of generating an instance of a target schema from an instance of a source schema such that source data is reflected in the target. Generally, data exchnge is performed using schema mapping, representing high level relations between source and target schemas. In this paper, we argue that data exchange solely based on schema level information limits the ability to express semantics in data exchange. We show such schema level mappings not only may result in entity fragmentation, they are unable to resolve some ambiguous data exchange scenarios. To address this problem, we propose Scalable Entity Preserving Data Exchange (SEDEX), a hybrid method based on data and schema mapping that employs similarities between relation trees of source and target relations to find the best relations that can host source instances. Our experiments show SEDEX outperforms other methods in terms of quality and scalability of data exchange.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 94
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: We propose TrustSVD, a trust-based matrix factorization technique for recommendations. TrustSVD integrates multiple information sources into the recommendation model in order to reduce the data sparsity and cold start problems and their degradation of recommendation performance. An analysis of social trust data from four real-world data sets suggests that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. TrustSVD therefore builds on top of a state-of-the-art recommendation algorithm, SVD++ (which uses the explicit and implicit influence of rated items), by further incorporating both the explicit and implicit influence of trusted and trusting users on the prediction of items for an active user. The proposed technique is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that TrustSVD achieves better accuracy than other ten counterparts recommendation techniques.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 95
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: Truth discovery is an important technique for enabling reliable crowdsourcing applications. It aims to automatically discover the truths from possibly conflicting crowdsourced claims. Most existing truth discovery approaches focus on categorical applications, such as image classification. They use the accuracy, i.e., rate of exactly correct claims, to capture the reliability of participants. As a consequence, they are not effective for truth discovery in quantitative applications, such as percentage annotation and object counting, where similarity rather than exact matching between crowdsourced claims and latent truths should be considered. In this paper, we propose two unsupervised Quantitative Truth Finders (QTFs) for truth discovery in quantitative crowdsourcing applications. One QTF explores an additive model and the other explores a multiplicative model to capture different relationships between crowdsourced claims and latent truths in different classes of quantitative tasks. These QTFs naturally incorporate the similarity between variables. Moreover, they use the bias and the confidence instead of the accuracy to capture participants’ abilities in quantity estimation. These QTFs are thus capable of accurately discovering quantitative truths in particular domains. Through extensive experiments, we demonstrate that these QTFs outperform other state-of-the-art approaches for truth discovery in quantitative crowdsourcing applications and they are also quite efficient.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 96
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: Sentiment analysis of microblog texts has drawn lots of attention in both the academic and industrial fields. However, most of the current work only focuses on polarity classification. In this paper, we present an opinion mining system for Chinese microblogs called CMiner. Instead of polarity classification, CMiner focuses on more complicated opinion mining tasks - opinion target extraction and opinion summarization. Novel algorithms are developed for the two tasks and integrated into the end-to-end system. CMiner can help to effectively understand the users’ opinion towards different opinion targets in a microblog topic. Specially, we develop an unsupervised label propagation algorithm for opinion target extraction. The opinion targets of all messages in a topic are collectively extracted based on the assumption that similar messages may focus on similar opinion targets. In addition, we build an aspect-based opinion summarization framework for microblog topics. After getting the opinion targets of all the microblog messages in a topic, we cluster the opinion targets into several groups and extract representative targets and summaries for each group. A co-ranking algorithm is proposed to rank both the opinion targets and microblog sentences simultaneously. Experimental results on a benchmark dataset show the effectiveness of our system and the algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 97
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: In this paper, we present a supervised dictionary learning method for optimizing the feature-based Bag-of-Words (BoW) representation towards Information Retrieval. Following the cluster hypothesis, which states that points in the same cluster are likely to fulfill the same information need, we propose the use of an entropy-based optimization criterion that is better suited for retrieval instead of classification. We demonstrate the ability of the proposed method, abbreviated as EO-BoW, to improve the retrieval performance by providing extensive experiments on two multi-class image datasets. The BoW model can be applied to other domains as well, so we also evaluate our approach using a collection of 45 time-series datasets, a text dataset, and a video dataset. The gains are three-fold since the EO-BoW can improve the mean Average Precision, while reducing the encoding time and the database storage requirements. Finally, we provide evidence that the EO-BoW maintains its representation ability even when used to retrieve objects from classes that were not seen during the training.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 98
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: This paper elaborates on the ranked nodes method (RNM) that is used for constructing conditional probability tables (CPTs) for Bayesian networks consisting of a class of nodes called ranked nodes. Such nodes typically represent continuous quantities that lack well-established interval scales and are hence expressed by ordinal scales. Based on expert elicitation, the CPT of a child node is generated in RNM by aggregating weighted states of parent nodes with a weight expression. RNM is also applied to nodes that are expressed by interval scales. However, the use of the method in this way may be ineffective due to challenges which are not addressed in the existing literature but are demonstrated through an illustrative example in this paper. To overcome the challenges, the paper introduces a novel approach that facilitates the use of RNM. It consists of guidelines concerning the discretization of the interval scales into ordinal ones and the determination of a weight expression and weights based on assessments of the expert about the mode of the child node. The determination is premised on interpretations and feasibility conditions of the weights derived in the paper. The utilization of the approach is demonstrated with the illustrative example throughout the paper.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 99
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: Graphs have been widely used to represent objects and object connections in applications such as the web, social networks, and citation networks. Mining influence relationships from graphs has gained increasing interests in recent years because providing information on how graph objects influence each other can facilitate graph exploration, graph search, and connection recommendations. In this paper, we study the problem of detecting influence aspects, on which objects are connected, and influence degree (or influence strength), with which one graph node influences another graph node on a given aspect. Existing techniques focus on inferring either the overall influence degrees or the influence types from graphs. In this paper, we propose a systematic approach to extract influence aspects and learn aspect-level influence strength. In particular, we first present a novel instance-merging based method to extract influence aspects from the context of object connections. We then introduce two generative models, Observed Aspect Influence Model (OAIM) and Latent Aspect Influence Model (LAIM), to model the topological structure of graphs, the text content associated with graph objects, and the context in which the objects are connected. To learn OAIM and LAIM, we design both non-parallel and parallel Gibbs sampling algorithms. We conduct extensive experiments on synthetic and real data sets to show the effectiveness and efficiency of our methods. The experimental results show that our models can discover more effective results than existing approaches. Our learning algorithms also scale well on large data sets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 100
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-06-03
    Description: With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatio-textual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study two fundamental problems in the spatial keyword queries: top $k$ spatial keyword search (TOPK-SK), and batch top $k$ spatial keyword search (BTOPK-SK). Given a set of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the closest $k$ objects each of which contains all keywords in the query. BTOPK-SK is the batch processing of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top $k$ spatial keyword sea- ch. To further enhance the filtering capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...