ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Ihre E-Mail wurde erfolgreich gesendet. Bitte prüfen Sie Ihren Maileingang.

Leider ist ein Fehler beim E-Mail-Versand aufgetreten. Bitte versuchen Sie es erneut.

Vorgang fortführen?

Exportieren
Filter
  • Artikel  (728)
  • 2015-2019  (728)
  • 2005-2009
  • 1935-1939
  • 1274
  • 1
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Graph algorithm is pervasive in many applications ranging from targeted advertising to natural language processing. Recently, Asynchronous Graph Processing (AGP) is becoming a promising model to support graph algorithm on large-scale distributed computing platforms because it enables faster convergence speed and lower synchronization cost than the synchronous model for no barrier between iterations. However, existing AGP methods still suffer from poor performance for inefficient vertex state propagation. In this paper, we propose an effective and low-cost forward and backward sweeping execution method to accelerate state propagation for AGP, based on a key observation that states in AGP can be propagated between vertices much faster when the vertices are processed sequentially along the graph path within each round. Through dividing graph into paths and asynchronously processing vertices on each path in an alternative forward and backward way according to their order on this path, vertex states in our approach can be quickly propagated to other vertices and converge in a faster way with only little additional overhead. In order to efficiently support it over distributed platforms, we also propose a scheme to reduce the communication overhead along with a static priority ordering scheme to further improve the convergence speed. Experimental results on a cluster with 1,024 cores show that our approach achieves excellent scalability for large-scale graph algorithms and the overall execution time is reduced by at least 39.8 percent, in comparison with the most cutting-edge methods.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 2
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Deal selection on Groupon is a typical social learning and decision making process, where the quality of a deal is usually unknown to the customers. The customers must acquire this knowledge through social learning from other social medias such as reviews on Yelp. Additionally, the quality of a deal depends on both the state of the vendor and decisions of other customers on Groupon. How social learning and network externality affect the decisions of customers in deal selection on Groupon is our main interest. We develop a data-driven game-theoretic framework to understand the rational deal selection behaviors cross social medias. The sufficient condition of the Nash equilibrium is identified. A value-iteration algorithm is proposed to find the optimal deal selection strategy. We conduct a year-long experiment to trace the competitions among deals on Groupon and the corresponding Yelp ratings. We utilize the dataset to analyze the deal selection game with realistic settings. Finally, the performance of the proposed social learning framework is evaluated with real data. The results suggest that customers do make decisions in a rational way instead of following naive strategies, and there is still room to improve their decisions with assistance from the proposed framework.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 3
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: RDF question/answering (Q/A) allows users to ask questions in natural languages over a knowledge base represented by RDF. To answer a natural language question, the existing work takes a two-stage approach: question understanding and query evaluation. Their focus is on question understanding to deal with the disambiguation of the natural language phrases. The most common technique is the joint disambiguation, which has the exponential search space. In this paper, we propose a systematic framework to answer natural language questions over RDF repository (RDF Q/A) from a graph data-driven perspective. We propose a semantic query graph to model the query intention in the natural language question in a structural way, based on which, RDF Q/A is reduced to subgraph matching problem. More importantly, we resolve the ambiguity of natural language questions at the time when matches of query are found. The cost of disambiguation is saved if there are no matching found. More specifically, we propose two different frameworks to build the semantic query graph, one is relation (edge)-first and the other one is node-first. We compare our method with some state-of-the-art RDF Q/A systems in the benchmark dataset. Extensive experiments confirm that our method not only improves the precision but also speeds up query performance greatly.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 4
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Spatial clustering deals with the unsupervised grouping of places into clusters and finds important applications in urban planning and marketing. Current spatial clustering models disregard information about the people and the time who and when are related to the clustered places. In this paper, we show how the density-based clustering paradigm can be extended to apply on places which are visited by users of a geo-social network. Our model considers spatio-temporal information and the social relationships between users who visit the clustered places. After formally defining the model and the distance measure it relies on, we provide alternatives to our model and the distance measure. We evaluate the effectiveness of our model via a case study on real data; in addition, we design two quantitative measures, called social entropy and community score, to evaluate the quality of the discovered clusters. The results show that temporal-geo-social clusters have special properties and cannot be found by applying simple spatial clustering approaches and other alternatives.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 5
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the “symptoms” of semantic drift, then DPs are the “causes” of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to describe our approach in cleaning information extraction errors caused by semantic drift, but we perform experiments on different relation extraction processes on three large real data extraction collections. The experimental results show that our DP cleaning method enables us to clean around 90 percent incorrect instances or patterns with about 90 percent precision, which outperforms the previous approaches we compare with.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 6
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Graphs are used for representing and understanding objects and their relationships for numerous applications such as social networks, Semantic Webs, and biological networks. Integrity assurance of data and query results for graph databases is an essential security requirement. In this paper, we propose two efficient integrity verification schemes—HMACs for graphs (gHMAC) for two-party data sharing, and redactable HMACs for graphs (rgHMAC) for third-party data sharing, such as a cloud-based graph database service. We compute one HMAC value for both the schemes and two other verification objects for rgHMAC scheme that are shared with the verifier. We show that the proposed schemes are provably secure with respect to integrity attacks on the structure and/or content of graphs and query results. The proposed schemes have linear complexity in terms of the number of vertices and edges in the graphs, which is shown to be optimal. Our experimental results corroborate that the proposed HMAC-based schemes for graphs are highly efficient as compared to the digital signature-based schemes—computation of HMAC tags is about 10 times faster than the computation of digital signatures.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 7
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Reliable propagation of information through large networks, e.g., communication networks, social networks, or sensor networks is very important in many applications concerning marketing, social networks, and wireless sensor networks. However, social ties of friendship may be obsolete, and communication links may fail, inducing the notion of uncertainty in such networks. In this paper, we address the problem of optimizing information propagation in uncertain networks given a constrained budget of edges. We show that this problem requires to solve two NP-hard subproblems: the computation of expected information flow, and the optimal choice of edges. To compute the expected information flow to a source vertex, we propose the F-tree as a specialized data structure, that identifies independent components of the graph for which the information flow can either be computed analytically and efficiently, or for which traditional Monte-Carlo sampling can be applied independently of the remaining network. For the problem of finding the optimal edges, we propose a series of heuristics that exploit properties of this data structure. Our evaluation shows that these heuristics lead to high quality solutions, thus yielding high information flow, while maintaining low running time.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 8
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Business to Business (B2B) marketing aims at meeting the needs of other businesses instead of individual consumers, and thus entails management of more complex business needs than consumer marketing. The buying processes of the business customers involve series of different marketing campaigns providing multifaceted information about the products or services. While most existing studies focus on individual consumers, little has been done to guide business customers due to the dynamic and complex nature of these business buying processes. To this end, in this paper, we focus on providing a unified view of social and temporal modeling for B2B marketing campaign recommendation. Along this line, we first exploit the temporal behavior patterns in the B2B buying processes and develop a marketing campaign recommender system. Specifically, we start with constructing a temporal graph as the knowledge representation of the buying process of each business customer. Temporal graph can effectively extract and integrate the campaign order preferences of individual business customers. It is also worth noting that our system is backward compatible since the participating frequency used in conventional static recommender systems is naturally embedded in our temporal graph. The campaign recommender is then built in a low-rank graph reconstruction framework based on probabilistic graphical models. Our framework can identify the common graph patterns and predict missing edges in the temporal graphs. In addition, since business customers very often have different decision makers from the same company, we also incorporate social factors, such as community relationships of the business customers, for further improving overall performances of the missing edge prediction and recommendation. Finally, we have performed extensive empirical studies on real-world B2B marketing data sets and the results show that the proposed method can effectively improve the quality of the campaign recommendat- ons for challenging B2B marketing tasks.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 9
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which allows dynamically modeling time-changing concepts, making predictions in a local fashion. Instead of learning a single model on a fixed or adaptive sliding window of historical data or ensemble learning a set of weighted base classifiers, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a proposed P-Tree, which are obtained based on the error-driven representativeness learning and synchronization-inspired constrained clustering. To identify abrupt concept drifts in data streams, PCA and statistical analysis based heuristic approaches have been introduced. To further learn the associations among distributed data streams, the extended P-Tree structure and KNN-style strategy are introduced. We demonstrate that our new data stream classification approach has several attractive benefits: (a) SyncStream is capable of dynamically modeling the evolving concepts from even a small set of prototypes. (b) Owing to synchronization-based constrained clustering and P-Tree, SyncStream supports efficient and effective data representation and maintenance. (c) SyncStream is also tolerant of inappropriate or noisy examples via error-driven representativeness learning. (d) SyncStream allows learning relationship among distributed data streams at the instance level. The experimental results indicate its efficiency and effectiveness.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 10
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Community search is important in graph analysis and can be used in many real applications. In the literature, various community models have been proposed. However, most of them cannot well identify the overlaps between communities which is an essential feature of real graphs. To address this issue, the $k$ -clique percolation community model was proposed and has been proven effective in many applications. Motivated by this, in this paper, we adopt the $k$ -clique percolation community model and study the densest clique percolation community search problem which aims to find the $k$ -clique percolation community with the maximum $k$ value that contains a given set of query nodes. We adopt an index-based approach to solve this problem. Based on the observation that a $k$ -clique percolation community is a union of maximal cliques, we devise a novel compact index, $mathsf {DCPC}$ - $mathsf {Index}$ , to preserve the max- mal cliques and their connectivity information of the input graph. With $mathsf {DCPC}$ - $mathsf {Index}$ , we can answer the densest clique percolation community query efficiently. Besides, we also propose an index construction algorithm based on the definition of $mathsf {DCPC}$ - $mathsf {Index}$ and further improve the algorithm in terms of efficiency and memory consumption. We conduct extensive performance studies on real graphs and the experimental results demonstrate the efficiency of our index-based query processing algorithm and index construction algorithm.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 11
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: The class imbalance problem in machine learning occurs when certain classes are underrepresented relative to the others, leading to a learning bias toward the majority classes. To cope with the skewed class distribution, many learning methods featuring minority oversampling have been proposed, which are proved to be effective. To reduce information loss during feature space projection, this study proposes a novel oversampling algorithm, named minority oversampling in kernel adaptive subspaces (MOKAS), which exploits the invariant feature extraction capability of a kernel version of the adaptive subspace self-organizing maps. The synthetic instances are generated from well-trained subspaces and then their pre-images are reconstructed in the input space. Additionally, these instances characterize nonlinear structures present in the minority class data distribution and help the learning algorithms to counterbalance the skewed class distribution in a desirable manner. Experimental results on both real and synthetic data show that the proposed MOKAS is capable of modeling complex data distribution and outperforms a set of state-of-the-art oversampling algorithms.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 12
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: In this paper, we study a novel variant of obstructed nearest neighbor queries, namely, range-based obstructed nearest neighbor (RONN) search. As a natural generalization of continuous obstructed nearest-neighbor (CONN), an RONN query retrieves a set of obstructed nearest neighbors corresponding to every point in a specified range. We propose a new index, namely binary obstructed tree (called OB-tree ), for indexing complex objects in the obstructed space. The novelty of OB-tree lies in the idea of dividing the obstructed space into non-obstructed subspaces , aiming to efficiently retrieve highly qualified candidates for RONN processing. We develop an algorithm for construction of the OB-tree and propose a space division scheme, called optimal obstacle balance (OOB2) scheme, to address the tree balance problem. Accordingly, we propose an efficient algorithm, called RONN by OB-tree Acceleration (RONN-OBA), which exploits the OB-tree and a binary traversal order of data objects to accelerate query processing of RONN. In addition, we extend our work in several aspects regarding the shape of obstacles, and range-based $k$ NN queries in obstructed space. At last, we conduct a comprehensive performance evaluation using both real and synthetic datasets to validate our ideas and the proposed algorithms. The experimental result shows that the RONN-OBA algorithm outperforms the two R-tree based algorithms and RONN-OA significantly.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 13
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-04
    Beschreibung: Existing graph classification usually relies on an exhaustive enumeration of substructure patterns, where the number of substructures expands exponentially w.r.t. with the size of the graph set. Recently, the Weisfeiler-Lehman (WL) graph kernel has achieved the best performance in terms of both accuracy and efficiency among state-of-the-art methods. However, it is still time-consuming, especially for large-scale graph classification tasks. In this paper, we present a -Ary Tree based Hashing (KATH) algorithm, which is able to obtain competitive accuracy with a very fast runtime. The main idea of KATH is to construct a traversal table to quickly approximate the subtree patterns in WL using $K$ -ary trees. Based on the traversal table, KATH employs a recursive indexing process that performs only $r$ times of matrix indexing to generate all $(r-1)$ -depth $K$ -ary trees, where the leaf node labels of a tree can uniquely specify the pattern. After that, the MinHash scheme is used to fingerprint the acquired subtree patterns for a graph. Our experimental results on both real world and synthetic data sets show that KATH runs significantly faster than state-of-the-art methods while achieving competitive or better accuracy.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 14
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-04-07
    Beschreibung: SimRank is an important measure of vertex-pair similarity according to the structure of graphs. Although progress has been achieved, existing methods still face challenges to handle large graphs. Besides huge index construction and maintenance cost, existing methods may require considerable search space and time overheads in the online SimRank query. In this paper, we design a Monte Carlo based method, UniWalk, to enable the fast top- $k$ SimRank computation over large undirected graphs. UniWalk directly locates the top- $k$ similar vertices for any single source vertex $u$ via $R$ sampling paths originating from $u$ , which avoids selecting candidate vertex set $mathcal{C}$ and the following $O(|mathcal{C}|R)$ bidirectional sampling paths. We also devise a path enumeration strategy to improve the SimRank precision by using path probabilities instead of path frequencies when sampling, a space-efficient method to- reduce intermediate results, and a path-sharing strategy to lower the redundant path sampling cost for multiple source vertices. Furthermore, we extend UniWalk to existing distributed graph processing frameworks to improve its scalability. We conduct extensive experiments to illustrate that UniWalk has high scalability, and outperforms the state-of-the-art methods by orders of magnitude.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 15
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Given a sequence of snapshots of flu propagating over a population network, can we find a segmentation when the patterns of the disease spread change, possibly due to interventions? In this paper, we study the problem of segmenting graph sequences with labeled nodes. Memes on the Twitter network, diseases over a contact network, movie-cascades over a social network, etc. are all graph sequences with labeled nodes. Most related work on this subject is on plain graphs and hence ignores the label dynamics. Others require fix parameters or feature engineering. We propose SnapNETS , to automatically find segmentations of such graph sequences, with different characteristics of nodes of each label in adjacent segments. It satisfies all the desired properties (being parameter free, comprehensive and scalable) by leveraging a principled, multi-level, flexible framework which maps the problem to a path optimization problem over a weighted DAG. Also, we develop the parallel framework of SnapNETS which speeds up its running time. Finally, we propose an extension of SnapNETS to handle the dynamic graph structures and use it to detect anomalies (and events) in network sequences. Extensive experiments on several diverse real datasets show that it finds cut points matching ground-truth or meaningful external signals and detects anomalies outperforming non-trivial baselines. We also show that the segmentations are easily interpretable, and that SnapNETS scales near-linearly with the size of the input. Finally, we show how to use SnapNETS to detect anomaly in a sequence of dynamic networks.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 16
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Document network is a kind of intriguing dataset which can provide both topical (textual content) and topological (relational link) information. A key point in modeling such datasets is to discover proper denominators beneath the text and link. Most previous work introduces the assumption that documents closely linked with each other share common latent topics. However, the heterophily (i.e., tendency to link to different others) of nodes is neglected, which is pervasive in social networks. In this paper, we simultaneously incorporate community detection and topic modeling in a unified framework, and appeal to Canonical Correlation Analysis (CCA) to capture the latent semantic correlations between the two heterogeneous factors, community and topic . Despite of the homophily (i.e., tendency to link to similar others) or heterophily, CCA can properly capture the inherent correlations which fit the dataset itself without any prior hypothesis. We also impose auxiliary word embeddings to improve the quality of topics. The effectiveness of our proposed model is comprehensively verified on three different types of datasets which are hyperlinked networks of web pages, social networks of friends, and coauthor networks of publications. Experimental results show that our approach achieves significant improvements compared with the current state of the art.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 17
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: The classical K Shortest Paths (KSP) problem, which identifies the $k$ shortest paths in a directed graph, plays an important role in many application domains, such as providing alternative paths for vehicle routing services. However, the returned $k$ shortest paths may be highly similar , i.e., sharing significant amounts of edges, thus adversely affecting service qualities. In this paper, we formalize the K Shortest Paths with Diversity (KSPD) problem that identifies top- $k$ shortest paths such that the paths are dissimilar with each other and the total length of the paths is minimized. We first prove that the KSPD problem is NP-hard and then propose a generic greedy framework to solve the KSPD problem in the sense that (1) it supports a wide variety of path similarity metrics which are widely adopted in the literature and (2) it is also able to efficiently solve the traditional KSP problem if no path similarity metric is specified. The core of the framework includes the use of two judiciously designed lower bounds, where one is dependent on and the other one is independent on the chosen path similarity metric, which effectively reduces the search space and significantly improves efficiency. Empirical studies on five real-world and synthetic graphs and five different path similarity metrics offer insight into the design properties of the proposed general framework and offer evidence that the proposed lower bounds are effective.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 18
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Today, many scientific data sets are open to the public. For their operators, it is important to know what the users are interested in. In this paper, we study the problem of extracting and analyzing patterns from the query log of a database. We focus on design errors (antipatterns), which typically lead to unnecessary SQL statements. Such antipatterns do not only have a negative effect on performance. They also introduce bias on any subsequent analysis of the SQL log. We propose a framework designed to discover patterns and antipatterns in arbitrary SQL query logs and to clean antipatterns. To study the usefulness of our approach and to reveal insights regarding the existence of antipatterns in real-world systems, we examine the SQL log of the SkyServer project, containing more than 40 million queries. Among the top 15 patterns, we have found six antipatterns. This result as well as other ones gives way to the conclusion that antipatterns might falsify refactoring and any other downstream analyses.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 19
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: This paper proposes a new unsupervised spectral feature selection method to preserve both the local and global structure of the features as well as the samples. Specifically, our method uses the self-expressiveness of the features to represent each feature by other features for preserving the local structure of features, and a low-rank constraint on the weight matrix to preserve the global structure among samples as well as features. Our method also proposes to learn the graph matrix measuring the similarity of samples for preserving the local structure among samples. Furthermore, we propose a new optimization algorithm to the resulting objective function, which iteratively updates the graph matrix and the intrinsic space so that collaboratively improving each of them. Experimental analysis on 12 benchmark datasets showed that the proposed method outperformed the state-of-the-art feature selection methods in terms of classification performance.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 20
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Generalization is an effective technique for protecting confidential information of individuals, and has been studied by proposing numerous algorithms. However, the previous works do not separate the protection against identity disclosure and sensitive disclosure. Thus, when the requirement of attribute protection is higher than that of identity protection, generalization for $l$ -diversity causes overprotection for identity and large mounts of information utility loss. This paper presents a novel approach, called cross-bucket generalization, as a solution to meet the problem. The rationale is to divide microdata into equivalence groups and buckets. First, it provides separate protection for identity and sensitive values, and the level of protection can be flexibly adjusted based on actual demands. Second, the sizes of equivalence groups and buckets are minimized as far as possible by only satisfying the protection requirements, which avoid the overprotection for identity and reduce information loss. The experiments we conducted illustrate the effectiveness of our solution.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 21
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Joint clustering of multiple networks has been shown to be more accurate than performing clustering on individual networks separately. This is because multi-network clustering algorithms typically assume there is a common clustering structure shared by all networks, and different networks can provide compatible and complementary information for uncovering this underlying clustering structure. However, this assumption is too strict to hold in many emerging applications, where multiple networks usually have diverse data distributions. More popularly, the networks in consideration belong to different underlying groups. Only networks in the same underlying group share similar clustering structures. Better clustering performance can be achieved by considering such groups differently. As a result, an ideal method should be able to automatically detect network groups so that networks in the same group share a common clustering structure. To address this problem, we propose a new method, ComClus , to simultaneously group and cluster multiple networks. ComClus is novel in combining the clustering approach of non-negative matrix factorization (NMF) and the feature subspace learning approach of metric learning. Specifically, it treats node clusters as features of networks and learns proper subspaces from such features to differentiate different network groups. During the learning process, the two procedures of network grouping and clustering are coupled and mutually enhanced. Moreover, ComClus can effectively leverage prior knowledge on how to group networks such that network grouping can be conducted in a semi-supervised manner. This will enable users to guide the grouping process using domain knowledge so that network clustering accuracy can be further boosted. Extensive experimental evaluations on a variety of synthetic and real datasets demonstrate the effectiveness and scalability of the proposed method.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 22
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Utilizing large-scale GPS data to improve taxi services has become a popular research problem in the areas of data mining, intelligent transportation, geographical information systems, and the Internet of Things. In this paper, we utilize a large-scale GPS data set generated by over 7,000 taxis in a period of one month in Nanjing, China, and propose TaxiRec: a framework for evaluating and discovering the passenger-finding potentials of road clusters, which is incorporated into a recommender system for taxi drivers to seek passengers. In TaxiRec, the underlying road network is first segmented into a number of road clusters, a set of features for each road cluster is extracted from real-life data sets, and then a ranking-based extreme learning machine (ELM) model is proposed to evaluate the passenger-finding potential of each road cluster. In addition, TaxiRec can use this model with a training cluster selection algorithm to provide road cluster recommendations when taxi trajectory data is incomplete or unavailable. Experimental results demonstrate the feasibility and effectiveness of TaxiRec.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 23
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors $mathbf{W}$ and $mathbf{H}$ , for the given input matrix $mathbf{A}$ , such that $mathbf{A}approx mathbf{W}mathbf{H}$ . NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms that iteratively solves alternating non-negative least squares (NLS) subproblems for $mathbf{W}$ and $mathbf{H}$ . It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage- a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 24
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Computing shortest distances is a central task in many domains. The growing number of applications dealing with dynamic graphs calls for incremental algorithms, as it is impractical to recompute shortest distances from scratch every time updates occur. In this paper, we address the problem of maintaining all-pairs shortest distances in dynamic graphs. We propose efficient incremental algorithms to process sequences of edge deletions/insertions/updates and vertex deletions/insertions. The proposed approach relies on some general operators that can be easily “instantiated” both in main memory and on top of different underlying DBMSs. We provide complexity analyses of the proposed algorithms. Experimental results on several real-world datasets show that current main-memory algorithms become soon impractical, disk-based ones are needed for larger graphs, and our approach significantly outperforms state-of-the-art algorithms.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 25
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Privacy concern in data sharing especially for health data gains particularly increasing attention nowadays. Now, some patients agree to open their information for research use, which gives rise to a new question of how to effectively use the public information to better understand the private dataset without breaching privacy. In this paper, we specialize this question as selecting an optimal subset of the public dataset for M-estimators in the framework of differential privacy (DP) in [1] . From a perspective of non-interactive learning, we first construct the weighted private density estimation from the hybrid datasets under DP. Along the same line as [2] , we analyze the accuracy of the DP M-estimators based on the hybrid datasets. Our main contributions are (i) we find that the bias-variance tradeoff in the performance of our M-estimators can be characterized in the sample size of the released dataset; (ii) based on this finding, we develop an algorithm to select the optimal subset of the public dataset to release under DP. Our simulation studies and application to the real datasets confirm our findings and set a guideline in the real application.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 26
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Episode Rule Mining is a popular framework for discovering sequential rules from event sequential data. However, traditional episode rule mining methods only tell that the consequent event is likely to happen within a given time interval after the occurrence of the antecedent events. As a result, they cannot satisfy the requirement of many time sensitive applications, such as program security trading and intelligent transportation management due to the lack of fine-grained response time. In this study, we come up with the concept of fixed-gap episode to address this problem. A fixed-gap episode consists of an ordered set of events where the elapsed time between any two consecutive events is a constant. Based on this concept, we formulate the problem of mining precise-positioning episode rules in which the occurrence time of each event in the consequent is clearly specified. In addition, we develop a trie-based data structure to mine such precise-positioning episode rules with several pruning strategies incorporated for improving the performance as well as reducing memory consumption. Experimental results on real datasets show the superiority of our proposed algorithms.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 27
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: The problem of deriving lower and upper bounds for the edit distance between undirected, labeled graphs has recently received increasing attention. However, only one algorithm has been proposed that allegedly computes not only an upper but also a lower bound for non-uniform edit costs and incorporates information about both node and edge labels. In this paper, we demonstrate that this algorithm is incorrect. We present a corrected version $mathsf {Bscriptstyle{RANCH}}$ that runs in $mathcal{O}(n^2Delta ^3+n^3)$ time, where $Delta$ is the maximum of the maximum degrees of input graphs $G$ and $H$ . We also develop a speed-up $mathsf {Bscriptstyle{RANCH}}mathsf{Fscriptstyle{AST}}$ that runs in $mathcal{O}(n^2Delta ^2+n^3)$ time and computes an only slightly less accurate lower bound. The lower bounds produced by $maths- {Bscriptstyle{RANCH}}$ and $mathsf {Bscriptstyle{RANCH}}mathsf{Fscriptstyle{AST}}$ are shown to be pseudo-metrics on a collection of graphs. Finally, we suggest an anytime algorithm $mathsf {Bscriptstyle{RANCH}}mathsf{Tscriptstyle{IGHT}}$ that iteratively improves $mathsf {Bscriptstyle{RANCH}}$ ’s lower bound. $mathsf {Bscriptstyle{RANCH}}mathsf{Tscriptstyle{IGHT}}$ runs in $mathcal{O}(n^3Delta ^2+I(n^2Delta ^3+n^3))$ time, where the number of iterations $I$ is controlled by the user. A detailed experimental evaluation shows that all suggested algorithms are Pareto optimal, that they are very effective when used as filters for edit distance range queries, and that they perform excellently when used within classification frameworks.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 28
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Clustering of customer transaction data is an important procedure to analyze customer behaviors in retail and e-commerce companies. Note that products from companies are often organized as a product tree, in which the leaf nodes are goods to sell, and the internal nodes (except root node) could be multiple product categories. Based on this tree, we propose the “personalized product tree”, named purchase tree, to represent a customer’s transaction records. So the customers’ transaction data set can be compressed into a set of purchase trees. We propose a partitional clustering algorithm, named PurTreeClust, for fast clustering of purchase trees. A new distance metric is proposed to effectively compute the distance between two purchase trees. To cluster the purchase tree data, we first rank the purchase trees as candidate representative trees with a novel separate density, and then select the top $k$ customers as the representatives of $k$ customer groups. Finally, the clustering results are obtained by assigning each customer to the nearest representative. We also propose a gap statistic based method to evaluate the number of clusters. A series of experiments were conducted on ten real-life transaction data sets, and experimental results show the superior performance of the proposed method.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 29
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-02-07
    Beschreibung: Database provenance explains how results are derived by queries. However, many use cases such as auditing and debugging of transactions require understanding of how the current state of a database was derived by a transactional history. We present MV-semirings, a provenance model for queries and transactional histories that supports two common multi-version concurrency control protocols: snapshot isolation (SI) and read committed snapshot isolation (RC-SI). Furthermore, we introduce an approach for retroactively capturing such provenance using reenactment , a novel technique for replaying a transactional history with provenance capture. Reenactment exploits the time travel and audit logging capabilities of modern DBMS to replay parts of a transactional history using queries. Importantly, our technique requires no changes to the transactional workload or underlying DBMS and results in only moderate runtime overhead for transactions. We have implemented our approach on top of a commercial DBMS and our experiments confirm that by applying novel optimizations we can efficiently capture provenance for complex transactions over large data sets.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 30
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: Recording and querying time-stamped trajectories incurs high cost of data storage and computing. In this paper, we explore several characteristics of the trajectories in road networks, which have motivated the idea of coding trajectories by associating timestamps with relative spatial path and locations. Such a representation contains a large number of duplicate information to achieve a lower entropy compared with the existing representations, thereby drastically cutting the storage cost. We propose several techniques to compress spatial path and locations separately, which can support fast positioning and achieve better compression ratio. For locations, we propose two novel encoding schemes such that the binary code can preserve distance information, which is very helpful for LBS applications. In addition, an unresolved question in this area is whether it is possible to perform a search directly on the compressed trajectories, and if the answer is yes, then how. Here, we show that directly querying compressed trajectories based on our encoding scheme is possible and can be done efficiently. We design a set of primitive operations for this purpose, and propose index structures to reduce query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real trajectories in road network.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 31
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: Real-time bidding (RTB) based display advertising has become one of the key technological advances in computational advertising. RTB enables advertisers to buy individual ad impressions via an auction in real-time and facilitates the evaluation and the bidding of individual impressions across multiple advertisers. In RTB, the advertisers face three main challenges when optimizing their bidding strategies, namely (i) estimating the utility (e.g., conversions, clicks) of the ad impression, (ii) forecasting the market value (thus the cost) of the given ad impression, and (iii) deciding the optimal bid for the given auction based on the first two. Previous solutions assume the first two are solved before addressing the bid optimization problem. However, these challenges are strongly correlated and dealing with any individual problem independently may not be globally optimal. In this paper, we propose Bidding Machine , a comprehensive learning to bid framework, which consists of three optimizers dealing with each challenge above, and as a whole, jointly optimizes these three parts. We show that such a joint optimization would largely increase the campaign effectiveness and the profit. From the learning perspective, we show that the bidding machine can be updated smoothly with both offline periodical batch or online sequential training schemes. Our extensive offline empirical study and online A/B testing verify the high effectiveness of the proposed bidding machine.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 32
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: The skyline of a data point set is made up of the best points in the set, and is very important for multi-criteria decision making. In these years, the skyline problem attracts more and more attention, and many variants of the traditional skyline emerge in the database field. One recent and important variant is group-based skyline, which aims to find the best groups of points in a given set. In this paper, we bring forward an efficient approach, called minimum dominance search (MDS), to solve the g-skyline problem, a latest group-based skyline problem. MDS consists of two steps: In the first step, we construct a novel g-skyline support structure, i.e., minimum dominance graph (MDG), which proves to be a minimum g-skyline support structure. In the second step, we search for g-skyline groups based on the MDG through two searching algorithms, and a skyline-combination based optimization strategy is employed to improve these two algorithms. We conduct comprehensive experiments on both synthetic and real-world data sets, and show that our algorithms are orders of magnitude faster than the state-of-the-art in most cases.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 33
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: Modern networks are very large in size and also evolve with time. As their sizes grow, the complexity of performing network analysis grows as well. Getting a smaller representation of a temporal network with similar properties will help in various data mining tasks. In this paper, we study the novel problem of getting a smaller diffusion-equivalent representation of a set of time-evolving networks. We first formulate a well-founded and general temporal-network condensation problem based on the so-called system-matrix of the network. We then propose NetCondense , a scalable and effective algorithm which solves this problem using careful transformations in sub-quadratic running time, and linear space complexities. Our extensive experiments show that we can reduce the size of large real temporal networks (from multiple domains such as social, co-authorship, and email) significantly without much loss of information. We also show the wide-applicability of NetCondense by leveraging it for several tasks: for example, we use it to understand, explore, and visualize the original datasets and to also speed-up algorithms for the influence-maximization and event detection problems on temporal networks.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 34
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: Cloud-based data-intensive applications have to process high volumes of transactional and analytical requests on large-scale data. Businesses base their decisions on the results of analytical requests, creating a need for real-time analytical processing. We propose Janus, a hybrid scalable cloud datastore, which enables the efficient execution of diverse workloads by storing data in different representations. Janus manages big datasets in the context of datacenters, thus supporting scaling out by partitioning the data across multiple servers. This requires Janus to efficiently support distributed transactions. In order to support the different datacenter requirements, Janus also allows diverse partitioning strategies for the different representations. Janus proposes a novel data movement pipeline to continuously ensure up to date data between the different representations. Unlike existing multi-representation storage systems and Change Data Capture (CDC) pipelines, the data movement pipeline in Janus supports partitioning and handles both distributed transactions and diverse partitioning strategies. In this paper, we focus on supporting Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) workloads, and hence use row and column-oriented representations, which are the most efficient representations for these workloads. Our evaluations over Amazon AWS illustrate that Janus can provide real-time analytical results, in addition to processing high-throughput transactional workloads.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 35
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: In the paper, we propose a class of dynamic conditional Gaussian graphical models (DCGGMs) based on a set of non-identical distribution observations, which changes smoothly with time or condition. Specifically, the DCGGMs model the dynamic output network influenced by conditioning input variables, which are encoded by a set of varying parameters. Moreover, we propose a joint smooth graphical Lasso to estimate the DCGGMs, which combines kernel smoother with sparse group Lasso penalty. At the same time, we design an efficient accelerated proximal gradient algorithm to solve this estimator. Theoretically, we establish the asymptotic properties of our model on consistency and sparsistency under the high-dimensional settings. In particular, we highlight a class of consistency theory for dynamic graphical models, in which the sample size can be seen as $n^{4/5}$ for estimating a local graphical model when the bandwidth parameter $h$ of kernel smoother is chosen as $h; asymp; n^{-1/5}$ for describing the dynamic. Finally, the extensive numerical experiments on both synthetic and real datasets are provided to support the effectiveness of the proposed method.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 36
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: In this paper, we research the community deception problem. Tackling this problem consists in developing techniques to hide a target community ( $mathcal{C}$ ) from community detection algorithms. This need emerges whenever a group (e.g., activists, police enforcements, or network participants in general) want to observe and cooperate in a social network while avoiding to be detected. We introduce and formalize the community deception problem and devise an efficient algorithm that allows to achieve deception by identifying a certain number ( $beta$ ) of $mathcal{C}$ ’s members connections to be rewired. Deception can be practically achieved in social networks like Facebook by friending or unfriending network members as indicated by our algorithm. We compare our approach with another technique based on modularity. By considering a variety of (large) real networks, we provide a systematic evaluation of the robustness of community detection algorithms to deception techniques. Finally, we open some challenging research questions about the design of detection algorithms robust to deception techniques.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 37
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: GPS enables mobile devices to continuously provide new opportunities to improve our daily lives. For example, the data collected in applications created by Uber or Public Transport Authorities can be used to plan transportation routes, estimate capacities, and proactively identify low coverage areas. In this paper, we study a new kind of query— Reverse $k$ Nearest Neighbor Search over Trajectories ( $mathbf{R}{k}mathbf{NNT}$ ), which can be used for route planning and capacity estimation. Given a set of existing routes $mathcal{D}_{mathcal{R}}$ , a set of passenger transitions $mathcal{D}_{mathcal{T}}$ , and a query route $Q$ , an $mathbf{R}{k}mathbf{NNT}$  query returns all transitions that take $Q$ as one of its $k$ nearest travel routes. To solve the problem, we first develop an index to handle dynamic trajectory updates, so that the most up-to-date transition data are available for answering an $mathbf{R}{k}mathbf{NNT}$ query. Then we introduce a filter refinement framework for processing $mathbf{R}{k}mathbf{NNT}$ queries using the proposed indexes. Next, we show how to use $mathbf{R}{k}mathbf{NNT}$ to solve the optimal route planning problem $mathbf{MaxR}{k}mathbf{NNT}$ ( $mathbf{MinR}{k}mathbf{NNT}$ ), which is to search for the optimal route from a start location to an end location that could attract the maximum (or minimum) number of passengers based on a predefined travel distance threshold. Experiments on real datasets demonstrate the efficiency and scalability of our approaches. To the best of our knowledge, this is the first work to study the $mathbf{R}{k}mathbf{NNT}$ problem for route planning.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 38
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: Finding the shortest path in road networks becomes one of important issues in location based services (LBS). The problem of finding the optimal meeting point for a group of users has also been well studied in existing works. In this paper, we investigate a new problem for two users. Each user has his/her own source and destination. However, whether to meet before going to their destinations is with some uncertainty. We model it as minimum path pair ( MPP ) query, which consists of two pairs of source and destination and a user-specified weight $alpha$ to balance the two different needs. The result is a pair of paths connecting the two sources and destinations respectively, with minimal overall cost of the two paths and the shortest route between them. To solve MPP queries, we devise algorithms by enumerating node pairs. We adopt a location-based pruning strategy to reduce the number of node pairs for enumeration. An efficient algorithm based on point-to-point shortest path calculation is proposed to further improve query efficiency. We also give two fast approximate algorithms with approximation bounds. Extensive experiments are conducted to show the effectiveness and efficiency of our methods.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 39
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: We propose topic models for unsupervised cluster matching, which is the task of finding matching between clusters in different domains without correspondence information. For example, the proposed model finds correspondence between document clusters in English and German without alignment information, such as dictionaries and parallel sentences/documents. The proposed model assumes that documents in all languages have a common latent topic structure, and there are potentially infinite number of topic proportion vectors in a latent topic space that is shared by all languages. Each document is generated using one of the topic proportion vectors and language-specific word distributions. By inferring a topic proportion vector used for each document, we can allocate documents in different languages into common clusters, where each cluster is associated with a topic proportion vector. Documents assigned into the same cluster are considered to be matched. We develop an efficient inference procedure for the proposed model based on collapsed Gibbs sampling. The effectiveness of the proposed model is demonstrated with real data sets including multilingual corpora of Wikipedia and product reviews.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 40
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: Fuzzy Petri nets (FPNs) are a vital modeling technique for the construction of knowledge-based systems, which have been commonly used in many fields, such as fault diagnosis, risk assessment, workflow management, and disassembly process planning. However, the conventional FPNs have been blamed for the following reasons: 1) the representation parameters in FPNs cannot precisely model experts' experience since it is difficult to manage the fuzziness and randomness of knowledge assessments simultaneously, and 2) the weight coefficients in the existing approximate reasoning algorithms are hardly enough to reflect the associated weights of reordered places. In response, we propose a new type of FPNs, called cloud reasoning Petri nets (CRPNs) based on the concept of interval clouds and the hybrid averaging operator. The cloud production rules in a knowledge-based system are modeled by CRPNs, where the truth degrees of places, the certainty factors of rules, and the thresholds of transitions are represented by interval clouds. Moreover, a matrix operation-based reasoning algorithm is proposed to improve the efficiency of calculating final truth degrees, in which both local and ordered weight coefficients are taken into consideration. Finally, a practical example concerning a power system is provided to demonstrate the usefulness and advantages of the proposed CRPN model.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 41
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: With recent advances in data-as-a-service (DaaS) and cloud computing, aggregate query services over set-valued data are becoming widely available for business intelligence that drives decision making. However, as the service provider is often a third-party delegate of the data owner, the integrity of the query results cannot be guaranteed and is thus imperative to be authenticated. Unfortunately, existing query authentication techniques either do not work for set-valued data or they lack data confidentiality. In this paper, we propose authenticated aggregate queries over set-valued data that not only ensure the integrity of query results but also preserve the confidentiality of source data. As many aggregate queries are composed of multiset operations such as set union and subset, we first develop a family of privacy-preserving authentication protocols for primitive multiset operations. Using these protocols as building blocks, we present a privacy-preserving authentication framework for various aggregate queries and further optimize their authentication performance. Security analysis and empirical evaluation show that our proposed privacy-preserving authentication techniques are feasible and robust under a wide range of system workloads.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 42
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: A multitude of contemporary applications heavily involve graph data whose size appears to be ever–increasing. This trend shows no signs of subsiding and has caused the emergence of a number of distributed graph processing systems including Pregel , Apache Giraph , and GraphX . However, the unprecedented scale now reached by real-world graphs hardens the task of graph processing due to excessive memory demands even for distributed environments. By and large, such contemporary graph processing systems employ ineffective in-memory representations of adjacency lists. Therefore, memory usage patterns emerge as a primary concern in distributed graph processing. We seek to address this challenge by exploiting empirically-observed properties demonstrated by graphs generated by human activity. In this paper, we propose 1) three compressed adjacency list representations that can be applied to any distributed graph processing system, 2) a variable-byte encoded representation of out-edge weights for space-efficient support of weighted graphs, and 3) a tree-based compact out-edge representation that allows for efficient mutations on the graph elements. We experiment with publicly-available graphs whose size reaches two-billion edges and report our findings in terms of both space-efficiency and execution time. Our suggested compact representations do reduce respective memory requirements for accommodating the graph elements up–to 5 times if compared with state-of-the-art methods. At the same time, our memory-optimized methods retain the efficiency of uncompressed structures and enable the execution of algorithms for large scale graphs in settings where contemporary alternative structures fail due to memory errors.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 43
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-03-09
    Beschreibung: With the continued proliferation of location-based services, a growing number of web-accessible data objects are geo-tagged and have text descriptions. An important query over such web objects is the direction-aware spatial keyword query that aims to retrieve the top- $k$ objects that best match query parameters in terms of spatial distance and textual similarity in a given query direction. In some cases, it can be difficult for users to specify appropriate query parameters. After getting a query result, users may find some desired objects are unexpectedly missing and may therefore question the entire result. Enabling why-not questions in this setting may aid users to retrieve better results, thus improving the overall utility of the query functionality. This paper studies the direction-aware why-not spatial keyword top- $k$ query problem. We propose efficient query refinement techniques to revive missing objects by minimally modifying users’ direction-aware queries. We prove that the best refined query directions lie in a finite solution space for a special case and reduce the search for the optimal refinement to a linear programming problem for the general case. Extensive experimental studies demonstrate that the proposed techniques outperform a baseline method by two orders of magnitude and are robust in a broad range of settings.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 44
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Ride-sharing (RS) has great values in saving energy and alleviating traffic pressure. Existing studies can be improved for better efficiency. Therefore, we propose a new ride-sharing model, where each driver has a requirement that if the driver shares a ride with a rider, the shared route percentage (i.e., the ratio of the shared route's distance to the driver's total travel distance) exceeds an expectation rate of the driver, e.g., 0.8. We consider two variants of this problem. The first considers multiple drivers and multiple riders and aims to compute driver-rider pairs to maximize the overall shared route percentage (SRP). We model this problem as the maximum weighted bigraph matching problem, where the vertices are drivers and riders, edges are driver-rider pairs, and edge weights are driver-rider's SRP. However, it is rather expensive to compute the SRP values for large numbers of driver-rider pairs on road networks. To address this problem, we propose an efficient method to prune many unnecessary driver-rider pairs and avoid computing the SRP values for every pair. To improve the efficiency, we propose an approximate method with error bound guarantee. The basic idea is that we compute an upper bound and a lower bound for each driver-rider pair in constant time. Then, we estimate an upper bound and a lower bound of the graph matching. Next, we select some driver-rider pairs, compute their real shortest-route distance, and update the lower and upper bounds of the maximum graph matching. We repeat above steps until the ratio of the upper bound to the lower bound is not larger than a given approximate rate. The second considers multiple drivers and a single rider and aims to find the top- $k$ drivers for the rider with the largest SRP. We first prune a large - umber of drivers that cannot meet the SRP requirements. Then, we propose a best-first algorithm that progressively selects the drivers with high probability to be in the top- $k$ results and prunes the drivers that cannot be in the top- $k$ results. Extensive experiments on real-world datasets demonstrate the superiority of our method.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 45
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Aspect-based opinion mining is finding elaborate opinions towards a subject such as a product or an event. With explosive growth of opinionated texts on the Web, mining aspect-level opinions has become a promising means for online public opinion analysis. In particular, the boom of various types of online media provides diverse yet complementary information, bringing unprecedented opportunities for cross media aspect-opinion mining. Along this line, we propose CAMEL, a novel topic model for complementary aspect-based opinion mining across asymmetric collections. CAMEL gains information complementarity by modeling both common and specific aspects across collections, while keeping all the corresponding opinions for contrastive study. An auto-labeling scheme called AME is also proposed to help discriminate between aspect and opinion words without elaborative human labeling, which is further enhanced by adding word embedding-based similarity as a new feature. Moreover, CAMEL-DP, a nonparametric alternative to CAMEL is also proposed based on coupled Dirichlet Processes. Extensive experiments on real-world multi-collection reviews data demonstrate the superiority of our methods to competitive baselines. This is particularly true when the information shared by different collections becomes seriously fragmented. Finally, a case study on the public event “2014 Shanghai Stampede” demonstrates the practical value of CAMEL for real-world applications.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 46
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Traditionally, recommender systems modelled the physical and cyber contextual influence on people’s moving, querying, and browsing behaviors in isolation. Yet, searching, querying, and moving behaviors are intricately linked, especially indoors. Here, we introduce a tripartite location-query-browse graph (LQB) for nuanced contextual recommendations. The LQB graph consists of three kinds of nodes: locations, queries, and Web domains. Directed connections only between heterogeneous nodes represent the contextual influences, while connections of homogeneous nodes are inferred from the contextual influences of the other nodes. This tripartite LQB graph is more reliable than any monopartite or bipartite graph in contextual location, query, and Web content recommendations. We validate this LQB graph in an indoor retail scenario with extensive dataset of three logs collected from over 120,000 anonymized, opt-in users over a 1-year period in a large inner-city mall in Sydney, Australia. We characterize the contextual influences that correspond to the arcs in the LQB graph, and evaluate the usefulness of the LQB graph for location, query, and Web content recommendations. The experimental results show that the LQB graph successfully captures the contextual influence and significantly outperforms the state of the art in these applications.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 47
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Data visualization is a common and effective technique for data exploration. However, for complex data, it is infeasible for an analyst to manually generate and browse all possible visualizations for insights. This observation motivated the need for automated solutions that can effectively recommend such visualizations. The main idea underlying those solutions is to evaluate the utility of all possible visualizations and then recommend the top-k visualizations. This process incurs high data processing cost, that is further aggravated by the presence of numerical dimensional attributes. To address that challenge, we propose novel view recommendation schemes, which incorporate a hybrid multi-objective utility function that captures the impact of numerical dimension attributes. Our first scheme, Multi-Objective View Recommendation for Data Exploration (MuVE), adopts an incremental evaluation of our multi-objective utility function, which allows pruning of a large number of low-utility views and avoids unnecessary objective evaluations. Our second scheme, upper MuVE (uMuVE), further improves the pruning power by setting the upper bounds on the utility of views and allowing interleaved processing of views, at the expense of increased memory usage. Finally, our third scheme, Memory-aware uMuVE (MuMuVE), provides pruning power close to that of uMuVE, while keeping memory usage within a pre-specified limit.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 48
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Given a set of facilities and a set of users, a reverse nearest neighbors (RNN) query retrieves every user $u$ for which the query facility $q$ is its closest facility. Since $q$ is the closest facility to $u$ , the user $u$ is said to be influenced by $q$ . In this paper, we propose a relaxed definition of influence where a user $u$ is said to be influenced by not only its closest facility but also every other facility that is almost as close to $u$ as its closest facility is. Based on this definition of influence, we propose reverse approximate nearest neighbors (RANN) queries. Formally, given a value $x>1$ , an RANN query $q$ returns every user $u$ for which $dist(u,q) leq xtimes NNDist(u)$ where $NNDist(u)$ denotes the distance between a user $u$ and its nearest facility, i.e., $q$ is an approximate nearest neighbor of $u$ . In this paper, we study both snapshot and continuous versions of RANN queries. In a snapshot RANN query, the underlying data sets do not change and the results of a query are to be computed only once. In the continuous version, the users continuously change their locations and the results of RANN queries are to be continuously monitored. Based on effective pruning techniques and several non-trivial observations, we propose efficient RANN query processing algorithms for both the snapshot and continuous RANN queries. We conduct extensive experiments on both real and synthetic da
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 49
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Heterogeneous information networks that consist of multi-type, interconnected objects are becoming increasingly popular, such as social media networks and bibliographic networks. The task of linking named entity mentions detected from unstructured Web text with their corresponding entities in a heterogeneous information network is of practical importance for the problem of information network population. This task is challenging due to name ambiguity and limited knowledge existing in the network. Most existing entity linking methods focus on linking entities with Wikipedia and cannot be applied to our task. In this paper, we present SHINE+, a general framework for linking named entitie S in Web free text with a H eterogeneous I nformation NE twork. We propose a probabilistic linking model, which unifies an entity popularity model with an entity object model. As the entity knowledge contained in the information network is insufficient, we propose a knowledge population algorithm to iteratively enrich the network entity knowledge by leveraging the context information of mentions mapped by the linking model with high confidence, which subsequently boosts the linking performance. Experimental results over two real heterogeneous information networks (i.e., DBLP and IMDb) demonstrate the effectiveness and efficiency of our proposed framework in comparison with the baselines.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 50
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: A social approach can be exploited for the Internet of Things (IoT) to manage a large number of connected objects. These objects operate as autonomous agents to request and provide information and services to users. Establishing trustworthy relationships among the objects greatly improves the effectiveness of node interaction in the social IoT and helps nodes overcome perceptions of uncertainty and risk. However, there are limitations in the existing trust models. In this paper, a comprehensive model of trust is proposed that is tailored to the social IoT. The model includes ingredients such as trustor, trustee, goal, trustworthiness evaluation, decision, action, result, and context. Building on this trust model, we clarify the concepts of trust in the social IoT in five aspects such as: 1) mutuality of trustor and trustee; 2) inferential transfer of trust; 3) transitivity of trust; 4) trustworthiness update; and 5) trustworthiness affected by dynamic environment. With network connectivities that are from real-world social networks, a series of simulations are conducted to evaluate the performance of the social IoT operated with the proposed trust model. An experimental IoT network is used to further validate the proposed trust model.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 51
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Entity linking is one of the problems to be handled in order to process natural language and to enrich the existing unstructured text with metadata. The generation of assignments between knowledge base entities and lexical units is called entity linking. Although a number of systems have been proposed for linking entity mentions in various languages, there is currently no publicly available entity linking system specific to the Turkish language. This paper presents a novel entity linking system—THINKER - for linking Turkish content with entities defined in the Turkish dictionary (tdk.gov.tr) or Turkish Wikipedia (tr.wikipedia.org) . Specifically, we first propose a novel machine learning based entity detection algorithm for the Turkish language. Then, we propose a collective disambiguation algorithm which utilizes a set of metrics for the linking task and, which is optimized using a genetic algorithm. The effectiveness of THINKER is validated empirically over generated data sets. The experimental results show that THINKER outperformed the state-of-the-art cross-lingual and multilingual entity linking systems in the literature. High entity linking performance (74.81 percent F1 score) is achieved by extending previous methods with some features specific to Turkish language and by developing a novel method that can learn better representations of entity embeddings.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 52
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Presents a listing of reviewers who contributed to IEEE Transactions on Knowledge and Data Engineering in 2017.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 53
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Stock market volatility is influenced by information release, dissemination, and public acceptance. With the increasing volume and speed of social media, the effects of Web information on stock markets are becoming increasingly salient. However, studies of the effects of Web media on stock markets lack both depth and breadth due to the challenges in automatically acquiring and analyzing massive amounts of relevant information. In this study, we systematically reviewed 229 research articles on quantifying the interplay between Web media and stock markets from the fields of Finance, Management Information Systems, and Computer Science. In particular, we first categorized the representative works in terms of media type and then summarized the core techniques for converting textual information into machine-friendly forms. Finally, we compared the analysis models used to capture the hidden relationships between Web media and stock movements. Our goal is to clarify current cutting-edge research and its possible future directions to fully understand the mechanisms of Web information percolation and its impact on stock markets from the perspectives of investors cognitive behaviors, corporate governance, and stock market regulation.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 54
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: As the worlds of commerce and the Internet technology become more inextricably linked, a large number of user consumption series become available for online market intelligence analysis. A critical demand along this line is to predict the future product adoption state of each user, which enables a wide range of applications such as targeted marketing. Nevertheless, previous works only aimed at predicting if a user would adopt a particular product or not with a binary buy-or-not representation. The problem of tracking and predicting users’ adoption rates, i.e., the frequency and regularity of using each product over time, is still under-explored. To this end, we present a comprehensive study of product adoption rate prediction in a competitive market. This task is nontrivial as there are three major challenges in modeling users’ complex adoption states: the heterogeneous data sources around users, the unique user preference and the competitive product selection. To deal with these challenges, we first introduce a flexible factor-based decision function to capture the change of users’ product adoption rate over time, where various factors that may influence users’ decisions from heterogeneous data sources can be leveraged. Using this factor-based decision function, we then provide two corresponding models to learn the parameters of the decision function with both generalized and personalized assumptions of users’ preferences. We further study how to leverage the competition among different products and simultaneously learn product competition and users’ preferences with both generalized and personalized assumptions. Finally, extensive experiments on two real-world datasets show the superiority of our proposed models.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 55
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Triadic closure is ubiquitous in social networks, which refers to the property among three individuals, A, B, and C, such that if there exist strong ties between A-B and A-C, then there must be a strong or weak tie between B-C. Related to triadic closure, the number of triangles has been extensively studied since it can be effectively used as a metric to analyze the structure and function of a network. In this paper, from a different viewpoint, we study triangle-free dense structures which have received little attention. We focus on $K_{3,3}$ where there are two subsets of three vertices, a vertex in a subset has an edge connected to every vertex in another subset while it does not have an edge to any other vertex in the same subset. Such $K_{n,n}$ in general implies a philosophy contradiction: (a) Any two individuals are friends if they have no common friends, and (b) Any two individuals are not friends if they have common friends. However, we find such induced $K_{3,3}$ does exist frequently, and they do not disappear over time over a real academic collaboration network. In addition, in the real datasets tested, nearly all edges appearing in $K_{3,3}$ appear in some triangles. We analyze the expected numbers of induced $K_{3,3}$ and triangles ( $Delta$ ) in four representative random graph models, namely, Erdős-Rényi random graph model, Watts-Strogatz small-world model, Barabási-Albert preferential attachment model, and configuration model, and give an algorithm to enumerate all distinct $K_{3,3}$ in an undirected social network. We conduct extensive experiments on both real and synthetic datasets to confirm our findings. As an application, such $K_{3,3}$ found helps to find new stars collaborated by well-known figures who themselves do not collaborate.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 56
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: In privacy-preserving record linkage, a number of data custodians encode their records and submit them to a trusted third-party who is responsible for identifying those records that refer to the same real-world entity. In this paper, we propose FEDERAL, a novel record linkage framework that implements methods for anonymizing both string and numerical data values, which are typically present in data records. These methods rely on a strong theoretical foundation for rigorously specifying the dimensionality of the anonymization space, into which the original values are embedded, to provide accuracy and privacy guarantees under various models of privacy attacks. A key component of the applied embedding process is the threshold that is required by the distance computations, which we prove can be formally specified to guarantee accurate results. We evaluate our framework using three real-world data sets with varying characteristics. Our experimental findings show that FEDERAL offers a complete and effective solution for accurately identifying matching anonymized record pairs (with recall rates constantly above 93 percent) in large-scale privacy-preserving record linkage tasks.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 57
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2018-01-12
    Beschreibung: Efficient processing of large-scale graphs in distributed environments has been an increasingly popular topic of research in recent years. Inter-connected data that can be modeled as graphs appear in application domains such as machine learning, recommendation, web search, and social network analysis. Writing distributed graph applications is inherently hard and requires programming models that can cover a diverse set of problems, including iterative refinement algorithms, graph transformations, graph aggregations, pattern matching, ego-network analysis, and graph traversals. Several high-level programming abstractions have been proposed and adopted by distributed graph processing systems and big data platforms. Even though significant work has been done to experimentally compare distributed graph processing frameworks, no qualitative study and comparison of graph programming abstractions has been conducted yet. In this survey, we review and analyze the most prevalent high-level programming models for distributed graph processing, in terms of their semantics and applicability. We review 34 distributed graph processing systems with respect to the graph processing models they implement and we survey applications that appear in recent distributed graph systems papers. Finally, we discuss trends and open research questions in the area of distributed graph processing.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 58
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Social recommender system, using social relation networks as additional input to improve the accuracy of traditional recommender systems, has become an important research topic. However, most existing methods utilize the entire user relationship network with no consideration to its huge size, sparsity, imbalance, and noise issues. This may degrade the efficiency and accuracy of social recommender systems. This study proposes a new approach to manage the complexity of adding social relation networks to recommender systems. Our method first generates an individual relationship network (IRN) for each user and item by developing a novel fitting algorithm of relationship networks to control the relationship propagation and contracting. We then fuse matrix factorization with social regularization and the neighborhood model using IRN's to generate recommendations. Our approach is quite general, and can also be applied to the item-item relationship network by switching the roles of users and items. Experiments on four datasets with different sizes, sparsity levels, and relationship types show that our approach can improve predictive accuracy and gain a better scalability compared with state-of-the-art social recommendation methods.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 59
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Heterogeneous graph is a popular data model to represent the real-world relations with abundant semantics. To analyze heterogeneous graphs, an important step is extracting homogeneous graphs from the heterogeneous graphs, called homogeneous graph extraction. In an extracted homogeneous graph, the relation is defined by a line pattern on the heterogeneous graph and the new attribute values of the relation are calculated by user-defined aggregate functions. The key challenges of the extraction problem are how to efficiently enumerate paths matched by the line pattern and aggregate values for each pair of vertices from the matched paths. To address above two challenges, we propose a parallel graph extraction framework, where we use vertex-centric model to enumerate paths and compute aggregate functions in parallel. The framework compiles the line pattern into a path concatenation plan, which determines the order of concatenating paths and generates the final paths in a divide-and-conquer manner. We introduce a cost model to estimate the cost of a plan and discuss three plan selection strategies, among which the best plan can enumerate paths in $\mathcal {O}(log(l))$ iterations, where $l$ is the length of a pattern. Furthermore, to improve the performance of evaluating aggregate functions, we classify the aggregate functions into three categories, i.e., distributive aggregation, algebraic aggregation, and holistic aggregation. Since the distributive and algebraic aggregations can be computed from the partial paths, we speed up the aggregation by computing partial aggregate values during the path enumeration.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 60
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Recently, social networks have witnessed a massive surge in popularity. A key issue in social network research is network evolution analysis, which assumes that all the autonomous nodes in a social network follow uniform evolution mechanisms. However, different nodes in a social network should have different evolution mechanisms to generate different edges. This is proposed as the underlying idea to ensure the nodes’ evolution diversity in this paper. Our approach involves identifying the micro-level node evolution that generates different edges by introducing the existing link prediction methods from the perspectives of nodes. We also propose the edge generation coefficient to evaluate the extent to which an edge's generation can be explained by a link prediction method. To quantify the nodes’ evolution diversity, we define the diverse evolution distance. Furthermore, a diverse node adaption algorithm is proposed to indirectly analyze the evolution of the entire network based on the nodes’ evolution diversity. Extensive experiments on disparate real-world networks demonstrate that the introduction of the nodes’ evolution diversity is important and beneficial for analyzing the network evolution. The diverse node adaption algorithm outperforms other state-of-the-art link prediction algorithms in terms of both accuracy and universality. The greater the nodes’ evolution diversity, the more obvious its advantages.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 61
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Probabilistic top- $k$ ranking is an important and well-studied query operator in uncertain databases. However, the quality of top- $k$ results might be heavily affected by the ambiguity and uncertainty of the underlying data. Uncertainty reduction techniques have been proposed to improve the quality of top- $k$ results by cleaning the original data. Unfortunately, most data cleaning models aim to probe the exact values of the objects individually and therefore do not work well for subjective data types, such as user ratings, which are inherently probabilistic. In this paper, we propose a novel pairwise crowdsourcing model to reduce the uncertainty of top- $k$ ranking using a crowd of domain experts. Given a crowdsourcing task of limited budget, we propose efficient algorithms to select the best object pairs for crowdsourcing that will bring in the highest quality improvement. Extensive experiments show that our proposed solutions outperform a random selection method by up to 30 times in terms of quality improvement of probabilistic top- $k$ ranking queries. In terms of efficiency, our proposed solutions can reduce the elapsed time of a brute-force algorithm from several days to one minute.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 62
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-08-12
    Beschreibung: The increase of interest in using social media as a source for research has motivated tackling the challenge of automatically geolocating tweets, given the lack of explicit location information in the majority of tweets. In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyze the extent to which a tweet’s country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyze the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as the use of tweet content alone-the most widely used feature in previous work-leaves much to be desired. Choosing an appropriate combination of both tweet content and metadata can actually lead to substantial improvements of between 20 and 50 percent. We observe that tweet content, the user’s self-reported location and the user’s real name, all of which are inherent in a tweet and available in a real-time scenario, are particularly useful to determine the country of origin. We also experiment on the applicability of a model trained on historical tweets to classify new tweets, finding that the choice of a particular combination of features whose utility does not fade over time can actually lead to comparable performance, avoiding the need to retrain. However, the difficulty of achieving accurate classification inc- eases slightly for countries with multiple commonalities, especially for English and Spanish speaking countries.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 63
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-08-12
    Beschreibung: The query logs from an on-line map query system provide rich cues to understand the behaviors of human crowds. With the growing ability of collecting large scale query logs, the query suggestion has been a topic of recent interest. In general, query suggestion aims at recommending a list of relevant queries w.r.t. users’ inputs via an appropriate learning of crowds’ query logs. In this paper, we are particularly interested in map query suggestions (e.g., the predictions of location-related queries) and propose a novel model Hierarchical Contextual Attention Recurrent Neural Network (HCAR-NN) for map query suggestion in an encoding-decoding manner. Given crowds map query logs, our proposed HCAR-NN not only learns the local temporal correlation among map queries in a query session (e.g., queries in a short-term interval are relevant to accomplish a search mission), but also captures the global longer range contextual dependencies among map query sessions in query logs (e.g., how a sequence of queries within a short-term interval has an influence on another sequence of queries). We evaluate our approach over millions of queries from a commercial search engine (i.e., Baidu Map ). Experimental results show that the proposed approach provides significant performance improvements over the competitive existing methods in terms of classical metrics (i.e., Recall@K and MRR ) as well as the prediction of crowds’ search missions.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 64
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Data science models, although successful in a number of commercial domains, have had limited applicability in scientific problems involving complex physical phenomena. Theory-guided data science (TGDS) is an emerging paradigm that aims to leverage the wealth of scientific knowledge for improving the effectiveness of data science models in enabling scientific discovery. The overarching vision of TGDS is to introduce scientific consistency as an essential component for learning generalizable models. Further, by producing scientifically interpretable models, TGDS aims to advance our scientific understanding by discovering novel domain insights. Indeed, the paradigm of TGDS has started to gain prominence in a number of scientific disciplines such as turbulence modeling, material discovery, quantum chemistry, bio-medical science, bio-marker discovery, climate science, and hydrology. In this paper, we formally conceptualize the paradigm of TGDS and present a taxonomy of research themes in TGDS. We describe several approaches for integrating domain knowledge in different research themes using illustrative examples from different disciplines. We also highlight some of the promising avenues of novel research for realizing the full potential of theory-guided data science.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 65
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Many feature extraction methods reduce the dimensionality of data based on the input graph matrix. The graph construction which reflects relationships among raw data points is crucial to the quality of resulting low-dimensional representations. To improve the quality of graph and make it more suitable for feature extraction tasks, we incorporate a new graph learning mechanism into feature extraction and add an interaction between the learned graph and the low-dimensional representations. Based on this learning mechanism, we propose a novel framework, termed as unsupervised single view feature extraction with structured graph (FESG), which learns both a transformation matrix and an ideal structured graph containing the clustering information. Moreover, we propose a novel way to extend FESG framework for multi-view learning tasks. The extension is named as unsupervised multiple views feature extraction with structured graph (MFESG), which learns an optimal weight for each view automatically without requiring an additional parameter. To show the effectiveness of the framework, we design two concrete formulations within FESG and MFESG, together with two efficient solving algorithms. Promising experimental results on plenty of real-world datasets have validated the effectiveness of our proposed algorithms.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 66
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: We propose a parametric network generation model which we call network reconstruction model (NRM) for structural reconstruction of scale-free real networks with power-law exponent greater than 2 in the tail of its degree distribution. The reconstruction method for a real network is concerned with finding the optimal values of the model parameters by utilizing the power-law exponents of model network and the real network. The method is validated for certain real world networks. The usefulness of NRM in order to solve structural reconstruction problem is demonstrated by comparing its performance with some existing popular network generative models. We show that NRM can generate networks which follow edge-densification and densification power-law when the model parameters satisfy an inequality. Computable expressions of the expected number of triangles and expected diameter are obtained for model networks generated by NRM. Finally, we numerically establish that NRM can generate networks with shrinking diameter and modular structure when specific model parameters are chosen.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 67
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Censoring is a common phenomenon that arises in many longitudinal studies where an event of interest could not be recorded within the given time frame. Censoring causes missing time-to-event labels, and this effect is compounded when dealing with datasets which have high amounts of censored instances. In addition, dependent censoring in the data, where censoring is dependent on the covariates in the data leads to bias in standard survival estimators. This motivates us to develop an approach for pre-processing censored data which calibrates the right censored (RC) times in an attempt to reduce the bias in the survival estimators. This calibration is done using an imputation method which estimates the sparse inverse covariance matrix over the dataset in an iterative convergence framework. During estimation, we apply row and column-based regularization to account for both row and column-wise correlations between different instances while imputing them. This is followed by comparing these imputed censored times with the original RC times to obtain the final calibrated RC times. These calibrated RC times can now be used in the survival dataset in place of the original RC times for more effective prediction. One of the major benefits of our calibration approach is that it is a pre-processing method for censored data which can be used in conjunction with any survival prediction algorithm and improve its performance. We evaluate the goodness of our approach using a wide array of survival prediction algorithms which are applied over crowdfunding data, electronic health records (EHRs), and synthetic censored datasets. Experimental results indicate that our calibration method improves the AUC values of survival prediction algorithms, compared to applying them directly on the original survival data.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 68
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Business processes are prone to unexpected changes, as process workers may suddenly or gradually start executing a process differently in order to adjust to changes in workload, season, or other external factors. Early detection of business process changes enables managers to identify and act upon changes that may otherwise affect process performance. Business process drift detection refers to a family of methods to detect changes in a business process by analyzing event logs extracted from the systems that support the execution of the process. Existing methods for business process drift detection are based on an explorative analysis of a potentially large feature space and in some cases they require users to manually identify specific features that characterize the drift. Depending on the explored feature space, these methods miss various types of changes. Moreover, they are either designed to detect sudden drifts or gradual drifts but not both. This paper proposes an automated and statistically grounded method for detecting sudden and gradual business process drifts under a unified framework. An empirical evaluation shows that the method detects typical change patterns with significantly higher accuracy and lower detection delay than existing methods, while accurately distinguishing between sudden and gradual drifts.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 69
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: With the soaring development of large scale online social networks, online information sharing is becoming ubiquitous everyday. Various information is propagating through online social networks including both the positive and negative. In this paper, we focus on the negative information problems such as the online rumors. Rumor blocking is a serious problem in large-scale social networks. Malicious rumors could cause chaos in society and hence need to be blocked as soon as possible after being detected. In this paper, we propose a model of dynamic rumor influence minimization with user experience (DRIMUX). Our goal is to minimize the influence of the rumor (i.e., the number of users that have accepted and sent the rumor) by blocking a certain subset of nodes. A dynamic Ising propagation model considering both the global popularity and individual attraction of the rumor is presented based on a realistic scenario. In addition, different from existing problems of influence minimization, we take into account the constraint of user experience utility. Specifically, each node is assigned a tolerance time threshold. If the blocking time of each user exceeds that threshold, the utility of the network will decrease. Under this constraint, we then formulate the problem as a network inference problem with survival theory, and propose solutions based on maximum likelihood principle. Experiments are implemented based on large-scale real world networks and validate the effectiveness of our method.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 70
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Monotonic classification is a kind of classification task in which a monotonicity constraint exist between features and class, i.e., if sample $x_i$ has a higher value in each feature than sample $x_j$ , it should be assigned to a class with a higher level than the level of $x_j$ 's class. Several methods have been proposed, but they have some limits such as with limited kind of data or limited classification accuracy. In our former work, the classification accuracy on monotonic classification has been improved by fusing monotonic decision trees, but it always has a complex classification model. This work aims to find a monotonic classifier to process both nominal and numeric data by fusing complete monotonic decision trees. Through finding the completed feature subsets based on discernibility matrix on ordinal dataset, a set of monotonic decision trees can be obtained directly and automatically, on which the rank is still preserved. Fewer decision trees are needed, which will serve as base classifiers to construct a decision forest fused complete monotonic decision trees. The experiment results on 10 datasets demonstrate that the proposed method can reduce the number of base classifiers effectively and then simplify classification model, and obtain good classification performance simultaneously.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 71
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: In the field of pattern recognition, data analysis, and machine learning, data points are usually modeled as high-dimensional vectors. Due to the curse-of-dimensionality, it is non-trivial to efficiently process the orginal data directly. Given the unique properties of nonlinear dimensionality reduction techniques, nonlinear learning methods are widely adopted to reduce the dimension of data. However, existing nonlinear learning methods fail in many real applications because of the too-strict requirements (for real data) or the difficulty in parameters tuning. Therefore, in this paper, we investigate the manifold learning methods which belong to the family of nonlinear dimensionality reduction methods. Specifically, we proposed a new manifold learning principle for dimensionality reduction named Curved Cosine Mapping (CCM). Based on the law of cosines in Euclidean space, CCM applies a brand new mapping pattern to manifold learning. In CCM, the nonlinear geometric relationships are obtained by utlizing the law of cosines, and then quantified as the dimensionality-reduced features. Compared with the existing approaches, the model has weaker theoretical assumptions over the input data. Moreover, to further reduce the computation cost, an optimized version of CCM is developed. Finally, we conduct extensive experiments over both artificial and real-world datasets to demonstrate the performance of proposed techniques.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 72
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Database design is critical for high performance in relational databases and a myriad of tools exist to aid application designers in selecting an appropriate schema. While the problem of schema optimization is also highly relevant for NoSQL databases, existing tools for relational databases are inadequate in that setting. Application designers wishing to use a NoSQL database instead rely on rules of thumb to select an appropriate schema. We present a system for recommending database schemas for NoSQL applications. Our cost-based approach uses a novel binary integer programming formulation to guide the mapping from the application's conceptual data model to a database schema. We implemented a prototype of this approach for the Cassandra extensible record store. Our prototype, the NoSQL Schema Evaluator (NoSE) is able to capture rules of thumb used by expert designers without explicitly encoding the rules. Automating the design process allows NoSE to produce efficient schemas and to examine more alternatives than would be possible with a manual rule-based approach.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 73
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: During the last decade, community-based question answering (CQA) sites have accumulated a vast amount of questions and their crowdsourced answers over time. How to efficiently identify the quality of answers that are relevant to a given question has become an active line of research in CQA. The major challenge of CQA is the accurate selection of high-quality answers w.r.t given questions. Previous approaches tend to model the semantic matching between individual pair of one question and its corresponding answer (how fitting an answer is to a posted question). However, these works ignore the temporal interactions between answers (how previous answers influence the late posted answers). For example, a rational user likely adapts others’ opinions, revises his inclinations, and posts a more appropriate answer after understanding the given question and previously posted answers. As a result, this paper devises an architecture named Temporal Interaction and Causal Influence LSTM (TC-LSTM) to effectively leverage not only the causal influence between question-answer (how appropriate an answer is for a given question) but also the temporal interactions between answers-answer (how a high-quality answer gradually forms). In particular, long short-term memory (LSTM) is used to capture the explicit question-answer influence and the implicit answers-answer interactions. Experiments are conducted on SemEval 2015 CQA dataset for answer classification task and Baidu Zhidao Dataset for answer ranking task. The experimental results show the advantage of our model comparing with other state-of-the-art methods.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 74
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: The recent Bigdata and IoT era has presented a number of applications that generate objects in a streaming fashion. It is well-known that real-time mining of important patterns from data streams support many domains. In retail markets and social network services, for example, such patterns are itemsets and words that frequently appear in many user-accounts, i.e., co-occurrence patterns . To efficiently monitor co-occurrence patterns, we address the novel problem of mining top-k closed co-occurrence patterns across multiple streams. We employ sliding window setting in this problem, and each pattern is ranked based on count, which is the number of streams that have generated the pattern. Since objects are consecutively generated and deleted, the count of a given pattern is dynamic, which may change the rank of the pattern. This renders a challenge to monitoring the top-k answer in real-time. We propose an index-based algorithm that addresses the challenge and provides the exact answer. Specifically, we propose the CP-Graph, a hybrid index of graph and inverted file structures. The CP-Graph can efficiently compute the count of a given pattern and update the answer while pruning unnecessary patterns. Our experimental study on real datasets demonstrates the efficiency and scalability of our solution.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 75
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Networks are prevalent in many high impact domains. Moreover, cross-domain interactions are frequently observed in many applications, which naturally form the dependencies between different networks. Such kind of highly coupled network systems are referred to as multi-layered networks , and have been used to characterize various complex systems, including critical infrastructure networks, cyber-physical systems, collaboration platforms, biological systems, and many more. Different from single-layered networks where the functionality of their nodes is mainly affected by within-layer connections, multi-layered networks are more vulnerable to disturbance as the impact can be amplified through cross-layer dependencies, leading to the cascade failure to the entire system. To manipulate the connectivity in multi-layered networks, some recent methods have been proposed based on two-layered networks with specific types of connectivity measures. In this paper, we address the above challenges in multiple dimensions. First, we propose a family of connectivity measures ( SubLine ) that unifies a wide range of classic network connectivity measures. Third, we reveal that the connectivity measures in the SubLine family enjoy diminishing returns property , which guarantees a near-optimal solution with linear complexity for the connectivity optimization problem. Finally, we evaluate our proposed algorithm on real data sets to demonstrate its effectiveness and efficiency.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 76
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: The increasing popularity of social media has encouraged health consumers to share, explore, and validate health and wellness information on social networks, which provide a rich repository of Patient Generated Wellness Data (PGWD). While data-driven healthcare has attracted a lot of attention from academia and industry for improving care delivery through personalized healthcare, limited research has been done on harvesting and utilizing PGWD available on social networks. Recently, representation learning has been widely used in many applications to learn low-dimensional embedding of users. However, existing approaches for representation learning are not directly applicable to PGWD due to its domain nature as characterized by longitudinality, incompleteness, and sparsity of observed data as well as heterogeneity of the patient population. To tackle these problems, we propose an approach which directly learns the embedding from longitudinal data of users, instead of vector-based representation. In particular, we simultaneously learn a low-dimensional latent space as well as the temporal evolution of users in the wellness space. The proposed method takes into account two types of wellness prior knowledge: (1) temporal progression of wellness attributes; and (2) heterogeneity of wellness attributes in the patient population. Our approach scales well to large datasets using parallel stochastic gradient descent. We conduct extensive experiments to evaluate our framework at tackling three major tasks in wellness domain: attribute prediction, success prediction, and community detection. Experimental results on two real-world datasets demonstrate the ability of our approach in learning effective user representations.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 77
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: The widespread location-aware applications produce a vast amount of spatio-textual data that contains both spatial and textual attributes. To make use of this enriched information for users to describe their preferences for travel routes, we propose a Bounded-Cost Informative Route (BCIR) query to retrieve the routes that are the most textually relevant to the user-specified query keywords subject to a travel cost constraint. BCIR query is particularly helpful for tourists and city explorers to plan their travel routes. We will show that BCIR query is an NP-hard problem. To answer BCIR query efficiently, we propose an exact solution with effective pruning techniques and two approximate solutions with performance guarantees. Extensive experiments over real data sets demonstrate that the proposed solutions achieve the expected performance.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 78
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: In partial label learning, each training example is associated with a set of candidate labels among which only one is the ground-truth label. The common strategy to induce predictive model is trying to disambiguate the candidate label set, i.e., differentiating the modeling outputs of individual candidate labels. Specifically, disambiguation by differentiation can be conducted either by identifying the ground-truth label iteratively or by treating each candidate label equally. Nonetheless, the disambiguation strategy is prone to be misled by the false positive labels co-occurring with ground-truth label. In this paper, a new partial label learning strategy is studied which refrains from conducting disambiguation. Specifically, by adapting error-correcting output codes (ECOC), a simple yet effective approach named Pl-ecoc is proposed by utilizing candidate label set as an entirety . During training phase, to build binary classifier w.r.t. each column coding, any partially labeled example will be regarded as a positive or negative training example only if its candidate label set entirely falls into the coding dichotomy. During testing phase, class label for the unseen instance is determined via loss-based decoding which considers binary classifiers’ empirical performance and predictive margin. Extensive experiments show that Pl-ecoc performs favorably against state-of-the-art partial label learning approaches.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 79
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Convolutional Neural Network (CNN) has gained attractions in image analytics and speech recognition in recent years. However, employing CNN for classification of graphs remains to be challenging. This paper presents the Ngram graph-block based convolutional neural network model for classification of graphs. Our Ngram deep learning framework consists of three novel components. First, we introduce the concept of $n$ -gram block to transform each raw graph object into a sequence of $n$ -gram blocks connected through overlapping regions. Second, we introduce a diagonal convolution step to extract local patterns and connectivity features hidden in these $n$ -gram blocks by performing $n$ -gram normalization. Finally, we develop deeper global patterns based on the local patterns and the ways that they respond to overlapping regions by building a $n$ -gram deep learning model using convolutional neural network. We evaluate the effectiveness of our approach by comparing it with the existing state of art methods using five real graph repositories from bioinformatics and social networks domains. Our results show that the Ngram approach outperforms existing methods with high accuracy and comparable performance.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 80
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Linear discriminant analysis (LDA) is one of the most important supervised linear dimensional reduction techniques which seeks to learn low-dimensional representation from the original high-dimensional feature space through a transformation matrix, while preserving the discriminative information via maximizing the between-class scatter matrix and minimizing the within class scatter matrix. However, the conventional LDA is formulated to maximize the arithmetic mean of trace ratios which suffers from the domination of the largest objectives and might deteriorate the recognition accuracy in practical applications with a large number of classes. In this paper, we propose a new criterion to maximize the weighted harmonic mean of trace ratios, which effectively avoid the domination problem while did not raise any difficulties in the formulation. An efficient algorithm is exploited to solve the proposed challenging problems with fast convergence, which might always find the globally optimal solution just using eigenvalue decomposition in each iteration. Finally, we conduct extensive experiments to illustrate the effectiveness and superiority of our method over both of synthetic datasets and real-life datasets for various tasks, including face recognition, human motion recognition and head pose recognition. The experimental results indicate that our algorithm consistently outperforms other compared methods on all of the datasets.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 81
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-09-13
    Beschreibung: Social media plays a major role in helping people affected by natural calamities. These people use social media to request information and help in situations where time is a critical commodity. However, generic social media platforms like Twitter and Facebook are not conducive for obtaining answers promptly. Algorithms to ensure prompt responders for questions in social media have to understand and model the factors affecting their response time. In this paper, we draw from sociological studies on information seeking and organizational behavior to identify users who can provide timely and relevant responses to questions posted on social media. We first draw from these theories to model the future availability and past response behavior of the candidate responders and integrate these criteria with user relevance. We propose a learning algorithm from these criteria to derive optimal rankings of responders for a given question. We present questions posted on Twitter as a form of information seeking activity in social media and use them to evaluate our framework. Our experiments demonstrate that the proposed framework is useful in identifying timely and relevant responders for questions in social media.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 82
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Over the last decades, several studies have demonstrated the importance of co-clustering to simultaneously produce groups of objects and features. Even to obtain object clusters only, using co-clustering is often more effective than one-way clustering, especially when considering sparse high dimensional data. In this paper, we present a novel generative mixture model for co-clustering such data. This model, the Sparse Poisson Latent Block Model (SPLBM), is based on the Poisson distribution, which arises naturally for contingency tables, such as document-term matrices. The advantages of SPLBM are two-fold. First, it is a rigorous statistical model which is also very parsimonious. Second, it has been designed from the ground up to deal with data sparsity problems. As a consequence, in addition to seeking homogeneous blocks, as other available algorithms, it also filters out homogeneous but noisy ones due to the sparsity of the data. Experiments on various datasets of different size and structure show that an algorithm based on SPLBM clearly outperforms state-of-the-art algorithms. Most notably, the SPLBM-based algorithm presented here succeeds in retrieving the natural cluster structure of difficult, unbalanced datasets which other known algorithms are unable to handle effectively.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 83
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: With the increasing availability of moving-object tracking data, trajectory search is increasingly important. We propose and investigate a novel query type named trajectory search by regions of interest (TSR query). Given an argument set of trajectories, a TSR query takes a set of regions of interest as a parameter and returns the trajectory in the argument set with the highest spatial-density correlation to the query regions. This type of query is useful in many popular applications such as trip planning and recommendation, and location based services in general. TSR query processing faces three challenges: how to model the spatial-density correlation between query regions and data trajectories, how to effectively prune the search space, and how to effectively schedule multiple so-called query sources. To tackle these challenges, a series of new metrics are defined to model spatial-density correlations. An efficient trajectory search algorithm is developed that exploits upper and lower bounds to prune the search space and that adopts a query-source selection strategy, as well as integrates a heuristic search strategy based on priority ranking to schedule multiple query sources. The performance of TSR query processing is studied in extensive experiments based on real and synthetic spatial data.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 84
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Traditional relational topic models provide a successful way to discover the hidden topics from a document network. Many theoretical and practical tasks, such as dimensional reduction, document clustering, and link prediction, could benefit from this revealed knowledge. However, existing relational topic models are based on an assumption that the number of hidden topics is known a priori, which is impractical in many real-world applications. Therefore, in order to relax this assumption, we propose a nonparametric relational topic model using stochastic processes instead of fixed-dimensional probability distributions in this paper. Specifically, each document is assigned a Gamma process, which represents the topic interest of this document. Although this method provides an elegant solution, it brings additional challenges when mathematically modeling the inherent network structure of typical document network, i.e., two spatially closer documents tend to have more similar topics. Furthermore, we require that the topics are shared by all the documents. In order to resolve these challenges, we use a subsampling strategy to assign each document a different Gamma process from the global Gamma process, and the subsampling probabilities of documents are assigned with a Markov Random Field constraint that inherits the document network structure. Through the designed posterior inference algorithm, we can discover the hidden topics and its number simultaneously. Experimental results on both synthetic and real-world network datasets demonstrate the capabilities of learning the hidden topics and, more importantly, the number of topics.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 85
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: We present a new two-level composition model for crowdsourced Sensor-Cloud services based on dynamic features such as spatio-temporal aspects. The proposed approach is defined based on a formal Sensor-Cloud service model that abstracts the functionality and non-functional aspects of sensor data on the cloud in terms of spatio-temporal features. A spatio-temporal indexing technique based on the 3D R-tree to enable fast identification of appropriate Sensor-Cloud services is proposed. A novel quality model is introduced that considers dynamic features of sensors to select and compose Sensor-Cloud services. The quality model defines Coverage as a Service which is formulated as a composition of crowdsourced Sensor-Cloud services. We present two new QoS-aware spatio-temporal composition algorithms to select the optimal composition plan. Experimental results validate the performance of the proposed algorithms.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 86
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Web search engines are composed by thousands of query processing nodes, i.e., servers dedicated to process user queries. Such many servers consume a significant amount of energy, mostly accountable to their CPUs, but they are necessary to ensure low latencies, since users expect sub-second response times (e.g., 500 ms). However, users can hardly notice response times that are faster than their expectations. Hence, we propose the Predictive Energy Saving Online Scheduling Algorithm ( $\sf{PESOS}$ ) to select the most appropriate CPU frequency to process a query on a per-core basis. $\sf{PESOS}$ aims at process queries by their deadlines, and leverage high-level scheduling information to reduce the CPU energy consumption of a query processing node. $\sf{PESOS}$ bases its decision on query efficiency predictors, estimating the processing volume and processing time of a query. We experimentally evaluate $\sf{PESOS}$ upon the TREC ClueWeb09B collection and the MSN2006 query log. Results show that $\sf{PESOS}$ can reduce the CPU energy consumption of a query processing node up to ${\sim}$ 48 percent compared to a system running at maximum CPU core frequency. $\sf{PESOS}$ outperforms also the best state-of-the-art competitor with a ${\sim}$ 20 percent energy saving, while the competitor requires a fine parameter tuning and it may incurs in uncontrollable latency violations.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 87
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: To harness the rich amount of information available on the web today, many organizations aggregate public (and private) data to derive knowledge repositories for real-world entities. This paper aims to build historical profiles of real-world entities by integrating temporal records collected from different sources. This problem is challenging not only because entities may change their attribute values over time, but also because information provided by the sources could be unreliable. In this paper, we present a new solution for profiling entities over time. To understand the evolution of entities, we describe a novel transition model which gives the probability that an entity will change to a particular attribute value after some time period. Next, a set of quality metrics are defined for the data sources to capture the exactness and timeliness of their provided values. The transition model and the quality metrics are then built into a source-aware temporal matching algorithm that can link temporal records to entities at the right time and augment entity profiles with correct values. Our suite of experiments demonstrate that the proposed approach is able to outperform the state-of-the-art techniques by constructing more complete and accurate profiles for entities.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 88
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Query expansion has been widely adopted in Web search as a way of tackling the ambiguity of queries. Personalized search utilizing folksonomy data has demonstrated an extreme vocabulary mismatch problem that requires even more effective query expansion methods. Co-occurrence statistics, tag-tag relationships, and semantic matching approaches are among those favored by previous research. However, user profiles which only contain a user's past annotation information may not be enough to support the selection of expansion terms, especially for users with limited previous activity with the system. We propose a novel model to construct enriched user profiles with the help of an external corpus for personalized query expansion. Our model integrates the current state-of-the-art text representation learning framework, known as word embeddings, with topic models in two groups of pseudo-aligned documents. Based on user profiles, we build two novel query expansion techniques. These two techniques are based on topical weights-enhanced word embeddings, and the topical relevance between the query and the terms inside a user profile, respectively. The results of an in-depth experimental evaluation, performed on two real-world datasets using different external corpora, show that our approach outperforms traditional techniques, including existing non-personalized and personalized query expansion methods.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 89
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: YouTube, with millions of content creators, has become the preferred destination for viewing videos online. Through the Partner program, YouTube allows content creators to monetize their popular videos. Of significant importance for content creators is which meta-level features (title, tag, thumbnail, and description) are most sensitive for promoting video popularity. The popularity of videos also depends on the social dynamics, i.e., the interaction of the content creators (or channels) with YouTube users. Using real-world data consisting of about 6 million videos spread over 25 thousand channels, we empirically examine the sensitivity of YouTube meta-level features and social dynamics. The key meta-level features that impact the view counts of a video include: first day view count, number of subscribers, contrast of the video thumbnail, Google hits, number of keywords, video category, title length, and number of upper-case letters in the title, respectively, and illustrate that these meta-level features can be used to estimate the popularity of a video. In addition, optimizing the meta-level features after a video is posted increases the popularity of videos. In the context of social dynamics, we discover that there is a causal relationship between views to a channel and the associated number of subscribers. Additionally, insights into the effects of scheduling and video playthrough in a channel are also provided. Our findings provide a useful understanding of user engagement in YouTube.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 90
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Text classification is a process of classifying documents into predefined categories through different classifiers learned from labelled or unlabelled training samples. Many researchers who work on binary text classification attempt to find a more effective way to separate relevant texts from a large data set. However, current text classifiers cannot unambiguously describe the decision boundary between positive and negative objects because of uncertainties caused by text feature selection and the knowledge learning process. This paper proposes a three-way decision model for dealing with the uncertain boundary to improve the binary text classification performance based on the rough set techniques and centroid solution. It aims to understand the uncertain boundary through partitioning the training samples into three regions (the positive, boundary and negative regions) by two main boundary vectors C~P and C~N, created from the labeled positive and negative training subsets, respectively, and further resolve the objects in the boundary region by two derived boundary vectors B~P and B~N, produced according to the structure of the boundary region. It involves an indirect strategy which is composed of two successive steps in the whole classification process: ‘two-way to three-way’ and ‘three-way to two-way’. Four decision rules are proposed from the training process and applied to the incoming documents for more precise classification. A large number of experiments have been conducted based on the standard data sets RCV1 and Reuters-21578. The experimental results show that the usage of boundary vectors is very effective and efficient for dealing with uncertainties of the decision boundary, and the proposed model has significantly improved the performance of binary text classification in terms of F1 measure and AUC area compared with six other popular baseline models.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 91
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Uncertain graph models are widely used in real-world applications such as knowledge graphs and social networks. To capture the uncertainty, each edge in an uncertain graph is associated with an existential probability that signifies the likelihood of the existence of the edge. One notable issue of querying uncertain graphs is that the results are sometimes uninformative because of the edge uncertainty. In this paper, we consider probabilistic reachability queries, which are one of the fundamental classes of graph queries. To make the results more informative, we adopt a crowdsourcing-based approach to clean the uncertain edges. However, considering the time and monetary cost of crowdsourcing, it is a problem to efficiently select a limited set of edges for cleaning that maximizes the quality improvement. We prove that the edge selection problem is #P-hard. In light of the hardness of the problem, we propose a series of edge selection algorithms, followed by a number of optimization techniques and pruning heuristics for reducing the computation time. Our experimental results demonstrate that our proposed techniques outperform a random selection by up to 27 times in terms of the result quality improvement and the brute-force solution by up to 60 times in terms of the elapsed time.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 92
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: We propose a collaborative multi-domain sentiment classification approach to train sentiment classifiers for multiple domains simultaneously. In our approach, the sentiment information in different domains is shared to train more accurate and robust sentiment classifiers for each domain when labeled data is scarce. Specifically, we decompose the sentiment classifier of each domain into two components, a global one and a domain-specific one. The global model can capture the general sentiment knowledge and is shared by various domains. The domain-specific model can capture the specific sentiment expressions in each domain. In addition, we extract domain-specific sentiment knowledge from both labeled and unlabeled samples in each domain and use it to enhance the learning of domain-specific sentiment classifiers. Besides, we incorporate the similarities between domains into our approach as regularization over the domain-specific sentiment classifiers to encourage the sharing of sentiment information between similar domains. Two kinds of domain similarity measures are explored, one based on textual content and the other one based on sentiment expressions. Moreover, we introduce two efficient algorithms to solve the model of our approach. Experimental results on benchmark datasets show that our approach can effectively improve the performance of multi-domain sentiment classification and significantly outperform baseline methods.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 93
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Transfer learning techniques have been broadly applied in applications where labeled data in a target domain are difficult to obtain while a lot of labeled data are available in related source domains. In practice, there can be multiple source domains that are related to the target domain, and how to combine them is still an open problem. In this paper, we seek to leverage labeled data from multiple source domains to enhance classification performance in a target domain where the target data are received in an online fashion. This problem is known as the online transfer learning problem. To achieve this, we propose novel online transfer learning paradigms in which the source and target domains are leveraged adaptively. We consider two different problem settings: homogeneous transfer learning and heterogeneous transfer learning. The proposed methods work in an online manner, where the weights of the source domains are adjusted dynamically. We provide the mistake bounds of the proposed methods and perform comprehensive experiments on real-world data sets to demonstrate the effectiveness of the proposed algorithms.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 94
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: We study the problem of preserving user privacy in the publication of location sequences. Consider a database of trajectories, corresponding to movements of people, captured by their transactions when they use credit cards, RFID debit cards, or NFC ( http://en.wikipedia.org/wiki/Near_field_communication ) compliant devices. We show that, if such trajectories are published exactly (by only hiding the identities of persons that followed them), one can use partial trajectory knowledge as a quasi-identifier for the remaining locations in the sequence. We devise four intuitive techniques, based on combinations of locations suppression and trajectories splitting, and we show that they can prevent privacy breaches while keeping published data accurate for aggregate query answering and frequent subsets data mining.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 95
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Automatic discovery of newsworthy themes from sequenced data can relieve journalists from manually poring over a large amount of data in order to find interesting news. In this paper, we propose a novel $k$ -Sketch query that aims to find $k$ striking streaks to best summarize a subject. Our scoring function takes into account streak strikingness and streak coverage at the same time. We study the $k$ -Sketch query processing in both offline and online scenarios, and propose various streak-level pruning techniques to find striking candidates. Among those candidates, we then develop approximate methods to discover the $k$ most representative streaks with theoretical bounds. We conduct experiments on four real datasets, and the results demonstrate the efficiency and effectiveness of our proposed algorithms: the running time achieves up to 500 times speedup and the quality of the generated summaries is endorsed by the anonymous users from Amazon Mechanical Turk.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 96
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: Getting back to previously viewed web pages is a common yet uneasy task for users due to the large volume of personally accessed information on the web. This paper leverages human's natural recall process of using episodic and semantic memory cues to facilitate recall, and presents a personal web revisitation technique called WebPagePrev through context and content keywords. Underlying techniques for context and content memories’ acquisition, storage, decay, and utilization for page re-finding are discussed. A relevance feedback mechanism is also involved to tailor to individual's memory strength and revisitation habits. Our 6-month user study shows that: (1) Compared with the existing web revisitation tool Memento , History List Searching method, and Search Engine method, the proposed WebPagePrev delivers the best re-finding quality in finding rate (92.10 percent), average F1-measure (0.4318), and average rank error (0.3145). (2) Our dynamic management of context and content memories including decay and reinforcement strategy can mimic users’ retrieval and recall mechanism. With relevance feedback, the finding rate of WebPagePrev increases by 9.82 percent, average F1-measure increases by 47.09 percent, and average rank error decreases by 19.44 percent compared to stable memory management strategy. Among time, location, and activity context factors in WebPagePrev , activity is the best recall cue, and context+content based re-finding delivers the best performance, compared to context based re-finding and content based re-finding.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 97
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-06-07
    Beschreibung: This paper presents a spectral analysis of signed networks from both theoretical and practical aspects. On the theoretical aspect, we conduct theoretical studies based on results from matrix perturbation for analyzing community structures of complex signed networks and show how the negative edges affect distributions and patterns of node spectral coordinates in the spectral space. We prove and demonstrate that node spectral coordinates form orthogonal clusters for two types of signed networks: graphs with dense inter-community mixed sign edges and $k$ -dispute graphs where inner-community connections are absent or very sparse but inter-community connections are dense with negative edges. The cluster orthogonality pattern is different from the line orthogonality pattern (i.e., node spectral coordinates form orthogonal lines) observed in the networks with $k$ -block structure. We show why the line orthogonality pattern does not hold in the spectral space for these two types of networks. On the practical aspect, we have developed a clustering method to study signed networks and $k$ -dispute networks. Empirical evaluations on both synthetic networks (with up to one million nodes) and real networks show our algorithm outperforms existing clustering methods on signed networks in terms of accuracy and efficiency.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 98
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-07-08
    Beschreibung: Conventional semi-supervised clustering approaches have several shortcomings, such as (1) not fully utilizing all useful must-link and cannot-link constraints, (2) not considering how to deal with high dimensional data with noise, and (3) not fully addressing the need to use an adaptive process to further improve the performance of the algorithm. In this paper, we first propose the transitive closure based constraint propagation approach, which makes use of the transitive closure operator and the affinity propagation to address the first limitation. Then, the random subspace based semi-supervised clustering ensemble framework with a set of proposed confidence factors is designed to address the second limitation and provide more stable, robust, and accurate results. Next, the adaptive semi-supervised clustering ensemble framework is proposed to address the third limitation, which adopts a newly designed adaptive process to search for the optimal subspace set. Finally, we adopt a set of nonparametric tests to compare different semi-supervised clustering ensemble approaches over multiple datasets. The experimental results on 20 real high dimensional cancer datasets with noisy genes and 10 datasets from UCI datasets and KEEL datasets show that (1) The proposed approaches work well on most of the real-world datasets. (2) It outperforms other state-of-the-art approaches on 12 out of 20 cancer datasets, and 8 out of 10 UCI machine learning datasets.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 99
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-07-08
    Beschreibung: An ongoing challenge in the rapidly evolving app market ecosystem is to maintain the integrity of app categories. At the time of registration, app developers have to select, what they believe, is the most appropriate category for their apps. Besides the inherent ambiguity of selecting the right category, the approach leaves open the possibility of misuse and potential gaming by the registrant. Periodically, the app store will refine the list of categories available and potentially reassign the apps. However, it has been observed that the mismatch between the description of the app and the category it belongs to, continues to persist. Although some common mechanisms (e.g., a complaint-driven or manual checking) exist, they limit the response time to detect miscategorized apps and still open the challenge on categorization. We introduce FRAC+ : (FR)amework for (A)pp (C)ategorization. FRAC+ has the following salient features: (i) it is based on a data-driven topic model and automatically suggests the categories appropriate for the app store, and (ii) it can detect miscategorizated apps. Extensive experiments attest to the performance of FRAC+ . Experiments on Google Play shows that FRAC+ ’s topics are more aligned with Google ’s new categories and 0.35-1.10 percent game apps are detected to be miscategorized.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
  • 100
    facet.materialart.
    Unbekannt
    Institute of Electrical and Electronics Engineers (IEEE)
    Publikationsdatum: 2017-07-08
    Beschreibung: This paper investigates an important problem in stream mining, i.e., classification under streaming emerging new classes or SENC . The SENC problem can be decomposed into three subproblems: detecting emerging new classes, classifying known classes, and updating models to integrate each new class as part of known classes. The common approach is to treat it as a classification problem and solve it using either a supervised learner or a semi-supervised learner. We propose an alternative approach by using unsupervised learning as the basis to solve this problem. The proposed method employs completely-random trees which have been shown to work well in unsupervised learning and supervised learning independently in the literature. The completely-random trees are used as a single common core to solve all three subproblems: unsupervised learning, supervised learning, and model update on data streams. We show that the proposed unsupervised-learning-focused method often achieves significantly better outcomes than existing classification-focused methods.
    Print ISSN: 1041-4347
    Digitale ISSN: 1558-2191
    Thema: Informatik
    Standort Signatur Erwartet Verfügbarkeit
    BibTip Andere fanden auch interessant ...
Schließen ⊗
Diese Webseite nutzt Cookies und das Analyse-Tool Matomo. Weitere Informationen finden Sie hier...