ALBERT

All Library Books, journals and Electronic Records Telegrafenberg

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
  • 101
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-06-13
    Description: Vetting a mobile app vendor's processes via a questionnaire is a poor substitute for vetting the app itself, but situations often arise when it's the only or most practical option.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 102
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-06-13
    Description: Realizing the full potential of virtualized computation-the cloud-requires rethinking software development. Deployment decisions, and their validation, can and should be moved up the development chain into the design phase.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 103
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-06-13
    Description: News of interest to Computer Society members.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 104
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-06-13
    Description: Information about upcoming events of interest to Computer Society members.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 105
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-06-13
    Description: Classified advertisement for job postings.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 106
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-06-13
    Description: Advertisement, IEEE.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 107
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-06-13
    Description: Prospective authors are requested to submit new, unpublished manuscripts for inclusion in the upcoming event described in this call for papers.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 108
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-06-13
    Description: Prospective authors are requested to submit new, unpublished manuscripts for inclusion in the upcoming event described in this call for papers.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 109
    Publication Date: 2016-07-22
    Description: Clustering is a fundamental task in data mining. Affinity propagation clustering (APC) is an effective and efficient clustering technique that has been applied in various domains. APC iteratively propagates information between affinity samples, updates the responsibility matrix and availability matrix, and employs these matrices to choose cluster centers (or exemplars) of respective clusters. However, since it mainly uses negative Euclidean distance between exemplars and samples as the similarity between them, it is difficult to identify clusters with complex structure. Therefore, the performance of APC deteriorates on samples distributed with complex structure. To mitigate this problem, we propose an improved APC based on a path-based similarity (APC-PS). APC-PS firstly utilizes negative Euclidean distance to find exemplars of clusters. Then, it employs the path-based similarity to measure the similarity between exemplars and samples, and to explore the underlying structure of clusters. Next, it assigns non-exemplar samples to their respective clusters via that similarity. Our empirical study on synthetic and UCI datasets shows that the proposed APC-PS significantly outperforms original APC and other related approaches.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 110
    Publication Date: 2016-07-23
    Description: Graph-based semi-supervised classification uses a graph to capture the relationship between samples and exploits label propagation techniques on the graph to predict the labels of unlabeled samples. However, it is difficult to construct a graph that faithfully describes the relationship between high-dimensional samples. Recently, low-rank representation has been introduced to construct a graph, which can preserve the global structure of high-dimensional samples and help to train accurate transductive classifiers. In this paper, we take advantage of low-rank representation for graph construction and propose an inductive semi-supervised classifier called Semi-Supervised Classification based on Low-Rank Representation (SSC-LRR). SSC-LRR first utilizes a linearized alternating direction method with adaptive penalty to compute the coefficient matrix of low-rank representation of samples. Then, the coefficient matrix is adopted to define a graph. Finally, SSC-LRR incorporates this graph into a graph-based semi-supervised linear classifier to classify unlabeled samples. Experiments are conducted on four widely used facial datasets to validate the effectiveness of the proposed SSC-LRR and the results demonstrate that SSC-LRR achieves higher accuracy than other related methods.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 111
    Publication Date: 2016-07-23
    Description: This research proposes a two-stage user-based collaborative filtering process using an artificial immune system for the prediction of student grades, along with a filter for professor ratings in the course recommendation for college students. We test for cosine similarity and Karl Pearson (KP) correlation in affinity calculations for clustering and prediction. This research uses student information and professor information datasets of Yuan Ze University from the years 2005–2009 for the purpose of testing and training. The mean average error and confusion matrix analysis form the testing parameters. A minimum professor rating was tested to check the results, and observed that the recommendation systems herein provide highly accurate results for students with higher mean grades.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 112
    Publication Date: 2016-07-31
    Description: This paper is concerned with the application of computational intelligence techniques to the conceptual design and development of a large-scale floating settlement. The settlement in question is a design for the area of Urla, which is a rural touristic region located on the west coast of Turkey, near the metropolis of Izmir. The problem at hand includes both engineering and architectural aspects that need to be addressed in a comprehensive manner. We thus adapt the view as a multi-objective constrained real-parameter optimization problem. Specifically, we consider three objectives, which are conflicting. The first one aims at maximizing accessibility of urban functions such as housing and public spaces, as well as special functions, such as a marina for yachts and a yacht club. The second one aims at ensuring the wind protection of the general areas of the settlement, by adequately placing them in between neighboring land masses. The third one aims at maximizing visibility of the settlement from external observation points, so as to maximize the exposure of the settlement. To address this complex multi-objective optimization problem and identify lucrative alternative design solutions, a multi-objective harmony search algorithm (MOHS) is developed and applied in this paper. When compared to the Differential Evolution algorithm developed for the problem in the literature, we demonstrate that MOHS achieves competitive or slightly better performance in terms of hyper volume calculation, and gives promising results when the Pareto front approximation is examined.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 113
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving paradigm has gained a lot of attention and has been a hot research topic in the field of computer science. Even though a lot of work has been conducted on this topic, so far we do not have a comprehensive survey on most relevant work done in the crowdsourcing field. In this paper, we aim to offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing. According to their focus, we divide this work into three parts, which are: incentive design, task assignment, and quality control. For each part, we start with different problems faced in that area followed by a brief description of existing work and a discussion of pros and cons. In addition, we also present a real scenario on how the different techniques are used in implementing a location-based crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose crowdsourcing techniques and present some open problems in this area.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 114
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Any important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring platforms are an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowdsourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowdsourced data management.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 115
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Principal component analysis and the residual error is an effective anomaly detection technique. In an environment where anomalies are present in the training set, the derived principal components can be skewed by the anomalies. A further aspect of anomaly detection is that data might be distributed across different nodes in a network and their communication to a centralized processing unit is prohibited due to communication cost. Current solutions to distributed anomaly detection rely on a hierarchical network infrastructure to aggregate data or models; however, in this environment, links close to the root of the tree become critical and congested. In this paper, an algorithm is proposed that is more robust in its derivation of the principal components of a training set containing anomalies. A distributed form of the algorithm is then derived where each node in a network can iterate towards the centralized solution by exchanging small matrices with neighboring nodes. Experimental evaluations on both synthetic and real-world data sets demonstrate the superior performance of the proposed approach in comparison to principal component analysis and alternative anomaly detection techniques. In addition, it is shown that in a variety of network infrastructures, the distributed form of the anomaly detection model is able to derive a close approximation of the centralized model.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 116
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Many applications deal with moving object datasets, e.g., mobile phone social networking, scientific simulations, and ride-sharing services. These applications need to handle a tremendous number of spatial objects that continuously move and execute spatial queries to explore their surroundings. To manage such update-heavy workloads, several throwaway index structures have recently been proposed, where a static index is rebuilt periodically from scratch rather than updated incrementally. It has been shown that throwaway indices outperform specialized moving-object indices that maintain location updates incrementally. However, throwaway indices suffer from scalability due to their single-server design and the only distributed throwaway index (D-MOVIES), extension of a centralized approach, does not scale out as the number of servers increases, especially during query processing phase.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 117
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Collaborative filtering (CF) is out of question the most widely adopted and successful recommendation approach. A typical CF-based recommender system associates a user with a group of like-minded users based on their individual preferences over all the items, either explicit or implicit, and then recommends to the user some unobserved items enjoyed by the group. However, we find that two users with similar tastes on one item subset may have totally different tastes on another set. In other words, there exist many user-item subgroups each consisting of a subset of items and a group of like-minded users on these items. It is more reasonable to predict preferences through one user's correlated subgroups, but not the entire user-item matrix. In this paper, to find meaningful subgroups, we formulate a new Multiclass Co-Clustering (MCoC) model, which captures relations of user-to-item, user-to-user, and item-to-item simultaneously. Then, we combine traditional CF algorithms with subgroups for improving their top- $N$ recommendation performance. Our approach can be seen as a new extension of traditional clustering CF models. Systematic experiments on several real data sets have demonstrated the effectiveness of our proposed approach.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 118
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: With the rapid development of Web 2.0 and Online To Offline (O2O) marketing model, various online e vent- b ased s ocial n etwork s (EBSNs) are getting popular. An important task of EBSNs is to facilitate the most satisfactory event-participant arrangement for both sides, i.e., events enroll more participants and participants are arranged with personally interesting events. Existing approaches usually focus on the arrangement of each single event to a set of potential users, or ignore the conflicts between different events, which leads to infeasible or redundant arrangements. In this paper, to address the shortcomings of existing approaches, we first identify a more general and useful event-participant arrangement problem, called G lobal E vent-participant A rrangement with C onflict and C apacity ( $GEACC$ ) problem, focusing on the conflicts of different events and making event-participant arrangements in a global view. We find that the GEACC problem is NP-hard due to the conflicts among events. Thus, we design two approximation algorithms with provable approximation ratios and an exact algorithm with pruning technique to address this problem. In addition, we propose an online setting of GEACC, called OnlineGEACC, which is also practical in real-world scenarios. We further design an online algorithm with provable performance guarantee. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 119
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Given a point $p$ and a set of points $S$ , the kNN operation finds the $k$ closest points to in $S$ . It is a computational intensive task with a large range of applications such as knowledge discovery or data mining. However, as the volume and the dimension of data increase, only distributed approaches can perform such costly operation in a reasonable time. Recent works have focused on implementing efficient solutions using the MapReduce programming model because it is suitable for distributed large scale data processing. Although these works provide different solutions to the same problem, each one has particular constraints and properties. In this paper, we compare the different existing approaches for computing kNN on MapReduce, first theoretically, and then by performing an extensive experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN computation on MapReduce: data pre-processing, data partitioning, and computation. We then analyze each step from load balancing, accuracy, and complexity aspects. Experiments in this paper use a variety of datasets, and analyze the impact of data volume, data dimension, and the value of k from many perspectives like time and space complexity, and accuracy. The experimental part brings new advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this is the first pape- that compares kNN computing methods on MapReduce both theoretically and experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based practical problems in the context of big data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 120
    Publication Date: 2016-08-05
    Description: Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many natural and engineering complex systems, e.g., protein networks, power grid, and transportation systems. Most of the existing techniques view the community mining problem as an optimization problem based on a given quality function(e.g., modularity), however none of them are grounded with a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining efficiency and the community quality still remains an open problem. In this paper, we attempt to address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable influence factor is proposed to measure the leadership of each node, those nodes with highest local leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system to describe the dynamical assignment of community membership; and formulate the serval conditions to guarantee the convergence of each node's dynamic trajectory, by which the hierarchical community structure of the network can be revealed. The proposed dynamical system is independent of the quality function used, so could also be applied in other community mining models. Our algorithm is highly efficient: the computational complexity analysis shows that the execution time is nearly linearly dependent on the number of nodes in sparse networks. We finally give demonstrative applications of the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the algorithmic performance.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 121
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: When a microblogging user adopts some content propagated to her, we can attribute that to three behavioral factors, namely, topic virality , user virality , and user susceptibility . Topic virality measures the degree to which a topic attracts propagations by users. User virality and susceptibility refer to the ability of a user to propagate content to other users, and the propensity of a user adopting content propagated to her, respectively. In this paper, we study the problem of mining these behavioral factors specific to topics from microblogging content propagation data. We first construct a three dimensional tensor for representing the propagation instances. We then propose a tensor factorization framework to simultaneously derive the three sets of behavioral factors. Based on this framework, we develop a numerical factorization model and another probabilistic factorization variant. We also develop an efficient algorithm for the models’ parameters learning. Our experiments on a large Twitter dataset and synthetic datasets show that the proposed models can effectively mine the topic-specific behavioral factors of users and tweet topics. We further demonstrate that the proposed models consistently outperforms the other state-of-the-art content based models in retweet prediction over time.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 122
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups ( clusters ) of anomalies; i.e., sets of points which collectively exhibit abnormal patterns. In many applications, this can lead to a better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD .
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 123
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Multilabel classification is prevalent in many real-world applications where data instances may be associated with multiple labels simultaneously. In multilabel classification, exploiting label correlations is an essential but nontrivial task. Most of the existing multilabel learning algorithms are either ineffective or computationally demanding and less scalable in exploiting label correlations. In this paper, we propose a co-evolutionary multilabel hypernetwork (Co-MLHN) as an attempt to exploit label correlations in an effective and efficient way. To this end, we firstly convert the traditional hypernetwork into a multilabel hypernetwork (MLHN) where label correlations are explicitly represented. We then propose a co-evolutionary learning algorithm to learn an integrated classification model for all labels. The proposed Co-MLHN exploits arbitrary order label correlations and has linear computational complexity with respect to the number of labels. Empirical studies on a broad range of multilabel data sets demonstrate that Co-MLHN achieves competitive results against state-of-the-art multilabel learning algorithms, in terms of both classification performance and scalability with respect to the number of labels.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 124
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Query auto completion (QAC) methods recommend queries to search engine users when they start entering a query. Current QAC methods mostly rank query completions based on their past popularity, i.e., on the number of times they have previously been submitted as a query. However, query popularity changes over time and may vary drastically across users. Accordingly, the ranking of query completions should be adjusted. Previous time-sensitive and user-specific QAC methods have been developed separately, yielding significant improvements over methods that are neither time-sensitive nor personalized. We propose a hybrid QAC method that is both time-sensitive and personalized. We extend it to handle long-tail prefixes, which we achieve by assigning optimal weights to the contribution from time-sensitivity and personalization. Using real-world search log datasets, we return top $N$ query suggestions ranked by predicted popularity as estimated from popularity trends and cyclic popularity behavior; we rerank them by integrating similarities to a user's previous queries (both in the current session and in previous sessions). Our method outperforms state-of-the-art time-sensitive QAC baselines, achieving total improvements of between 3 and 7 percent in terms of mean reciprocal rank (MRR). After optimizing the weights, our extended model achieves MRR improvements of between 4 and 8 percent.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 125
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Wrappers are pieces of software used to extract data from websites and structure them for further application processing. Unfortunately, websites are continuously evolving and structural changes happen with no forewarning, which usually results in wrappers working incorrectly. Thus, wrappers maintenance is necessary for detecting whether wrapper is extracting erroneous data. The solution consists of using verification models to detect whether wrapper output is statistically similar to the output produced by the wrapper itself when it was successfully invoked in the past. Current proposals present some weaknesses, as the data used to build these models are supposed to be homogeneous, independent, or representative enough, or following a single predefined mathematical model. In this paper, we present MAVE, a novel multilevel wrapper verification system that is based on one-class classification techniques to overcome previous weaknesses. The experimental results show that our proposal outperforms accuracy of current solutions.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 126
    Publication Date: 2016-07-19
    Description: During a construction project life cycle, project costs and time estimations contribute greatly to baseline scheduling. Besides, schedule risk analysis and project control are also influenced by the above factors. Although many papers have offered estimation techniques, little attempt has been made to generate project time series data as daily progressive estimations in different project environments that could help researchers in generating general and customized formulae in further studies. This paper, however, is an attempt to introduce a new simulation approach to reflect the data regarding time series progress of the project, considering the specifications and the complexity of the project and the environment where the project is performed. Moreover, this simulator can equip project managers with estimated information, which reassures them of the execution stages of the project although they lack historical data. A case study is presented to show the usefulness of the model and its applicability in practice. In this study, singular spectrum analysis has been employed to analyze the simulated outputs, and the results are separated based on their signal and noise trends. The signal trend is used as a point-of-reference to compare the outputs of a simulation employing S-curve technique results and the formulae corresponding to earned value management, as well as the life of a given project.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 127
    Publication Date: 2016-07-27
    Description: This paper discusses the parameter estimation problems of multi-input output-error autoregressive (OEAR) systems. By combining the auxiliary model identification idea and the data filtering technique, a data filtering based recursive generalized least squares (F-RGLS) identification algorithm and a data filtering based iterative least squares (F-LSI) identification algorithm are derived. Compared with the F-RGLS algorithm, the proposed F-LSI algorithm is more effective and can generate more accurate parameter estimates. The simulation results confirm this conclusion.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 128
    Publication Date: 2016-08-05
    Description: The force-directed paradigm is one of the few generic approaches to drawing graphs. Since force-directed algorithms can be extended easily, they are used frequently. Most of these algorithms are, however, quite slow on large graphs, as they compute a quadratic number of forces in each iteration. We give a new algorithm that takes only O ( m + n log n ) time per iteration when laying out a graph with n vertices and m edges. Our algorithm approximates the true forces using the so-called well-separated pair decomposition. We perform experiments on a large number of graphs and show that we can strongly reduce the runtime, even on graphs with less than a hundred vertices, without a significant influence on the quality of the drawings (in terms of the number of crossings and deviation in edge lengths).
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 129
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Online portfolio selection has attracted increasing attention from data mining and machine learning communities in recent years. An important theory in financial markets is mean reversion, which plays a critical role in some state-of-the-art portfolio selection strategies. Although existing mean reversion strategies have been shown to achieve good empirical performance on certain datasets, they seldom carefully deal with noise and outliers in the data, leading to suboptimal portfolios, and consequently yielding poor performance in practice. In this paper, we propose to exploit the reversion phenomenon by using robust $L_1$ -median estimators, and design a novel online portfolio selection strategy named “Robust Median Reversion” (RMR), which constructs optimal portfolios based on the improved reversion estimator. We examine the performance of the proposed algorithms on various real markets with extensive experiments. Empirical results show that RMR can overcome the drawbacks of existing mean reversion algorithms and achieve significantly better results. Finally, RMR runs in linear time, and thus is suitable for large-scale real-time algorithmic trading applications.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 130
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: The development of a topic in a set of topic documents is constituted by a series of person interactions at a specific time and place. Knowing the interactions of the persons mentioned in these documents is helpful for readers to better comprehend the documents. In this paper, we propose a topic person interaction detection method called SPIRIT, which classifies the text segments in a set of topic documents that convey person interactions. We design the rich interactive tree structure to represent syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based convolution kernel to identify interactive segments. Experiment results based on real world topics demonstrate that the proposed rich interactive tree structure effectively detects the topic person interactions and that our method outperforms many well-known relation extraction and protein-protein interaction methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 131
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: A social recommendation system has attracted a lot of attention recently in the research communities of information retrieval, machine learning, and data mining. Traditional social recommendation algorithms are often based on batch machine learning methods which suffer from several critical limitations, e.g., extremely expensive model retraining cost whenever new user ratings arrive, unable to capture the change of user preferences over time. Therefore, it is important to make social recommendation system suitable for real-world online applications where data often arrives sequentially and user preferences may change dynamically and rapidly. In this paper, we present a new framework of online social recommendation from the viewpoint of online graph regularized user preference learning (OGRPL), which incorporates both collaborative user-item relationship as well as item content features into an unified preference learning process. We further develop an efficient iterative procedure, OGRPL-FW which utilizes the Frank-Wolfe algorithm, to solve the proposed online optimization problem. We conduct extensive experiments on several large-scale datasets, in which the encouraging results demonstrate that the proposed algorithms obtain significantly lower errors (in terms of both RMSE and MAE) than the state-of-the-art online recommendation methods when receiving the same amount of training data in the online learning process.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 132
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: In this paper, we revisit the private over-threshold data aggregation problem. We formally define the problem's security requirements as both data and user privacy goals. To achieve both goals, and to strike a balance between efficiency and functionality, we devise an efficient cryptographic construction and its proxy-based variant. Both schemes are provably secure in the semi-honest model. Our key idea for the constructions and their malicious variants is to compose two encryption functions tightly coupled in a way that the two functions are commutative and one public-key encryption has an additive homomorphism. We call that double encryption. We analyze the computational and communication complexities of our construction, and show that it is much more efficient than the existing protocols in the literature. Specifically, our protocol has linear complexity in computation and communication with respect to the number of users. Its round complexity is also linear in the number of users. Finally, we show that our basic protocol is efficiently transformed into a stronger protocol secure in the presence of malicious adversaries, and provide the resulting protocol's performance and security analysis.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 133
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 134
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-08-05
    Description: Automated feature selection is important for text categorization to reduce feature size and to speed up learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination ( $MD$ ) and methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 135
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: The proliferation of location-based social networks, such as Foursquare and Facebook Places, offers a variety of ways to record human mobility, including user generated geo-tagged contents, check-in services, and mobile apps. Although trajectory data is of great value to many applications, it is challenging to analyze and mine trajectory data due to the complex characteristics reflected in human mobility, which is affected by multiple contextual information. In this paper, we propose a Multi-Context Trajectory Embedding Model, called MC-TEM, to explore contexts in a systematic way. MC-TEM is developed in the distributed representation learning framework, and it is flexible to characterize various kinds of useful contexts for different applications. To the best of our knowledge, it is the first time that the distributed representation learning methods apply to trajectory data. We formally incorporate multiple context information of trajectory data into the proposed model, including user-level, trajectory-level, location-level, and temporal contexts. All the context information is represented in the same embedding space. We apply MC-TEM to two challenging tasks, namely location recommendation and social link prediction. We conduct extensive experiments on three real-world datasets. Extensive experiment results have demonstrated the superiority of our MC-TEM model over several state-of-the-art methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 136
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: The study of urban networks reveals that the accessibility of important city objects for the vehicle traffic and pedestrians is significantly correlated to the popularity, micro-criminality, micro-economic vitality, and social liveability of the city, and is always the chief factor in regulating the growth and expansion of the city. The accessibility between different components of an urban structure are frequently measured along the streets and routes considered as edges of a planar graph, while the traffic ultimate destination points and street junctions are treated as vertices. For estimation of the accessibility of destination vertex $j$ from vertex $i$ through urban networks, in particular, the random walks are used to calculate the expected distance a random walker starting from $i$ makes before $j$ is visited (known as access time ). The state-of-the-art of access time computation is costly in large planar graphs since it involves matrix operation over entire graph. The time complexity is $O(n^{2.376})$ where $n$ is the number of vertices in the planar graph. To enable efficient access time query answering in large planar graphs, this work proposes the first access time oracle which is based on the proposed access time decomposition and reconstruction scheme. The oracle is a hierarchical data structure with deliberate design on the relationships between different hierarchical levels. The storage requirement of the proposed oracle is $O(n^{frac{4}{3}}log log n)$ and the access time query response time is $O(n^{frac{2}{3}})$ . The extensive tests on a number of large real-world road networks (with up to about 2 million vertices) have verified the superiority of the proposed oracle.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 137
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Clustering is one of the research hotspots in the field of data mining and has extensive applications in practice. Recently, Rodriguez and Laio [1] published a clustering algorithm on Science that identifies the clustering centers in an intuitive way and clusters objects efficiently and effectively. However, the algorithm is sensitive to a preassigned parameter and suffers from the identification of the “ideal” number of clusters. To overcome these shortages, this paper proposes a new clustering algorithm that can detect the clustering centers automatically via statistical testing. Specifically, the proposed algorithm first defines a new metric to measure the density of an object that is more robust to the preassigned parameter, further generates a metric to evaluate the centrality of each object. Afterwards, it identifies the objects with extremely large centrality metrics as the clustering centers via an outward statistical testing method. Finally, it groups the remaining objects into clusters containing their nearest neighbors with higher density. Extensive experiments are conducted over different kinds of clustering data sets to evaluate the performance of the proposed algorithm and compare with the algorithm in Science. The results show the effectiveness and robustness of the proposed algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 138
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: A representative skyline contains $k$ skyline points that can represent its corresponding full skyline. The existing measuring criteria of $k$ representative skylines are specifically designed for static data, and they cannot effectively handle streaming data. In this paper, we focus on the problem of calculating the $k$ representative skyline over data streams. First, we propose a new criterion to choose $k$ skyline points as the $k$ representative skyline for data stream environments, termed the $k$ largest dominance skyline ( $k$ -LDS), which is representative to the entire data set and is highly stable over the streaming data. Second, we propose an efficient exact algorithm, called Prefix-based Algorithm (PBA), to solve the $k$ -LDS problem in a 2-dimensional space. The time complexity of PBA is only $mathcal {O}((M-k)times k)$ where $M$ is the size of the full skyline set. Third, the $k$ -LDS problem for a $d$ -dimensional ( $dge 3$ ) space turns out to be very complex. Therefore, a greedy algorithm is designed to answer $k$ -LDS queries. To further accelerate the calculation, we propose a $epsilon$ -greedy algorithm which can achieve an approximate factor of $frac{1}{(1+epsilon)}(1-frac{1}{sqrt{e}})$
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 139
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Domain adaptation generalizes a learning model across source domain and target domain that are sampled from different distributions. It is widely applied to cross-domain data mining for reusing labeled information and mitigating labeling consumption. Recent studies reveal that deep neural networks can learn abstract feature representation, which can reduce, but not remove, the cross-domain discrepancy. To enhance the invariance of deep representation and make it more transferable across domains, we propose a unified deep adaptation framework for jointly learning transferable representation and classifier to enable scalable domain adaptation, by taking the advantages of both deep learning and optimal two-sample matching. The framework constitutes two inter-dependent paradigms, unsupervised pre-training for effective training of deep models using deep denoising autoencoders, and supervised fine-tuning for effective exploitation of discriminative information using deep neural networks, both learned by embedding the deep representations to reproducing kernel Hilbert spaces (RKHSs) and optimally matching different domain distributions. To enable scalable learning, we develop a linear-time algorithm using unbiased estimate that scales linearly to large samples. Extensive empirical results show that the proposed framework significantly outperforms state of the art methods on diverse adaptation tasks: sentiment polarity prediction, email spam filtering, newsgroup content categorization, and visual object recognition.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 140
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: With the rapid growth of various applications on the Internet, recommender systems become fundamental for helping users alleviate the problem of information overload. Since contextual information is a significant factor in modeling the user behavior, various context-aware recommendation methods have been proposed recently. The state-of-the-art context modeling methods usually treat contexts as certain dimensions similar to those of users and items, and capture relevances between contexts and users/items. However, such kind of relevance has much difficulty in explanation. Some works on multi-domain relation prediction can also be used for the context-aware recommendation, but they have limitations in generating recommendations under a large amount of contextual information. Motivated by recent works in natural language processing, we represent each context value with a latent vector, and model the contextual information as a semantic operation on the user and item. Besides, we use the contextual operating tensor to capture the common semantic effects of contexts. Experimental results show that the proposed Context Operating Tensor (COT) model yields significant improvements over the competitive compared methods on three typical datasets. From the experimental results of COT, we also obtain some interesting observations which follow our intuition.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 141
    Publication Date: 2016-07-08
    Description: In many applications, one can obtain descriptions about the same objects or events from a variety of sources. As a result, this will inevitably lead to data or information conflicts. One important problem is to identify the true information (i.e., the truths ) among conflicting sources of data. It is intuitive to trust reliable sources more when deriving the truths, but it is usually unknown which one is more reliable a priori . Moreover, each source possesses a variety of properties with different data types. An accurate estimation of source reliability has to be made by modeling multiple properties in a unified model. Existing conflict resolution work either does not conduct source reliability estimation, or models multiple properties separately. In this paper, we propose to resolve conflicts among multiple sources of heterogeneous data types. We model the problem using an optimization framework where truths and source reliability are defined as two sets of unknown variables. The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. Different loss functions can be incorporated into this framework to recognize the characteristics of various data types, and efficient computation approaches are developed. The proposed framework is further adapted to deal with streaming data in an incremental fashion and large-scale data in MapReduce model. Experiments on real-world weather, stock, and flight data as well as simulated multi-source data demonstrate the advantage of jointly modeling different data types in the proposed framework.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 142
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Detection of non-overlapping and overlapping communities are essentially the same problem. However, current algorithms focus either on finding overlapping or non-overlapping communities. We present a generalized framework that can identify both non-overlapping and overlapping communities, without any prior input about the network or its community distribution. To do so, we introduce a vertex-based metric, GenPerm , that quantifies by how much a vertex belongs to each of its constituent communities. Our community detection algorithm is based on maximizing the GenPerm over all the vertices in the network. We demonstrate, through experiments over synthetic and real-world networks, that GenPerm is more effective than other metrics in evaluating community structure. Further, we show that due to its vertex-centric property, GenPerm can be used to unfold several inferences beyond community detection, such as core-periphery analysis and message spreading. Our algorithm for maximizing GenPerm outperforms six state-of-the-art algorithms in accurately predicting the ground-truth labels. Finally, we discuss the problem of resolution limit in overlapping communities and demonstrate that maximizing GenPerm can mitigate this problem.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 143
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Social hierarchy (i.e., pyramid structure of societies) is a fundamental concept in sociology and social network analysis. The importance of social hierarchy in a social network is that the topological structure of the social hierarchy is essential in both shaping the nature of social interactions between individuals and unfolding the structure of the social networks. The social hierarchy found in a social network can be utilized to improve the accuracy of link prediction, provide better query results, rank web pages, and study information flow and spread in complex networks. In this paper, we model a social network as a directed graph $G$ , and consider the social hierarchy as DAG (directed acyclic graph) of $G$ , denoted as $G_D$ . By DAG, all the vertices in $G$ can be partitioned into different levels, the vertices at the same level represent a disjoint group in the social hierarchy, and all the edges in DAG follow one direction. The main issue we study in this paper is how to find DAG $G_D$ in $G$ . The approach we take is to find $G_D$ by removing all possible cycles from $G$ such that $G = {cal U}(G) cup G_D$ , where ${cal U}(G)$ is a maximum Eulerian subgraph which contains all possible cycles. We give the reasons for doing so, investigate the properties of $G_D$ found, and discuss the applications. In addition, we develop a novel two-phase algorithm, called Greedy-&-Refine, which greedily computes an Eulerian subgraph and then refines this greedy solution to find the maximum Eulerian subgraph. We give a bound between the greedy solution and the optimal. The quality of our greedy approach is high. We conduct comprehensive experimental studies over 14 real-world datasets. The results show that our algorithms are at least two orders of magnitude faster than the baseline algorithm.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 144
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Visualization provides a powerful means for data analysis. But to be practical, visual analytics tools must support smooth and flexible use of visualizations at a fast rate. This becomes increasingly onerous with the ever-increasing size of real-world datasets. First, large databases make interaction more difficult once query response time exceeds several seconds. Second, any attempt to show all data points will overload the visualization, resulting in chaos that will only confuse the user. Over the last few years, substantial effort has been put into addressing both of these issues and many innovative solutions have been proposed. Indeed, data visualization is a topic that is too large to be addressed in a single survey paper. Thus, we restrict our attention here to interactive visualization of large data sets. Our focus then is skewed in a natural way towards query processing problem—provided by an underlying database system—rather than to the actual data visualization problem.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 145
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Almost all of the existing domain adaptation methods assume that all test data belong to a single stationary target distribution. However, in many real world applications, data arrive sequentially and the data distribution is continuously evolving. In this paper, we tackle the problem of adaptation to a continuously evolving target domain that has been recently introduced. We assume that the available data for the source domain are labeled but the examples of the target domain can be unlabeled and arrive sequentially. Moreover, the distribution of the target domain can evolve continuously over time. We propose the Evolving Domain Adaptation (EDA) method that first finds a new feature space in which the source domain and the current target domain are approximately indistinguishable. Therefore, source and target domain data are similarly distributed in the new feature space and we use a semi-supervised classification method to utilize both the unlabeled data of the target domain and the labeled data of the source domain. Since test data arrives sequentially, we propose an incremental approach both for finding the new feature space and for semi-supervised classification. Experiments on several real datasets demonstrate the superiority of our proposed method in comparison to the other recent methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 146
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: The widespread use of online social networks (OSNs) to disseminate information and exchange opinions, by the general public, news media, and political actors alike, has enabled new avenues of research in computational political science. In this paper, we study the problem of quantifying and inferring the political leaning of Twitter users. We formulate political leaning inference as a convex optimization problem that incorporates two ideas: (a) users are consistent in their actions of tweeting and retweeting about political issues, and (b) similar users tend to be retweeted by similar audience. We then apply our inference technique to 119 million election-related tweets collected in seven months during the 2012 U.S. presidential election campaign. On a set of frequently retweeted sources, our technique achieves 94 percent accuracy and high rank correlation as compared with manually created labels. By studying the political leaning of 1,000 frequently retweeted sources, 232,000 ordinary users who retweeted them, and the hashtags used by these sources, our quantitative study sheds light on the political demographics of the Twitter population, and the temporal dynamics of political polarization as events unfold.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 147
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Comparing two clustering results of a data set is a challenging task in cluster analysis. Many external validity measures have been proposed in the literature. A good measure should be invariant to the changes of data size, cluster size, and number of clusters. We give an overview of existing set matching indexes and analyze their properties. Set matching measures are based on matching clusters from two clusterings. We analyze the measures in three parts: 1) cluster similarity, 2) matching, and 3) overall measurement. Correction for chance is also investigated and we prove that normalized mutual information and variation of information are intrinsically corrected. We propose a new scheme of experiments based on synthetic data for evaluation of an external validity index. Accordingly, popular external indexes are evaluated and compared when applied to clusterings of different data size, cluster size, and number of clusters. The experiments show that set matching measures are clearly better than the other tested. Based on the analytical comparisons, we introduce a new index called Pair Sets Index (PSI).
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 148
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Many commercial products and academic research activities are embracing behavior analysis as a technique for improving detection of attacks of many sorts—from retweet boosting, hashtag hijacking to link advertising. Traditional approaches focus on detecting dense blocks in the adjacency matrix of graph data, and recently, the tensors of multimodal data. No method gives a principled way to score the suspiciousness of dense blocks with different numbers of modes and rank them to draw human attention accordingly. In this paper, we first give a list of axioms that any metric of suspiciousness should satisfy; we propose an intuitive, principled metric that satisfies the axioms, and is fast to compute; moreover, we propose CrossSpot , an algorithm to spot dense blocks that are worth inspecting, typically indicating fraud or some other noteworthy deviation from the usual, and sort them in the order of importance (“suspiciousness”). Finally, we apply CrossSpot to the real data, where it improves the F1 score over previous techniques by 68 percent and finds suspicious behavioral patterns in social datasets spanning 0.3 billion posts.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 149
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: With the rapid development of mobile devices and crowdsourcing platforms, the spatial crowdsourcing has attracted much attention from the database community. Specifically, the spatial crowdsourcing refers to sending location-based requests to workers, based on their current positions. In this paper, we consider a spatial crowdsourcing scenario, in which each worker has a set of qualified skills, whereas each spatial task (e.g., repairing a house, decorating a room, and performing entertainment shows for a ceremony) is time-constrained, under the budget constraint, and required a set of skills. Under this scenario, we will study an important problem, namely multi-skill spatial crowdsourcing (MS-SC), which finds an optimal worker-and-task assignment strategy, such that skills between workers and tasks match with each other, and workers’ benefits are maximized under the budget constraint. We prove that the MS-SC problem is NP-hard and intractable. Therefore, we propose three effective heuristic approaches, including greedy, $g$ -divide-and-conquer and cost-model-based adaptive algorithms to get worker-and-task assignments. Through extensive experiments, we demonstrate the efficiency and effectiveness of our MS-SC processing approaches on both real and synthetic data sets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 150
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Twitter has become one of the largest microblogging platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short period of time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper $sf {TopicSketch}$ , a sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that $sf {TopicSketch}$ on a single machine can potentially handle hundreds of millions tweets per day, which is on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-granularity.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 151
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Recent studies show that disk-based graph computation systems on just a single PC can be as highly competitive as cluster-based systems on large-scale problems. Inspired by this remarkable progress, we develop VENUS, a disk-based graph computation system which is able to handle billion-scale graphs efficiently on a commodity PC. VENUS adopts a novel computing architecture that features vertex-centric “streamlined” processing—the graph is sequentially loaded and an update function is executed for each vertex in parallel on the fly. VENUS deliberately avoids loading batch edge data by separating read-only structure data from mutable vertex data on disk, and minimizes random IOs by caching vertex data in the main memory whenever possible. The streamlined processing is realized with efficient sequential scan over massive structure data and fast feeding the update function for a large number of vertices. Extensive evaluation on large real-world and synthetic graphs has demonstrated the efficiency of VENUS. For example, to run the PageRank algorithm on a Twitter graph of 42 million vertices and 1.4 billion edges, Spark needs 8.1 minutes with 50 machines and GraphChi spends 13 minutes using high-speed SSD, while VENUS only takes 5 minutes on one machine with an ordinary hard disk.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 152
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: The problem of counting triangles in graphs has been well studied in the literature. However, all existing algorithms, exact or approximate, spend at least linear time in the size of the graph (except a recent theoretical result), which can be prohibitive on today's large graphs. Nevertheless, we observe that the ideas in many existing triangle counting algorithms can be coupled with random sampling to yield potentially sublinear-time algorithms that return an approximation of the triangle count without looking at the whole graph. This paper makes these random sampling algorithms more explicit, and presents an experimental and analytical comparison of different approaches, identifying the best performers among a number of candidates.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 153
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Predicting plausible links that may emerge between pairs of nodes is an important task in social network analysis, with over a decade of active research. Here, we propose a novel framework for link prediction. It integrates signals from node features, the existing local link neighborhood of a node pair, community-level link density, and global graph properties. Our framework uses a stacked two-level learning paradigm. At the lower level, the first two kinds of features are processed by a novel local learner. Its outputs are then integrated with the last two kinds of features by a conventional discriminative learner at the upper-level. We also propose a new stratified sampling scheme for evaluating link prediction algorithms in the face of an extremely large number of potential edges, out of which very few will ever materialize. It is not tied to a specific application of link prediction, but robust to a range of application requirements. We report on extensive experiments with seven benchmark datasets and over five competitive baseline systems. The system we present consistently shows at least 10 percent accuracy improvement over state-of-the-art, and over 30 percent improvement in some cases. We also demonstrate, through ablation, that our features are complementary in terms of the signals and accuracy benefits they provide.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 154
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Max-flow has been adopted for semi-supervised data modelling, yet existing algorithms were derived only for the learning from static data. This paper proposes an online max-flow algorithm for the semi-supervised learning from data streams. Consider a graph learned from labelled and unlabelled data, and the graph being updated dynamically for accommodating online data adding and retiring. In learning from the resulting non stationary graph, we augment and de-augment paths to update max-flow with a theoretical guarantee that the updated max-flow equals to that from batch retraining. For classification, we compute min-cut over current max-flow, so that minimized number of similar sample pairs are classified into distinct classes. Empirical evaluation on real-world data reveals that our algorithm outperforms state-of-the-art stream classification algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 155
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-07-08
    Description: Due to the fact that existing database systems are increasingly more difficult to use, improving the quality and the usability of database systems has gained tremendous momentum over the last few years. In particular, the feature of explaining why some expected tuples are missing in the result of a query has received more attention. In this paper, we study the problem of explaining missing answers to top-k queries in the context of SQL (i.e., with selection, projection, join, and aggregation). To approach this problem, we use the query-refinement method. That is, given as inputs the original top-k SQL query and a set of missing tuples, our algorithms return to the user a refined query that includes both the missing tuples and the original query results. Case studies and experimental results show that our algorithms are able to return high quality explanations efficiently.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 156
    Publication Date: 2016-06-22
    Description: Sentiment analysis of online social media has attracted significant interest recently. Many studies have been performed, but most existing methods focus on either only textual content or only visual content. In this paper, we utilize deep learning models in a convolutional neural network (CNN) to analyze the sentiment in Chinese microblogs from both textual and visual content. We first train a CNN on top of pre-trained word vectors for textual sentiment analysis and employ a deep convolutional neural network (DNN) with generalized dropout for visual sentiment analysis. We then evaluate our sentiment prediction framework on a dataset collected from a famous Chinese social media network (Sina Weibo) that includes text and related images and demonstrate state-of-the-art results on this Chinese sentiment analysis benchmark.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 157
    Publication Date: 2016-06-23
    Description: We investigate the problem of minimizing the total power consumption under the constraint of the signal-to-noise ratio (SNR) requirement for the physical layer multicasting system with large-scale antenna arrays. In contrast with existing work, we explicitly consider both the transmit power and the circuit power scaling with the number of antennas. The joint antenna selection and beamforming technique is proposed to minimize the total power consumption. The problem is a challenging one, which aims to minimize the linear combination of ℓ 0 -norm and ℓ 2 -norm. To our best knowledge, this minimization problem has not yet been well solved. A random decremental antenna selection algorithm is designed, which is further modified by an approximation of the minimal transmit power based on the asymptotic orthogonality of the channels. Then, a more efficient decremental antenna selection algorithm is proposed based on minimizing the ℓ 0 norm. Performance results show that the ℓ 0 norm minimization algorithm greatly outperforms the random selection algorithm in terms of the total power consumption and the average run time.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 158
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-05-03
    Description: The number of applications based on Apache Hadoop is dramatically increasing due to the robustness and dynamic features of this system. At the heart of Apache Hadoop, the Hadoop Distributed File System (HDFS) provides the reliability and high availability for computation by applying a static replication by default. However, because of the characteristics of parallel operations on the application layer, the access rate for each data file in HDFS is completely different. Consequently, maintaining the same replication mechanism for every data file leads to detrimental effects on the performance. By rigorously considering the drawbacks of the HDFS replication, this paper proposes an approach to dynamically replicate the data file based on the predictive analysis. With the help of probability theory, the utilization of each data file can be predicted to create a corresponding replication strategy. Eventually, the popular files can be subsequently replicated according to their own access potentials. For the remaining low potential files, an erasure code is applied to maintain the reliability. Hence, our approach simultaneously improves the availability while keeping the reliability in comparison to the default scheme. Furthermore, the complexity reduction is applied to enhance the effectiveness of the prediction when dealing with Big Data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 159
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-05-03
    Description: As remote sensing equipment and networked observational devices continue to proliferate, their corresponding data volumes have surpassed the storage and processing capabilities of commodity computing hardware. This trend has led to the development of distributed storage frameworks that incrementally scale out by assimilating resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of research: extracting insights, relationships, and models from the underlying datasets. The focus of this study is twofold: exploratory and predictive analytics over voluminous, multidimensional datasets in a distributed environment. Both of these types of analysis represent a higher-level abstraction over standard query semantics; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. The algorithms presented in this work were evaluated empirically on a real-world geospatial time-series dataset in a production environment, and are broadly applicable across other storage frameworks.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 160
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-05-03
    Description: Researchers have recently shown that declarative database query languages, such as Datalog, could naturally be used to specify and implement network protocols and services. In this paper, we present a declarative framework for the specification, execution, simulation, and analysis of distributed applications. Distributed applications, including routing protocols, can be specified using a Declarative Networking language, called D2C, whose semantics capture the notion of a Distributed State Machine (DSM), i.e., a network of computational nodes that communicate with each other through the exchange of data. The D2C specification can be directly executed using the DSM computational infrastructure of our framework. The same specification can be simulated and formally verified. The simulation component integrates the DSM tool within a network simulation environment and allows developers to simulate network dynamics and collect data about the execution in order to evaluate application responses to network changes. The formal analysis component of our framework, instead, complements the empirical testing by supporting the verification of different classes of properties of distributed algorithms, including convergence of network routing protocols. To demonstrate the generality of our framework, we show how it can be used to analyze two classes of network routing protocols, a path vector and a Mobile Ad-Hoc Network (MANET) routing protocol, and execute a distributed algorithm for pattern formation in multi-robot systems.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 161
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2016-05-03
    Description: Additive models are regression methods which model the response variable as the sum of univariate transfer functions of the input variables. Key benefits of additive models are their accuracy and interpretability on many real-world tasks. Additive models are however not adapted to problems involving a large number (e.g., hundreds) of input variables, as they are prone to overfitting in addition to losing interpretability. In this paper, we introduce a novel framework for applying additive models to a large number of input variables. The key idea is to reduce the task dimensionality by deriving a small number of new covariates obtained by linear combinations of the inputs, where the linear weights are estimated with regard to the regression problem at hand. The weights are moreover constrained to prevent overfitting and facilitate the interpretation of the derived covariates. We establish identifiability of the proposed model under mild assumptions and present an efficient approximate learning algorithm. Experiments on synthetic and real-world data demonstrate that our approach compares favorably to baseline methods in terms of accuracy, while resulting in models of lower complexity and yielding practical insights into high-dimensional real-world regression tasks. Our framework broadens the applicability of additive models to high-dimensional problems while maintaining their interpretability and potential to provide practical insights.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 162
    Publication Date: 2016-05-27
    Description: Recently manifold learning has received extensive interest in the community of pattern recognition. Despite their appealing properties, most manifold learning algorithms are not robust in practical applications. In this paper, we address this problem in the context of the Hessian locally linear embedding (HLLE) algorithm and propose a more robust method, called RHLLE, which aims to be robust against both outliers and noise in the data. Specifically, we first propose a fast outlier detection method for high-dimensional datasets. Then, we employ a local smoothing method to reduce noise. Furthermore, we reformulate the original HLLE algorithm by using the truncation function from differentiable manifolds. In the reformulated framework, we explicitly introduce a weighted global functional to further reduce the undesirable effect of outliers and noise on the embedding result. Experiments on synthetic as well as real datasets demonstrate the effectiveness of our proposed algorithm.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 163
    Publication Date: 2016-02-07
    Description: A new orthogonal projection method for computing the minimum distance between a point and a spatial parametric curve is presented. It consists of a geometric iteration which converges faster than the existing Newton’s method, and it is insensitive to the choice of initial values. We prove that projecting a point onto a spatial parametric curve under the method is globally second-order convergence.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 164
    Publication Date: 2016-07-30
    Description: We consider the problem of estimating the measure of subsets in very large networks. A prime tool for this purpose is the Markov Chain Monte Carlo (MCMC) algorithm. This algorithm, while extremely useful in many cases, still often suffers from the drawback of very slow convergence. We show that in a special, but important case, it is possible to obtain significantly better bounds on the convergence rate. This special case is when the huge state space can be aggregated into a smaller number of clusters, in which the states behave approximately the same way (but their behavior still may not be identical). A Markov chain with this structure is called quasi-lumpable. This property allows the aggregation of states (nodes) into clusters. Our main contribution is a rigorously proved bound on the rate at which the aggregated state distribution approaches its limit in quasi-lumpable Markov chains. We also demonstrate numerically that in certain cases this can indeed lead to a significantly accelerated way of estimating the measure of subsets. The result can be a useful tool in the analysis of complex networks, whenever they have a clustering that aggregates nodes with similar (but not necessarily identical) behavior.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 165
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Domain transfer learning generalizes a learning model across training data and testing data with different distributions. A general principle to tackle this problem is reducing the distribution difference between training data and testing data such that the generalization error can be bounded. Current methods typically model the sample distributions in input feature space, which depends on nonlinear feature mapping to embody the distribution discrepancy. However, this nonlinear feature space may not be optimal for the kernel-based learning machines. To this end, we propose a transfer kernel learning (TKL) approach to learn a domain-invariant kernel by directly matching source and target distributions in the reproducing kernel Hilbert space (RKHS). Specifically, we design a family of spectral kernels by extrapolating target eigensystem on source samples with Mercer’s theorem. The spectral kernel minimizing the approximation error to the ground truth kernel is selected to construct domain-invariant kernel machines. Comprehensive experimental evidence on a large number of text categorization, image classification, and video event recognition datasets verifies the effectiveness and efficiency of the proposed TKL approach over several state-of-the-art methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 166
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: We present an evolutionary multi-branch tree clustering method to model hierarchical topics and their evolutionary patterns over time. The method builds evolutionary trees in a Bayesian online filtering framework. The tree construction is formulated as an online posterior estimation problem, which well balances both the fitness of the current tree and the smoothness between trees. The state-of-the-art multi-branch clustering method, Bayesian rose trees, is employed to generate a topic tree with a high fitness value. A constraint model is also introduced to preserve the smoothness between trees. A set of comprehensive experiments on real world news data demonstrates that the proposed method better incorporates historical tree information and is more efficient and effective than the traditional evolutionary hierarchical clustering algorithm. In contrast to our previous method [31] , we implement two additional baseline algorithms to compare them with our algorithm. We also evaluate the performance of the clustering algorithm based on multiple constraint trees. Furthermore, two case studies are conducted to demonstrate the effectiveness and usefulness of our algorithm in helping users understand the major hierarchical topic evolutionary patterns in text data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 167
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: The discovery of regions of interest in large cities is an important challenge. We propose and investigate a novel query called the path nearby cluster (PNC) query that finds regions of potential interest (e.g., sightseeing places and commercial districts) with respect to a user-specified travel route. Given a set of spatial objects $O$ (e.g., POIs, geo-tagged photos, or geo-tagged tweets) and a query route $q$ , if a cluster $c$ has high spatial-object density and is spatially close to $q$ , it is returned by the query (a cluster is a circular region defined by a center and a radius). This query aims to bring important benefits to users in popular applications such as trip planning and location recommendation. Efficient computation of the PNC query faces two challenges: how to prune the search space during query processing, and how to identify clusters with high density effectively. To address these challenges, a novel collective search algorithm is developed. Conceptually, the search process is conducted in the spatial and density domains concurrently. In the spatial domain, network expansion is adopted, and a set of vertices are selected from the query route as expansion centers. In the density domain, clusters are sorted according to their density distributions and they are scan- ed from the maximum to the minimum. A pair of upper and lower bounds are defined to prune the search space in the two domains globally. The performance of the PNC query is studied in extensive experiments based on real and synthetic spatial data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 168
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Given learning samples from a raster data set, spatial decision tree learning aims to find a decision tree classifier that minimizes classification errors as well as salt-and-pepper noise. The problem has important societal applications such as land cover classification for natural resource management. However, the problem is challenging due to the fact that learning samples show spatial autocorrelation in class labels, instead of being independently identically distributed. Related work relies on local tests (i.e., testing feature information of a location) and cannot adequately model the spatial autocorrelation effect, resulting in salt-and-pepper noise. In contrast, we recently proposed a focal-test-based spatial decision tree (FTSDT), in which the tree traversal direction of a sample is based on both local and focal (neighborhood) information. Preliminary results showed that FTSDT reduces classification errors and salt-and-pepper noise. This paper extends our recent work by introducing a new focal test approach with adaptive neighborhoods that avoids over-smoothing in wedge-shaped areas. We also conduct computational refinement on the FTSDT training algorithm by reusing focal values across candidate thresholds. Theoretical analysis shows that the refined training algorithm is correct and more scalable. Experiment results on real world data sets show that new FTSDT with adaptive neighborhoods improves classification accuracy, and that our computational refinement significantly reduces training time.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 169
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: As a newly emerging network model, heterogeneous information networks (HINs) have received growing attention. Many data mining tasks have been explored in HINs, including clustering, classification, and similarity search. Similarity join is a fundamental operation required for many problems. It is attracting attention from various applications on network data, such as friend recommendation, link prediction, and online advertising. Although similarity join has been well studied in homogeneous networks, it has not yet been studied in heterogeneous networks. Especially, none of the existing research on similarity join takes different semantic meanings behind paths into consideration and almost all completely ignore the heterogeneity and diversity of the HINs. In this paper, we propose a path-based similarity join (PS-join) method to return the top $k$ similar pairs of objects based on any user specified join path in a heterogeneous information network. We study how to prune expensive similarity computation by introducing bucket pruning based locality sensitive hashing (BPLSH) indexing. Compared with existing Link-based Similarity join (LS-join) method, PS-join can derive various similarity semantics. Experimental results on real data sets show the efficiency and effectiveness of the proposed approach.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 170
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Time-series classification has attracted considerable research attention due to the various domains where time-series data are observed, ranging from medicine to econometrics. Traditionally, the focus of time-series classification has been on short time-series data composed of a few patterns exhibiting variabilities, while recently there have been attempts to focus on longer series composed of multiple local patrepeating with an arbitrary irregularity. The primary contribution of this paper relies on presenting a method which can detect local patterns in repetitive time-series via fitting local polynomial functions of a specified degree. We capture the repetitiveness degrees of time-series datasets via a new measure. Furthermore, our method approximates local polynomials in linear time and ensures an overall linear running time complexity. The coefficients of the polynomial functions are converted to symbolic words via equi-area discretizations of the coefficients’ distributions. The symbolic polynomial words enable the detection of similar local patterns by assigning the same word to similar polynomials. Moreover, a histogram of the frequencies of the words is constructed from each time-series’ bag of words. Each row of the histogram enables a new representation for the series and symbolizes the occurrence of local patterns and their frequencies. In an experimental comparison against state-of-the-art baselines on repetitive datasets, our method demonstrates significant improvements in terms of prediction accuracy.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 171
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Sentiment classification is a topic-sensitive task, i.e., a classifier trained from one topic will perform worse on another. This is especially a problem for the tweets sentiment analysis. Since the topics in Twitter are very diverse, it is impossible to train a universal classifier for all topics. Moreover, compared to product review, Twitter lacks data labeling and a rating mechanism to acquire sentiment labels. The extremely sparse text of tweets also brings down the performance of a sentiment classifier. In this paper, we propose a semi-supervised topic-adaptive sentiment classification (TASC) model, which starts with a classifier built on common features and mixed labeled data from various topics. It minimizes the hinge loss to adapt to unlabeled data and features including topic-related sentiment words, authors’ sentiments and sentiment connections derived from “@” mentions of tweets, named as topic-adaptive features. Text and non-text features are extracted and naturally split into two views for co-training. The TASC learning algorithm updates topic-adaptive features based on the collaborative selection of unlabeled data, which in turn helps to select more reliable tweets to boost the performance. We also design the adapting model along a timeline (TASC-t) for dynamic tweets. An experiment on 6 topics from published tweet corpuses demonstrates that TASC outperforms other well-known supervised and ensemble classifiers. It also beats those semi-supervised learning methods without feature adaption. Meanwhile, TASC-t can also achieve impressive accuracy and F-score. Finally, with timeline visualization of “river” graph, people can intuitively grasp the ups and downs of sentiments’ evolvement, and the intensity by color gradation.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 172
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: This paper presents a new robust EM algorithm for the finite mixture learning procedures. The proposed Spatial-EM algorithm utilizes median-based location and rank-based scatter estimators to replace sample mean and sample covariance matrix in each M step, hence enhancing stability and robustness of the algorithm. It is robust to outliers and initial values. Compared with many robust mixture learning methods, the Spatial-EM has the advantages of simplicity in implementation and statistical efficiency. We apply Spatial-EM to supervised and unsupervised learning scenarios. More specifically, robust clustering and outlier detection methods based on Spatial-EM have been proposed. We apply the outlier detection to taxonomic research on fish species novelty discovery. Two real datasets are used for clustering analysis. Compared with the regular EM and many other existing methods such as K-median, X-EM and SVM, our method demonstrates superior performance and high robustness.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 173
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user preferences; yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative patterns in text documents as higher level features and deploys them over low-level features (terms). It also classifies terms into categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods and the pattern based methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 174
    Publication Date: 2015-05-08
    Description: The construction of a similarity matrix is one significant step for the spectral clustering algorithm; while the Gaussian kernel function is one of the most common measures for constructing the similarity matrix. However, with a fixed scaling parameter, the similarity between two data points is not adaptive and appropriate for multi-scale datasets. In this paper, through quantitating the value of the importance for each vertex of the similarity graph, the Gaussian kernel function is scaled, and an adaptive Gaussian kernel similarity measure is proposed. Then, an adaptive spectral clustering algorithm is gotten based on the importance of shared nearest neighbors. The idea is that the greater the importance of the shared neighbors between two vertexes, the more possible it is that these two vertexes belong to the same cluster; and the importance value of the shared neighbors is obtained with an iterative method, which considers both the local structural information and the distance similarity information, so as to improve the algorithm’s performance. Experimental results on different datasets show that our spectral clustering algorithm outperforms the other spectral clustering algorithms, such as the self-tuning spectral clustering and the adaptive spectral clustering based on shared nearest neighbors in clustering accuracy on most datasets.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 175
    Publication Date: 2015-05-09
    Description: In this paper, we propose a detection method of pulmonary nodules in X-ray computed tomography (CT) scans by use of three image filters and appearance-based k-means clustering. First, voxel values are suppressed in radial directions so as to eliminate extra regions in the volumes of interest (VOIs). Globular regions are enhanced by moment-of-inertia tensors where the voxel values in the VOIs are regarded as mass. Excessively enhanced voxels are reduced based on displacement between the VOI centers and the gravity points of the voxel values in the VOIs. Initial nodule candidates are determined by these filtering processings. False positives are reduced by, first, normalizing the directions of intensity distributions in the VOIs by rotating the VOIs based on the eigenvectors of the moment-of-inertia tensors, and then applying an appearance-based two-step k-means clustering technique to the rotated VOIs. The proposed method is applied to actual CT scans and experimental results are shown.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 176
    Publication Date: 2015-05-09
    Description: We propose a linear time algorithm, called G2DLP, for generating 2D lattice L(n1, n2) paths, equivalent to two-item  multiset permutations, with a given number of turns. The usage of turn has three meanings: in the context of multiset permutations, it means that two consecutive elements of a permutation belong to two different items; in lattice path enumerations, it means that the path changes its direction, either from eastward to northward or from northward to eastward; in open shop scheduling, it means that we transfer a job from one type of machine to another. The strategy of G2DLP is divide-and-combine; the division is based on the enumeration results of a previous study and is achieved by aid of an integer partition algorithm and a multiset permutation algorithm; the combination is accomplished by a concatenation algorithm that constructs the paths we require. The advantage of G2DLP is twofold. First, it is optimal in the sense that it directly generates all feasible paths without visiting an infeasible one. Second, it can generate all paths in any specified order of turns, for example, a decreasing order or an increasing order. In practice, two applications, scheduling and cryptography, are discussed.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 177
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Numerous theories and algorithms have been developed to solve vectorial data learning problems by searching for the hypothesis that best fits the observed training sample. However, many real-world applications involve samples that are not described as feature vectors, but as (dis)similarity data. Converting vectorial data into (dis)similarity data is more easily performed than converting (dis)similarity data into vectorial data. This study proposes a stochastic iterative distance transformation model for similarity-based learning. The proposed model can be used to identify a clear class boundary in data by modifying the (dis)similarities between examples. The experimental results indicate that the performance of the proposed method is comparable with those of various vector-based and proximity-based learning algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 178
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Although graph embedding has been a powerful tool for modeling data intrinsic structures, simply employing all features for data structure discovery may result in noise amplification. This is particularly severe for high dimensional data with small samples. To meet this challenge, this paper proposes a novel efficient framework to perform feature selection for graph embedding, in which a category of graph embedding methods is cast as a least squares regression problem. In this framework, a binary feature selector is introduced to naturally handle the feature cardinality in the least squares formulation. The resultant integral programming problem is then relaxed into a convex Quadratically Constrained Quadratic Program (QCQP) learning problem, which can be efficiently solved via a sequence of accelerated proximal gradient (APG) methods. Since each APG optimization is w.r.t. only a subset of features, the proposed method is fast and memory efficient. The proposed framework is applied to several graph embedding learning problems, including supervised, unsupervised, and semi-supervised graph embedding. Experimental results on several high dimensional data demonstrated that the proposed method outperformed the considered state-of-the-art methods.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 179
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Boosting is an iterative process that improves the predictive accuracy for supervised (machine) learning algorithms. Boosting operates by learning multiple functions with subsequent functions focusing on incorrect instances where the previous functions predicted the wrong label. Despite considerable success, boosting still has difficulty on data sets with certain types of problematic training data (e.g., label noise) and when complex functions overfit the training data. We propose a novel cluster-based boosting (CBB) approach to address limitations in boosting for supervised learning systems. Our CBB approach partitions the training data into clusters containing highly similar member data and integrates these clusters directly into the boosting process. CBB boosts selectively (using a high learning rate, low learning rate, or not boosting) on each cluster based on both the additional structure provided by the cluster and previous function accuracy on the member data. Selective boosting allows CBB to improve predictive accuracy on problematic training data. In addition, boosting separately on clusters reduces function complexity to mitigate overfitting. We provide comprehensive experimental results on 20 UCI benchmark data sets with three different kinds of supervised learning systems. These results demonstrate the effectiveness of our CBB approach compared to a popular boosting algorithm, an algorithm that uses clusters to improve boosting, and two algorithms that use selective boosting without clustering.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 180
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Discovering trajectory patterns is shown to be very useful in learning interactions between moving objects. Many types of trajectory patterns have been proposed in the literature, but previous methods were developed for only a specific type of trajectory patterns. This limitation could make pattern discovery tedious and inefficient since users typically do not know which types of trajectory patterns are hidden in their data sets. Our main observation is that many trajectory patterns can be arranged according to the strength of temporal constraints. In this paper, we propose a unifying framework of mining trajectory patterns of various temporal tightness, which we call unifying trajectory patterns  ( UT-patterns ). This framework consists of two phases: initial pattern discovery and granularity adjustment . A set of initial patterns are discovered in the first phase, and their granularities (i.e., levels of detail) are adjusted by split and merge to detect other types in the second phase. As a result, the structure called a pattern forest is constructed to show various patterns. Both phases are guided by an information-theoretic formula without user intervention. Experimental results demonstrate that our framework facilitates easy discovery of various patterns from real-world trajectory data.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 181
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: With the increasing availability of graph data and widely adopted cloud computing paradigm, graph partitioning has become an efficient pre-processing technique to balance the computing workload and cope with the large scale of input data. Since the cost of partitioning the entire graph is strictly prohibitive, there are some recent tentative works towards streaming graph partitioning which run faster, are easily parallelized, and can be incrementally updated. Most of the existing works on streaming partitioning assume that worker nodes within a cluster are homogeneous in nature. Unfortunately, this assumption does not always hold. Experiments show that these homogeneous algorithms suffer a significant performance degradation when running at heterogeneous environment. In this paper, we propose a novel adaptive streaming graph partitioning approach to cope with heterogeneous environment. We first formally model the heterogeneous computing environment with the consideration of the unbalance of computing ability (e.g., the CPU frequency) and communication ability (e.g., the network bandwidth) for each node. Based on this model, we propose a new graph partitioning objective function that aims to minimize the total execution time of the graph-processing job. We then explore some simple yet effective streaming algorithms for this objective function that can achieve balanced and efficient partitioning result. Extensive experiments are conducted on a moderate sized computing cluster with real-world web and social network graphs. The results demonstrate that the proposed approach achieves significant improvement compared with the state-of-the-art solutions.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 182
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Matrix factorization is useful to extract the essential low-rank structure from a given matrix and has been paid increasing attention. A typical example is non-negative matrix factorization (NMF), which is one type of unsupervised learning, having been successfully applied to a variety of data including documents, images and gene expression, where their values are usually non-negative. We propose a new model of NMF which is trained by using auxiliary information of overlapping groups. This setting is very reasonable in many applications, a typical example being gene function estimation where functional gene groups are heavily overlapped with each other. To estimate true groups from given overlapping groups efficiently, our model incorporates latent matrices with the regularization term using a mixed norm. This regularization term allows group-wise sparsity on the optimized low-rank structure. The latent matrices and other parameters are efficiently estimated by a block coordinate gradient descent method. We empirically evaluated the performance of our proposed model and algorithm from a variety of viewpoints, comparing with four methods including MMF for auxiliary graph information, by using both synthetic and real world document and gene expression data sets.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 183
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: This paper focuses on the measure of recommendation stability, which reflects the consistency of recommender system predictions. Stability is a desired property of recommendation algorithms and has important implications on users’ trust and acceptance of recommendations. Prior research has reported that some popular recommendation algorithms can suffer from a high degree of instability. In this study, we explore two scalable, general-purpose meta-algorithmic approaches—based on bagging and iterative smoothing—that can be used in conjunction with different traditional recommendation algorithms to improve their stability. Our experimental results on real-world rating data demonstrate that both approaches can achieve substantially higher stability as compared to the original recommendation algorithms. Furthermore, perhaps as importantly, the proposed approaches not only do not sacrifice the predictive accuracy in order to improve recommendation stability, but are actually able to provide additional accuracy improvements.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 184
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: We propose selective supervised Latent Dirichlet Allocation (ssLDA) to boost the prediction performance of the widely studied supervised probabilistic topic models. We introduce a Bernoulli distribution for each word in one given document to select this word as a strongly or weakly discriminative one with respect to its assigned topic. The Bernoulli distribution is parameterized by the discrimination power of the word for its assigned topic. As a result, the document is represented as a “bag-of-selective-words” instead of the probabilistic “bag-of-topics” in the topic modeling domain or the flat “bag-of-words” in the traditional natural language processing domain to form a new perspective. Inheriting the general framework of supervised LDA (sLDA), ssLDA can also predict many types of response specified by a Gaussian Linear Model (GLM). Focusing on the utilization of this word selection mechanism for singe-label document classification in this paper, we conduct the variational inference for approximating the intractable posterior and derive a maximum-likelihood estimation of parameters in ssLDA. The experiments reported on textual documents show that ssLDA not only performs competitively over “state-of-the-art” classification approaches based on both the flat “bag-of-words” and probabilistic “bag-of-topics” representation in terms of classification performance, but also has the ability to discover the discrimination power of the words specified in the topics (compatible with our rational knowledge).
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 185
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Metric learning, the task of learning a good distance metric, is a key problem in machine learning with ample applications. This paper introduces a novel framework for nonlinear metric learning, called kernel density metric learning (KDML), which is easy to use and provides nonlinear, probability-based distance measures. KDML constructs a direct nonlinear mapping from the original input space into a feature space based on kernel density estimation. The nonlinear mapping in KDML embodies established distance measures between probability density functions, and leads to accurate classification on datasets for which existing linear metric learning methods would fail. It addresses the severe challenge to distance-based classifiers when features are from heterogeneous domains and, as a result, the Euclidean or Mahalanobis distance between original feature vectors is not meaningful. We also propose two ways to determine the kernel bandwidths, including an adaptive local scaling approach and an integrated optimization algorithm that learns the Mahalanobis matrix and kernel bandwidths together. KDML is a general framework that can be combined with any existing metric learning algorithm. As concrete examples, we combine KDML with two leading metric learning algorithms, large margin nearest neighbors (LMNN) and neighborhood component analysis (NCA). KDML can naturally handle not only numerical features, but also categorical ones, which is rarely found in previous metric learning algorithms. Extensive experimental results on various datasets show that KDML significantly improves existing metric learning algorithms in terms of classification accuracy.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 186
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: The Earth Mover’s Distance (EMD) is a well-known distance metric for data represented as probability distributions over a predefined feature space. Supporting EMD-based similarity search has attracted intensive research effort. Despite the plethora of literature, most existing solutions are optimized for $L^p$ feature spaces (e.g., Euclidean space); while in a spectrum of applications, the relationships between features are better captured using networks. In this paper, we study the problem of answering $k$ -nearest neighbor ( $k$ -NN) queries under network-based EMD metrics (NEMD). We propose Oasis , a new access method which leverages the network structure of feature space and enables efficient NEMD-based similarity search. Specifically, Oasis employs three novel techniques: (i) Range Oracle , a scalable model to estimate the range of $k$ -th nearest neighbor under NEMD, (ii) Boundary Index , a structure that efficiently fetches candidates within given range, and (iii) Network Compression Hierarchy , an incremental filtering mechanism that effectively prunes false positive candidates to save unnecessary computation. Through extensive experiments using both synthetic and real data sets, we confirmed that Oasis significantly outperforms the state-of-the-art methods in query processing cost.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 187
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Many mature term-based or pattern-based approaches have been used in the field of information filtering to generate users’ information needs from a collection of documents. A fundamental assumption for these approaches is that the documents in the collection are all about one topic. However, in reality users’ interests can be diverse and the documents in the collection often involve multiple topics. Topic modelling, such as Latent Dirichlet Allocation (LDA), was proposed to generate statistical models to represent multiple topics in a collection of documents, and this has been widely utilized in the fields of machine learning and information retrieval, etc. But its effectiveness in information filtering has not been so well explored. Patterns are always thought to be more discriminative than single terms for describing documents. However, the enormous amount of discovered patterns hinder them from being effectively and efficiently used in real applications, therefore, selection of the most discriminative and representative patterns from the huge amount of discovered patterns becomes crucial. To deal with the above mentioned limitations and problems, in this paper, a novel information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is proposed. The main distinctive features of the proposed model include: (1) user information needs are generated in terms of multiple topics; (2) each topic is represented by patterns; (3) patterns are generated from topic models and are organized in terms of their statistical and taxonomic features; and (4) the most discriminative and representative patterns, called Maximum Matched Patterns, are proposed to estimate the document relevance to the user’s information needs in order to filter out irrelevant documents. Extensive experiments are conducted to evaluate the effectiveness of the proposed model by using the TREC data collection Reuters Corpus Volume 1. The results show that the proposed- model significantly outperforms both state-of-the-art term-based models and pattern-based models.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 188
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-05-02
    Description: Trust plays a crucial role in helping online users collect reliable information and it has gained increasing attention from the computer science community in recent years. Traditionally, research about online trust assumes static trust relations between users. However, trust, as a social concept, evolves as people interact. Most existing studies about trust evolution are from sociologists in the physical world while little work exists in an online world. Studying online trust evolution faces unique challenges because more often than not, available data is from passive observation. In this work, we leverage social science theories to develop a methodology that enables the study of online trust evolution. In particular, we identify the differences of trust evolution study in physical and online worlds and propose a framework, eTrust, to study trust evolution using online data from passive observation in the context of product review sites by exploiting the dynamics of user preferences. We present technical details about modeling trust evolution, and perform experiments to show how the exploitation of trust evolution can help improve the performance of online applications such as trust prediction, rating prediction and ranking evolution.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 189
    Publication Date: 2015-05-09
    Description: In this work we generate the numerical solutions of Burgers’ equation by applying the Crank-Nicholson method and different schemes for solving nonlinear systems, instead of using Hopf-Cole transformation to reduce Burgers’ equation into the linear heat equation. The method is analyzed on two test problems in order to check its efficiency on different kinds of initial conditions. Numerical solutions as well as exact solutions for different values of viscosity are calculated, concluding that the numerical results are very close to the exact solution.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 190
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-04-04
    Description: Learning to rank arises in many data mining applications, ranging from web search engine, online advertising to recommendation system. In learning to rank, the performance of a ranking model is strongly affected by the number of labeled examples in the training set; on the other hand, obtaining labeled examples for training data is very expensive and time-consuming. This presents a great need for the active learning approaches to select most informative examples for ranking learning; however, in the literature there is still very limited work to address active learning for ranking. In this paper, we propose a general active learning framework, expected loss optimization (ELO), for ranking. The ELO framework is applicable to a wide range of ranking functions. Under this framework, we derive a novel algorithm, expected discounted cumulative gain (DCG) loss optimization (ELO-DCG), to select most informative examples. Then, we investigate both query and document level active learning for raking and propose a two-stage ELO-DCG algorithm which incorporate both query and document selection into active learning. Furthermore, we show that it is flexible for the algorithm to deal with the skewed grade distribution problem with the modification of the loss function. Extensive experiments on real-world web search data sets have demonstrated great potential and effectiveness of the proposed framework and algorithms.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 191
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-04-04
    Description: Advanced technology in GPS and sensors enables us to track physical events, such as human movements and facility usage. Periodicity analysis from the recorded data is an important data mining task which provides useful insights into the physical events and enables us to report outliers and predict future behaviors. To mine periodicity in an event, we have to face real-world challenges of inherently complicated periodic behaviors and imperfect data collection problem. Specifically, the hidden temporal periodic behaviors could be oscillating and noisy, and the observations of the event could be incomplete. In this paper, we propose a novel probabilistic measure for periodicity and design a practical algorithm, ePeriodicity, to detect periods. Our method has thoroughly considered the uncertainties and noises in periodic behaviors and is provably robust to incomplete observations. Comprehensive experiments on both synthetic and real datasets demonstrate the effectiveness of our method.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 192
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-04-04
    Description: Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 193
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-04-04
    Description: The classification of patterns into naturally ordered labels is referred to as ordinal regression or ordinal classification. Usually, this classification setting is by nature highly imbalanced, because there are classes in the problem that are a priori more probable than others. Although standard over-sampling methods can improve the classification of minority classes in ordinal classification, they tend to introduce severe errors in terms of the ordinal label scale, given that they do not take the ordering into account. A specific ordinal over-sampling method is developed in this paper for the first time in order to improve the performance of machine learning classifiers. The method proposed includes ordinal information by approaching over-sampling from a graph-based perspective. The results presented in this paper show the good synergy of a popular ordinal regression method (a reformulation of support vector machines) with the graph-based proposed algorithms, and the possibility of improving both the classification and the ordering of minority classes. A cost-sensitive version of the ordinal regression method is also introduced and compared with the over-sampling proposals, showing in general lower performance for minority classes.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 194
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-04-04
    Description: This paper presents a novel approach that transforms the feature space into a new feature space such that a range query in the original space is mapped into an equivalent box query in the transformed space. Since box queries are axis aligned, there are several implementational advantages that can be exploited to speed up the retrieval of query results using R-Tree [9] like indexing schemes. For two dimensional data, the transformation is precise. For larger than two dimensions, we propose a space transformation scheme based on disjoint planer rotation and a new type of query, pruning box query, to get the precise results. Experimental results with large synthetic databases and some real databases show the effectiveness of the proposed transformation scheme. These experimental results have been corroborated with suitable mathematical models. In disjoint planer rotation, additional computation time is required to remove the false positives produced due to the bounding box not being precise. A second topological transformation scheme is presented based on optimized bounding box, which reduces the amount of false positives. The amount of this reduction is more with increasing dimensions. Optimized bounding box for higher dimensions is computed based on a novel approach of simultaneous local optimal projections.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 195
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-04-04
    Description: In a relational database, tuples are called “duplicate” if they describe the same real-world entity. If such duplicate tuples are observed, it is recommended to remove them and to replace them with one tuple that represents the joint information of the duplicate tuples to a maximal extent. This remove-and-replace operation is called a fusion operation. Within the setting of a relational database management system, the removal of the original duplicate tuples can breach referential integrity. In this paper, a strategy is proposed to maintain referential integrity in a semantically correct manner, thereby optimizing the quality of relationships in the database. An algorithm is proposed that is able to propagate a fusion operation through the entire database. The algorithm is based on a framework of first and second order fusion functions on the one hand, and conflict resolution strategies on the other hand. It is shown how classical strategies for maintaining referential integrity, such as delete cascading, are highly specialized cases of the proposed framework. Experimental results are reported that (i) show the efficiency of the proposed algorithm and (ii) show the differences in quality between several second order fusion functions. It is shown that some strategies easily outperform delete cascading.
    Print ISSN: 1041-4347
    Electronic ISSN: 1558-2191
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 196
    Publication Date: 2015-03-28
    Description: An image analysis procedure based on a two dimensional Gaussian fitting is presented and applied to satellite maps describing the surface urban heat island (SUHI). The application of this fitting technique allows us to parameterize the SUHI pattern in order to better understand its intensity trend and also to perform quantitative comparisons among different images in time and space. The proposed procedure is computationally rapid and stable, executing an initial guess parameter estimation by a multiple regression before the iterative nonlinear fitting. The Gaussian fit was applied to both low and high resolution images (1 km and 30 m pixel size) and the results of the SUHI parameterization shown. As expected, a reduction of the correlation coefficient between the map values and the Gaussian surface was observed for the image with the higher spatial resolution due to the greater variability of the SUHI values. Since the fitting procedure provides a smoothed Gaussian surface, it has better performance when applied to low resolution images, even if the reliability of the SUHI pattern representation can be preserved also for high resolution images.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 197
    Publication Date: 2015-04-23
    Description: The auxiliary problem principle is a powerful tool for solving multi-area economic dispatch problem. One of the main drawbacks of the auxiliary problem principle method is that the convergence performance depends on the selection of penalty parameter. In this paper, we propose a self-adaptive strategy to adjust penalty parameter based on the iterative information, the proposed approach is verified by two given test systems. The corresponding simulation results demonstrate that the proposed self-adaptive auxiliary problem principle iterative scheme is robust in terms of the selection of penalty parameter and has better convergence rate compared with the traditional auxiliary problem principle method.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 198
    Publication Date: 2015-04-14
    Description: Aiming at improving the well-known fuzzy compactness and separation algorithm (FCS), this paper proposes a new clustering algorithm based on feature weighting fuzzy compactness and separation (WFCS). In view of the contribution of features to clustering, the proposed algorithm introduces the feature weighting into the objective function. We first formulate the membership and feature weighting, and analyze the membership of data points falling on the crisp boundary, then give the adjustment strategy. The proposed WFCS is validated both on simulated dataset and real dataset. The experimental results demonstrate that the proposed WFCS has the characteristics of hard clustering and fuzzy clustering, and outperforms many existing clustering algorithms with respect to three metrics: Rand Index, Xie-Beni Index and Within-Between(WB) Index.
    Electronic ISSN: 1999-4893
    Topics: Computer Science
    Published by MDPI Publishing
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 199
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-04-18
    Description: This installment of Computer's series highlighting the work published in IEEE Computer Society journals comes from IEEE Transactions on Pattern Analysis and Machine Intelligence.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
  • 200
    facet.materialart.
    Unknown
    Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2015-04-18
    Description: A summary of articles published in Computer 32 and 16 years ago.
    Print ISSN: 0018-9162
    Electronic ISSN: 1558-0814
    Topics: Computer Science
    Location Call Number Expected Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. More information can be found here...