ALBERT — All Library Books, journals and Electronic Records Telegrafenberg

11

Unknown

Graph Learning for Combinatorial Optimization: A Survey of State-of-the-Art (2021)

Peng, Yun ; Choi, Byron ; Xu, Jianliang

Springer

In: Data Science and Engineering. 2021; Published 2021 Apr 28. doi: 10.1007/s41019-021-00155-3. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-04-28

Description: Graphs have been widely used to represent complex data in many applications, such as e-commerce, social networks, and bioinformatics. Efficient and effective analysis of graph data is important for graph-based applications. However, most graph analysis tasks are combinatorial optimization (CO) problems, which are NP-hard. Recent studies have focused a lot on the potential of using machine learning (ML) to solve graph-based CO problems. Most recent methods follow the two-stage framework. The first stage is graph representation learning, which embeds the graphs into low-dimension vectors. The second stage uses machine learning to solve the CO problems using the embeddings of the graphs learned in the first stage. The works for the first stage can be classified into two categories, graph embedding methods and end-to-end learning methods. For graph embedding methods, the learning of the the embeddings of the graphs has its own objective, which may not rely on the CO problems to be solved. The CO problems are solved by independent downstream tasks. For end-to-end learning methods, the learning of the embeddings of the graphs does not have its own objective and is an intermediate step of the learning procedure of solving the CO problems. The works for the second stage can also be classified into two categories, non-autoregressive methods and autoregressive methods. Non-autoregressive methods predict a solution for a CO problem in one shot. A non-autoregressive method predicts a matrix that denotes the probability of each node/edge being a part of a solution of the CO problem. The solution can be computed from the matrix using search heuristics such as beam search. Autoregressive methods iteratively extend a partial solution step by step. At each step, an autoregressive method predicts a node/edge conditioned to current partial solution, which is used to its extension. In this survey, we provide a thorough overview of recent studies of the graph learning-based CO methods. The survey ends with several remarks on future research directions.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

12

Unknown

A Workload-Adaptive Streaming Partitioner for Distributed Graph Stores (2021)

Davoudian, Ali ; Chen, Liu ; Tu, Hongwei ; [et al.]

Springer

In: Data Science and Engineering. 2021; Published 2021 Apr 15. doi: 10.1007/s41019-021-00156-2. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-04-15

Description: Streaming graph partitioning methods have recently gained attention due to their ability to scale to very large graphs with limited resources. However, many such methods do not consider workload and graph characteristics. This may degrade the performance of queries by increasing inter-node communication and computational load imbalance. Moreover, existing workload-aware methods cannot consistently provide good performance as they do not consider dynamic workloads that keep emerging in graph applications. We address these issues by proposing a novel workload-adaptive streaming partitioner named WASP, that aims to achieve low-latency and high-throughput online graph queries. As each workload typically contains frequent query patterns, WASP exploits the existing workload to capture active vertices and edges which are frequently visited and traversed, respectively. This information is used to heuristically improve the quality of partitions either by avoiding the concentration of active vertices in a few partitions proportional to their visit frequencies or by reducing the probability of the cut of active edges proportional to their traversal frequencies. In order to assess the impact of WASP on a graph store and to show how easily the approach can be plugged on top of the system, we exploit it in a distributed graph-based RDF store. Our experiments over three synthetic and real-world graph datasets and the corresponding static and dynamic query workloads show that WASP achieves a better query performance against state-of-the-art graph partitioners, especially in dynamic query workloads.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

13

Unknown

Keyword Search on Large Graphs: A Survey (2021)

Yang, Jianye ; Yao, Wu ; Zhang, Wenjie

Springer

In: Data Science and Engineering. 2021; Published 2021 Mar 31. doi: 10.1007/s41019-021-00154-4. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-03-31

Description: With the prevalence of Internet access and online services, various big graphs are generated in many real applications (e.g., online social networks and knowledge graphs). An important task on analyzing and mining these graphs is keyword search. Essentially, given a graph G and query Q associated with a set of keywords, the keyword search aims to find a substructure (e.g., rooted tree or subgraph) S in G such that nodes in S collectively cover part of or all keywords in Q, and in the meanwhile, S is optimal on some user specified semantics. Keyword search on graphs can be applied in many real-life applications, such as point-of-interests recommendation and web search facility. In spite of the great importance of graph keyword search, we, however, notice that the latest survey on this topic is far out of date. Consequently, there is prompt need to conduct a comprehensive survey in this research direction. Motivated by this, in this survey, we systematically review graph keyword search studies by classifying the existing works into different categories according to the specific problem definition. This survey aims to provide the researchers a comprehensive understanding of existing graph keyword search solutions.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

14

Unknown

Querying Optimal Routes for Group Meetup (2021)

Chen, Bo ; Zhu, Huaijie ; Liu, Wei ; [et al.]

Springer

In: Data Science and Engineering. 2021; Published 2021 Mar 15. doi: 10.1007/s41019-021-00153-5. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2021-03-15

Description: Motivated by location-based social networks which allow people to access location-based services as a group, we study a novel variant of optimal sequenced route (OSR) queries, optimal sequenced route for group meetup (OSR-G) queries. OSR-G query aims to find the optimal meeting POI (point of interest) such that the maximum users’ route distance to the meeting POI is minimized after each user visits a number of POIs of specific categories (e.g., gas stations, restaurants, and shopping malls) in a particular order. To process OSR-G queries, we first propose an OSR-Based (OSRB) algorithm as our baseline, which examines every POI in the meeting category and utilizes existing OSR (called E-OSR) algorithm to compute the optimal route for each user to the meeting POI. To address the shortcomings (i.e., requiring to examine every POI in the meeting category) of OSRB, we propose an upper bound based filtering algorithm, called circle filtering (CF) algorithm, which exploits the circle property to filter the unpromising meeting POIs. In addition, we propose a lower bound based pruning (LBP) algorithm, namely LBP-SP which exploits a shortest path lower bound to prune the unqualified meeting POIs to reduce the search space. Furthermore, we develop an approximate algorithm, namely APS, to accelerate OSR-G queries with a good approximation ratio. Finally the experimental results show that both CF and LBP-SP outperform the OSRB algorithm and have high pruning rates. Moreover, the proposed approximate algorithm runs faster than the exact OSR-G algorithms and has a good approximation ratio.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

15

Unknown

Exploring Means to Enhance the Efficiency of GPU Bitmap Index Query Processing (2020)

Tran, Brandon ; Schaffner, Brennan ; Myre, Joseph M. ; [et al.]

Springer

In: Data Science and Engineering. 2020; Published 2020 Nov 30. doi: 10.1007/s41019-020-00148-8. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2020-11-30

Description: Once exotic, computational accelerators are now commonly available in many computing systems. Graphics processing units (GPUs) are perhaps the most frequently encountered computational accelerators. Recent work has shown that GPUs are beneficial when analyzing massive data sets. Specifically related to this study, it has been demonstrated that GPUs can significantly reduce the query processing time of database bitmap index queries. Bitmap indices are typically used for large, read-only data sets and are often compressed using some form of hybrid run-length compression. In this paper, we present three GPU algorithm enhancement strategies for executing queries of bitmap indices compressed using word aligned hybrid compression: (1) data structure reuse (2) metadata creation with various type alignment and (3) a preallocated memory pool. The data structure reuse greatly reduces the number of costly memory system calls. The use of metadata exploits the immutable nature of bitmaps to pre-calculate and store necessary intermediate processing results. This metadata reduces the number of required query-time processing steps. Preallocating a memory pool can reduce or entirely remove the overhead of memory operations during query processing. Our empirical study showed that performing a combination of these strategies can achieve 32.4$$imes$$ × to 98.7$$imes$$ × speedup over the current state-of-the-art implementation. Our study also showed that by using our enhancements, a common gaming GPU can achieve a $$15.0imes$$ 15.0 × speedup over a more expensive high-end CPU.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

16

Unknown

How Good Are Modern Spatial Libraries? (2020)

Pandey, Varun ; van Renen, Alexander ; Kipf, Andreas ; [et al.]

Springer

In: Data Science and Engineering. 2020; Published 2020 Nov 07. doi: 10.1007/s41019-020-00147-9. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2020-11-07

Description: Many applications today like Uber, Yelp, Tinder, etc. rely on spatial data or locations from its users. These applications and services either build their own spatial data management systems or rely on existing solutions. JTS Topology Suite (JTS), its C++ port GEOS, Google S2, ESRI Geometry API, and Java Spatial Index (JSI) are some of the spatial processing libraries that these systems build upon. These applications and services depend on indexing capabilities available in these libraries for high-performance spatial query processing. In this work, we compare these libraries qualitatively and quantitatively based on four different spatial queries using two real world datasets. We also compare these libraries with an open-source implementation of the Vantage Point Tree—an index structure that has been well studied in image retrieval and nearest-neighbor search algorithms for high-dimensional data. We found that Vantage Point Trees are very competitive and even outperform the aforementioned libraries in two queries.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

17

Unknown

Blocking Techniques for Entity Linkage: A Semantics-Based Approach (2020)

Azzalini, Fabio ; Jin, Songle ; Renzi, Marco ; [et al.]

Springer

In: Data Science and Engineering. 2020; Published 2020 Nov 03. doi: 10.1007/s41019-020-00146-w. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2020-11-03

Description: Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, differently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a first phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

18

Unknown

Parrot: A Progressive Analysis System on Large Text Collections (2020)

Zhang, Yazhong ; Zhang, Hanbing ; He, Zhenying ; [et al.]

Springer

In: Data Science and Engineering. 2020; 6(1): 1-19. Published 2020 Oct 22. doi: 10.1007/s41019-020-00144-y.

add to mindlist on the mindlist

Details

Publication Date: 2020-10-22

Description: The size of textual data continues to grow along with the need for timely and cost-effective analysis, while the growth of computation power cannot keep up with the growth of data. The delays when processing huge textual data can negatively impact user activity and insight. This calls for a paradigm shift from blocking fashion to progressive processing. In this paper, we propose a sample-based progressive processing model that focuses on term frequency calculation on text. The model is based on an incremental execution engine and will calculate a series of approximate results for a single query in a progressive way to provide a smooth trade-off between accuracy and latency. As a part, we proposed a new variant of the bootstrap technique to quantify result error progressively. We implemented this method in our system called Parrot on top of Apache Spark and used real-world data to test its performance. Experiments demonstrate that our method is 2.4×–19.7× faster to get a result within 1% error while the confidence interval always covers the accurate results very well.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink

	Location	Call Number	Expected	Availability

Others were also interested in ...

PAPER CURRENT

S·F·X

Fulltext

19

Unknown

Heterogeneous CPU-GPU Epsilon Grid Joins: Static and Dynamic Work Partitioning Strategies (2020)

Gallet, Benoit ; Gowanlock, Michael

Springer

In: Data Science and Engineering. 2020; Published 2020 Oct 21. doi: 10.1007/s41019-020-00145-x. [early online release]

add to mindlist on the mindlist

Details

Publication Date: 2020-10-21

Description: Given two datasets (or tables) A and B and a search distance $$epsilon$$ ϵ , the distance similarity join, denoted as $$A ltimes _epsilon B$$ A ⋉ ϵ B , finds the pairs of points ($$p_a$$ p a , $$p_b$$ p b ), where $$p_a in A$$ p a ∈ A and $$p_b in B$$ p b ∈ B , and such that the distance between $$p_a$$ p a and $$p_b$$ p b is $$le epsilon$$ ≤ ϵ . If $$A = B$$ A = B , then the similarity join is equivalent to a similarity self-join, denoted as $$A owtie _epsilon A$$ A ⋈ ϵ A . We propose in this paper Heterogeneous Epsilon Grid Joins (HEGJoin), a heterogeneous CPU-GPU distance similarity join algorithm. Efficiently partitioning the work between the CPU and the GPU is a challenge. Indeed, the work partitioning strategy needs to consider the different characteristics and computational throughput of the processors (CPU and GPU), as well as the data-dependent nature of the similarity join that accounts in the overall execution time (e.g., the number of queries, their distribution, the dimensionality, etc.). In addition to HEGJoin, we design in this paper a dynamic and two static work partitioning strategies. We also propose a performance model for each static partitioning strategy to perform the distribution of the work between the processors. We evaluate the performance of all three partitioning methods by considering the execution time and the load imbalance between the CPU and GPU as performance metrics. HEGJoin achieves a speedup of up to $$5.46imes$$ 5.46 × ($$3.97imes$$ 3.97 × ) over the GPU-only (CPU-only) algorithms on our first test platform and up to $$1.97imes$$ 1.97 × ($$12.07imes$$ 12.07 × ) on our second test platform over the GPU-only (CPU-only) algorithms.

Print ISSN: 2364-1185

Electronic ISSN: 2364-1541

Topics: Computer Science

Published by Springer

Permalink