ACM Transactions on Database Systems
Latest Publications





Published By Association For Computing Machinery


2021 ◽  
Vol 46 (4) ◽  
pp. 1-49
Alejandro Grez ◽  
Cristian Riveros ◽  
Martín Ugarte ◽  
Stijn Vansummeren

Complex event recognition (CER) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real time. CER finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. Existing CER languages lack a clear semantics, however, which makes them hard to understand and generalize. Moreover, there are no general techniques for evaluating CER query languages with clear performance guarantees. In this article, we embark on the task of giving a rigorous and efficient framework to CER. We propose a formal language for specifying complex events, called complex event logic (CEL), that contains the main features used in the literature and has a denotational and compositional semantics. We also formalize the so-called selection strategies, which had only been presented as by-design extensions to existing frameworks. We give insight into the language design trade-offs regarding the strict sequencing operators of CEL and selection strategies. With a well-defined semantics at hand, we discuss how to efficiently process complex events by evaluating CEL formulas with unary filters. We start by introducing a formal computational model for CER, called complex event automata (CEA), and study how to compile CEL formulas with unary filters into CEA. Furthermore, we provide efficient algorithms for evaluating CEA over event streams using constant time per event followed by output-linear delay enumeration of the results.

2021 ◽  
Vol 46 (4) ◽  
pp. 1-45
Chenhao Ma ◽  
Yixiang Fang ◽  
Reynold Cheng ◽  
Laks V. S. Lakshmanan ◽  
Wenjie Zhang ◽  

Given a directed graph G , the directed densest subgraph (DDS) problem refers to the finding of a subgraph from G , whose density is the highest among all the subgraphs of G . The DDS problem is fundamental to a wide range of applications, such as fraud detection, community mining, and graph compression. However, existing DDS solutions suffer from efficiency and scalability problems: on a 3,000-edge graph, it takes three days for one of the best exact algorithms to complete. In this article, we develop an efficient and scalable DDS solution. We introduce the notion of [ x , y ]-core, which is a dense subgraph for G , and show that the densest subgraph can be accurately located through the [ x , y ]-core with theoretical guarantees. Based on the [ x , y ]-core, we develop exact and approximation algorithms. We further study the problems of maintaining the DDS over dynamic directed graphs and finding the weighted DDS on weighted directed graphs, and we develop efficient non-trivial algorithms to solve these two problems by extending our DDS algorithms. We have performed an extensive evaluation of our approaches on 15 real large datasets. The results show that our proposed solutions are up to six orders of magnitude faster than the state-of-the-art.

2021 ◽  
Vol 46 (4) ◽  
pp. 1-40
Michael Benedikt ◽  
Pierre Bourhis ◽  
Louis Jachiet ◽  
Efthymia Tsamoura

We study the design of data publishing mechanisms that allow a collection of autonomous distributed data sources to collaborate to support queries. A common mechanism for data publishing is via views : functions that expose derived data to users, usually specified as declarative queries. Our autonomy assumption is that the views must be on individual sources, but with the intention of supporting integrated queries. In deciding what data to expose to users, two considerations must be balanced. The views must be sufficiently expressive to support queries that users want to ask—the utility of the publishing mechanism. But there may also be some expressiveness restrictions. Here, we consider two restrictions, a minimal information requirement, saying that the views should reveal as little as possible while supporting the utility query, and a non-disclosure requirement, formalizing the need to prevent external users from computing information that data owners do not want revealed. We investigate the problem of designing views that satisfy both expressiveness and inexpressiveness requirements, for views in a restricted information systems - query languages (conjunctive queries), and for arbitrary views.

2021 ◽  
Vol 46 (4) ◽  
pp. 1-35
Shikha Singh ◽  
Prashant Pandey ◽  
Michael A. Bender ◽  
Jonathan W. Berry ◽  
Martín Farach-Colton ◽  

Given an input stream S of size N , a ɸ-heavy hitter is an item that occurs at least ɸN times in S . The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection ( TED ) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (Ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.

2021 ◽  
Vol 46 (3) ◽  
pp. 1-39
Mahmoud Abo Khamis ◽  
Phokion G. Kolaitis ◽  
Hung Q. Ngo ◽  
Dan Suciu

The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a long-standing open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the conjunctive query containment under bag semantics. These connections are established using information inequalities, which are considered to be the laws of information theory. Our first main result asserts that deciding the validity of a generalization of information inequalities is many-one equivalent to the restricted case of conjunctive query containment in which the containing query is acyclic; thus, either both these problems are decidable or both are undecidable. Our second main result identifies a new decidable case of the conjunctive query containment problem under bag semantics. Specifically, we give an exponential-time algorithm for conjunctive query containment under bag semantics, provided the containing query is chordal and admits a simple junction tree.

2021 ◽  
Vol 46 (3) ◽  
pp. 1-44
Xuelian Lin ◽  
Shuai Ma ◽  
Jiahao Jiang ◽  
Yanchen Hou ◽  
Tianyu Wo

Nowadays, various sensors are collecting, storing, and transmitting tremendous trajectory data, and it is well known that the storage, network bandwidth, and computing resources could be heavily wasted if raw trajectory data is directly adopted. Line simplification algorithms are effective approaches to attacking this issue by compressing a trajectory to a set of continuous line segments, and are commonly used in practice. In this article, we first classify the error bounded line simplification algorithms into different categories and review each category of algorithms. We then study the data aging problem of line simplification algorithms and distance metrics from the views of aging friendliness and aging errors. Finally, we present a systematic experimental evaluation of representative error bounded line simplification algorithms, including both compression optimal and sub-optimal methods, in terms of commonly adopted perpendicular Euclidean, synchronous Euclidean, and direction-aware distances. Using real-life trajectory datasets, we systematically evaluate and analyze the performance (compression ratio, average error, running time, aging friendliness, and query friendliness) of error bounded line simplification algorithms with respect to distance metrics, trajectory sizes, and error bounds. Our study provides a full picture of error bounded line simplification algorithms, which leads to guidelines on how to choose appropriate algorithms and distance metrics for practical applications.

2021 ◽  
Vol 46 (3) ◽  
pp. 1-45
Immanuel Trummer ◽  
Junxiong Wang ◽  
Ziyun Wei ◽  
Deepak Maram ◽  
Samuel Moseley ◽  

SkinnerDB uses reinforcement learning for reliable join ordering, exploiting an adaptive processing engine with specialized join algorithms and data structures. It maintains no data statistics and uses no cost or cardinality models. Also, it uses no training workloads nor does it try to link the current query to seemingly similar queries in the past. Instead, it uses reinforcement learning to learn optimal join orders from scratch during the execution of the current query. To that purpose, it divides the execution of a query into many small time slices. Different join orders are tried in different time slices. SkinnerDB merges result tuples generated according to different join orders until a complete query result is obtained. By measuring execution progress per time slice, it identifies promising join orders as execution proceeds. Along with SkinnerDB, we introduce a new quality criterion for query execution strategies. We upper-bound expected execution cost regret, i.e., the expected amount of execution cost wasted due to sub-optimal join order choices. SkinnerDB features multiple execution strategies that are optimized for that criterion. Some of them can be executed on top of existing database systems. For maximal performance, we introduce a customized execution engine, facilitating fast join order switching via specialized multi-way join algorithms and tuple representations. We experimentally compare SkinnerDB’s performance against various baselines, including MonetDB, Postgres, and adaptive processing methods. We consider various benchmarks, including the join order benchmark, TPC-H, and JCC-H, as well as benchmark variants with user-defined functions. Overall, the overheads of reliable join ordering are negligible compared to the performance impact of the occasional, catastrophic join order choice.

2021 ◽  
Vol 46 (3) ◽  
pp. 1-44
Shaoxu Song ◽  
Fei Gao ◽  
Aoqian Zhang ◽  
Jianmin Wang ◽  
Philip S. Yu

Stream data are often dirty, for example, owing to unreliable sensor reading or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum modification rule in data cleaning. To capture the knowledge about what is clean , we consider the (widely existing) constraints on the speed and acceleration of data changes, such as fuel consumption per hour, daily limit of stock prices, or the top speed and acceleration of a car. Guided by these semantic constraints, in this article, we propose the constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window. Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient median-based solution , and (3) experiments on real datasets demonstrate that our method can show significantly lower L1 error than the existing approaches such as smoother.

2021 ◽  
Vol 46 (2) ◽  
pp. 1-45
Amine Mhedhbi ◽  
Chathura Kankanamge ◽  
Semih Salihoglu

We study the problem of optimizing one-time and continuous subgraph queries using the new worst-case optimal join plans. Worst-case optimal plans evaluate queries by matching one query vertex at a time using multiway intersections. The core problem in optimizing worst-case optimal plans is to pick an ordering of the query vertices to match. We make two main contributions: 1. A cost-based dynamic programming optimizer for one-time queries that (i) picks efficient query vertex orderings for worst-case optimal plans and (ii) generates hybrid plans that mix traditional binary joins with worst-case optimal style multiway intersections. In addition to our optimizer, we describe an adaptive technique that changes the query vertex orderings of the worst-case optimal subplans during query execution for more efficient query evaluation. The plan space of our one-time optimizer contains plans that are not in the plan spaces based on tree decompositions from prior work. 2. A cost-based greedy optimizer for continuous queries that builds on the delta subgraph query framework. Given a set of continuous queries, our optimizer decomposes these queries into multiple delta subgraph queries, picks a plan for each delta query, and generates a single combined plan that evaluates all of the queries. Our combined plans share computations across operators of the plans for the delta queries if the operators perform the same intersections. To increase the amount of computation shared, we describe an additional optimization that shares partial intersections across operators. Our optimizers use a new cost metric for worst-case optimal plans called intersection-cost . When generating hybrid plans, our dynamic programming optimizer for one-time queries combines intersection-cost with the cost of binary joins. We demonstrate the effectiveness of our plans, adaptive technique, and partial intersection sharing optimization through extensive experiments. Our optimizers are integrated into GraphflowDB.

2021 ◽  
Vol 46 (2) ◽  
pp. 1-50
Yangjun Chen ◽  
Gagandeep Singh

Given a directed edge labeled graph G , to check whether vertex v is reachable from vertex u under a label set S is to know if there is a path from u to v whose edge labels across the path are a subset of S . Such a query is referred to as a label-constrained reachability ( LCR ) query. In this article, we present a new approach to store a compressed transitive closure of G in the form of intervals over spanning trees (forests). The basic idea is to associate each vertex v with two sequences of some other vertices: one is used to check reachability from v to any other vertex, by using intervals, while the other is used to check reachability to v from any other vertex. We will show that such sequences are in general much shorter than the number of vertices in G. Extensive experiments have been conducted, which demonstrates that our method is much better than all the previous methods for this problem in all the important aspects, including index construction times, index sizes, and query times.

Sign in / Sign up

Export Citation Format

Share Document