Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences
You have accessPerspectives

Making big data small

    Abstract

    Big data analytics is often prohibitively costly and is typically conducted by parallel processing with a cluster of machines. Is big data analytics beyond the reach of small companies that can only afford limited resources? This paper tackles this question by presenting Boundedly EvAlable SQL (BEAS), a system for querying big relations with constrained resources. The idea is to make big data small. To answer a query posed on a dataset, it often suffices to access a small fraction of the data no matter how big the dataset is. In the light of this, BEAS answers queries on big data by identifying and fetching a small set of the data needed. Under available resources, it computes exact answers whenever possible and otherwise approximate answers with accuracy guarantees. Underlying BEAS are principled approaches of bounded evaluation and data-driven approximation, the focus of this paper.

    1. Introduction

    Big data analytics appeals for a departure from the classical computational complexity theory and conventional query evaluation paradigm. Textbooks tell us that a computational problem is tractable if its cost can be expressed as a polynomial in the size n of the input [1]. When it comes to big data, however, it may no longer be the case. As an example, let us consider queries on relational datasets, i.e. tables of records (a.k.a. tuples). It was reported that it took days to join two tables of 100 million tuples each [2]. While join is a routine operation in relational queries and its cost is in O(n2) time on tables of n tuples, it can no longer be considered ‘tractable’ if it takes days on big relations. That is, computational problems that are labelled tractable in the classical complexity theory may become infeasible or ‘intractable’ in the context of big data [3].

    The industrial solution to big data analytics is typically by parallel computation with a cluster of machines. When data grows big, it scales out the cluster by adding more machines. The solution assumes the parallel scalability, i.e. the computation would get faster when given more machines. However, few parallel algorithms in the literature offer this performance guarantee. Worse still, some computational problems are not parallel scalable, i.e. there exist no algorithms for these problems that guarantee to substantially reduce running time by adding more machines; for instance, no parallel algorithm for graph simulation, an O(n2) problem for social network analyses, is able to further reduce running time by adding more processors, after the number of processors used reaches a point [4] (see [5] for more real-life examples). Furthermore, even when dealing with computational problems that are parallel scalable, small businesses typically have limited resources and cannot afford large-scale parallel computation.

    Is big data analytics beyond the reach of small companies with limited resources?

    We tackle this question by presenting an alternative solution, as an effort to provide small companies with a capacity of big data analytics under constrained resources. We focus on relational query answering, since relational data accounts for 75% of business data in the real world [6], relational databases dominate big data [7], and analytics of the data is typically carried out with queries expressed in SQL (structured query language, the standard query language for relational databases). Given an SQL query Q and a relational dataset D, our job is to compute the answer to Q in D, denoted by Q(D), which is a set of tuples. The query answering problem is non-trivial. It is NP-complete (intractable) to decide whether a given tuple is in the answer Q(D) when Q is in SPC, a special case of SQL defined with selection, projection and Cartesian product (join) operators only.

    The idea is to make big data small, based on the following observation. Given a query Q posed on a dataset D, in practice it often suffices to compute Q(D) by accessing a small fraction DQ of D, even when D is big. In the light of this, we propose (a) bounded evaluation [812]. Given a query Q, it identifies DQ such that Q(DQ) = Q(D) and computes the exact answer to Q in D by accessing small data DQ; moreover, it fetches DQ of a bounded size when possible. When exact answers are beyond reach of a bounded fraction of D, we propose (b) a data-driven approximation scheme [13]. Also identifying and fetching a small fraction DQ of D, it computes Q(DQ) as approximate answer to Q in D and offers a deterministic accuracy bound. Both (a) and (b) are carried out under available limited resources and reduce queries on big data to computations on small data.

    As proof of concept, we have developed a prototype system BEAS (Boundedly EvAlable SQL) [14], which implements the new query evaluation paradigm to answer SQL queries under constrained resources. Our industry collaborators have evaluated BEAS with call-detailed-record (CDR) queries and found that BEAS is able to compute exact answers to more than 90% of their CDR queries, reduce big datasets from PB (1015) to GB (109) and improve the performance by orders of magnitude [15].

    Organization. In the remainder of the paper, we give a brief introduction to bounded evaluation (§2) and the data-driven approximation scheme (§3). Putting these together, we present BEAS, our resource-bounded query evaluation system (§4). We focus on SQL queries on big data in this paper, which account for a large part of big data analytics.

    2. Bounded evaluation

    Given an SQL query Q posed on a big dataset D, bounded evaluation [9] aims to compute the exact answer Q(D) to Q in D by accessing only a bounded subset DQ of D that includes necessary information for answering Q in D, instead of the entire D. The idea is simple, but it is non-trivial to identify DQ. To this end, it makes use of an access schema A, which is a set of access constraints, i.e. a combination of simple cardinality constraints and their associated indices on the data.

    Bounded evaluation. Under an access schema A, query Q is boundedly evaluable if for each dataset D that conforms to A, there exists a fraction DQD such that

    Q(DQ) = Q(D), i.e. it suffices to compute the exact answerQ(D) by only using QD; and

    the time for identifying DQ and the size |DQ| of DQ are determined by the query Q and the schema A only. That is, the cost of computing Q(DQ) is independent of the size |D| of D.

    Intuitively, query Q is boundedly evaluable under A if Q(D) can be computed by accessing a bounded fraction DQ of D, no matter how big the dataset D grows. We identify DQ by reasoning about the cardinality constraints in A and fetch DQ by using the indices in A. A dataset D conforms to A if D satisfies the cardinality constraints of A and has indices of A built on it.

    Example 2.1

    Consider a database schema R0 that consists of four relations (tables):

    (a)

    person(pid, city, country), each tuple specifies a person pid who lives in city of country;

    (b)

    friend(pid, fid), stating that person fid is a friend of person pid;

    (c)

    restaurant(rid, name, rating, price, street, city, country), specifying the name, rating, average cost per person and the address of a restaurant and

    (d)

    visit(pid, rid, dd, mm, yy), saying that personpid visited restaurantrid on date (dd, mm, yy).

    Consider a query Q1 to find all restaurants in London that were rated A and were visited in 2018 by one of my friends who lives in London. This is a personalized search query taken from Graph Search of Facebook [16]. Written in SQL, query Q1 can be expressed as follows:

    • selectrestaurant.rid, restaurant.name

    • fromfriend, person, visit, restaurant

    • wherefriend.pid = p0andfriend.fid = person.pid and person.city = ‘London’ and

      friend.fid = visit.pidandvisit.yy = 2018 and

      visit.rid = restaurant.ridandrestaurant.rating = ‘A’ andrestaurant.city = ‘London’

    where p0 indicates ‘me’ for personalized search. The query uses three join operations: (a) a join between tables friend and person to find friends of p0; (b) one between table visit and the set (table) of friends of p0 retrieved in (a) above, to find rids visited ‘by my friends’ and (c) another join between table restaurant and the set of the rids selected, to find attributes name, city and rating of these restaurants. It employs conditions to select friends of p0 who lives in London and restaurants in London that were rated A and were visited in 2018. Finally, it returns a table of tuples with rid and name attributes of restaurant, by applying projection on the intermediate result.

    An instance D0 of database schema R0 may be ‘big’, e.g. Facebook has billions of users and trillions of friend links [17]. On such a dataset D0, it is prohibitively costly to compute Q1(D0).

    However, we can do better by making use of access constraints. An example access schema A1 consists of the following five access constraints:

    ψ1: friend(pidfid,5000);

    ψ2: visit(yy(dd,mm,yy),366);

    ψ3: visit((pid,dd,mm,yy)rid,1);

    ψ4: person(pidcity,1) and

    ψ5: restaurant(rid(name,city,rating),1).

    Here, ψ1 is a cardinality constraint imposed by Facebook [17], which puts a limit of 5000 friends per person; ψ2 says that each year has at most 366 days; ψ3 states that on a given day, each person dines out at most once; and ψ4 is a simple constraint known as a key, stating that the pid of a person uniquely determines the city where the person lives, similarly for ψ5. An index is built for ψ1 such that given a pid, it returns all friend fids of the pid from friend, i.e. ψ1 is a combination of the cardinality constraint (5000 fids for each pid) and the index, similarly for ψ2ψ5. These are simple cardinality constraints commonly found in the real world, along with their associated indices.

    Under the set A1 of access constraints, query Q1 is boundedly evaluable. Indeed, we can (1) first identify and fetch at most 5000 friend fids of person p0 from relation friend by using the index for ψ1 and (2) for each fid fetched, we do the following: (a) get 1 city where the person fid lives from relation person by using the index for ψ4; (b) using the index for ψ2 and constant 2018, fetch at most 366 (dd, mm, yy) partial tuples from table visit; (c) putting the fid together with each (dd, mm, yy), fetch at most 366 rids from visit by using the index for ψ3 and (d) for each of the rids found, retrieve at most 1 (name, city, rating) tuple from the restaurant table by using the index for ψ5. In total, we fetch a set DQ consisting of 5000 × 2 + 366 + 5000 × 366 × 2 tuples, instead of trillions. The set DQ contains all the data that are needed for computing Q1(D0), and hence, it suffices to answer query Q1 in the big dataset D0 by using the set DQ of retrieved data only. Better yet, under the access schema A1, the size of DQ remains stable no matter how big dataset D0 grows.

    Bounded evaluation suggests that given an application that requires querying big datasets, we can first discover an access schema A offline, based on an analysis of historical queries. We then take SQL queries Q online and check whether Q is boundedly evaluable under A. If so, query Q is guaranteed to be answered on big datasets by accessing a bounded amount of data.

    Effective algorithms are in place for discovering access schema from historical queries, incrementally maintaining the access schema in response to updates, checking the bounded evaluability of input queries and generating query plans for bounded queries by accessing a bounded amount of data. As an example, below we consider how to check the bounded evaluability.

    Deciding bounded evaluability. A challenge is that it is undecidable to check whether an SQL query is boundedly evaluable under an access schema A [9]. The undecidability remains intact for relational algebra (RA), denoted by RA, which is SPC extended with set union and difference, and is the first-order logic fragment of SQL. To cope with this, an effective syntaxL was developed [12]. That is, L is a class of RA queries such that under access schema A,

    (a)

    an RA query Q is boundedly evaluable if and only if it is equivalent to a query Q′ in L and

    (b)

    it takes PTIME (polynomial time) in the size |Q| of query Q and the size |A| of access schema A to check whether Q is in L, reducing the problem to syntactic checking.

    That is, L identifies the core subclass of boundedly evaluable RA queries under access schema A, without sacrificing their expressive power up to equivalence. This is analogous to how database management systems (DBMSs) support range-safe relational calculus queries, which is undecidable [18]. Based on the effective syntax, we can efficiently check whether a query Q is boundedly evaluable, and if so, generate a query plan to answer Q by accessing a bounded amount of data [12].

    The unique features of bounded evaluation. The foundation of bounded evaluation was established in [810]. The theory was first put in action for SPC [11] and then extended to RA [12]. A prototype system was developed in [14] and offers the following:

    (1) Bounded evaluability. It is able to check whether an input SQL query Q is boundedly evaluable, and if so, answer Q by accessing a bounded amount of data from a (possibly big) dataset.

    (2) Plug and play. Bounded evaluation can be readily built on top of commercial DBMSs and provides the DBMSs with a convenient capability of querying big relations under limited resources.

    (3) Join reduction. Access constraints provide us with new optimization rules, to replace costly join operations with simple data fetch by leveraging their associated indices at the query level.

    We found that about 77% of SPC queries [11] and 67% of SQL queries [12] are boundedly evaluable. Our industry collaborators found that more than 90% of their CDR queries are boundedly evaluable.

    The study of bounded evaluation is among the first efforts to compute exact answers to queries in big datasets under limited resources. In contrast to commercial DBMSs, it decides whether an input query is boundedly evaluable, and if so, it generates a bounded query plan, by reasoning about access schema. Access schema is implemented with indices, but as opposed to conventional indices, it allows us to reason about cardinality constraints at the query level, reduce costly join operations and deduce bounded query plans. Moreover, conventional indices retrieve entire tuples [19]. In contrast, with access constraints, we fetch distinct partial tuples consisting of only attributes needed for answering a query. For instance, as shown in example (2.1), we only fetch the name, rating and city attributes of restaurant, but not price, street and country, since answering Q1 does not need the latter. That is, bounded evaluation does not retrieve duplicated and unnecessary attributes or tuples; the redundancies get inflated rapidly with joins in conventional query evaluation.

    3. Data-driven approximation

    For queries Q that are boundedly evaluable under an access schema, we can answer Q by accessing a bounded amount of data. What can we do if query Q is not boundedly evaluable? Is it still possible to compute Q(D) in a big dataset D under constrained resources? We tackle this question next.

    Approximation. We propose a data-driven scheme for approximate query answering. It is parameterized with a resource ratioα∈(0, 1], indicating that our available resources can only afford to access an α-fraction of a big dataset D. Given ratio α, dataset D and an SQL query Q over D, the approximation scheme identifies DQD and computes Q(DQ) and a ratio η∈(0, 1] such that

    (1)

    |DQ| ≤ α|D|, where |DQ| is measured in its number of tuples and

    (2)

    accuracy(Q(DQ), Q, D)≥η, where η is the deduced accuracy bound.

    Intuitively, the scheme computes a set Q(DQ) of tuples as the approximate answer to query Q in big dataset D, by accessing at most α|D| tuples in the entire process. Thus, it can scale with D when D grows big by setting the ratio α small. Moreover, Q(DQ) assures a deterministic accuracy bound η [13]:

    (a)

    for each tuple s in the approximate answer Q(DQ), there exists a tuple t in the exact answer Q(D) that is within distance η of s, and conversely,

    (b)

    for each tuple t in the exact answer Q(D), there exists a tuple s in the approximate answer Q(DQ) that is within distance η of t.

    That is, the set Q(DQ) consists only of ‘relevant’ answers, i.e. each tuple in Q(DQ) is a sensible answer to users, up to distance η; here, the distance between s and t is the sum of the pairwise distances of their attributes, which is in turn measured in terms of a distance function in the domain of the corresponding attribute (see an example shortly and [13] for a definition). Moreover, Q(DQ) ‘covers’ the exact answer, i.e. each tuple in the exact answer Q(D) finds a tuple sQ(DQ) within distance η, and no tuple in Q(D) is missed by Q(DQ) up to bound η. It finds sensible answers in users' interest and suffices for exploratory queries, e.g. real-time problem diagnosis on logs [20]. The larger the ratio α is, i.e. the more resources we can afford, the higher the accuracy bound η we can get.

    As observed in a recent survey [21], approximate query answering is challenging, and little practical use has emerged from it. Nonetheless, we can make data-driven approximation work in practice by employing a mild extension of the access schema that we have seen earlier.

    Example 3.1

    Continuing with example 2.1, consider query Q2 to find me restaurants that cost at most £35 per person on average and are in a city where one of my friends lives:

    • selectrestaurant.name, restaurant.price

    • fromrestaurant, friend, person

    • wherefriend.pid = p0andfriend.fid = person.pidand

      person.city = restaurant.cityandrestaurant.price ≤ 35

    Here, again p0 indicates ‘me’. The query is defined with two join operations, one between tables friend and person to find the cities where ‘my friends’ live, and the other between table restaurant and the set of ‘my friends’ to make sure that they are in the same city. It is costly to compute the exact answer Q2(D0) in a big dataset D0 with billions of person tuples and trillions of friend links.

    Suppose that our available resources can afford to access at most 10−4*|D0| tuples, i.e. α = 10−4. That is, we have to reduce a dataset of PB size to a dataset of 100 GB, while guaranteeing a deterministic accuracy bound. Under these constraints, we can compute approximate answers to query Q2 in D0 based on data-driven approximation, by making use of an access schema A2, which includes ψ1ψ5 of example (2.1), and in addition, the following extended access constraints:

    ϕ1: restaurant(city(name,price),1,(en1,ep1)),…

    ϕm: restaurant(city(name,price),2m,(enm,epm)), where m = ⌈log2M⌉.

    Here, M is the maximum number of distinct restaurant tuples in dataset D0 grouped by attribute city. Observe that there exists no constant bound on the number of restaurants in a city, and hence, it is hard to define access constraints along the same lines as ψ1ψ5 of example (2.1). Instead, we build an index for each ϕi in A2 such that for each i∈[1, m], given any city-value c, we can retrieve a set T of at most 2i (name, price) values from D0 by using the index. Moreover, for each restaurant tuple t = (id, n, r, p, s, c, c1) in D0 (recall that such tuples are specified by (rid, name, rating, price, street, city, country)), there exists a tuple (n′, p′)∈T such that the (name, price)-value (n, p) of t differs from (n′, p′) by distances at most (ein, eip); for instance, when eip = 4, price p′ returned is at most £4 more expensive than the real price p. That is, T represents (name, price) values that correspond to city-attribute c with at most 2i tuples, subject to distances (ein, eip). Intuitively, the indices give us a hierarchical representation of relation restaurant with different resolutions i∈[1, m]. The higher the resolution level i is, the smaller the distances (ein, eip) are, and the more accurately the index for ϕi represents the data in D0.

    Under access schema A2, we can find restaurants of interest by accessing at most α|D0| tuples as follows: (a) fetch a set T1 of friend.fid's of p0 by accessing at most 5000 friend tuples, using the access constraint ψ1 of example (2.1); (b) for each fid in T1, fetch 1 associated city from relation person by using the index for ψ4, which yields a set T2 of at most 5000 city values; (c) for each cityc in T2, fetch at most 2kα (name, price) pairs corresponding to c from relation restaurant by using the index for ϕkα, where kα = ⌊log2(α|D0| − 10 000)⌋ and (d) return a set S of those (name, price) values with price at most (35 + ekαp), as approximate answer to Q2 in D0. The process accesses at most 5000 + 5000 + 2kα ≤ α|D0| tuples in total, instead of trillions.

    The approximate answer S has a deterministic accuracy bound: (1) for each tuple (n, p) in the exact answer Q(D0), there exists a tuple (n′, p′) in S such that n′ and p′ are within distance ekαn and ekαp of n and p, respectively, and (2) for each tuple in S, its price p′ exceeds 35 by at most ekαp, e.g. ekαp = 4 and p′ = 39, and n′ is the name of a restaurant. Moreover, the larger the resource ratio α is, the smaller the distances ekαp and ekαn are, and the more accurate the approximate answer S is.

    It has been shown [13] that for any dataset D, there exists an access schema A such that for any resource ratio α∈(0, 1] and any SQL query Q over D, one can effectively identify a subset DQ of D by reasoning about the constraints in A and deduce a deterministic accuracy bound η such that |DQ| ≤ α|D| and accuracy(Q(DQ), Q, D)≥η. Moreover, the larger the resource ratio α is, the higher the accuracy bound η is. The access schema A can be effectively computed offline and efficiently maintained online in response to updates to dataset D [13].

    Unique features of data-driven approximation. There has been a host of work on approximate query answering, typically based on one of the following two approaches (see [21,22] for surveys): (a) synopsis [2327] or (b) dynamic sampling, e.g. [20,2830]. The first approach computes an one-size-fit-all synopsis D′ of a dataset D and uses D′ to answer all queries posed on D. A representative method of the second approach is BlinkDB [20]. Assuming predictable QCSs (query column sets), i.e. ‘the frequency of columns used for grouping and filtering does not change over time’, BlinkDB selects samples from historical QCS patterns and caches them as views. It targets aggregate queries, i.e. queries to compute max, min, count, sum and average values grouped by certain attributes. It answers such aggregate queries by using the samples instead of accessing the original datasets and offers probabilistic error rates for the approximate answers computed.

    As opposed to the prior approaches, the data-driven approximation scheme makes use of an access schema to identify an α-fraction DQ of dataset D. It offers the following:

    (1) Unpredictable queries. The scheme does not assume the availability of any prior knowledge about user queries Q. By contrast, previous approaches make various assumptions on future queries, e.g. they assume that workloads, query predicates or QCSs of future queries are known in advance.

    (2) Generic queries. The scheme is able to handle generic SQL queries, aggregate or not. For instance, query Q2 in example (3.1) is not an aggregate query. By contrast, previous approaches ‘substantially limit the types of queries they can execute’ [20] and typically focus on aggregate queries.

    (3) Deterministic accuracy bound. As remarked earlier, the data-driven scheme proposes an accuracy measure in terms of both relevance and coverage, to ensure that each tuple in the approximate answer computed is sensible and all tuples in the exact answer are covered. By contrast, previous approaches provide either no accuracy guarantee at all or probabilistic error rates for aggregate queries only. Such error rates do not tell us how ‘good’ each approximate answer is, which is one of the reasons why approximate query answering has not been widely used in practice.

    Our preliminary study using real-life data finds that the data-driven approximation scheme computes approximate answers to SQL queries, aggregate or not, with accuracy η≥0.82, even when α is as small as 5.5 × 10−4 [13]. That is, it reduces D of PB size to DQ of 550 GB.

    4. A resource-bounded query evaluation framework

    Putting bounded evaluation and data-driven approximation together, we develop BEAS (Boundedly EvAluable Sql), a system for querying big relations under constrained resources.

    BEAS. For a big dataset D in an application, BEAS takes a resource ratioα∈(0, 1] as a parameter, which is estimated by an analysis of available resources and the size of dataset D. It discovers an access schema A offline, by striking a balance between the speedup of query evaluation and the space cost introduced by the indices in the access schema A. Then, BEAS answers SQL queries posed on dataset D online. Given an SQL query Q, BEAS works as follows:

    (1)

    it first checks whether query Q is boundedly evaluable under A, i.e. the exact answer Q(D) can be computed by accessing DQD such that |DQ| is independent of the size |D| of D;

    (2)

    if so, it computes the exact answer Q(D) by accessing a bounded fraction DQ of D;

    (3)

    otherwise, BEAS identifies a fraction DQ of D with |DQ| ≤ α|D| and computes Q(DQ) with a deterministic accuracy bound η, based on data-driven approximation.

    That is, under the resource constraint α, BEAS computes exact answers Q(D) when possible, and approximate answers Q(DQ) otherwise with accuracy η. In the entire process, it accesses a small fraction of data that can be afforded by available resources. BEAS also supports incremental maintenance of the indices in A in response to updates to dataset D and guarantees that the maintenance cost is determined by the size of changes, not by the size of dataset D [11,13].

    Unique features of BEAS. BEAS departs from commercial DBMS by promoting an unconventional query evaluation paradigm. It is unique in the following:

    (1) Querying big relations. It is parameterized with a resource ratioα, i.e. it is able to scale with arbitrarily large datasets D by adjusting ratio α based on available resources.

    (2) Separating exact answers from approximate answers. By effectively deciding whether an input query is boundedly evaluable under an access schema, BEAS is capable of telling its users whether the answers to their queries are exact or approximate. Moreover, for approximate answers, it provides deterministic accuracy bounds η in terms of both relevance and coverage.

    (3) Generic. BEAS is able to answer SQL queries that are unpredictable, without assuming prior knowledge about future queries, and generic, no matter whether the queries are aggregate or not.

    (4) Ease of implementation. BEAS can be built on top of commercial DBMS and provide the DBMS with an immediate capacity to query big relations under constrained resources. It offers its own optimization strategies and is also able to further improve its query plans, bounded or approximate, by making use of existing DBMS optimizers that have been developed for decades.

    In light of these, BEAS is promising for providing small companies with the capability of big data analytics under limited resources. It can also help big companies to reduce the cost and improve efficiency [15]. In particular, for computational problems that are not parallel scalable, bounded evaluation and data-driven approximation offer a feasible solution that does not rely on large-scale parallel computation. This said, bounded evaluation and data-driven approximation can be parallelized to benefit from parallel processing provided that resources are available.

    The idea of resource-bounded query answering is not limited to querying big relations. It has been shown that bounded evaluation improves the performance of graph pattern matching via subgraph isomorphism, an intractable problem [1] that is widely used in social media marketing, knowledge base extension and graph data cleaning, by four orders of magnitude on average [31]. For personalized social search via subgraph isomorphism, data-driven approximation retains 100% accuracy (i.e. η = 1) when α is as small as 1.5 × 10−6 [32], i.e. when processing graphs G of 1 PB, they access only 15 GB of data, i.e. reducing G from PB to GB while retaining high accuracy.

    Data accessibility

    This article has no additional data.

    Competing interests

    I declare I have no competing interests.

    Funding

    W.F. is supported in part by ERC 652976, Royal Society Wolfson Research Merit Award WRM/R1/180014, NSFC 61421003, EPSRC EP/M025268/1, Foundation for Innovative Research Groups of NSFC, Joint Lab between Edinburgh and Huawei, Beijing Advanced Innovation Centre for Big Data and Brain Computing and Shenzhen Institute of Computing Sciences.

    Footnotes

    Published by the Royal Society. All rights reserved.

    References