Abstract
Big data analytics is often prohibitively costly and is typically conducted by parallel processing with a cluster of machines. Is big data analytics beyond the reach of small companies that can only afford limited resources? This paper tackles this question by presenting Boundedly EvAlable SQL (
1. Introduction
Big data analytics appeals for a departure from the classical computational complexity theory and conventional query evaluation paradigm. Textbooks tell us that a computational problem is tractable if its cost can be expressed as a polynomial in the size n of the input [1]. When it comes to big data, however, it may no longer be the case. As an example, let us consider queries on relational datasets, i.e. tables of records (a.k.a. tuples). It was reported that it took days to join two tables of 100 million tuples each [2]. While join is a routine operation in relational queries and its cost is in O(n2) time on tables of n tuples, it can no longer be considered ‘tractable’ if it takes days on big relations. That is, computational problems that are labelled tractable in the classical complexity theory may become infeasible or ‘intractable’ in the context of big data [3].
The industrial solution to big data analytics is typically by parallel computation with a cluster of machines. When data grows big, it scales out the cluster by adding more machines. The solution assumes the parallel scalability, i.e. the computation would get faster when given more machines. However, few parallel algorithms in the literature offer this performance guarantee. Worse still, some computational problems are not parallel scalable, i.e. there exist no algorithms for these problems that guarantee to substantially reduce running time by adding more machines; for instance, no parallel algorithm for graph simulation, an O(n2) problem for social network analyses, is able to further reduce running time by adding more processors, after the number of processors used reaches a point [4] (see [5] for more real-life examples). Furthermore, even when dealing with computational problems that are parallel scalable, small businesses typically have limited resources and cannot afford large-scale parallel computation.
Is big data analytics beyond the reach of small companies with limited resources?
We tackle this question by presenting an alternative solution, as an effort to provide small companies with a capacity of big data analytics under constrained resources. We focus on relational query answering, since relational data accounts for 75% of business data in the real world [6], relational databases dominate big data [7], and analytics of the data is typically carried out with queries expressed in
The idea is to make big data small, based on the following observation. Given a query Q posed on a dataset D, in practice it often suffices to compute Q(D) by accessing a small fraction DQ of D, even when D is big. In the light of this, we propose (a) bounded evaluation [8–12]. Given a query Q, it identifies DQ such that Q(DQ) = Q(D) and computes the exact answer to Q in D by accessing small data DQ; moreover, it fetches DQ of a bounded size when possible. When exact answers are beyond reach of a bounded fraction of D, we propose (b) a data-driven approximation scheme [13]. Also identifying and fetching a small fraction DQ of D, it computes Q(DQ) as approximate answer to Q in D and offers a deterministic accuracy bound. Both (a) and (b) are carried out under available limited resources and reduce queries on big data to computations on small data.
As proof of concept, we have developed a prototype system
Organization. In the remainder of the paper, we give a brief introduction to bounded evaluation (§2) and the data-driven approximation scheme (§3). Putting these together, we present
2. Bounded evaluation
Given an
Bounded evaluation. Under an access schema , query Q is boundedly evaluable if for each dataset D that conforms to , there exists a fraction DQ⊆D such that
| — | Q(DQ) = Q(D), i.e. it suffices to compute the exact answerQ(D) by only using QD; and | ||||
| — | the time for identifying DQ and the size |DQ| of DQ are determined by the query Q and the schema only. That is, the cost of computing Q(DQ) is independent of the size |D| of D. | ||||
Intuitively, query Q is boundedly evaluable under if Q(D) can be computed by accessing a bounded fraction DQ of D, no matter how big the dataset D grows. We identify DQ by reasoning about the cardinality constraints in and fetch DQ by using the indices in . A dataset D conforms to if D satisfies the cardinality constraints of and has indices of built on it.
Consider a database schema that consists of four relations (tables):
| (a) | |||||
| (b) | |||||
| (c) | |||||
| (d) | |||||
Consider a query Q1 to find all restaurants in London that were rated A and were visited in 2018 by one of my friends who lives in London. This is a personalized search query taken from Graph Search of Facebook [16]. Written in
select
restaurant.rid, restaurant.name from
friend, person, visit, restaurant where
friend.pid = p0andfriend.fid =person.pid andperson.city = ‘London’ andfriend.fid =visit.pid andvisit.yy = 2018 andvisit.rid =restaurant.rid andrestaurant.rating = ‘A’ andrestaurant.city = ‘London’
An instance D0 of database schema may be ‘big’, e.g. Facebook has billions of users and trillions of friend links [17]. On such a dataset D0, it is prohibitively costly to compute Q1(D0).
However, we can do better by making use of access constraints. An example access schema consists of the following five access constraints:
| — | ψ1: ; | ||||
| — | ψ2: ; | ||||
| — | ψ3: ; | ||||
| — | ψ4: and | ||||
| — | ψ5: . | ||||
Here, ψ1 is a cardinality constraint imposed by Facebook [17], which puts a limit of 5000 friends per person; ψ2 says that each year has at most 366 days; ψ3 states that on a given day, each person dines out at most once; and ψ4 is a simple constraint known as a key, stating that the
Under the set of access constraints, query Q1 is boundedly evaluable. Indeed, we can (1) first identify and fetch at most 5000 friend
Bounded evaluation suggests that given an application that requires querying big datasets, we can first discover an access schema offline, based on an analysis of historical queries. We then take
Effective algorithms are in place for discovering access schema from historical queries, incrementally maintaining the access schema in response to updates, checking the bounded evaluability of input queries and generating query plans for bounded queries by accessing a bounded amount of data. As an example, below we consider how to check the bounded evaluability.
Deciding bounded evaluability. A challenge is that it is undecidable to check whether an
| (a) | an | ||||
| (b) | it takes | ||||
That is, identifies the core subclass of boundedly evaluable
The unique features of bounded evaluation. The foundation of bounded evaluation was established in [8–10]. The theory was first put in action for
(1) Bounded evaluability. It is able to check whether an input
(2) Plug and play. Bounded evaluation can be readily built on top of commercial
(3) Join reduction. Access constraints provide us with new optimization rules, to replace costly join operations with simple data fetch by leveraging their associated indices at the query level.
We found that about 77% of
The study of bounded evaluation is among the first efforts to compute exact answers to queries in big datasets under limited resources. In contrast to commercial
3. Data-driven approximation
For queries Q that are boundedly evaluable under an access schema, we can answer Q by accessing a bounded amount of data. What can we do if query Q is not boundedly evaluable? Is it still possible to compute Q(D) in a big dataset D under constrained resources? We tackle this question next.
Approximation. We propose a data-driven scheme for approximate query answering. It is parameterized with a resource ratioα∈(0, 1], indicating that our available resources can only afford to access an α-fraction of a big dataset D. Given ratio α, dataset D and an
| (1) | |DQ| ≤ α|D|, where |DQ| is measured in its number of tuples and | ||||
| (2) | |||||
Intuitively, the scheme computes a set Q(DQ) of tuples as the approximate answer to query Q in big dataset D, by accessing at most α|D| tuples in the entire process. Thus, it can scale with D when D grows big by setting the ratio α small. Moreover, Q(DQ) assures a deterministic accuracy bound η [13]:
| (a) | for each tuple s in the approximate answer Q(DQ), there exists a tuple t in the exact answer Q(D) that is within distance η of s, and conversely, | ||||
| (b) | for each tuple t in the exact answer Q(D), there exists a tuple s in the approximate answer Q(DQ) that is within distance η of t. | ||||
That is, the set Q(DQ) consists only of ‘relevant’ answers, i.e. each tuple in Q(DQ) is a sensible answer to users, up to distance η; here, the distance between s and t is the sum of the pairwise distances of their attributes, which is in turn measured in terms of a distance function in the domain of the corresponding attribute (see an example shortly and [13] for a definition). Moreover, Q(DQ) ‘covers’ the exact answer, i.e. each tuple in the exact answer Q(D) finds a tuple s∈Q(DQ) within distance η, and no tuple in Q(D) is missed by Q(DQ) up to bound η. It finds sensible answers in users' interest and suffices for exploratory queries, e.g. real-time problem diagnosis on logs [20]. The larger the ratio α is, i.e. the more resources we can afford, the higher the accuracy bound η we can get.
As observed in a recent survey [21], approximate query answering is challenging, and little practical use has emerged from it. Nonetheless, we can make data-driven approximation work in practice by employing a mild extension of the access schema that we have seen earlier.
Continuing with example 2.1, consider query Q2 to find me restaurants that cost at most £35 per person on average and are in a city where one of my friends lives:
select
restaurant.name, restaurant.price from
restaurant, friend, person where
friend.pid = p0andfriend.fid =person.pid andperson.city =restaurant.city andrestaurant.price ≤ 35
Suppose that our available resources can afford to access at most 10−4*|D0| tuples, i.e. α = 10−4. That is, we have to reduce a dataset of PB size to a dataset of 100 GB, while guaranteeing a deterministic accuracy bound. Under these constraints, we can compute approximate answers to query Q2 in D0 based on data-driven approximation, by making use of an access schema , which includes ψ1–ψ5 of example (2.1), and in addition, the following extended access constraints:
| — | ϕ1: ,… | ||||
| — | ϕm: , where m = ⌈log2M⌉. | ||||
Here, M is the maximum number of distinct
Under access schema , we can find restaurants of interest by accessing at most α|D0| tuples as follows: (a) fetch a set T1 of
The approximate answer S has a deterministic accuracy bound: (1) for each tuple (n, p) in the exact answer Q(D0), there exists a tuple (n′, p′) in S such that n′ and p′ are within distance ekαn and ekαp of n and p, respectively, and (2) for each tuple in S, its price p′ exceeds 35 by at most ekαp, e.g. ekαp = 4 and p′ = 39, and n′ is the name of a restaurant. Moreover, the larger the resource ratio α is, the smaller the distances ekαp and ekαn are, and the more accurate the approximate answer S is.
It has been shown [13] that for any dataset D, there exists an access schema such that for any resource ratio α∈(0, 1] and any
Unique features of data-driven approximation. There has been a host of work on approximate query answering, typically based on one of the following two approaches (see [21,22] for surveys): (a) synopsis [23–27] or (b) dynamic sampling, e.g. [20,28–30]. The first approach computes an one-size-fit-all synopsis D′ of a dataset D and uses D′ to answer all queries posed on D. A representative method of the second approach is BlinkDB [20]. Assuming predictable QCSs (query column sets), i.e. ‘the frequency of columns used for grouping and filtering does not change over time’, BlinkDB selects samples from historical QCS patterns and caches them as views. It targets aggregate queries, i.e. queries to compute max, min, count, sum and average values grouped by certain attributes. It answers such aggregate queries by using the samples instead of accessing the original datasets and offers probabilistic error rates for the approximate answers computed.
As opposed to the prior approaches, the data-driven approximation scheme makes use of an access schema to identify an α-fraction DQ of dataset D. It offers the following:
(1) Unpredictable queries. The scheme does not assume the availability of any prior knowledge about user queries Q. By contrast, previous approaches make various assumptions on future queries, e.g. they assume that workloads, query predicates or QCSs of future queries are known in advance.
(2) Generic queries. The scheme is able to handle generic
(3) Deterministic accuracy bound. As remarked earlier, the data-driven scheme proposes an accuracy measure in terms of both relevance and coverage, to ensure that each tuple in the approximate answer computed is sensible and all tuples in the exact answer are covered. By contrast, previous approaches provide either no accuracy guarantee at all or probabilistic error rates for aggregate queries only. Such error rates do not tell us how ‘good’ each approximate answer is, which is one of the reasons why approximate query answering has not been widely used in practice.
Our preliminary study using real-life data finds that the data-driven approximation scheme computes approximate answers to
4. A resource-bounded query evaluation framework
Putting bounded evaluation and data-driven approximation together, we develop
BEAS. For a big dataset D in an application,
| (1) | it first checks whether query Q is boundedly evaluable under , i.e. the exact answer Q(D) can be computed by accessing DQ⊆D such that |DQ| is independent of the size |D| of D; | ||||
| (2) | if so, it computes the exact answer Q(D) by accessing a bounded fraction DQ of D; | ||||
| (3) | otherwise, | ||||
That is, under the resource constraint α,
Unique features of BEAS.
(1) Querying big relations. It is parameterized with a resource ratioα, i.e. it is able to scale with arbitrarily large datasets D by adjusting ratio α based on available resources.
(2) Separating exact answers from approximate answers. By effectively deciding whether an input query is boundedly evaluable under an access schema,
(3) Generic.
(4) Ease of implementation.
In light of these,
The idea of resource-bounded query answering is not limited to querying big relations. It has been shown that bounded evaluation improves the performance of graph pattern matching via subgraph isomorphism, an intractable problem [1] that is widely used in social media marketing, knowledge base extension and graph data cleaning, by four orders of magnitude on average [31]. For personalized social search via subgraph isomorphism, data-driven approximation retains 100% accuracy (i.e. η = 1) when α is as small as 1.5 × 10−6 [32], i.e. when processing graphs G of 1 PB, they access only 15 GB of data, i.e. reducing G from PB to GB while retaining high accuracy.
Data accessibility
This article has no additional data.
Competing interests
I declare I have no competing interests.
Funding
W.F. is supported in part by ERC 652976, Royal Society Wolfson Research Merit Award WRM/R1/180014, NSFC 61421003, EPSRC EP/M025268/1, Foundation for Innovative Research Groups of NSFC, Joint Lab between Edinburgh and Huawei, Beijing Advanced Innovation Centre for Big Data and Brain Computing and Shenzhen Institute of Computing Sciences.
Footnotes
References
- 1.
- 2. Stack Overflow. 2017SQL: Inner joining two massive tables. See http://stackoverflow.com/questions/1750001/sql-inner-joining-two-massive-tables. Google Scholar
- 3.
Fan W, Geerts F, Neven F . 2013Making queries tractable on big data with preprocessing. Proc. VLDB Endowment 6, 685–696. (doi:10.14778/2536360.2536368) Crossref, Google Scholar - 4.
Fan W, Wang X, Wu Y . 2014Distributed graph simulation: impossibility and possibility. Proc. VLDB Endowment 7, 1083–1094. (doi:10.14778/2732977.2732983) Crossref, Google Scholar - 5.
Xie C, Chen R, Guan H, Zang B, Chen H . 2015SYNC or ASYNC: time to fuse for distributed graph-parallel computation. In PPoPP, pp. 194–204. See https://doi.org/10.1145/2688500.2688508. Google Scholar - 6. Unisphere Research. 2016See http://www.unisphereresearch.com/Content/ReportDetail.aspx?IssueID=6559. Google Scholar
- 7.
Asay M . 2016NoSQL keeps rising, but relational databases still dominate big data. See https://www.techrepublic.com/article/nosql-keeps-rising-but-relational-databases-still-dominate-big-data/. Google Scholar - 8.
Fan W, Geerts F, Libkin L . 2014On scale independence for querying big data. In PODS. See https://doi.org/10.1145/2594538.2594551. Google Scholar - 9.
Fan W, Geerts F, Cao Y, Deng T . 2015Querying big data by accessing small data. In PODS. See https://doi.org/10.1145/2745754.2745771. Google Scholar - 10.
Cao Y, Fan W, Geerts F, Lu P . 2016Bounded query rewriting using views. In PODS. See https://doi.org/10.1145/2902251.2902294 Google Scholar - 11.
Cao Y, Fan W, Wo T, Yu W . 2014Bounded conjunctive queries. Proc. VLDB Endowment 7, 1231–1242. (doi:10.14778/2732977) Crossref, Google Scholar - 12.
Cao Y, Fan W . 2016An effective syntax for bounded relational queries. In SIGMOD. See https://doi.org/10.1145/2882903.2882942. Google Scholar - 13.
Cao Y, Fan W . 2017Data driven approximation with bounded resources. Proc. VLDB Endowment 10, 973–984. (doi:10.14778/3099622) Crossref, ISI, Google Scholar - 14.
Cao Y, Fan W, Wang Y, Yuan T, Li Y, Chen LY . 2017BEAS: bounded evaluation of SQL queries. In SIGMOD, pp. 1667–1670. See https://doi.org/10.1145/3035918.3058748 Google Scholar - 15. University of Edinburgh. 2017Huawei deal to advance expertise in data science. See http://www.ed.ac.uk/news/2017/huawei-deal-to-advance-expertise-in-data-science. Google Scholar
- 16. Facebook. 2013Introducing graph search. See https://en-gb.facebook.com/about/graphsearch. Google Scholar
- 17.
Grujic I, Bogdanovic-Dinic S, Stoimenov L . 2014Collecting and analyzing data from e-government facebook pages. In ICT Innovations. Google Scholar - 18.
Abiteboul S, Hull R, Vianu V . 1995Foundations of databases. Reading, MA: Addison-Wesley. Google Scholar - 19.
Ramakrishnan R, Gehrke J . 2000Database management systems. New York, NY: McGraw-Hill Higher Education. Google Scholar - 20.
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I . 2013BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys. See https://doi.org/10.1145/2465351.2465355. Google Scholar - 21.
Chaudhuri S, Ding B, Kandula S . 2017Approximate query processing: no silver bullet. In SIGMOD, pp. 511–519. See https://doi.org/10.1145/3035918.3056097. Google Scholar - 22.
Cormode G, Garofalakis MN, Haas PJ, Jermaine C . 2012Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4, 1–294. (doi:10.1561/1900000004) Crossref, Google Scholar - 23.
Acharya S, Gibbons PB, Poosala V . 2000Congressional samples for approximate answering of group-by queries. In SIGMOD. See https://doi.org/10.1145/342009.335450. Google Scholar - 24.
Ioannidis YE, Poosala V . 1999Histogram-based approximation of set-valued query-answers. In VLDB. See https://doi.org/10.1109/69.250091. Google Scholar - 25.
Jagadish HV, Koudas N, Muthukrishnan S, Poosala V, Sevcik KC, Suel T . 2009Optimal histograms with quality guarantees. In VLDB. See https://doi.org/10.1145/130283.130335. Google Scholar - 26.
Chakrabarti K, Garofalakis MN, Rastogi R, Shim K . 2001Approximate query processing using wavelets. VLDB J. 10, 199–223. (doi:10.1007/s007780100049) Crossref, ISI, Google Scholar - 27.
Cormode G, Garofalakis M . 2005Sketching streams through the net: distributed approximate query tracking. In VLDB. See https://doi.org/10.1016/j.jalgor.2003.12.001. Google Scholar - 28.
Babcock B, Chaudhuri S, Das G . 2003Dynamic sample selection for approximate query processing. In SIGMOD. See https://doi.org/10.1145/872757.872822. Google Scholar - 29.
Ding B, Huang S, Chaudhuri S, Chakrabarto K, Wang C . 2016Sample + Seek: approximating aggregates with distribution precision guarantee. In SIGMOD. See https://doi.org/10.1145/2882903.2915249. Google Scholar - 30.
Kandula S, Shanbhag A, Vitorovic A, Olma M, Grandl R, Chaudhuri S, Ding B . 2016Quickr: lazily approximating complex ad-hoc queries in big data clusters. In SIGMOD. See https://doi.org/10.1145/304181.304581. Google Scholar - 31.
Cao Y, Fan W, Huang R . 2015Making pattern queries bounded in big graphs. In ICDE. See https://doi.org/10.1109/ICDE.2015.7113281. Google Scholar - 32.
Fan W, Wang X, Wu Y . 2014Querying big graphs within bounded resources. In SIGMOD. See https://doi.org/10.1145/2588555.2610513. Google Scholar


