High-Concurrency OLAP Workloads with StarRocks Query Cache

Publish date: Sep 19, 2023 5:19:12 AM

Application performance can suffer when thousands of users run lots of potentially heavy aggregation queries against large complex datasets. Luckily, StarRocks provides a query cache that speeds up aggregations by reusing the intermediate results of previous queries. When a query is executed, the query cache is checked first. If the query matches, the result is returned from the query cache skipping disk access and local computation. Subsequently, the result from the query cache is returned directly.

By storing the intermediate results rather than only the entire queries, the StarRocks query cache achieves a very high hit rate, even for queries that are not identical. Cache hits save a lot of time and resources for similar queries. Performance tests show that query cache can improve query efficiency in high-concurrency scenarios by 3 to 17 times.

Introduction

In user-facing analytics, achieving high concurrency is crucial and can present considerable obstacles. Conventional systems may encounter difficulties in retrieving data promptly and performing efficient computations, particularly with intricate OLAP queries, leading to delayed decision-making and potential inaccuracies.

StarRocks addresses these issues head-on. Its superior raw query performance allows rapid data processing irrespective of the number of concurrent interactions. Moreover, its intelligent design bypasses unnecessary computations, enhancing efficiency.

In a typical user-facing scenario, many users execute similar queries on the same data. Often, the queries are completely or partially semantically equivalent. For example, in a dashboard application, many users will want to see a certain business domain's latest metrics or trends. If these queries can be cached, the system can serve more requests with less latency and less resource usage.

StarRocks query cache, what is it?

StarRocks query cache is a mechanism that speeds up aggregation and multi-table join queries by reusing the results of queries launched earlier. It is not a final result cache, but instead caches the intermediate output results of local aggregation in memory.

The result of local aggregations (intermediate results) are cached in memory

Engineered for maximum cache-hits rate

Consider these three scenarios where the query cache is effective even when the queries aren't identical:

Semantically equivalent queries
Queries with overlapping scanned partitions: Predicate-based query splitting
Append-only data changes (no update or delete): Multi-version caching

Semantic equivalence

These two queries make effective use of the query cache because the subquery of the first example is semantically equivalent (or identical) to the second, so the results are reused and pulled from the cache.

SELECT
     (ifnull(sum(murmur_hash3_32(hour)), 0) 
      + ifnull(sum(murmur_hash3_32(k0)), 0)
      + ifnull( sum( murmur_hash3_32(__c_0)), 0)
     ) AS fingerprint
FROM
     ( SELECT  date_trunc( hour', ts) AS hour,
               ko,
               sum(v1).                      AS __с_0
       FROM
               tO
       WHERE 
               ts
       BETWEEN '2022-01-03 00:00:00' and '2022-01-03 23:59:59'
       GROUP BY
               date_trunc("hour', ts),
               k0
) AS t;

and

SELECT date_trunc('hour', ts) AS hour,
       k0,
sum(v1) AS __c_0
FROM t0
WHERE ts
BETWEEN '2022-01-03 00:00:00' AND '2022-01-03 23:59:59'
GROUP BY date_trunc('hour', ts),
         k0;

More details on the query cache are in the reference section, but finish reading the blog to get the whole story.

Overlap in partition: Predicate-based query splitting

In these two queries, only the range differs, and some of the range overlaps. The first part of the query references the column "ts," used to partition the table. StarRocks can detect when a query has a predicate referencing a table's partitioning column that specifies a value range. StarRocks splits the range into multiple intervals for these queries based on table partitioning. Other queries can reuse computation results from each interval.

SELECT date_trunc( 'day', ts) as day,
       sum (v0)
FROM t0
WHERE ts
BETWEEN '2022-01-02 12:30:00' AND '2022-01-14 23:59:59'
GROUP BY day;

and

SELECT date_trunc( 'day', ts) as day,
       sum(v0)
FROM t0
WHERE ts >= '2022-01-02 12:30:00'
      AND ts < '2022-01-05 00:00:00'
GROUP BY day;

Append-only data loading: Multi-version caching

Handling different query permutations and avoiding computation is great, but what about when data changes? Query results can be re-used if the data is appended, but nothing is deleted or updated.

Combining with the query result cache

The query cache is used for aggregations, similar but not identical queries, and volatile data. The similarly named query result cache is more suitable for other queries that are exactly alike and not aggregated against relatively static data. Having different mechanisms, StarRocks query cache and query result cache are designed to serve different purposes.

StarRocks' query cache is more suitable for scenarios where:

Aggregation queries are frequently executed
Queries are similar but not necessarily identical
Data is volatile, as long as the changes are append-only

On the other hand, the query result cache is more suitable for scenarios where:

Queries are non-aggregated
Queries are repeated and identical
Data has very low volatility

The two approaches can be combined to maximize performance. While the query cache stores intermediate results, the result cache stores final query results. A dual-layer caching approach ensures that both intermediate and final results are readily available for reuse, further minimizing the need for redundant computations and disk access.

For businesses, this combination of StarRocks' query cache and an application-layer result cache can lead to substantial improvements in query response times and overall system performance, especially in user-facing analytics scenarios. Faster query processing translates to real-time insights and data-driven decision-making, ultimately enabling businesses to be more agile and responsive to market trends and customer needs. Moreover, the efficient use of resources reduces operational costs and allows businesses to scale their data infrastructure more effectively.

When new data is loaded, new versions of tablets are created. This causes cached computation results from previous tablet versions to become outdated. To address this, the multi-version caching mechanism automatically merges the outdated cached results with the latest tablet versions to generate accurate results for new queries. More details are linked at the bottom of the page; keep reading to get the full picture.

How to use it

In this section, we will use a table from the SSB_flat benchmark to demonstrate how to use the StarRocks query cache.

Preparation

The query cache is disabled by default. You can enable it for the session:

SET enable_query_cache=true;

An example

We will show results using both the Star Schema Benchmark (SSB) and the SSB_flat table that ClickHouse uses. Both of these benchmarks are linked in the references. Some example queries will be shown in the next section that refer to the table `lineorder_flat`, so the DDL for this table is shown here:

SELECT date_trunc( 'day', ts) as day,
sum(v0 )
FROM t0
WHERE ts >= '2022-01-02 12:30:00'
AND ts < '2022-01-05 00:00:00'
GROUP BY day;

CREATE TABLE `lineorder_flat` (
  `lo_orderdate` date NOT NULL COMMENT "",
  `lo_orderkey` int(11) NOT NULL COMMENT "",
  `lo_orderpriority` varchar(100) NOT NULL COMMENT "",
  `lo_shippriority` tinyint(4) NOT NULL COMMENT "",
  `lo_quantity` tinyint(4) NOT NULL COMMENT "",
  `lo_extendedprice` int(11) NOT NULL COMMENT "",
  `lo_ordtotalprice` int(11) NOT NULL COMMENT "",
  `lo_discount` tinyint(4) NOT NULL COMMENT "",
  `lo_revenue` int(11) NOT NULL COMMENT "",
  `lo_supplycost` int(11) NOT NULL COMMENT "",
  `lo_tax` tinyint(4) NOT NULL COMMENT "",
  `lo_commitdate` date NOT NULL COMMENT "",
  `lo_shipmode` varchar(100) NOT NULL COMMENT "",
  `s_nation` varchar(100) NOT NULL COMMENT "",
  `s_region` varchar(100) NOT NULL COMMENT "",
  `p_category` varchar(100) NOT NULL COMMENT "",
  `p_brand` varchar(100) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`lo_orderdate`, `lo_orderkey`)
COMMENT "olap"
PARTITION BY RANGE(`lo_orderdate`)
(PARTITION p1 VALUES [("0000-01-01"), ("1993-01-01")),
PARTITION p2 VALUES [("1993-01-01"), ("1994-01-01")),
PARTITION p3 VALUES [("1994-01-01"), ("1995-01-01")),
PARTITION p4 VALUES [("1995-01-01"), ("1996-01-01")),
PARTITION p5 VALUES [("1996-01-01"), ("1997-01-01")),
PARTITION p6 VALUES [("1997-01-01"), ("1998-01-01")),
PARTITION p7 VALUES [("1998-01-01"), ("1999-01-01")))
DISTRIBUTED BY HASH(`lo_orderkey`) BUCKETS 48
PROPERTIES (
"replication_num" = "1",
"colocate_with" = "groupxx1",
"in_memory" = "false",
"storage_format" = "DEFAULT",
"enable_persistent_index" = "false",
"replicated_storage" = "false",
"compression" = "LZ4"
);

Note: This query is from the ClickHouse SSB_flat benchmark, reference at end of page.

Typical Scenarios

There are several cases where the StarRocks query cache can be especially useful, including:

Aggregated queries on flat table data models.
Aggregated JOIN queries on star-schema data models.

The query cache can greatly help in cases where we need to speed up queries that have the same or similar structure and are executed frequently and concurrently. For example, in the code blocks below you will find that queries Q1, Q2, Q3, and Q4 all have the same pattern and can benefit from the query cache (the range for the filter on l0_orderdate changes, but the rest of the query is the same across Q1 - Q4).

-- Q1 : lo_orderdate ranges ['1993-01-01', '1993-12-31']
SELECT sum(lo_revenue), year(lo_orderdate) AS year,  p_brand
FROM lineorder_flat
WHERE p_category = 'MFGR#12' AND
      s_region = 'AMERICA' AND
      lo_orderdate between '1993-01-01' and '1993-12-31'
GROUP BY year,  p_brand

Q1 : lo_orderdate ranges ['1993-01-01', '1993-12-31']

-- Q2: widen the time range， o_orderdate ranges ['1993-01-01', '1994-12-31']
SELECT sum(lo_revenue), year(lo_orderdate) AS year,  p_brand
FROM lineorder_flat
WHERE p_category = 'MFGR#12' AND
      s_region = 'AMERICA' AND
      lo_orderdate between '1993-01-01' and '1994-12-31'
GROUP BY year,  p_brand

Q2: widen the time range， o_orderdate ranges ['1993-01-01', '1994-12-31']

-- Q3: narrow the time range, o_orderdate ranges ['1993-01-01', '1993-06-30']
SELECT sum(lo_revenue), year(lo_orderdate) AS year,  p_brand
FROM lineorder_flat
WHERE p_category = 'MFGR#12' AND
      s_region = 'AMERICA' AND
      lo_orderdate between '1993-01-01' and '1993-06-30'
GROUP BY year,  p_brand

Q3: narrow the time range, o_orderdate ranges ['1993-01-01', '1993-06-30']

-- Q4: slide time range, o_orderdate ranges ['1993-06-30', '1994-06-30']
SELECT sum(lo_revenue), year(lo_orderdate) AS year,  p_brand
FROM lineorder_flat
WHERE p_category = 'MFGR#12' AND
      s_region = 'AMERICA' AND
      lo_orderdate between '1993-06-30' and '1994-06-30'
GROUP BY year,  p_brand

Q4: slide time range, o_orderdate ranges ['1993-06-30', '1994-06-30']

Using benchmark datasets is good, but here are some business use cases where the StarRocks query cache can be helpful in terms of reducing the load on the underlying data source and increasing the scalability of the system.

For monitoring or dashboard platforms, the dataset is ingested incrementally as time goes by, and users are interested in the summary of the data in different periods
For web servers that generate dynamic pages based on database content
For user-facing analytics with high QPS involving summarization and grouping of data by certain dimensions or measures

Warming up the cache

The query cache can be used to speed up queries after it is populated with the intermediate results of previous queries. Each BE node maintains its own local query cache in memory, and the query can be assigned to any BE that has tablet replicas. Therefore, to fill up the query cache on all BEs, the query should be executed at least as many times as the replication number (replication_num). Just note that the cache does not need to be fully warmed up to take effect.

Best practices

To use the query cache effectively, you need to design your table schema carefully. You should consider:

Select a single date/datetime column as the partition-by column. This column should reflect how the data is incrementally ingested over time and how the queries are filtered by time ranges. If you have multiple date/datetime type columns, choose the one that best matches these criteria.
Choose an appropriate partition size. The most recent data will update the latest partitions of the table, so the cache entries for those partitions are likely to be invalidated frequently. You want to avoid having too large or too small partitions that affect the cache hit rate.
Specify a bucket count in the range of several dozen in the data distribution description. This will ensure that each query can cover enough tablets on each BE to enable the query cache. If the bucket count is too low, the query cache will not work because it requires the tablet number on each BE to be greater than or equal to pipeline_dop.

Just note that tuning a database is all about trade-offs. Tune it in the way that best fits your scenario.

Performance benchmark

Although StarRocks query cache is not a result cache, re-using the intermediate computation result can still yield great performance gains. In addition to aggregate queries, JOIN operations can also be accelerated. Now let's look at some performance data. For simplicity, we represent the result as RT ratio, which is the ratio ofno_cache/cache_hitin terms of query latency. The detailed performance test results are in the appendix

Note that for all tests, the query cache is warmed up fully

Flat table schema test

The test uses the SSB_flat dataset (link in references). The tests were run on three servers, each with 16 cores and 64GB memory. 10 concurrent clients were used. The ratio of query latency without the query cache to with the query cache was reported.

responseTimeRatio

Query Cache on vs. Query Cache off

We can see that for single table (aggregate) queries with 10 concurrent clients, query cache hits can yield up to 10x performance gain.

Star schema test

The test uses the SSB original multi-table dataset (link in references). The tests were run on three servers, each with 16 cores and 64GB memory. 10 concurrent clients were used. The ratio of query latency without the query cache to with the query cache was reported.

RatioStarSchema

Query Cache on vs. Query Cache off

We can see that for multi-table (aggregate) queries with 10 concurrent clients, query cache hits can yield up to 17x performance gain.

Conclusion

The query cache is a powerful feature of StarRocks that can greatly enhance the performance of aggregation queries. By storing the intermediate results of local aggregation in memory, the query cache can avoid unnecessary disk access and computation for new queries that are similar to previous ones. The query cache can also handle queries and data that are not exactly the same, which makes it more flexible and adaptable than a final result cache. The query cache is especially useful for high-concurrency scenarios where many users run similar queries on large and complex data sets. With its query cache, StarRocks can deliver fast and accurate results for aggregation queries, saving time and resources and enabling better scalability.

Reference

StarRocks Benchmarks
Query cache docs
Pat O'Neil, Betty O'Neil, Xuedong Chen: Star Schema Benchmark.
Star Schema flat benchmark in ClickHouse documentation.