Some additional information in one line

Sitting on top of columnar files, analytical systems are built to run fast analytical queries. Unfortunately, processing real-time data, especially with mutable data that includes frequent upserts and deletes, continues to be a persistent challenge for many.

This article dives into what's behind what makes upserts so difficult for real-time data warehouses, why this continues to be a problem for many enterprises, and how engineers have finally solved these issues in actual production systems.

 

 

Problems With Data Upserts

Most data that needs to be analyzed today is naturally mutable, like orders, customer profiles, and inventory records. As business operations evolve, so do the entries in these tables, with changes needing to be reflected in real time for accurate and up-to-date insights.

However, due to a lack of specialized tools for this use case, traditional data warehouses are either:

  1. Designed to handle mutable data only in big batches, not in real-time. Think read optimized tables such as copy-on-write.

  2. Can perform real-time upserts but at the cost of severely degraded query performance. Think about write-optimized tables such as merge-on-read.

 

These shortcomings have pushed many users away from real-time analytics, hindering its adoption, and leaving enterprises missing out on the benefits of real-time.

 

Improving Data Upserts

These data upsert challenges can be addressed, however, but a direct and robust approach is required. One popular approach that has been gaining momentup is delete-and-insert.

This approach allows the system to efficiently manage data updates in real time by marking outdated records for deletion and immediately inserting updated records. This ensures that the data remains current without the overhead of traditional merging operations that affect query performance.

 

A Proven Approach to Data Upserts

A great example of delete-and-insert in action is StarRocks, a real-time data warehouse designed to handle real-time mutable data. StarRocks' primary key table's specific implementation of the delete-and-insert upsert mechanism employs several innovative technologies that set it apart in its approach:

 

Mark delete

StarRocks uses roaring bitmaps to manage and store the deletion markers for each columnar storage file. For instance, if a file contains 10,000 rows, of which 200 are marked as deleted, a bitmap can efficiently mark these deletions. Roaring bitmaps are stored within a local RocksDB and remote storage to enable efficient storage and quick access. Additionally, roaring bitmaps are also cached in memory to boost performance further.

 

In-memory and on-disk primary key index implementations

StarRocks also employs primary key indexes to accelerate the mapping from primary keys to their record locations. This aids in generating columnar storage file-specific deletion markers during the data ingestion and deletion processes. Two implementations of primary key indexes are utilized:

  • In-memory primary key index: Built on-demand and released during periods of inactivity to conserve memory, In-memory primary key indexes are ideal for environments needing quick updates when there is plenty of memory

  • Disk-persistent primary key index: This method sinks primary key indexes to disk and remote storage, optimizing memory use without sacrificing performance. It is ideal for when memory is tight while using high-performance storage devices such as NVME SSDs.

 

This sounds great, but how does it play out in the real world? Let's look at an example from Airbnb to see StarRocks' approach to data upserts in action.

 

Data Upsert Example: Airbnb Fraud Detection

To safeguard the legitimate interests of both their users and the platform, Airbnb's data scientists are tasked with the timely detection of violations to mitigate potential losses. This introduced specific requirements for the OLAP system:

  • Real-time: The system must immediately capture and integrate business data, ensuring real-time data ingestion and querying capabilities.

  • Mutable data: Airbnb's statuses, transactions, and more are mutable and must land in the data warehouse in real-time.

  • Complex queries: Due to the uncertainty of violations, the dataset to query and the query method cannot be predicted. Therefore, the OLAP system must be able to run complex ad-hoc queries on real-time updated data and return results within seconds.

Current technologies like Druid and Presto fell short of fulfilling these criteria, so Airbnb deployed a new Trust Analytics architecture based on StarRocks. This enhanced data processing and analytics capabilities significantly. This architecture encompasses:

Airbnb Architecture

An example of Airbnb's data architecture

 

  • Real-time data ingestion: Utilizing Kafka for logging events and integrating data tables directly into StarRocks allows Airbnb's team to implement real-time data updates through a primary key table, ensuring the required data freshness.

  • Complex query execution: Powered by StarRocks' primary key table, Airbnb enjoys the best of both worlds: second-level data freshness and uncompromising query performance. This enables flexibility and ensures fraudulent activities are identified in time.

 

Thanks to StarRocks, Airbnb achieved the real-time performance needed to power its fraud detection, saving the company from significant losses and highlighting the value of its Trust Analytics engineers.

 

It's Time to Embrace Real-Time Analytics

The challenges of managing real-time data in analytical systems, especially those requiring frequent upserts and deletions, have historically hindered the broader adoption of real-time analytics. The delete-and-insert approach introduced by StarRocks represents a promising solution, ensuring real-time analytics work on any kind of data, not limited to immutable data.

If you have been held back from investing in real-time analytics, approaches like delete-and-instert ensure you can leverage timely insights to drive further success, just like Airbnb did with StarRocks. Join the community on Slack to learn more about how other enterprises are utilizing StarRocks.