Reduced 80% cost! Didi's Journey from multiple OLAP engines to StarRocks
Introduction
Didi Chuxing Technology Co., Ltd., commonly known as Didi, is a Chinese multinational ride-sharing company headquartered in Beijing. It is one of the world's largest ride-hailing companies, serving 550 million users across over 400 cities. A total of 7.43 billion rides were completed on DiDi's platform in 2017.
Didi's IT environment is highly complex and consists of a large number of systems and applications. The company's data infrastructure is massive, with petabytes of data generated every day. Didi's IT team is responsible for maintaining and operating this complex infrastructure, as well as developing new applications and services.
Here are some specific examples of how Didi's IT environment is used:
- Real-time ride matching: Didi's IT infrastructure is used to match riders with drivers in real time. This requires the company to have a high-performance database that can handle thousands of requests per second.
- Demand forecasting: Didi uses its data and analytics capabilities to forecast demand for rides in its service areas. This information is used to optimize the deployment of drivers and maximize the utilization of its fleet.
- Fraud detection: Didi's IT systems are used to detect and prevent fraudulent activities, such as unauthorized ride requests and fake accounts. This helps to maintain the integrity of the Didi platform and protect its users.
To effectively analyze this data and derive valuable insights, Didi initially relied on various combinations of Apache Druid, Apache Kylin, Presto and then ClickHouse for their OLAP (Online Analytical Processing) needs. However, as the company's data volume and analytics requirements grew, the limitations of these platforms became apparent.
Challenges
Didi faced several challenges with its Presto, Clickhouse and Apache Druid OLAP environment:
- Complexity of management: Managing multiple OLAP systems, such as Presto, Clickhouse and Apache Druid, became increasingly complex for Didi's data engineering team. This resulted in a high overhead in maintaining and troubleshooting the infrastructure.
- Feature differences in a multi-OLAP environment: Users were confused on which system to use for various business use cases and did not understand why queries had to be written differently when using different systems and some data could be deleted but not in the other system.
- Data management complexity: Managing multiple OLAP systems added to the operational overhead, requiring specialized expertise and increasing the risk of data inconsistencies.
- Scalability: The rapid growth of Didi's business led to an exponential increase in data volume. The Presto, Apache Druid and ClickHouse systems were unable to scale effectively to handle this growth, resulting in performance bottlenecks and increased query latency.
- Lack of key features to support real time ingesting and querying: Didi needed to perform real-time analytics on its massive datasets to gain insights into user behavior and optimize its operations. However, the existing system lacked the necessary capabilities for low-latency, real-time analytics; specifically, the ability to do real time streaming inserts, updates and deletes.
- Performance bottlenecks: Presto, Clickhouse and Apache Druid struggled to handle the high volume and complexity of Didi's data. This resulted in slow query performance and long response times, which hindered the company's ability to make data-driven decisions in real time.
- Rapid adoption of innovation: Didi wanted to adopt the emerging open table formats. At the time, Presto, Clickhouse and Apache Druid did not support this.
Solution
To address these challenges, Didi embarked on a journey to modernize its OLAP infrastructure. After evaluating various options, Didi chose to migrate to StarRocks, a unified OLAP platform that combines the strengths of Presto, Apache Druid and ClickHouse.
StarRocks offered several advantages over the existing setup:
- Unified OLAP platform: StarRocks seamlessly integrated real-time and historical data, eliminating the need for separate OLAP systems.
- High performance: StarRocks' columnar storage, vectorized execution engine, and distributed architecture enabled blazing-fast query performance for both ad-hoc and complex analytics.
- Elastic scalability: StarRocks' horizontal scalability allowed Didi to easily scale its OLAP capacity to handle increasing data volumes and workloads.
- Unified data lakehouse architecture: StarRocks natively supports data warehouse capabilities on top of Apache Iceberg, Apache Hudi, Apache Hive and Delta Lake.
Results
Didi's Monitoring and Alarm System business unit
For Didi's monitoring and alarm system business unit, the migration to StarRocks brought about significant improvements. With an intake of 450,000 entries per second, and a daily data volume of 12TB, the new environment provided:
- Reduced query latency: StarRocks delivered faster query performance compared to the previous OLAP setup, enabling real-time insights and decision-making.
- Query performance improved 4x.
- Query P90 time improved from 500ms to 150ms.
- Simplified data management: The unified nature of StarRocks streamlined data management, reduced operational complexity and improved data consistency. In addition, with the ability to better support JOINS at scale, the data movement pipelines were radically simpified.
- Reduced infrastructure costs: StarRocks' elasticity and efficient resource utilization significantly lowered Didi's OLAP infrastructure costs.
- The cluster size reduced from 60+ nodes to less than 10 nodes.
- Data storage volume reduced by about 40%.
- Overall cost reduced by more than 80%.
Didi's Financial Data Portal
The Financial Data Portal is an internal financial product data platform within Didi, which mainly provides data analysis functions for the financial business team to drive the continued optimization of operations. The requirements for the OLAP engine include: supporting custom time dimensions and JOINS across datasets with query response times within seconds. The daily data volume is about 1 billion records.
StarRocks was able to meet the single seconds level query response times through the use of:
- Data bucketing and partitioning
- Use of prefix, ZoneMap, Bloom Filter and Bitmap indices
- Materialized views
Across Didi's entire IT infrastructure
As of 2023, the number of StarRocks clusters is about 30+, the data volume has reached 300+ Terabytes, and the daily average query per second (QPS) is 4,000,000+. StarRocks now serves multiple business lines within the company, including online taxis, carpooling, two-wheeled vehicles, finance, and energy business units.
What's Next for Didi?
Didi continues to leverage StarRocks as its core OLAP platform, enabling the company to:
- Expand real-time analytics use cases: With StarRocks' low latency and high throughput, Didi can extend real-time analytics to more areas of the business, driving real-time decision-making and optimization.
- Explore machine learning applications: StarRocks' integration with machine learning frameworks allows Didi to build and deploy machine learning models directly on top of its OLAP data, enabling advanced predictive analytics.
- Collaborate with the StarRocks community: Didi actively contributes to the StarRocks open-source project, sharing its expertise and helping to shape the future of the platform.
Conclusion
Didi's journey to StarRocks demonstrates the transformative power of a unified OLAP platform. By embracing StarRocks, Didi has achieved significant performance improvements, simplified data management, and reduced infrastructure costs, enabling the company to derive deeper insights from its vast data and make real-time decisions that drive business success.