StarRocks Best Practices: Data Ingestion
Over my years as a DBA and StarRocks contributor, I've gained a lot of experience working alongside a diverse array of community members and picked up plenty of best practices. In this time, I've found five specific models that stand out as absolutely critical: deployment, data modeling, data ingestion, querying, and monitoring.
In my previous article I shared some tips on StarRocks data modeling, in this one, I'll be explaining data ingestion.
Data Ingestion
Data ingestion is important, but it can also be a huge headache if you go about it the wrong way. With StarRocks, you have it easy, and there's only a few key things you need to remember:
Usage Recommendations
-
Required: Do not use
INSERT INTO VALUES()
for production data ingestion. -
Recommended: A minimum interval of 5 seconds between ingestion batches.
-
Recommended: For update scenarios in primary key tables, consider enabling persistent index, use this only if you have high-performance storage such as NVME SSD drives.
-
Recommended: For scenarios with frequent ETL operations (insert into select), consider enabling the Spill to disk feature to prevent exceeding memory limits.
-
Recommended: To batch ingest a partitioned table, especially ingest a large volume of historical data from Iceberg/Hudi/Hive, it’s better to perform ingestion partition by partition, to avoid generating small files.
Data Lifecycle
-
Recommended: Use
TRUNCATE
to delete data rather thanDELETE
. -
Required: Full update syntax is only available in version 3.0 and later of the primary key model; high-concurrency updates are prohibited, and it is recommended that each update operation be spaced by at least one minute.
-
Required: If using
DELETE
to remove data, it must include aWHERE
clause, and concurrent deletes are prohibited, e.g., avoid executing 1000 separateDELETE FROM tbl1 WHERE id=1
statements; instead, useDELETE FROM tbl1 WHERE id IN (1,2,3,...,1000)
. -
Required: The
DROP
operation by default moves to the FE trash and is retained for 86400 seconds (1 day), during which it can be recovered to prevent accidental deletions. This behavior is controlled by thecatalog_trash_expire_second
parameter. After one day, files move to the BE's trash directory, retained for 259200 seconds (3 days) by default.This retention time has been adjusted in versions 2.5.17, 3.0.9, and 3.1.6 onwards to 86400 seconds (1 day), influenced by the
trash_file_expire_time_sec
parameter. If rapid disk space release is necessary post-drop, consider reducing the FE and BE trash retention times.
As you can see, there aren't too many secrets to smart data ingestion with StarRocks. That said, the tips that do exist are golden, and knowing them early can save you from a ton of frustration.
This sums up my advice for data ingestion, but there's a lot more to share. Head on over to my fourth article in this series that will take a look at queries with StarRocks. If you have questions, I invite you to join me on the StarRocks' Slack where you can learn more.