Build a More Open Lakehouse With Unity Catalog

Publish date: Sep 16, 2024 2:45:43 PM

Databricks' decision to open source Unity Catalog and donate to LF AI & Data is great news for Lakehouse users. Providing a universal interface across data and AI, Unity Catalog ensures interoperability between many popular formats and compute engines. However, this announcement goes beyond simply providing new tools for the lakehouse; it's also a big step toward the viability of open architecture.

In this blog, we'll discuss this news, explain what it means for your work, and why there has never been a more opportune time to take a more open approach to your infrastructure.

Why Unity Catalog Is a Big Deal for Open Lakehouse Architectures

Open-sourcing Unity Catalog brings with it two significant improvements in how organizations manage and interact with their data:

Flexibility, free from vendor lock-in: Many lakehouse data governance solutions tie users to specific vendors or platforms, limiting flexibility and control over their data. With Unity Catalog, you can own your data and metadata, giving you the freedom to choose the optimal solution for yourself without being confined to a single vendor. This open approach ensures that organizations can always stay ahead of their competition without being hindered by proprietary systems.
Interoperability between formats and engines: Unity Catalog also delivers seamless interoperability between various data formats and compute engines. Whether it's Delta Lake, Apache Iceberg, or Apache Hudi, Unity Catalog ensures that data can be easily read and managed across different systems. This capability is crucial for modern data architectures, especially with the varied nature of AI applications, where diverse data must be integrated and analyzed collectively with multiple engines on top. This interoperability ensures a consistent user experience and simplifies integrations between different systems, which saves labor hours while enabling you to build faster.

What Does Unity Catalog Look Like in Action?

While the benefits sound promising, what does this look like in practice? This section will use StarRocks, an open-source query engine that supports Delta through Delta Kernel Java, to demonstrate how we can easily interpolate between different table formats using Delta UniForm and Unity Catalog.

unnamed

First, we'll use docker to bring up a StarRocks service:

docker run -p 9030:9030 -p 8030:8030 -p 8040:8040 -itd --name quickstart starrocks/allin1-ubuntu

Then, we follow the Unity Catalog's quick start guide:

In another terminal window, we start the Unity Catalog Environment:

git clone unitycatalog/unitycatalog

bin/start-uc-server

Next, we can use the MySQL CLI from the StarRocks container to access StarRocks:

docker exec -it quickstart \

mysql -P 9030 -h 127.0.0.1 -u root --prompt="StarRocks > "

Finally, we can create an REST external catalog to read the Iceberg tables. With Delta UniForm, when data is written, it automatically conforms to the Iceberg metadata standards alongside Delta and Apache Hudi, enabling any client in the Iceberg ecosystem to read the data as Iceberg. As a result, any Iceberg-compatible client can directly access and read the table.

-- Create the external catalog
CREATE EXTERNAL CATALOG `uc_rest`
PROPERTIES ("iceberg.rest-catalog.security"  =  "OAUTH2",
"iceberg.catalog.uri"  =  "http://127.0.0.1:8080/api/2.1/unity-catalog/iceberg",
"iceberg.catalog.warehouse"  =  "unity",
"type"  =  "iceberg",
"iceberg.catalog.type"  =  "rest",
"iceberg.rest-catalog.oauth2.token"  =  "not_used"

)


-- Read the data 
SELECT * FROM `unity.default`.marksheet_uniform;

How Does This Impact Compute Engines?

While the release of Unity Catalog has many implications, its impact on the compute engine landscape is worth calling out specifically:

More competition and innovation: Open-sourcing Unity Catalog dramatically reduces user friction when moving between solutions. This, in turn, will increase competition and create opportunities for new challengers. Ultimately, you can expect to see more options in this space, which will translate to more choices for you thus lowering costs.
Greater specialization of compute engines: Unity Catalog enables interoperability between lakehouse formats and compute engines, leading to more compute engines coexisting on a single source of truth data. Specialized compute engines that excel in specific tasks like batch processing or low-latency queries will more easily find their niche, and allow organizations to adopt technologies that are more finely tuned to their operational needs. This will be a big step up for your efficiency and performance.
A Stronger open source foundation: The significance of open-sourcing Unity Catalog ties directly into the broader narrative against vendor lock-in, offering users freedom in their choice of technologies. Strong open-source query engines ensure the ecosystem remains vibrant and complete, providing a comprehensive suite of tools that meet your needs without forcing you into a single vendor's solution.

Where Do We Go From Here?

By fostering an open and flexible compute engine landscape, Unity Catalog enhances the open lakehouse ecosystem, enabling compute engines like StarRocks to thrive. To learn more, you can read about Unity Catalog here.

If you're ready to join the open lakehouse conversation, we recommend jumping into the StarRocks Slack channel and the Unity Catalog Slack channel to connect with like-minded engineers who are building the next generation of open lakehouse architectures.