Databricks Steps Up Open-Source Data Lakehouse Contributions

At the Databricks Data + AI Summit this week the company expanded its contributions to the Delta Lake initiative, as well as unleashing a wave of new data lakehouse, machine learning and data governance technologies and unveiled plans for a marketplace for monetizing data and analytics assets.

ARTICLE TITLE HERE

Data lakehouse technology developer Databricks is taking its lakehouse platform to the next level with a series of new capabilities – including expanded data governance and machine learning pipeline functionality – that boost analytic workload performance.

Databricks, which held its Data + AI Summit in San Francisco this week, also said the company is contributing all Delta Lake enhancements to the Linux Foundation with the release of Delta Lake 2.0 and unveiled Project Lightspeed for developing next-generation Spark streaming data analytics technology.

The company also unveiled an online marketplace where businesses – including channel partners – can package and distribute data and analytics assets.

id
unit-1659132512259
type
Sponsored post

[Related: Databricks And AWS Extend Alliance With Pay-As-You-Go Data Lakehouse Option ]

Databricks, with its Databricks Lakehouse Platform, is one of the fastest growing companies in the IT space. But it faces stiff competition in the data lakehouse arena against such companies as Snowflake, Dremio and Cloudera – as well as the major cloud service providers AWS, Microsoft and Google. And the Apache Iceberg and Apache Hudi technologies are seen as open-source alternatives to the Databricks-supported Delta Lake.

Delta Lake is data storage and data table format technology developed by Databricks and open sourced by the company in 2019. This week the company released Delta Lake 2.0 and said it will contribute all Delta Lake features and enhancements to the Linux Foundation and will open source all Delta Lake APIs.

“There are a lot of performance enhancements we built on top of Delta Lake over the last couple of years. But at the end of the day, we are an open-source company at heart and wanted to bring those back to the community,” said Joel Minnick, Databricks marketing vice president, in an interview with CRN.

The move, Minnick said, means that businesses and organizations can get the same Delta Lake performance whether they choose to run it on Databricks or some other platform.

“With the announcement of Delta Lake 2.0, we’re very pleased to be able to say all those performance enhancements that we have built on top of Delta Lake, we are giving those to the Linux Foundation,” Minnick said. “All that performance, the query performance in particular that you get out of Databricks when you‘re using Delta Lake, you will now be able to achieve that anywhere you deploy Delta Lake.”

Minnick said Databricks’ revenue is largely generated by developing high-level data lakehouse functionality and software for specific use cases and vertical industries.

In recent months the company has debuted a number of data lakehouse packages targeting such vertical industries as financial services, healthcare and retail and developed its Brickbuilder Solutions program to work with partners within specific verticals.

“If you don‘t out innovate yourself, somebody else will,” Minnick said. “From a revenue standpoint, things we’ve been investing in like Delta Live Tables [and] Databricks SQL – these are services that really enable the lakehouse paradigm to come to life, enable those really advanced data engineering and data warehousing use cases on a data lake. That‘s where we tend to see customers perceiving the most value out of the Databricks platform. And it’s where we put a lot of our engineering efforts.”

Speedier MLOps

At the Data + AI Summit Databricks also unveiled MLflow 2.0, a new release of the company’s machine learning operations platform with new MLflow Pipelines functionality designed to provide standardization and repeatability for developing and deploying large numbers of machine learning models and provisioning them with data.

“As a result, I‘m going to be able to get more models into production faster and more successfully,” Minnick said. “That’s what we’re trying to do now with MLflow Pipelines, it‘s about making that MLOps process much, much easier by instituting repeatability and standardization.”

Databricks was founded by the developers of the Spark analytics engine for processing huge volumes of streaming data. This week Databricks debuted Spark Connect, a client and server interface for Apache Spark based on the DataFrame API that makes it possible to access Spark from any device.

Lightspeed Ahead

The company also announced Project Lightspeed, an effort to engage with the Spark community about the future of the Spark community and what the future of the Spark architecture will look like.

“The core focus of Project Lightspeed is to enable really, really low latency streaming workloads on the lakehouse as I‘m moving towards a world where more and more of what I’m trying to do is machine learning decisions in real time. That way the latency between when I see the data and when I make a decision on the data is as small as it possibly can be,” Minnick said.

The Databricks Marketplace, slated to launch over the next few months, will provide an open marketplace for packaging and distributing data and analytics assets including data tables, files, machine learning models, notebooks and analytics dashboards.

Built on Delta Lake and powered by the open-source Delta Sharing protocol for securely sharing data, the marketplace will provide an opportunity to monetize data assets without moving or replicating data from their cloud storage.

Other Databricks announcements at Data + AI Summit included:

*Databricks SQL Serverless on AWS, which Databricks said removes the need to manage, configure or scale cloud infrastructure on a lakehouse.

*Enzyme, a new optimization layer within Delta Live Tables to speed up data ETL (extract, transform and load) processes.

*Cleanrooms, available over the next few months, will provide a way to share and join data across organizations using a secure, hosted environment without the need for data replication.

*Data Lineage for Unity Catalog, which the company announced in early June, provides centralized data lineage and governance capabilities for data lakehouse data and AI assets.