WRY

Where Are You?
You are on the brave land,
To experience, to remember...

0%

Big Data Intro.

All content is from public documents and open source projects.

Brief Introductions

Data Warehouses

For analyzing structured or semi-structured data. They were purpose-built for BI and reporting, however,

  • No support for video, audio, text
  • No support for data science, machine learning
  • Limited support for streaming
  • Closed & proprietary formats

Data Lakes

For storing data at scale. They can handle all data for data science and machine learning, however,

  • Poor BI support
  • Complex to set up
  • Poor performance (operation based on files)
  • Unreliable data swamps

Databricks

It is one remarkable unified analytics platform that combines the benefits of both Data Lake and Data Warehouse by providing a Lake House Architecture. Lake House (Data Lake + Data WareHouse) Architecture built on top of the data lake is called Delta Lake. Below are a few aspects that describe the need for Databricks’ Delta Lake:

  • It is an open format storage layer that delivers reliability, security, and performance on your Data Lake for both streaming and batch operations.
  • It not only houses structured, semi-structured, and unstructured data but also provides Low-cost Data Management solutions.
  • Databricks Delta Lake also handles ACID (Atomicity, Consistency, Isolation, and Durability) transactions, scalable metadata handling, time travel handling and data processing on existing data lakes.

Parquet

It is a columnar storage format available to any project in the hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Detailed Introductions

Databricks

ACID Transaction

ACID transaction guarantees is provided by Databricks as a layer of storage on the basis of data lake. This means that:

  • Multiple writers across multiple clusters can simultaneously modify a table partition and see a consistent snapshot view of the table and there will be a serial order for these writes.
  • Readers continue to see a consistent snapshot view of the table that the Azure Databricks job started with, even when a table is modified during a job.

Delta Lake (Databricks) uses optimistic concurrency control to provide ACID transaction guarantees. Under this mechanism, writes operate in three stages:

  • Read: Reads (if needed) the latest available version of the table to identify which files need to be modified (that is, rewritten).
  • Write: Stages all the changes by writing new data files.
  • Validate and commit: Before committing the changes, checks whether the proposed changes conflict with any other changes that may have been concurrently committed since the snapshot that was read. If there are no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation succeeds. However, if there are conflicts, the write operation fails with a concurrent modification exception rather than corrupting the table as would happen with the write operation on a Parquet table.

The default isolation level of delta table is defined as Write Serializable.

Refer