Partitioning for Delta tables

Hive-style partitioning divides a Delta table into physical subdirectories based on the values of one or more partition columns. Each unique combination of partition column values creates a separate directory. This layout enables partition pruning: the engine skips entire directories when a query filters on the partition column.

Tip

For most workloads starting in Fabric Runtime 2.0, liquid clustering is the recommended data layout strategy for read performance. The primary reason to use partitioning is to enable concurrent write operations that don't conflict.

For full liquid clustering guidance, see Liquid clustering.

When to use partitioning

The primary use case for partitioning in Delta Lake is enabling concurrent write operations that don't conflict. Delta Lake uses optimistic concurrency control, and two operations that touch the same files can conflict. Partitioning makes it possible for concurrent operations to target disjoint sets of files by operating on separate partitions.

Use partitioning when:

You have concurrent writers that need to update, delete, or merge into the same table without conflicting—for example, multiple pipelines processing different business units or regions simultaneously.
Your partition column has low to moderate cardinality (tens to hundreds of distinct values, not thousands). Note: larger tables can accommodate more partitions. Target at least 1 GB of data in each partition.
Partition values align with your write patterns—each writer naturally targets a specific partition.

Important

For file skipping and read performance alone, liquid clustering is more effective than partitioning. Liquid clustering eliminates the risk of small-file issues from high-cardinality partition columns and lets you change clustering strategy over the table's lifecycle. Choose partitioning primarily when you need to isolate concurrent writers.

Create a partitioned table

CREATE TABLE sales.orders (
    order_id BIGINT,
    order_date DATE,
    region STRING,
    amount DECIMAL(10,2)
)
USING DELTA
PARTITIONED BY (region)

from delta.tables import DeltaTable

DeltaTable.create(spark) \
    .tableName("sales.orders") \
    .addColumn("order_id", "BIGINT") \
    .addColumn("order_date", "DATE") \
    .addColumn("region", "STRING") \
    .addColumn("amount", "DECIMAL(10,2)") \
    .partitionedBy("region") \
    .execute()

import io.delta.tables.DeltaTable

DeltaTable.create(spark)
    .tableName("sales.orders")
    .addColumn("order_id", "BIGINT")
    .addColumn("order_date", "DATE")
    .addColumn("region", "STRING")
    .addColumn("amount", "DECIMAL(10,2)")
    .partitionedBy("region")
    .execute()

Partitioning and concurrent writes

Partitioning is the primary mechanism in Delta Lake for avoiding conflicts between concurrent write operations. When a table is partitioned, operations that target different partitions operate on disjoint sets of files and don't conflict with each other.

For example, two concurrent MERGE INTO operations on a table partitioned by region don't conflict as long as each targets a different region—provided the partition column is explicitly included in the merge condition:

-- Pipeline A: processes North America only
MERGE INTO sales.orders AS target
USING staged_orders AS source
ON target.order_id = source.order_id
    AND target.region = 'NA'
    AND source.region = 'NA'
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

spark.sql("""
MERGE INTO sales.orders AS target
USING staged_orders AS source
ON target.order_id = source.order_id
    AND target.region = 'NA'
    AND source.region = 'NA'
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")

spark.sql("""
MERGE INTO sales.orders AS target
USING staged_orders AS source
ON target.order_id = source.order_id
    AND target.region = 'NA'
    AND source.region = 'NA'
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")

Without partitioning—or without including the partition column in the operation condition—these same operations could conflict even if they logically modify different rows. The partition column must appear in the merge condition itself, not just in the source data. Without it, Delta Lake can't determine at validation time that the two operations touched disjoint file sets.

For a complete guide to conflict types and resolution strategies, see Concurrency control.

Common pitfalls

High cardinality partition columns (for example, user_id with millions of values) create thousands of tiny directories and files, which degrades both write and read performance.
- Date columns should be chosen with caution. For many tables, partitioning by a date column results in too many small partitions. Target at least 1 GB of data in each partition.
Partition columns can't be changed after table creation without rewriting the entire table.
Small file problem is common with streaming or frequent appends into many partitions, because each write creates at least one file per partition.
Partitioning and liquid clustering are incompatible on the same table. You must choose one strategy.

Compare partitioning and liquid clustering

Aspect	Hive-style partitioning	Liquid clustering
Best for	Concurrent writer isolation	General-purpose file skipping and read optimization
Granularity	One directory per distinct value (or combination)	File-level value ranges, no directories
High cardinality	Creates thousands of small files/directories	Handles naturally; bins data into right-sized files
Column changes	Requires full table rewrite	`ALTER TABLE CLUSTER BY` applies on next `OPTIMIZE`
Write path	Partition column must be known at write time	Any column can be clustered after the fact
Concurrent writes	Disjoint partitions avoid conflicts	Append-only without conflicts; updates/deletes/merges can conflict on unpartitioned tables
Small file problem	Common with streaming or frequent inserts	Managed by `OPTIMIZE` compaction

Feedback

Was this page helpful?

Last updated on 2026-05-20