Clone a table on Azure Databricks

Create a copy of an existing table on Azure Databricks at a specific version using the clone command. Clones can be either deep or shallow.

Azure Databricks also supports cloning Parquet and Apache Iceberg tables. See Incrementally clone Parquet and Apache Iceberg tables to Delta Lake.

For details on using clone with Unity Catalog, see Shallow clone for Unity Catalog tables.

Note

Databricks recommends using Delta Sharing to provide read-only access to tables across different organizations. See What is Delta Sharing?.

Clone types

The following clone types are available:

Type SQL Syntax Description
Deep clone CLONE Copies both the data and metadata from the source table to the clone target, including stream metadata. A stream that writes to the source table can be stopped and continued on the clone target from where it left off.
Shallow clone SHALLOW CLONE Copies only the metadata from the source table to the clone target. Data files are not copied. Shallow clones are cheaper to create, because the operation uses less compute resources and storage space.

Cloned metadata includes: schema, partitioning information, invariants, nullability, and TBLPROPERTIES. For deep clones only, stream and COPY INTO metadata are also cloned. Metadata not cloned are the table description and user-defined commit metadata.

Important

Clones don't copy Delta Lake table history or Unity Catalog properties, such as tags. See Work with table history and Apply tags to Unity Catalog securable objects.

Note

Streaming tables and materialized views don't support CLONE. You cannot use a streaming table or materialized view as the source or target of a deep or shallow clone. See Limitations and Limitations.

Clone operation behavior for Hive metastore

Important

In Databricks Runtime 13.3 LTS and above, Unity Catalog managed tables support shallow clones. Clone behavior for Unity Catalog tables differs from clone behavior in other environments. See Shallow clone for Unity Catalog tables.

For a Delta Lake table registered to the Hive metastore or a collection of files not registered as a table, clone has the following behaviors:

  • Changes to deep or shallow clones don't affect the source table.
  • Shallow clones reference data files in the source directory. If you run VACUUM on the source table, clients can no longer read those data files and this raises a FileNotFoundException. To repair, run clone with replace on the shallow clone. If this happens often, consider using a deep clone, which doesn't depend on the source table.
  • Deep clones don't depend on the source table but are expensive to create because they copy both the data and the metadata.
  • Cloning with replace to a target that already has a table at that path creates a Delta log if one doesn't exist. Run VACUUM to clean up any existing data.
  • For existing Delta tables, cloning creates a new incremental commit that includes only new metadata and data from the source table since the last clone.
  • Cloning a table differs from Create Table As Select (CTAS). A clone copies the source table metadata in addition to the data. You don't need to specify partitioning, format, invariants, nullability, or other settings.
  • A cloned table has an independent history from its source table. Time travel queries on a cloned table don't work with the same inputs as on the source table.

Clone metrics

CLONE reports the following metrics as a single row DataFrame once the operation is complete:

  • source_table_size: Size of the source table that's being cloned in bytes.
  • source_num_of_files: The number of files in the source table.
  • num_removed_files: If the table is being replaced, how many files are removed from the current table.
  • num_copied_files: Number of files that were copied from the source (0 for shallow clones).
  • removed_files_size: Size in bytes of the files that are being removed from the current table.
  • copied_files_size: Size in bytes of the files copied to the table.

Clone metrics example

Permissions

You must configure permissions for Azure Databricks table access control and your cloud provider.

Table access control

The following permissions are required for both deep and shallow clones:

  • SELECT permission on the source table.
  • If you are using CLONE to create a new table, CREATE permission on the database in which you are creating the table.
  • If you are using CLONE to replace a table, you must have MODIFY permission on the table.

Cloud provider permissions

Readers of a deep clone need read access to the clone's directory. Writers need write access to the clone's directory.

Readers of a shallow clone need read access to both the source table's data files and the clone's directory, because data files remain in the source. Writers need write access to the clone's directory.

Examples

Create deep or shallow clones

The following code examples demonstrates how to create deep and shallow clones:

SQL

Create a deep clone:

CREATE TABLE target_table CLONE source_table;

Replace an existing target:

CREATE OR REPLACE TABLE target_table CLONE source_table;

Create a deep clone, or skip if the target already exists:

CREATE TABLE IF NOT EXISTS target_table CLONE source_table;

Create a shallow clone at the latest version, at a specific version, or at a specific timestamp. The timestamp can be a date string like '2019-01-01' or an expression like date_sub(current_date(), 1).

CREATE TABLE target_table SHALLOW CLONE source_table;

CREATE TABLE target_table SHALLOW CLONE source_table VERSION AS OF version;

CREATE TABLE target_table SHALLOW CLONE source_table TIMESTAMP AS OF timestamp_expression;

Python

The Python DeltaTable API is specific to Delta Lake.

Clone the source at the latest version:

from delta.tables import *

deltaTable = DeltaTable.forName(spark, "source_table")
deltaTable.clone(target="target_table", isShallow=True, replace=False)

Clone the source at a specific version:

deltaTable.cloneAtVersion(version=1, target="target_table", isShallow=True, replace=False)

Clone the source at a specific timestamp:

deltaTable.cloneAtTimestamp(timestamp="2019-01-01", target="target_table", isShallow=True, replace=False)

Scala

The Scala DeltaTable API is specific to Delta Lake.

Clone the source at the latest version:

import io.delta.tables._

val deltaTable = DeltaTable.forName(spark, "source_table")
deltaTable.clone(target="target_table", isShallow=true, replace=false)

Clone the source at a specific version:

deltaTable.cloneAtVersion(version=1, target="target_table", isShallow=true, replace=false)

Clone the source at a specific timestamp:

deltaTable.cloneAtTimestamp(timestamp="2019-01-01", target="target_table", isShallow=true, replace=false)

For syntax details, see CREATE TABLE CLONE.

Check metadata copied during CLONE

This example confirms which metadata is copied during CLONE operations, specifically TBLPROPERTIES, Unity Catalog tags, and Delta Lake history.

Set up a source table with a custom property and a non-default log retention duration, then insert data to generate table history:

CREATE OR REPLACE TABLE test_clone_source (id INT, val STRING)
TBLPROPERTIES ('my.custom.prop' = 'hello', 'delta.logRetentionDuration' = '12 days');

ALTER TABLE test_clone_source SET TAGS ('team' = 'data-eng', 'env' = 'prod');
INSERT INTO test_clone_source VALUES (1, 'a');
INSERT INTO test_clone_source VALUES (2, 'b');

Create a deep clone and a shallow clone:

CREATE OR REPLACE TABLE test_clone_deep DEEP CLONE test_clone_source;

CREATE OR REPLACE TABLE test_clone_shallow SHALLOW CLONE test_clone_source;

Confirm that TBLPROPERTIES are copied to both clones:

SHOW TBLPROPERTIES test_clone_source;
SHOW TBLPROPERTIES test_clone_deep;
SHOW TBLPROPERTIES test_clone_shallow;

Confirm that Unity Catalog tags are not copied to clones:

SELECT catalog_name, schema_name, table_name, tag_name, tag_value FROM information_schema.table_tags WHERE table_name = 'test_clone_source';
SELECT catalog_name, schema_name, table_name, tag_name, tag_value FROM information_schema.table_tags WHERE table_name = 'test_clone_deep';
SELECT catalog_name, schema_name, table_name, tag_name, tag_value FROM information_schema.table_tags WHERE table_name = 'test_clone_shallow';

Confirm that Delta Lake history is not copied to clones:

DESCRIBE HISTORY test_clone_source;
DESCRIBE HISTORY test_clone_deep;
DESCRIBE HISTORY test_clone_shallow;

Clean up:

DROP TABLE IF EXISTS test_clone_shallow;
DROP TABLE IF EXISTS test_clone_source;
DROP TABLE IF EXISTS test_clone_deep;

Data archiving

You can use deep clone to preserve the state of a table at a certain point in time for archival purposes. You can sync deep clones incrementally to maintain an updated state of a source table for disaster recovery.

Run the following command once a month to sync the archive:

CREATE OR REPLACE TABLE archive_table CLONE my_prod_table

ML model reproduction

For machine learning use cases, you may want to archive a version of a table that was used to train an ML model. Future models can be tested using this archived dataset. To archive a dataset version with CLONE, do the following:

For example, to archive the version of a table used to train a model at version 15:

CREATE TABLE model_dataset CLONE entire_dataset VERSION AS OF 15

Short-term experiments on a production table

To test a workflow on a production table without corrupting the table, create a shallow clone. Shallow clones allow you to run workloads on the cloned table, which contains all the production data, but doesn't affect any production workloads.

Example

Create a shallow clone of the production table:

CREATE OR REPLACE TABLE my_test SHALLOW CLONE my_prod_table;

Run updates and validations on the clone:

UPDATE my_test WHERE user_id is null SET invalid=true;

When ready, merge changes back. The merge uses update information in the clone to prune to only changed files where possible:

MERGE INTO my_prod_table
USING my_test
ON my_test.user_id <=> my_prod_table.user_id
WHEN MATCHED AND my_test.user_id is null THEN UPDATE *;

Drop the clone when finished:

DROP TABLE my_test;

Override table properties

Table property overrides are particularly useful for:

  • Annotating tables with owner or user information when sharing data with different business units.
  • Archiving Delta tables and table history or time travel is required. You can specify the data and log retention periods independently for the archive table. For example:

SQL

For a Delta Lake table:

CREATE OR REPLACE TABLE archive_table CLONE prod.my_table
TBLPROPERTIES (
delta.logRetentionDuration = '3650 days',
delta.deletedFileRetentionDuration = '3650 days'
)

For an Iceberg table:

CREATE OR REPLACE TABLE archive_table CLONE prod.my_table
TBLPROPERTIES (
iceberg.logRetentionDuration = '3650 days',
iceberg.deletedFileRetentionDuration = '3650 days'
)

Python

The Python DeltaTable API is Delta Lake-specific.

dt = DeltaTable.forName(spark, "prod.my_table")
tblProps = {
"delta.logRetentionDuration": "3650 days",
"delta.deletedFileRetentionDuration": "3650 days"
}
dt.clone(target="archive_table", isShallow=False, replace=True, tblProps)

Scala

The Scala DeltaTable API is Delta Lake-specific.

val dt = DeltaTable.forName(spark, "prod.my_table")
val tblProps = Map(
"delta.logRetentionDuration" -> "3650 days",
"delta.deletedFileRetentionDuration" -> "3650 days"
)
dt.clone(target="archive_table", isShallow = false, replace = true, properties = tblProps)