Bemærk
Adgang til denne side kræver godkendelse. Du kan prøve at logge på eller ændre mapper.
Adgang til denne side kræver godkendelse. Du kan prøve at ændre mapper.
Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Connect to cloud object storage using Unity Catalog. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.
Note
In Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either standard or dedicated access modes (formerly shared and single-user access modes).
Directory listing mode is supported by default. File notification mode is only supported on compute with dedicated access mode.
Specify locations for Auto Loader resources for Unity Catalog
The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.
Ingest data from cloud storage using Unity Catalog
The following examples assume the executing user has READ FILES permissions on the external location, owner privileges on the target tables, and the following configurations and grants.
Note
Azure Data Lake Storage is the only Azure storage type supported by Unity Catalog.
| Storage location | Grant |
|---|---|
abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data |
READ FILES |
abfss://dev-bucket@<storage-account>.dfs.core.windows.net |
READ FILES, WRITE FILES, CREATE TABLE |
Use Auto Loader to load to a Unity Catalog managed table
The following examples demonstrate how to use Auto Loader to ingest data to a Unity Catalog managed table.
Python
checkpoint_path = "abfss://dev-bucket@<storage-account>.dfs.core.windows.net/_checkpoint/dev_table"
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table"))
SQL
CREATE OR REFRESH STREAMING TABLE dev_catalog.dev_database.dev_table
AS SELECT * FROM STREAM read_files(
'abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data',
format => 'json'
);
When you use read_files in a CREATE STREAMING TABLE statement inside a Lakeflow Spark Declarative Pipelines pipeline, checkpoint and schema locations are managed automatically.
Use Auto Loader to load to a Unity Catalog external table
To keep data in a specific storage location, use a Unity Catalog external table instead of a managed table. For example, use an external table to share data with non-Databricks clients or to register existing data. With external tables, you set the storage path. See Work with external tables.
To use Auto Loader with a Unity Catalog external table, first register the table with CREATE TABLE ... LOCATION, then stream into it by name. The table location must be inside an external location where you have CREATE EXTERNAL TABLE permissions. The checkpoint location must also live in a Unity Catalog-managed external location. Use a separate path from the table data.
checkpoint_path = "abfss://dev-bucket@<storage-account>.dfs.core.windows.net/_checkpoint/dev_table"
table_path = "abfss://dev-bucket@<storage-account>.dfs.core.windows.net/external/dev_table"
# One-time: register the external table in UC.
spark.sql(f"""
CREATE TABLE IF NOT EXISTS dev_catalog.dev_database.dev_table
USING DELTA
LOCATION '{table_path}'
""")
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", checkpoint_path)
.load("abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data")
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("dev_catalog.dev_database.dev_table"))