How to Load Unstructured Data in Snowflake: Files, Documents, and AI Pipelines

Last Updated: March 27, 2026

down-chevron

Nikesh Vora

Technical Product Manager @ Coefficient

Desktop Hero Image Mobile Hero Image

Most Snowflake guides assume your data arrives as rows and columns. Increasingly, the data that matters most does not. PDFs, contracts, images, audio recordings, emails, invoices, support tickets. All of it is sitting in cloud storage or shared drives, untouched by your SQL queries because it does not fit in a table.

Snowflake has invested significantly in changing that. Between stages, directory tables, the FILE data type, and a full suite of Cortex AI functions that went GA in late 2025, there are now practical, SQL-native ways to store, catalog, access, and extract structured insights from unstructured files without moving data to external systems.

This guide covers the full workflow: how Snowflake stores unstructured data, how to catalog it with directory tables, how to control access with scoped URLs, how to process files with UDFs and Snowpark, and how to use Cortex AI functions to turn raw documents and audio into queryable, structured data.

What counts as unstructured data in Snowflake

Snowflake’s own documentation draws a useful distinction across three categories:

TypeExamplesHow Snowflake stores it
StructuredCSV, database tablesTyped columns in standard tables
Semi-structuredJSON, XML, Avro, ParquetVARIANT columns, queryable with dot notation
UnstructuredPDFs, images, audio, video, contracts, emailsFiles in stages, referenced via FILE data type or directory tables

The key thing to understand about unstructured data in Snowflake: Snowflake does not physically copy your files into the database. It leaves them where they are in an internal Snowflake stage or your own cloud storage bucket and gives you SQL-based tools to catalog, access, and process them from there. No base64 encoding, no binary column hacks. Your files stay in cloud storage; Snowflake gives you governed access to them.

Step 1: Create a stage for your unstructured files

Everything starts with a stage. Both internal stages (Snowflake-managed) and external stages (your S3, Azure Blob, or GCS bucket) support unstructured data.

For unstructured data, two things matter beyond a standard stage setup:

Enable server-side encryption so files can be accessed via URLs. By default, internal stages use client-side encryption with keys owned by Snowflake. Client-side encrypted files cannot be read via scoped URLs, file URLs, or pre-signed URLs. Switching to server-side encryption (SNOWFLAKE_SSE) fixes this.

Enable directory tables if you want a SQL-queryable catalog of your files — which you almost always do.

-- Internal stage for documents with directory table and SSE encryption
CREATE OR REPLACE STAGE document_stage
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
  DIRECTORY = (ENABLE = TRUE);

-- External stage pointing to an S3 bucket with directory table
CREATE OR REPLACE STAGE s3_document_stage
  URL = 's3://my-bucket/documents/'
  STORAGE_INTEGRATION = my_s3_integration
  DIRECTORY = (ENABLE = TRUE, AUTO_REFRESH = TRUE);

For external stages, use a storage integration rather than embedding credentials. Storage integrations use IAM roles or service principals and are the secure, production-ready approach.

Upload files to an internal stage

Use the PUT command from SnowSQL to upload local files:

PUT file:///local/path/contracts/*.pdf @document_stage/contracts/;

Or use the Snowsight UI. Navigate to your stage under Data > Databases, select the stage, and use the upload button to drag and drop files.

Step 2: Catalog files with directory tables

A directory table is an implicit metadata object layered on top of a stage. It gives you a SQL-queryable catalog of every file on the stage, including path, size, last modified timestamp, etag, and hosted file URL. You do not create it separately. It is enabled when you set DIRECTORY = (ENABLE = TRUE) on the stage.

Refresh the directory

Before querying, refresh the directory to pick up files added since the last refresh:

ALTER STAGE document_stage REFRESH;

For external stages with AUTO_REFRESH = TRUE, Snowflake uses cloud storage event notifications to refresh automatically when new files land.

Query the directory

-- List all files on the stage
SELECT RELATIVE_PATH, SIZE, LAST_MODIFIED, FILE_URL
FROM DIRECTORY(@document_stage);

-- Filter for PDFs modified in the last 7 days
SELECT RELATIVE_PATH, SIZE, LAST_MODIFIED
FROM DIRECTORY(@document_stage)
WHERE RELATIVE_PATH LIKE '%.pdf'
  AND LAST_MODIFIED > DATEADD(DAY, -7, CURRENT_TIMESTAMP());

-- Get a count and total size by file type
SELECT
  SPLIT_PART(RELATIVE_PATH, '.', -1)  AS file_type,
  COUNT(*)                             AS file_count,
  SUM(SIZE) / 1024 / 1024             AS total_size_mb
FROM DIRECTORY(@document_stage)
GROUP BY file_type
ORDER BY file_count DESC;

Directory tables return these columns:

ColumnContents
RELATIVE_PATHFile path relative to the stage root
SIZEFile size in bytes
LAST_MODIFIEDTimestamp of last modification
ETAGHash of the file contents is useful for detecting changes
FILE_URLSnowflake-hosted file URL

Join directory tables with structured metadata

A common pattern is to join the directory table with a metadata table that holds structured information about each file like document owner, category, review status to create a unified view:– contract_metadata table holds structured fields about each contract
CREATE OR REPLACE VIEW contracts_with_metadata AS
SELECT
d.RELATIVE_PATH,
d.FILE_URL,
d.SIZE,
d.LAST_MODIFIED,
m.vendor_name,
m.contract_value,
m.expiry_date,
m.status
FROM DIRECTORY(@document_stage) d
JOIN contract_metadata m
ON d.RELATIVE_PATH = m.file_path;

Step 3: Access files with scoped URLs, file URLs, and pre-signed URLs

Snowflake gives you three URL types for accessing staged files. Each serves a different access pattern.

URL typeAuthenticationExpiryUse case
Scoped URLSnowflake session token24 hours (configurable)Short-lived user access without stage privileges
File URLSnowflake login requiredNo expiryAuthenticated internal access for users with stage privileges
Pre-signed URLNone (public)ConfigurableSharing with external parties who do not have Snowflake credentials

Generate each URL type

-- Scoped URL: short-lived, no stage privilege required
SELECT BUILD_SCOPED_FILE_URL(@document_stage, 'contracts/vendor_a.pdf');

-- File URL: permanent, requires Snowflake login and READ privilege on stage
SELECT BUILD_STAGE_FILE_URL(@document_stage, 'contracts/vendor_a.pdf');

-- Pre-signed URL: public, no authentication, expires in 3600 seconds
SELECT GET_PRESIGNED_URL(@document_stage, 'contracts/vendor_a.pdf', 3600);

Access control note: Users with READ privilege on a stage can generate file URLs for any file on that stage. Scoped URLs allow you to grant access to a specific file without granting any stage privilege. For compliance-sensitive environments, scoped URLs are the right default.

Step 4: The FILE data type

Introduced in preview in April 2025, the FILE data type lets you store references to staged files directly in Snowflake table columns. A FILE column does not contain the file’s bytes. It holds metadata: stage name, relative path, MIME type, size, etag, and last modified timestamp. Think of it as a typed pointer to a staged file.

-- Create a table with a FILE column
CREATE OR REPLACE TABLE invoice_documents (
  invoice_id    INTEGER,
  vendor        VARCHAR,
  submitted_at  TIMESTAMP_NTZ,
  document      FILE
);

-- Load from an external stage using TO_FILE
INSERT INTO invoice_documents (invoice_id, vendor, submitted_at, document)
SELECT
  seq4()                                                      AS invoice_id,
  'Vendor A'                                                  AS vendor,
  CURRENT_TIMESTAMP()                                         AS submitted_at,
  TO_FILE('@s3_document_stage', RELATIVE_PATH)                AS document
FROM DIRECTORY(@s3_document_stage)
WHERE RELATIVE_PATH LIKE 'invoices/%.pdf';

-- Query file metadata from the table column
SELECT
  invoice_id,
  vendor,
  document:SIZE::INTEGER          AS file_size_bytes,
  document:CONTENT_TYPE::STRING   AS mime_type,
  document:LAST_MODIFIED::STRING  AS last_modified
FROM invoice_documents;

The FILE type is particularly useful when you want to join file references with structured business data and pass those references to Cortex AI functions for processing.

Step 5: Process unstructured data with Cortex AI functions

This is where Snowflake’s unstructured data story has changed most significantly. Cortex AI functions — the AI_* function family went generally available in November 2025. They let you run LLM-powered extraction, summarization, classification, transcription, and more on staged files and text columns, entirely within Snowflake SQL. No external APIs, no data movement, no infrastructure to manage.

AI_PARSE_DOCUMENT: extract raw text from documents

AI_PARSE_DOCUMENT reads a staged file and returns the text content. Use OCR mode for scanned documents and images. Use LAYOUT mode when you need to preserve document structure – tables, headers, sections.

-- Extract text from a single staged PDF
SELECT
  SNOWFLAKE.CORTEX.AI_PARSE_DOCUMENT(
    @document_stage,
    'contracts/vendor_a.pdf',
    {'mode': 'OCR'}
  ):content::VARCHAR AS extracted_text;

-- Process all PDFs in the stage and store extracted text
CREATE OR REPLACE TABLE contracts_text AS
SELECT
  RELATIVE_PATH,
  SNOWFLAKE.CORTEX.AI_PARSE_DOCUMENT(
    @document_stage,
    RELATIVE_PATH,
    {'mode': 'LAYOUT'}
  ):content::VARCHAR AS document_text
FROM DIRECTORY(@document_stage)
WHERE RELATIVE_PATH LIKE '%.pdf';

AI_EXTRACT: pull structured fields from documents and text

AI_EXTRACT is the function to use when you want specific structured data out of a document. It accepts a staged file or a text string, a set of questions or a JSON schema describing what to extract, and returns a structured JSON object. GA as of October 2025.

-- Extract key fields from a staged invoice PDF
SELECT
  AI_EXTRACT(
    TO_FILE('@document_stage', 'invoices/invoice_001.pdf'),
    {
      'fields': [
        {'name': 'vendor_name',    'description': 'Name of the vendor or supplier'},
        {'name': 'invoice_number', 'description': 'Invoice or reference number'},
        {'name': 'invoice_date',   'description': 'Date of the invoice'},
        {'name': 'total_amount',   'description': 'Total amount due including taxes'},
        {'name': 'due_date',       'description': 'Payment due date'}
      ]
    }
  ) AS extracted_fields;

The result is a JSON object. Parse it into columns for downstream use:

-- Process all invoices and extract structured fields into a table
CREATE OR REPLACE TABLE invoices_structured AS
SELECT
  RELATIVE_PATH,
  AI_EXTRACT(
    TO_FILE('@document_stage', RELATIVE_PATH),
    {
      'fields': [
        {'name': 'vendor_name',    'description': 'Vendor or supplier name'},
        {'name': 'invoice_number', 'description': 'Invoice reference number'},
        {'name': 'total_amount',   'description': 'Total amount due'},
        {'name': 'due_date',       'description': 'Payment due date'}
      ]
    }
  )                                    AS raw_extract,
  raw_extract:vendor_name::STRING      AS vendor_name,
  raw_extract:invoice_number::STRING   AS invoice_number,
  raw_extract:total_amount::FLOAT      AS total_amount,
  raw_extract:due_date::DATE           AS due_date
FROM DIRECTORY(@document_stage)
WHERE RELATIVE_PATH LIKE 'invoices/%.pdf';

Now your invoice PDFs are a queryable table. Join it with your ERP data, flag overdue payments, calculate vendor spend. All in SQL.

AI_TRANSCRIBE: extract text from audio and video

AI_TRANSCRIBE converts audio and video files in a stage into text with timestamps and speaker labels. GA as of November 2025.

-- Transcribe a call recording
SELECT
  AI_TRANSCRIBE(
    TO_FILE('@audio_stage', 'calls/support_call_001.mp3')
  ) AS transcript;

-- Extract specific fields from transcribed call content
WITH transcribed AS (
  SELECT
    RELATIVE_PATH,
    AI_TRANSCRIBE(TO_FILE('@audio_stage', RELATIVE_PATH)):text::VARCHAR AS transcript_text
  FROM DIRECTORY(@audio_stage)
  WHERE RELATIVE_PATH LIKE 'calls/%.mp3'
)
SELECT
  RELATIVE_PATH,
  AI_EXTRACT(
    transcript_text,
    {
      'fields': [
        {'name': 'issue_type',    'description': 'Primary issue the customer called about'},
        {'name': 'resolution',    'description': 'How the issue was resolved'},
        {'name': 'sentiment',     'description': 'Customer sentiment: positive, neutral, or negative'}
      ]
    }
  ) AS call_summary
FROM transcribed;

AI_COMPLETE: general-purpose analysis on extracted text

Once you have extracted text from a document, AI_COMPLETE handles anything that requires reasoning – summarization, classification, answering specific questions, generating structured output from free-form content.

-- Summarize contract terms from extracted text
SELECT
  file_path,
  SNOWFLAKE.CORTEX.AI_COMPLETE(
    'mistral-large',
    CONCAT(
      'Summarize the key obligations, payment terms, and termination clauses from this contract in three bullet points: ',
      document_text
    )
  ) AS contract_summary
FROM contracts_text
WHERE document_text IS NOT NULL;

Cortex AI function reference

FunctionWhat it doesGA status
AI_PARSE_DOCUMENTExtracts raw text from PDFs and documents using OCR or layout modeGA
AI_EXTRACTExtracts structured fields from documents, images, or textGA Oct 2025
AI_TRANSCRIBETranscribes audio and video files to text with timestampsGA Nov 2025
AI_COMPLETEGeneral LLM completion: summarize, classify, answer questions on textGA
AI_SENTIMENTReturns sentiment score from textGA
AI_CLASSIFYClassifies text or images into user-defined categoriesGA Nov 2025
AI_SUMMARIZE_AGGSummarizes across multiple rows without context window limitsGA
AI_EMBEDGenerates vector embeddings for semantic search and RAGGA Nov 2025

Step 6: Process files with Python UDFs and Snowpark

For processing that Cortex functions do not cover custom ML models, specialized parsing libraries, proprietary file formats. Python UDFs and Snowpark give you full Python ecosystem access running inside Snowflake’s compute.

Python UDF for PDF text extraction with PyPDF2

CREATE OR REPLACE FUNCTION extract_pdf_text(file_path STRING)
  RETURNS STRING
  LANGUAGE PYTHON
  RUNTIME_VERSION = '3.8'
  PACKAGES = ('snowflake-snowpark-python', 'pypdf2')
  HANDLER = 'extract_text'
AS $$
import pypdf2
import io

def extract_text(file_path):
    import _snowflake
    with _snowflake.open(file_path) as f:
        reader = pypdf2.PdfReader(io.BytesIO(f.read()))
        return ' '.join(
            page.extract_text() for page in reader.pages
            if page.extract_text()
        )
$$;

-- Apply the UDF across all PDFs in the stage
SELECT
  RELATIVE_PATH,
  extract_pdf_text(BUILD_SCOPED_FILE_URL(@document_stage, RELATIVE_PATH)) AS pdf_text
FROM DIRECTORY(@document_stage)
WHERE RELATIVE_PATH LIKE '%.pdf';

Building an automated pipeline with streams and tasks

For production workflows where you want new files to be processed automatically as they land:

-- Step 1: Create a stream on the directory table to detect new files
CREATE OR REPLACE STREAM document_stream
  ON STAGE document_stage;

-- Step 2: Create a task that triggers when new files appear
CREATE OR REPLACE TASK process_new_documents
  WAREHOUSE = my_warehouse
  SCHEDULE = '1 MINUTE'
  WHEN SYSTEM$STREAM_HAS_DATA('document_stream')
AS
INSERT INTO processed_documents (file_path, extracted_text, processed_at)
SELECT
  RELATIVE_PATH,
  extract_pdf_text(BUILD_SCOPED_FILE_URL(@document_stage, RELATIVE_PATH)),
  CURRENT_TIMESTAMP()
FROM document_stream
WHERE METADATA$ACTION = 'INSERT';

-- Step 3: Start the task
ALTER TASK process_new_documents RESUME;

New files land in the stage, the stream detects them, the task fires within a minute, and extracted text is stored in your table automatically. No manual triggers, no scheduled jobs outside Snowflake.

Step 7: Structured vs. unstructured. The right Snowflake data type for each

A quick reference before you start building:

DataRight approachSnowflake type
Structured records (users, orders, transactions)Standard typed columnsINTEGER, VARCHAR, TIMESTAMP, etc.
Semi-structured (API responses, JSON logs)VARIANT columnVARIANT
File references (PDFs, images, audio)Stage + directory table + FILE typeFILE / staged file
Extracted text from filesVARCHAR column after UDF or Cortex processingVARCHAR
Vector embeddings from documentsVECTOR column for similarity searchVECTOR

Best practices for Loading Unstructured Data

Enable server-side encryption on internal stages from day one. Client-side encrypted stages cannot serve files via scoped, file, or pre-signed URLs. Migrating encryption type after the fact requires re-staging all files.

Use separate stages per data type or team. A single shared stage for all unstructured data creates governance headaches and complicates directory table queries. One stage per domain like contracts, invoices, audio recordings makes access control and file management far cleaner.

Refresh directory tables explicitly after bulk uploads. For external stages without AUTO_REFRESH, Snowflake does not know about new files until you run ALTER STAGE <name> REFRESH. Build this into your load scripts.

Use scoped URLs as your default for user-facing access. Scoped URLs expire in 24 hours by default, require no stage privilege to generate, and are revocable. File URLs and pre-signed URLs require more careful governance because they do not expire automatically (file URLs) or carry no authentication (pre-signed URLs).

Process documents in batches when using Cortex functions. Cortex AI functions bill per token. For large document sets, test on a small sample first to understand cost before running across millions of files. Use CTEs or staged tables to avoid reprocessing already-extracted content.

Store extracted structured output in regular Snowflake tables. Once you have run AI_EXTRACT or a Python UDF to get structured fields from a document, write those fields to a proper typed table. Do not re-run extraction on every query, it is expensive and slow. Extract once, store, query the structured output.

Filter at the directory table level before processing. Applying WHERE clauses on the directory table like filtering by file type, date, path prefix before passing files to UDFs or Cortex functions avoids unnecessary processing of files you do not need.

Getting structured outputs from Snowflake into a spreadsheet

Once your unstructured data has been processed into structured Snowflake tables like extracted invoice fields, contract summaries, call transcript analysis, sentiment scores. The results need to reach the teams who act on them. Finance teams working invoice data, legal teams reviewing contract terms, QA teams analyzing call sentiment. Most of them live in spreadsheets.

Coefficient connects Snowflake directly to Google Sheets or Excel. Write a SQL query against your extracted data table, paste it into Coefficient’s Custom SQL input, set a refresh schedule, and the results stay current automatically. The data engineer processes the unstructured data in Snowflake; the business user sees the structured output in a live spreadsheet that updates daily without anyone re-running anything.

For more on connecting Snowflake to spreadsheets, see How to Export Snowflake Data into Google Sheets.

Start with Coefficient for free

Frequently asked questions

Does Snowflake store unstructured files inside the database?

No. Snowflake leaves files in their original cloud storage location, either an internal Snowflake-managed stage or your own S3, Azure Blob, or GCS bucket. What Snowflake provides is governed SQL access to those files via stages, directory tables, URL generation functions, and AI processing functions. No file bytes are stored in database columns.

What is a directory table in Snowflake?

A directory table is an implicit metadata object enabled on a stage that provides a SQL-queryable catalog of all files on that stage. Query it with SELECT * FROM DIRECTORY(@stage_name) to see file paths, sizes, timestamps, etags, and hosted URLs. It is not a separate database object. It is enabled via DIRECTORY = (ENABLE = TRUE) on the CREATE STAGE command.

What is the difference between a scoped URL, a file URL, and a pre-signed URL in Snowflake?

Scoped URLs are short-lived (24 hours by default) and require a Snowflake session but no stage privilege. They are ideal for granting temporary access to specific files. File URLs are permanent and require Snowflake login and READ privilege on the stage. Pre-signed URLs require no authentication and expire after a configurable time. Use them for sharing files with external parties who do not have Snowflake credentials.

What is Snowflake AI_EXTRACT and when should I use it?

AI_EXTRACT is a Cortex AI function (GA October 2025) that extracts specific structured fields from documents, images, or text using LLMs. You define what you want to extract, such as vendor name, invoice number, or contract value, and it returns a JSON object with those fields populated. Use it instead of building regex parsers or custom ML models for document extraction tasks.

What file formats does Snowflake support for unstructured data?

Snowflake can stage and reference any file type including PDFs, JPEGs, PNGs, MP3s, MP4s, WAVs, and Word documents. Processing capabilities vary by format. AI_PARSE_DOCUMENT handles PDFs and common document formats. AI_TRANSCRIBE handles audio and video. AI_EXTRACT works across documents and images. Custom Python UDFs can handle any format your Python libraries support.

How does Snowflake charge for Cortex AI functions on unstructured data?

Cortex AI functions bill per token processed. Input and output tokens are both counted for generative functions like AI_COMPLETE. For extraction functions like AI_EXTRACT, only input tokens are counted. Token costs vary by function and model. Test on a representative sample before running across large file sets to understand cost at scale.

Can I automate unstructured data processing when new files arrive?

Yes. Create a stream on the stage’s directory table and pair it with a Snowflake task. The stream detects new file additions. The task fires when stream data is present and runs your UDF or Cortex function against the new files. This gives you a fully automated ingestion-to-extraction pipeline inside Snowflake.