Skip to main content

Configuration

FieldRequiredDescription
bucketYesS3 bucket name (3-63 characters, lowercase)
pathsYesArray of S3 object paths (supports glob patterns)
formatNoFile format: parquet, json, csv, or file
regionNoAWS region (e.g., us-east-1)
endpointNoCustom endpoint for S3-compatible services
secret_nameNoReference to a project secret containing AWS credentials

Credentials

Store your AWS credentials as a project secret in the following JSON format:
{
  "aws_access_key_id": "AKIA...",
  "aws_secret_access_key": "..."
}
If no secret is specified, Daft Cloud will attempt to use environment credentials.

Glob Pattern Support

Paths support standard glob patterns:
PatternDescription
*Matches any characters except /
?Matches any single character
[...]Matches any character in the brackets
**Matches any path segment
Examples:
  • data/*.parquet - All parquet files in the data folder
  • logs/2024/**/*.json - All JSON files in any 2024 subdirectory
  • images/batch_[0-9].png - Specific numbered batch files

Example

import daft

def process_s3_data(sales_data: daft.DataFrame):
    """Process data from an S3 data source."""
    return sales_data.select("product_id", "revenue").to_pydict()

File Formats

FormatExtensionDescription
Parquet.parquetColumnar format, best for analytics
JSON.jsonJSON files, one object per line (JSONL)
CSV.csvComma-separated values
FileAnyBinary files (images, PDFs, audio, video, etc.)
The format is automatically detected from file extensions, or you can specify it explicitly.