Skip to main content

What are Data Sources?

Data sources are named connections to external data storage systems. Once configured, they can be automatically injected into your functions as Daft DataFrames.

Quick Start

1. Create a Data Source

In the Daft Cloud dashboard:
  1. Navigate to Data Sources in your project sidebar
  2. Click Create data source
  3. Select your source type
  4. Enter a name like sales_data
  5. Configure the connection (bucket, paths, credentials)
  6. Click Create

2. Use It in Your Code

Reference the data source using a type annotation:
import daft

def process_sales(sales_data: daft.DataFrame):
    """
    The `sales_data` parameter will be automatically injected
    with the configured data source.
    """
    # sales_data is already a Daft DataFrame pointing to your data
    result = sales_data.select("product_id", "revenue", "quantity")

    # Process the data
    summary = result.groupby("product_id").agg(
        daft.col("revenue").sum(),
        daft.col("quantity").sum(),
    )

    return summary.to_pydict()

3. Create a Run

When creating a run in the dashboard, map your function parameters to data sources using keyword arguments:
  1. Select the Function entrypoint type
  2. Enter your file path and function name (e.g., my_script.py:process_sales)
  3. In the Keyword Arguments section, add an argument where:
    • The key matches your function parameter name (e.g., sales_data)
    • The value is your configured data source name
  4. Click Create

How Injection Works

Daft Cloud uses type annotations to inject data sources:
  1. Annotate a parameter with daft.DataFrame
  2. Map the parameter to a data source when creating a run
  3. At runtime, the system:
    • Fetches the data source configuration
    • Loads credentials from your project secrets
    • Creates a Daft DataFrame pointing to your data
    • Passes it to your function
No special decorators or syntax required—just type annotations.

Multiple Data Sources

You can inject multiple data sources into a single function:
import daft

def combine_data(
    orders: daft.DataFrame,
    products: daft.DataFrame,
    customers: daft.DataFrame,
):
    """Combine data from multiple sources."""
    result = orders.join(products, on="product_id")
    result = result.join(customers, on="customer_id")
    return result.to_pydict()
When creating the run, map each parameter to its corresponding data source in the Arguments section.

Supported Data Sources