Access loaded data in Python
This guide explains how to access and manipulate data that has been loaded into your destination using the dlt Python library. After running your pipelines and loading data, you can use the pipeline.dataset() and data frame expressions, Ibis or SQL to query the data and read it as records, Pandas frames or Arrow tables.
Quick start example
Here's a full example of how to retrieve data from a pipeline and load it into a Pandas DataFrame or a PyArrow Table.
# Assuming you have a Pipeline object named 'pipeline'. You can create one with the dlt cli: dlt init fruitshop duckdb
# and you have loaded the data of the fruitshop example source into the destination
# the tables available in the destination are:
# - customers
# - inventory
# - purchases
# Step 1: Get the readable dataset from the pipeline
dataset = pipeline.dataset()
# Step 2: Access a table as a ReadableRelation
customers_relation = dataset.customers # Or dataset["customers"]
# Step 3: Fetch the entire table as a Pandas DataFrame
df = customers_relation.df() # or customers_relation.df(chunk_size=50)
# Alternatively, fetch as a PyArrow Table
arrow_table = customers_relation.arrow()
Getting started
Assuming you have a Pipeline object (let's call it pipeline), you can obtain a Dataset which is contains the crendentials and schema to your destination dataset. You can run construct a query and execute it on the dataset to retrieve a Relation which you may use to retrieve data from the Dataset.
Note: The Dataset and Relation objects are lazy-loading. They will only query and retrieve data when you perform an action that requires it, such as fetching data into a DataFrame or iterating over the data. This means that simply creating these objects does not load data into memory, making your code more efficient.
Access the dataset
# Get the readable dataset from the pipeline
dataset = pipeline.dataset()
# print the row counts of all tables in the destination as dataframe
print(dataset.row_counts().df())
Access tables as dataset
The simplest way of getting a Relation from a Dataset is to get a full table relation:
# Using attribute access
customers_relation = dataset.customers
# Using item access
customers_relation = dataset["customers"]
Creating relations with sql query strings
# Join 'customers' and 'purchases' tables and filter by quantity
query = """
SELECT *
FROM customers
JOIN purchases
ON customers.id = purchases.customer_id
WHERE purchases.quantity > 1
"""
joined_relation = dataset(query)
Reading data
Once you have a Relation, you can read data in various formats and sizes.
Fetch the entire table
Loading full tables into memory without limiting or iterating over them can consume a large amount of memory and may cause your program to crash if the table is too large. It's recommended to use chunked iteration or apply limits when dealing with large datasets.
As a Pandas DataFrame
df = customers_relation.df()
As a PyArrow Table
arrow_table = customers_relation.arrow()
As a list of Python tuples
items_list = customers_relation.fetchall()
Lazy loading behavior
The Dataset and Relation objects are lazy-loading. This means that they do not immediately fetch data when you create them. Data is only retrieved when you perform an action that requires it, such as calling .df(), .arrow(), or iterating over the data. This approach optimizes performance and reduces unnecessary data loading.
Iterating over data in chunks
To handle large datasets efficiently, you can process data in smaller chunks.
Iterate as Pandas DataFrames
for df_chunk in customers_relation.iter_df(chunk_size=5):
# Process each DataFrame chunk
pass
Iterate as PyArrow Tables
for arrow_chunk in customers_relation.iter_arrow(chunk_size=5):
# Process each PyArrow chunk
pass