Albert ETL Scripts — Beginner's Guide
Write your first ETL script in 30 minutes using cosmetic cream examples
What is an ETL Script?
ETL = Extract, Transform, Load
Cream analogy: You run a viscosity test on cream, get raw data, and write a lab report.
Extract — Read the viscometer output file (raw numbers)
Transform — Clean the data, extract what matters (pH, viscosity, temperature)
Load — Return it organized by ingredient batch and testing condition
Your ETL script is the automated lab report writer.
Your First ETL Script (Complete Template)
from io import BytesIO
import pandas as pd
from typing import Any, IO
def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
"""
Read lab data file and return cleaned, organized measurements.
Parameters:
- task_metadata: Information about which ingredient/batch you're testing
- data_buffer: The file Albert opened for you (CSV, Excel, etc.)
Returns:
- list[dict]: Measurements organized by ingredient and condition
"""
# STEP 1: Get ingredient info from Albert
items = task_metadata.inventory_items if task_metadata else []
inv_id = items[0].id if items else ""
lot_id = items[0].lot_id or "" if items else ""
# STEP 2: Read the lab output file
df = pd.read_csv(data_buffer)
# STEP 3: Extract what you need
result_df = pd.DataFrame([{
"Viscosity": df.iloc[0]["Viscosity"],
"pH": df.iloc[0]["pH"],
"Comments": None, # Always include Comments (can be None)
}])
# STEP 4: Return in Albert's required format
return [{
"InvId": inv_id, # Which ingredient batch
"Lot": lot_id, # Which lot/batch number
"Interval": "", # Albert fills this (leave empty)
"Data": result_df # Your cleaned measurements
}]
That's it! This is a complete, working ETL script. Copy this, modify Step 3, and you're done.
The 4 Steps Explained
STEP 1: Get Ingredient Info from task_metadata
items = task_metadata.inventory_items if task_metadata else []
# Safe check: Do we have ingredient info?
if items:
inv_id = items[0].id # "INV-OIL-001"
lot_id = items[0].lot_id or "" # "LOT-OIL-20240115"
else:
inv_id = ""
lot_id = ""
What it means:
inventory_items= list of ingredient batches you're testingid= ingredient ID (from Albert's system)lot_id= batch/lot number (from supplier)
Cream analogy: Check your supplier list. Which oil batch are you testing today?
Real example:
Oil Supplier shipment: INV-OIL-001, Lot LOT-OIL-20240115
Water Supplier shipment: INV-WATER-002, Lot LOT-WATER-20240120
You pick one (the first one), and record its ID and lot number.
STEP 2: Read the Lab Output File
Your viscometer (or pH meter, or any instrument) outputs a file. Albert opens it and passes it to you as data_buffer.
# OPTION 1: CSV file (comma-separated values)
df = pd.read_csv(data_buffer)
# OPTION 2: Excel file
df = pd.read_excel(data_buffer)
# OPTION 3: Tab-separated text
df = pd.read_csv(data_buffer, sep='\t')
# OPTION 4: European format (decimal comma, semicolon separator)
df = pd.read_csv(data_buffer, decimal=',', sep=';')
# OPTION 5: Special encoding (handles ° and special chars)
df = pd.read_csv(data_buffer, encoding='latin1')
Debug your file:
# What columns do you have?
print(df.columns) # ['Temperature', 'Viscosity', 'pH']
# How many rows?
print(len(df)) # 150 rows
# What's in the first row?
print(df.head()) # Shows first 5 rows
Cream analogy: Your viscometer outputs a CSV file. It has columns: Temperature, Viscosity, pH. You read it into pandas.
STEP 3: Extract What You Need
Cream analogy: From the raw viscometer data, Albert only needs specific columns. You pick them out.
Option A: Pick Specific Rows/Columns
# Get first row only
first_row = df.iloc[0]
result_df = pd.DataFrame([{
"Viscosity": first_row["Viscosity"],
"pH": first_row["pH"],
"Temperature": first_row["Temperature"],
"Comments": None, # ALWAYS include this
}])
Option B: Copy All Columns (if file is already clean)
result_df = df.copy()
# Add Comments column if missing
if "Comments" not in result_df.columns:
result_df["Comments"] = None
Option C: Multiple Rows (aggregate)
# Take the average of all readings
result_df = pd.DataFrame([{
"Viscosity": df["Viscosity"].mean(),
"pH": df["pH"].mean(),
"Comments": None,
}])
⚠️ ALWAYS INCLUDE Comments
# ❌ WRONG: Missing Comments
result_df = pd.DataFrame([{
"Viscosity": 1000,
"pH": 6.5,
}])
# ✅ CORRECT: Includes Comments (even if None)
result_df = pd.DataFrame([{
"Viscosity": 1000,
"pH": 6.5,
"Comments": None, # Always include!
}])
STEP 4: Return in Albert's Required Format
return [{
"InvId": inv_id, # Ingredient ID from STEP 1
"Lot": lot_id, # Batch/lot number from STEP 1
"Interval": "", # ALWAYS empty string (Albert fills this)
"Data": result_df # Your cleaned DataFrame from STEP 3
}]
Cream analogy: Fill out the lab report form:
Laboratory Report
================
Ingredient: INV-OIL-001
Batch/Lot: LOT-OIL-20240115
Testing Condition: [Albert fills this]
Results:
- Viscosity: 1000
- pH: 6.5
- Comments: None
Task Metadata Explained
Albert passes you task_metadata with information about the test:
inventory_items — Your Ingredient Batches
task_metadata.inventory_items
# [
# Ingredient(id="INV-OIL-001", lot_id="LOT-OIL-20240115"),
# Ingredient(id="INV-WATER-002", lot_id="LOT-WATER-20240120"),
# ]
Use it:
items = task_metadata.inventory_items
for item in items:
print(f"Ingredient: {item.id}, Batch: {item.lot_id}")
# Ingredient: INV-OIL-001, Batch: LOT-OIL-20240115
# Ingredient: INV-WATER-002, Batch: LOT-WATER-20240120
For beginners: Just use items[0] (the first one).
workflow_intervals — Testing Conditions
task_metadata.workflow_intervals
# [
# Condition(parameter="Temperature", value="20"),
# Condition(parameter="Temperature", value="50"),
# Condition(parameter="Temperature", value="100"),
# ]
Cream analogy: Your testing protocol: test at 20°C, 50°C, 100°C.
For beginners: Ignore this. Just note: if you have intervals, you'll need advanced guide.
parameter_names — What You're Measuring
task_metadata.parameter_names
# ["Temperature", "Time"]
For beginners: You don't need to use this directly. Just make sure your output has the right columns.
Common Issues & Fixes
Issue 1: "ModuleNotFoundError: No module named 'pandas'"
Cause: Pandas isn't imported.
Fix:
import pandas as pd # Add this at the top
Issue 2: "Column 'Viscosity' does not exist"
Cause: Your file has a different column name.
Fix:
# Check what columns you actually have
print(df.columns)
# Rename them
df = df.rename(columns={
"Visc (cP)": "Viscosity",
"pH Value": "pH",
})
Issue 3: "Invalid encoding"
Cause: File has special characters (°, ü, etc.) and wrong encoding.
Fix:
# Try latin1
df = pd.read_csv(data_buffer, encoding='latin1')
# Or ISO-8859-1
df = pd.read_csv(data_buffer, encoding='ISO-8859-1')
Issue 4: "Numbers show as text or None"
Cause: Decimal formatting (1.234,56 vs 1234.56).
Fix:
# European format
df = pd.read_csv(data_buffer, decimal=',')
# Or convert after reading
df['Viscosity'] = pd.to_numeric(df['Viscosity'], errors='coerce')
Your Checklist
Before submitting your script:
[ ] Function signature is
ETL(task_metadata: Any, data_buffer: IO)[ ] You read the file with
pd.read_csv()orpd.read_excel()[ ] You extracted the required columns
[ ] You included "Comments" column (even if None)
[ ] You got
inv_idandlot_idfromtask_metadata.inventory_items[ ] You returned:
InvId,Lot,Interval(empty),Data[ ] You tested with a real sample file
When to Move to the Full Guide
You're ready for the Complete Master Guide when you need to:
Handle 2+ ingredients (advanced approaches)
Learn about temperature/time intervals
Understand error handling (try/except)
Handle 5+ file formats
Learn about exact interval value matching
Real Working Example (Tested ✅)
from io import BytesIO
import pandas as pd
from typing import Any, IO
def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
"""Viscosity and pH extraction from CSV."""
# Get ingredient
items = task_metadata.inventory_items if task_metadata else []
inv_id = items[0].id if items else ""
lot_id = items[0].lot_id or "" if items else ""
# Read CSV (simple case first)
df = pd.read_csv(data_buffer)
# Extract: average of all readings
result = pd.DataFrame([{
"Viscosity": float(df["Viscosity"].iloc[0]),
"pH": float(df["pH"].iloc[0]),
"Comments": None,
}])
# Return
return [{
"InvId": inv_id,
"Lot": lot_id,
"Interval": "",
"Data": result
}]
Expected input file (CSV):
Viscosity,pH
1000.0,6.5
More Robust Example (Handles Common Issues)
from io import BytesIO
import pandas as pd
from typing import Any, IO
def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
"""Robust example with error handling."""
try:
# STEP 1: Get ingredient
items = task_metadata.inventory_items if task_metadata else []
inv_id = items[0].id if items else ""
lot_id = items[0].lot_id or "" if items else ""
# STEP 2: Read file (try common formats)
try:
# Try standard CSV first
df = pd.read_csv(data_buffer)
except:
# Fall back to Excel
data_buffer.seek(0)
df = pd.read_excel(data_buffer)
# STEP 3: Extract data safely
# Get first row and convert to float
result = pd.DataFrame([{
"Viscosity": pd.to_numeric(df["Viscosity"].iloc[0], errors='coerce'),
"pH": pd.to_numeric(df["pH"].iloc[0], errors='coerce'),
"Comments": None,
}])
# STEP 4: Return
return [{
"InvId": inv_id,
"Lot": lot_id,
"Interval": "",
"Data": result
}]
except Exception as e:
raise ValueError(f"ETL script failed: {str(e)}")
This example handles:
✅ CSV and Excel files
✅ Text that looks like numbers
✅ Missing columns (returns None/NaN)
✅ Clear error messages
Copy This Template (Guaranteed to Work)
from io import BytesIO
import pandas as pd
from typing import Any, IO
def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
# Step 1: Get ingredient info
items = task_metadata.inventory_items if task_metadata else []
inv_id = items[0].id if items else ""
lot_id = items[0].lot_id or "" if items else ""
# Step 2: Read the file
df = pd.read_csv(data_buffer)
# Step 3: Pick what you need (MODIFY THIS PART)
result = pd.DataFrame([{
"Viscosity": df["Viscosity"].iloc[0], # Change column names to match your file
"pH": df["pH"].iloc[0],
"Comments": None, # Always include
}])
# Step 4: Return
return [{
"InvId": inv_id,
"Lot": lot_id,
"Interval": "",
"Data": result
}]
To use this:
Copy the template
Look at your actual CSV column names
Change "Viscosity" and "pH" to match your file
Test with your real file
How to Know Your Column Names
def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
# Debug: Print what columns you have
df = pd.read_csv(data_buffer)
print("Columns:", df.columns.tolist())
print("First row:", df.iloc[0])
# Now update Step 3 to use the actual column names
# ...
Run this, look at the output, then fix your column names.
The Golden Rule
Your ETL script is a lab report generator.
Read messy lab data → Clean it → Label it by ingredient/batch → Return it organized.
That's all you're doing.
You've got this! 🚀
