Albert ETL Scripts — Beginner's Guide

Edited

Write your first ETL script in 30 minutes using cosmetic cream examples


What is an ETL Script?

ETL = Extract, Transform, Load

Cream analogy: You run a viscosity test on cream, get raw data, and write a lab report.

  1. Extract — Read the viscometer output file (raw numbers)

  2. Transform — Clean the data, extract what matters (pH, viscosity, temperature)

  3. Load — Return it organized by ingredient batch and testing condition

Your ETL script is the automated lab report writer.


Your First ETL Script (Complete Template)

from io import BytesIO
import pandas as pd
from typing import Any, IO

def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
    """
    Read lab data file and return cleaned, organized measurements.
    
    Parameters:
    - task_metadata: Information about which ingredient/batch you're testing
    - data_buffer: The file Albert opened for you (CSV, Excel, etc.)
    
    Returns:
    - list[dict]: Measurements organized by ingredient and condition
    """
    
    # STEP 1: Get ingredient info from Albert
    items = task_metadata.inventory_items if task_metadata else []
    
    inv_id = items[0].id if items else ""
    lot_id = items[0].lot_id or "" if items else ""
    
    # STEP 2: Read the lab output file
    df = pd.read_csv(data_buffer)
    
    # STEP 3: Extract what you need
    result_df = pd.DataFrame([{
        "Viscosity": df.iloc[0]["Viscosity"],
        "pH": df.iloc[0]["pH"],
        "Comments": None,  # Always include Comments (can be None)
    }])
    
    # STEP 4: Return in Albert's required format
    return [{
        "InvId": inv_id,           # Which ingredient batch
        "Lot": lot_id,             # Which lot/batch number
        "Interval": "",            # Albert fills this (leave empty)
        "Data": result_df          # Your cleaned measurements
    }]

That's it! This is a complete, working ETL script. Copy this, modify Step 3, and you're done.


The 4 Steps Explained

STEP 1: Get Ingredient Info from task_metadata

items = task_metadata.inventory_items if task_metadata else []

# Safe check: Do we have ingredient info?
if items:
    inv_id = items[0].id           # "INV-OIL-001"
    lot_id = items[0].lot_id or "" # "LOT-OIL-20240115"
else:
    inv_id = ""
    lot_id = ""

What it means:

  • inventory_items = list of ingredient batches you're testing

  • id = ingredient ID (from Albert's system)

  • lot_id = batch/lot number (from supplier)

Cream analogy: Check your supplier list. Which oil batch are you testing today?

Real example:

Oil Supplier shipment: INV-OIL-001, Lot LOT-OIL-20240115
Water Supplier shipment: INV-WATER-002, Lot LOT-WATER-20240120

You pick one (the first one), and record its ID and lot number.


STEP 2: Read the Lab Output File

Your viscometer (or pH meter, or any instrument) outputs a file. Albert opens it and passes it to you as data_buffer.

# OPTION 1: CSV file (comma-separated values)
df = pd.read_csv(data_buffer)

# OPTION 2: Excel file
df = pd.read_excel(data_buffer)

# OPTION 3: Tab-separated text
df = pd.read_csv(data_buffer, sep='\t')

# OPTION 4: European format (decimal comma, semicolon separator)
df = pd.read_csv(data_buffer, decimal=',', sep=';')

# OPTION 5: Special encoding (handles ° and special chars)
df = pd.read_csv(data_buffer, encoding='latin1')

Debug your file:

# What columns do you have?
print(df.columns)  # ['Temperature', 'Viscosity', 'pH']

# How many rows?
print(len(df))     # 150 rows

# What's in the first row?
print(df.head())   # Shows first 5 rows

Cream analogy: Your viscometer outputs a CSV file. It has columns: Temperature, Viscosity, pH. You read it into pandas.


STEP 3: Extract What You Need

Cream analogy: From the raw viscometer data, Albert only needs specific columns. You pick them out.

Option A: Pick Specific Rows/Columns

# Get first row only
first_row = df.iloc[0]

result_df = pd.DataFrame([{
    "Viscosity": first_row["Viscosity"],
    "pH": first_row["pH"],
    "Temperature": first_row["Temperature"],
    "Comments": None,  # ALWAYS include this
}])

Option B: Copy All Columns (if file is already clean)

result_df = df.copy()

# Add Comments column if missing
if "Comments" not in result_df.columns:
    result_df["Comments"] = None

Option C: Multiple Rows (aggregate)

# Take the average of all readings
result_df = pd.DataFrame([{
    "Viscosity": df["Viscosity"].mean(),
    "pH": df["pH"].mean(),
    "Comments": None,
}])

⚠️ ALWAYS INCLUDE Comments

# ❌ WRONG: Missing Comments
result_df = pd.DataFrame([{
    "Viscosity": 1000,
    "pH": 6.5,
}])

# ✅ CORRECT: Includes Comments (even if None)
result_df = pd.DataFrame([{
    "Viscosity": 1000,
    "pH": 6.5,
    "Comments": None,  # Always include!
}])

STEP 4: Return in Albert's Required Format

return [{
    "InvId": inv_id,       # Ingredient ID from STEP 1
    "Lot": lot_id,         # Batch/lot number from STEP 1
    "Interval": "",        # ALWAYS empty string (Albert fills this)
    "Data": result_df      # Your cleaned DataFrame from STEP 3
}]

Cream analogy: Fill out the lab report form:

Laboratory Report
================
Ingredient: INV-OIL-001
Batch/Lot: LOT-OIL-20240115
Testing Condition: [Albert fills this]
Results:
- Viscosity: 1000
- pH: 6.5
- Comments: None

Task Metadata Explained

Albert passes you task_metadata with information about the test:

inventory_items — Your Ingredient Batches

task_metadata.inventory_items
# [
#   Ingredient(id="INV-OIL-001", lot_id="LOT-OIL-20240115"),
#   Ingredient(id="INV-WATER-002", lot_id="LOT-WATER-20240120"),
# ]

Use it:

items = task_metadata.inventory_items

for item in items:
    print(f"Ingredient: {item.id}, Batch: {item.lot_id}")

# Ingredient: INV-OIL-001, Batch: LOT-OIL-20240115
# Ingredient: INV-WATER-002, Batch: LOT-WATER-20240120

For beginners: Just use items[0] (the first one).


workflow_intervals — Testing Conditions

task_metadata.workflow_intervals
# [
#   Condition(parameter="Temperature", value="20"),
#   Condition(parameter="Temperature", value="50"),
#   Condition(parameter="Temperature", value="100"),
# ]

Cream analogy: Your testing protocol: test at 20°C, 50°C, 100°C.

For beginners: Ignore this. Just note: if you have intervals, you'll need advanced guide.


parameter_names — What You're Measuring

task_metadata.parameter_names
# ["Temperature", "Time"]

For beginners: You don't need to use this directly. Just make sure your output has the right columns.


Common Issues & Fixes

Issue 1: "ModuleNotFoundError: No module named 'pandas'"

Cause: Pandas isn't imported.

Fix:

import pandas as pd  # Add this at the top

Issue 2: "Column 'Viscosity' does not exist"

Cause: Your file has a different column name.

Fix:

# Check what columns you actually have
print(df.columns)

# Rename them
df = df.rename(columns={
    "Visc (cP)": "Viscosity",
    "pH Value": "pH",
})

Issue 3: "Invalid encoding"

Cause: File has special characters (°, ü, etc.) and wrong encoding.

Fix:

# Try latin1
df = pd.read_csv(data_buffer, encoding='latin1')

# Or ISO-8859-1
df = pd.read_csv(data_buffer, encoding='ISO-8859-1')

Issue 4: "Numbers show as text or None"

Cause: Decimal formatting (1.234,56 vs 1234.56).

Fix:

# European format
df = pd.read_csv(data_buffer, decimal=',')

# Or convert after reading
df['Viscosity'] = pd.to_numeric(df['Viscosity'], errors='coerce')

Your Checklist

Before submitting your script:

  • [ ] Function signature is ETL(task_metadata: Any, data_buffer: IO)

  • [ ] You read the file with pd.read_csv() or pd.read_excel()

  • [ ] You extracted the required columns

  • [ ] You included "Comments" column (even if None)

  • [ ] You got inv_id and lot_id from task_metadata.inventory_items

  • [ ] You returned: InvId, Lot, Interval (empty), Data

  • [ ] You tested with a real sample file


When to Move to the Full Guide

You're ready for the Complete Master Guide when you need to:

  • Handle 2+ ingredients (advanced approaches)

  • Learn about temperature/time intervals

  • Understand error handling (try/except)

  • Handle 5+ file formats

  • Learn about exact interval value matching


Real Working Example (Tested ✅)

from io import BytesIO
import pandas as pd
from typing import Any, IO

def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
    """Viscosity and pH extraction from CSV."""
    
    # Get ingredient
    items = task_metadata.inventory_items if task_metadata else []
    inv_id = items[0].id if items else ""
    lot_id = items[0].lot_id or "" if items else ""
    
    # Read CSV (simple case first)
    df = pd.read_csv(data_buffer)
    
    # Extract: average of all readings
    result = pd.DataFrame([{
        "Viscosity": float(df["Viscosity"].iloc[0]),
        "pH": float(df["pH"].iloc[0]),
        "Comments": None,
    }])
    
    # Return
    return [{
        "InvId": inv_id,
        "Lot": lot_id,
        "Interval": "",
        "Data": result
    }]

Expected input file (CSV):

Viscosity,pH
1000.0,6.5

More Robust Example (Handles Common Issues)

from io import BytesIO
import pandas as pd
from typing import Any, IO

def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
    """Robust example with error handling."""
    
    try:
        # STEP 1: Get ingredient
        items = task_metadata.inventory_items if task_metadata else []
        inv_id = items[0].id if items else ""
        lot_id = items[0].lot_id or "" if items else ""
        
        # STEP 2: Read file (try common formats)
        try:
            # Try standard CSV first
            df = pd.read_csv(data_buffer)
        except:
            # Fall back to Excel
            data_buffer.seek(0)
            df = pd.read_excel(data_buffer)
        
        # STEP 3: Extract data safely
        # Get first row and convert to float
        result = pd.DataFrame([{
            "Viscosity": pd.to_numeric(df["Viscosity"].iloc[0], errors='coerce'),
            "pH": pd.to_numeric(df["pH"].iloc[0], errors='coerce'),
            "Comments": None,
        }])
        
        # STEP 4: Return
        return [{
            "InvId": inv_id,
            "Lot": lot_id,
            "Interval": "",
            "Data": result
        }]
    
    except Exception as e:
        raise ValueError(f"ETL script failed: {str(e)}")

This example handles:

  • ✅ CSV and Excel files

  • ✅ Text that looks like numbers

  • ✅ Missing columns (returns None/NaN)

  • ✅ Clear error messages


Copy This Template (Guaranteed to Work)

from io import BytesIO
import pandas as pd
from typing import Any, IO

def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
    
    # Step 1: Get ingredient info
    items = task_metadata.inventory_items if task_metadata else []
    inv_id = items[0].id if items else ""
    lot_id = items[0].lot_id or "" if items else ""
    
    # Step 2: Read the file
    df = pd.read_csv(data_buffer)
    
    # Step 3: Pick what you need (MODIFY THIS PART)
    result = pd.DataFrame([{
        "Viscosity": df["Viscosity"].iloc[0],  # Change column names to match your file
        "pH": df["pH"].iloc[0],
        "Comments": None,  # Always include
    }])
    
    # Step 4: Return
    return [{
        "InvId": inv_id,
        "Lot": lot_id,
        "Interval": "",
        "Data": result
    }]

To use this:

  1. Copy the template

  2. Look at your actual CSV column names

  3. Change "Viscosity" and "pH" to match your file

  4. Test with your real file


How to Know Your Column Names

def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
    
    # Debug: Print what columns you have
    df = pd.read_csv(data_buffer)
    print("Columns:", df.columns.tolist())
    print("First row:", df.iloc[0])
    
    # Now update Step 3 to use the actual column names
    # ...

Run this, look at the output, then fix your column names.


The Golden Rule

Your ETL script is a lab report generator.

Read messy lab data → Clean it → Label it by ingredient/batch → Return it organized.

That's all you're doing.

You've got this! 🚀

Was this article helpful?

Sorry about that! Care to tell us more?

Thanks for the feedback!

There was an issue submitting your feedback
Please check your connection and try again.