Developing ETL Automation Scripts for Albert Data Templates (Python)

Edited

Audience: Developers, informatics, automation engineers · Goal: Write robust ETL scripts that follow Albert Data Template Automation conventions.

This guide describes the standard ETL function signature, conventions for aligning with Data Templates, and typical patterns for building reliable automation scripts in Python.

Standard ETL function signature

All ETL scripts should follow this standardized skeleton, compatible with Albert’s Script Automation.

Some example ETL scripts use task_data as the first parameter. In this guide we call it task_metadata; both refer to the same object provided by Albert at runtime (with helpers like get_inventory_id_from_lot_id and split_dataframe_into_intervals).

from io import BytesIO
import pandas as pd
from typing import IO, Any


def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
    """
    Perform ETL on data file associated with a task.

    Parameters
    ----------
    task_metadata : Any
        Metadata about the task to import data to.
    data_buffer : IO
        A Python IO handle on the data file to be processed.

    Returns
    -------
    list[dict]
        A list of dictionaries, each containing `Data` for a block in the task.
    """
    # 1) Read file
    df = pd.read_csv(data_buffer, encoding="utf-8-sig")

    # 2) Clean headers
    df.columns = (
        df.columns
        .map(str)
        .str.strip()
        .str.replace("\u200b", "", regex=True)
    )

    # 3) Transform df -> df_clean according to the file structure & Data Template
    df_clean = df  # replace with real logic

    return [
        {
            "Data": df_clean,
            # "InvId": "",   # Optional
            # "LotId": "",   # Optional
            # "Interval": "" # Optional
        }
    ]

You must replace df_clean = df with real ETL logic.

Alternative minimal example

A very simple ETL example reads a CSV, transposes it, and returns it as a single block:

def ETL(task_metadata: Any, data_buffer: IO) -> list[dict]:
    # Read in the CSV into a data frame -- indicate that column 0 should be treated as an index
    df = pd.read_csv(data_buffer, index_col=0)
    df = pd.DataFrame.transpose(df)

    return [
        {
            "Data": df,   # Required
            # "InvId": "", # Optional
            # "LotId": "", # Optional
            # "Interval": "" # Optional
        }
    ]

In a typical file you will see two clear regions following Python best practices:

  • An # %% ETL Definition cell: only the ETL function that Albert will execute.

  • An if __name__ == "__main__": section (“Local testing”) that opens a local file, wraps it in BytesIO, and calls ETL(None, data_buffer). This never runs inside Albert but makes local debugging easy.

Treat the ETL skeleton as an interface contract: keep the function name, parameters, and return type stable so scripts remain compatible with Albert and “Import Results”.

When to prefer ETL scripts over manual Import Results mapping

  • Your Property Task uses multiple Inventories and the instrument exports one combined file or a .zip with multiple files (ETL can assign data to the right Inventory in one run).

  • You need custom “translation” logic from numeric values to “speaking” labels for scientists or customers (for example, color classes or pass/fail bands).

  • You want vendor-agnostic imports: similar data from different instruments should end up in the same Data Template, even if file formats differ.

  • You need automatic post-processing (for example, scaling, derived KPIs) as part of the import, not in a separate Excel step.

Core conventions you must follow

1. Column naming and Data Template alignment

  • The final Data DataFrame(s) must have columns that match the Data Template:

    • Exact spelling.

    • Exact case.

    • Exact units and symbols (for example, "Viscosity (10 s-1)", "Time Since Creation (days)").

  • Use an explicit rename_map and explicit column ordering:

    rename_map = {
        "Batch Name": "LotId",
        "Batch CIE L": "L",
        # ...
    }
    df = df.rename(columns=rename_map)
    
    expected_order = [
        "InventoryID", "LotId", "L", "A", "B", "C", "H*",
        "DL", "dA", "dB", "Dc", "dH", "dE",
    ]
    df = df[expected_order]
  • If you introduce derived columns (for example, deltas or ratios), keep the naming transparent so scientists can understand what each field represents.

2. Validation of required inputs

Before transforming, assert the presence of required columns in the raw file:

required_cols = ["Batch Name", "Batch CIE L", "Batch CIE a"]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError("❌ Missing required column(s): " + ", ".join(missing))

Beyond column presence, also validate other assumptions that show up in real scripts:

  • No completely empty data sections after header rows.

  • Reasonable value ranges (for example, no negative times or physically impossible temperatures).

  • Uniqueness of identifiers where required (for example, one row per Lot × Time combination).

When such assumptions fail, prefer raising a clear ValueError or using assert with a descriptive message early instead of letting a later transformation crash with a low-level pandas error.

3. Type conversion & decimal handling

Use pd.to_numeric(..., errors="coerce") for numeric fields. For European decimal formats, standardize:

def comma_to_dot(series: pd.Series) -> pd.Series:
    series = series.apply(lambda x: str(x).replace(",", "."))
    return pd.to_numeric(series, errors="ignore")

ETL scripts commonly import numpy to support operations like log10, handling sentinel values, and building composite keys. Prefer vectorised numpy / pandas operations over manual loops for any non-trivial numeric transformation.

4. Inventory & lot mapping

Use task_metadata.get_inventory_id_from_lot_id(lotid) where applicable:

df["InventoryID"] = df["LotId"].apply(
    lambda x: task_metadata.get_inventory_id_from_lot_id(str(x))
)

Apply any lot transformation rules (for example, B3160-56423 → LOTB56423) before that call.

In local tests, ETL functions are often called as ETL(None, data_buffer). That’s fine as long as you stub or skip the call to get_inventory_id_from_lot_id. In production, the same ETL function receives a real task object, and the helper can resolve Inventory IDs from the lot codes present in the file.

If your instrument export already contains a unique Inventory ID column, you may:

  • Fill "InvId" directly from the file and leave "LotId" empty, or

  • Fill both, if you want the Template to show both Inventory and Lot identifiers.

5. Intervals and block splitting

Use intervals when your Data Template represents a series of repeated measurements along one or more “interval axes”, such as:

  • Stability measurements over Time Since Creation (days).

  • Viscosity, EIS, or ICP curves versus Temperature (°C) or other conditions.

  • Any scenario where one lot has multiple curves, sweeps, or time points.

A typical example is a file like simulated_eis_data_intervals.csv: the instrument exports one big table where each lot has several “measurement intervals” (for example, separate sweeps) stacked under each other. Your ETL script:

  • Parses the raw file into one long output_df.

  • Computes or cleans the interval-defining columns (for example time since creation, temperature, sweep ID).

  • Passes that final output_df into the interval helper.

For interval-based Data Templates, use the built-in helper. It turns a single cleaned DataFrame into multiple Albert blocks, one per interval:

return task_metadata.split_dataframe_into_intervals(
    output_df,
    ["Time Since Creation", "Temperature"],
    inventory_id_column="InventoryID",
    lot_id_column="LotId",
)

What the helper expects

  • output_df is your final, cleaned DataFrame:

    • One row per measurement (for example, one row per time point or per EIS frequency row).

    • All required Template columns present and correctly typed.

  • The second argument is a list of interval columns:

    • Use the exact column names from the Data Template, for example "Time Since Creation", "Temperature".

    • These columns define how Albert slices the data into intervals for each Inventory / Lot.

    • For more complex experiments (for example multiple EIS sweeps per time point), you can include an additional discrete “interval ID” column you derive in ETL, such as "Sweep Index".

  • inventory_id_column and lot_id_column:

    • Point to the columns that carry the Inventory and Lot IDs.

    • These IDs are copied into the resulting block dictionaries.

How the helper behaves conceptually

  • Groups output_df by [inventory_id_column, lot_id_column] + interval_columns.

  • For each group, creates a block (one element in the returned list[dict]).

  • Generates an Interval identifier that Albert uses to link the block back to the interval in the Data Template.

The helper returns a list[dict] where each dict looks like:

{
    "Interval": interval_id,
    "InvId": "INV123",
    "LotId": "LOTB56423",
    "Data": df_block,
}

Within each block, Data should contain the rows and columns that correspond to the Template’s result grid for that specific interval. The helper takes care of duplicating the correct subset of rows into each interval block.

When not to use intervals

  • If your Template has one row per lot / sample (no repeated time points or curves), skip intervals and use the grouped blocks pattern from section 6. Non-interval case.

  • If your raw file already represents one Albert block per file, you can simply return a single dictionary with "Data": df_clean and omit "Interval".

  • If you only have a single interval (for example, one fixed time point), intervals are optional. You can still use them for consistency, but a non-interval block is often simpler.

6. Non-interval case (grouped blocks)

For non-interval templates, group by inventory/lot or sample ID:

result: list[dict] = []
for (inv_id, lot_id), group in output_df.groupby(["InventoryID", "LotId"], dropna=False):
    data = group.drop(columns=["InventoryID", "LotId"])
    result.append(
        {
            "InvId": inv_id,
            "LotId": lot_id,
            "Data": data,
        }
    )
return result

7. Using the task_metadata / task_data context

The first parameter (task_metadata / task_data) provides two key capabilities:

  • Inventory mapping: infer Inventory ID from a given Lot ID using get_inventory_id_from_lot_id, so files don’t need to contain Albert IDs.

  • Intervals from parameter values: read the parameter values that were recorded on the Task / DT Blocks and use them to pick the correct interval definitions when splitting data. Note: intervals must already exist on the Property Task.

The task_metadata object is also where future helpers live (for example, more advanced mappings and convenience methods). ETL authors normally only need a small subset of these helpers.

Typical ETL patterns

A) Block-based parsing (“Test:” sections)

For exports that use headerless tables with Test: lines indicating blocks:

  • Read with header=None.

  • Identify sentinel rows where a cell equals Test:.

  • Extract metadata using regex from free-text descriptions, for example:

    matches = re.findall(r"\[([^\]]+)\]", test_str)
    lotid_extracted, temperature_extracted, day_info_extracted = matches[-3:]
  • Initialize a current_data dict with:

    • Inventory ID.

    • Lot ID.

    • Interval values (for example, temperature, time since creation).

    • Target output columns, all initialized to None.

  • Fill values row-by-row based on numeric conditions (for example, shear rate) and finally build a list of records.

B) Wide → tidy transformations

Use melt or pivot to convert between wide (many columns per sample) and long (one row per measurement) formats. Align the final layout with what the Data Template expects (for example, one row per time point).

For more complex numeric transformations (for example, log-scaling frequency axes in EIS data or computing deltas vs. reference), prefer numpy and pandas vectorisation over Python loops so the script stays fast and easy to read.

C) Unit conversion

Perform units conversion in the script so Data Template units are consistent:

df["Time (min)"] = pd.to_numeric(df["Time (s)"], errors="coerce") / 60.0
df.drop(columns=["Time (s)"], inplace=True)

Keep unit conversions close to the column definition, and comment special cases (for example, “instrument stores °C + 273.15, convert back to °C here”) so other developers can see at a glance what has been normalised.

D) Local testing & file-type hints

ETL scripts used for local development often add two very practical pieces: a short docstring that explains what file type the script expects, and a small “local testing” harness for driving ETL with a real file.

1. Document the expected file format in the script header

"""
Works with file: simulated_eis_data_intervals.csv
Other files may have the wrong separator, so you need to change it.
"""

Use this to capture things like required separator (; vs ,), decimal style, or special export settings from the instrument software.

2. Local testing harness using BytesIO

from io import BytesIO

# %% Local testing
path = "path/to/your/simulated_eis_data_intervals.csv"  # or another export

with open(path, "rb") as file:
    all_bytes = file.read()

data_buffer = BytesIO(all_bytes)

# Call the ETL function exactly as Albert does on the server
transferred = ETL(None, data_buffer)

# Inspect the result (for example, in a notebook)
transferred  # or print(len(transferred)) for interval-based scripts

In a shared repository, prefer making path relative to the script file (for example ./examples/simulated_eis_data_intervals.csv) so the local testing section works out of the box for other developers.

E) Debugging interval-based output

When working with intervals, it’s worth explicitly checking what split_dataframe_into_intervals returns before deploying:

  • After calling the helper, print the number of blocks:

    blocks = task_metadata.split_dataframe_into_intervals(
        output_df,
        ["Time Since Creation", "Temperature"],
        inventory_id_column="InventoryID",
        lot_id_column="LotId",
    )
    print(len(blocks))  # how many intervals?
  • Inspect the first block:

    first = blocks[0]
    print(first.keys())      # should include "Interval", "InvId", "LotId", "Data"
    print(first["InvId"])    # expected Inventory ID
    print(first["LotId"])    # expected Lot ID
    print(first["Data"].head())
  • Confirm that each block’s data only contains rows for a single interval (for example, one time/temperature combination or a single EIS sweep).

  • If you see unexpected mixing of intervals inside one block, double-check:

    • The list of interval columns you pass into the helper.

    • That those columns are correctly cleaned and not NaN.

    • That Inventory and Lot IDs are stable and not accidentally duplicated.

F) “Speaking value” translations

One common use case is the “translation” of a numeric value into a more user-friendly category for customer communication—for example, mapping a color difference metric or a modulus value to labels like “Excellent / Good / Borderline / Out of spec”.

  • Implement this translation directly in the ETL, based on a user-defined ruleset (for example, thresholds stored in code or in a configuration file).

  • Write the user-friendly value into a separate column (for example, Color Class) so the raw numeric value stays available for detailed analysis.

  • Make the logic transparent and well-commented; the goal is a low entry barrier for future changes to the ruleset.

G) Vendor-agnostic imports & “intervals gone wild”

Another useful pattern is vendor-agnostic import, especially for complex instruments like ICP or emission spectroscopy:

  • Different instruments or software versions may produce quite different file layouts for essentially the same experiment.

  • Your ETL can branch on file metadata, header patterns, or file naming (for example, “Source Type 1” vs “Source Type 2”) and normalise everything into a single Data Template schema.

  • This is especially powerful when capturing measurements broken down by element and wavelength (“intervals gone wild”): the ETL hides vendor-specific quirks while Albert always sees a consistent interval structure.

The end result is that users can import results irrespective of the equipment used, and downstream dashboards / reports only need to understand the Albert schema—not every vendor format.

H) Coupling ETL with post-processing

It is also useful to couple ETL scripts with calculations: once the raw signals are parsed, the same script can apply user-defined scaling, normalisation, or KPI calculations.

  • Step 1: parse data from the file into a clean DataFrame aligned with the Template.

  • Step 2: apply automatic post-processing (for example, scaling, normalisation, derived KPIs) before returning the data block(s).

  • By embedding those calculations in the ETL script, you remove the need for a separate “evaluation Excel”, and Albert becomes the single source of truth.

Error handling & messages

Make errors understandable to scientists and power users:

if missing:
    raise ValueError(
        "❌ Import failed: missing required column(s) in file: "
        + ", ".join(missing)
        + ". Please check your export format."
    )

Avoid cryptic stack traces; rely on concise ValueError messages that can be pasted into an internal ticket. Well-structured ETL scripts raise high-level errors early in the ETL rather than letting low-level parsing failures bubble up.

For debugging, a useful pattern is to use your browser’s developer tools (Network tab) to inspect ETL runs and to include “fake print statements” via assertions: assert condition, "Message containing the diagnostic information". These messages surface as user-readable errors in the UI.

Tip: Surface what the user can actually fix (file type, export preset, missing columns) rather than low-level pandas errors. Use assertions and ValueError with copy-pasteable messages.

Best practices

  • Version control: Track scripts in Git or similar. Use meaningful names and comments.

  • One script per Data Template: Script Automation currently allows one ETL script per Data Template (with branching logic inside if you really need it). Design Templates and scripts together.

  • Results only: ETL scripts can fill results but cannot change intervals or parameters on the Property Task. Make sure intervals and parameter values are configured correctly in Albert before you rely on them in ETL.

  • Test with multiple real files: Include edge cases: blanks, missing data, alternative export presets, and future format changes.

  • Keep transformations transparent: Prefer simple, readable code over clever one-liners so others can maintain the script.

  • Check in example files: Store at least one sanitized example export (for example simulated_eis_data_intervals.csv) next to the ETL script so anyone can run the local testing cell without hunting for data.

  • Keep local test cells in source control: The # %% Local testing region is part of the documentation for how the ETL is meant to be used and debugged.

  • Use ETL where it adds real leverage: For simple one-file-per-sample workflows, Import Results with manual column mapping may be enough. For multi-inventory, multi-interval, vendor-agnostic or highly calculated workflows, prefer ETL.

  • Leverage AlbertSync and, where applicable, SAM: Use AlbertSync watch folders to get files into Tasks automatically; when instrument software supports it, SAM can also push parameter values into Tasks so your ETL always “knows” which method was run.

Need more help with ETL script development? Contact support.

Was this article helpful?

Sorry about that! Care to tell us more?

Thanks for the feedback!

There was an issue submitting your feedback
Please check your connection and try again.