An LLM agent as an ETL pipeline
Nikis needed a clean dataset of kids' holiday activities across NSW. So I pointed an agent at the open web and made it behave like infrastructure: typed, checkpointed, and capped.
15 Jun 2026 · 4 min
Most people reach for an LLM and immediately build a chatbot. I needed a database. Nikis is the Nikis Active platform, a booking marketplace for kids' holiday activities across New South Wales, and a marketplace is only as good as its supply. Somebody has to know that a particular surf school in Byron runs a five-day camp in the January break, where it is, and how to find it. There is no clean feed of that. It is scattered across thousands of small business websites, council pages, and PDFs.
So nikis-scraper is an LLM agent, but it is not a conversation. It is an ETL pipeline that happens to have a language model in the transform step, and I built it the way you build infrastructure: assume it will crash, assume it will cost money, and never trust its output until something has checked it.
Three phases, all resumable
The pipeline runs in three checkpointed phases. Discovery sweeps for candidate providers and activities. Enrichment goes back over each candidate and fills in the structured detail: dates, age ranges, location, price, booking link. Gap-fill is the cleanup pass that hunts down whatever the first two left thin.
The word that matters there is checkpointed. Each phase writes its state as it goes, so when the process dies at hour three (and it will, because the open web is hostile and rate limits are real), you restart and it picks up where it stopped instead of re-spending the whole budget from zero. A pipeline you cannot resume is one you are afraid to run, and a tool you are afraid to run does not get run.
The model only gets to search a little
The transform step uses a hosted model with a server-side web-search tool. The obvious failure mode is letting it search freely, at which point a single enrichment call can fire off twenty queries and you watch your bill climb in real time with no idea what you bought. So search is capped at a few calls per invocation. The cap is not a cost optimisation bolted on later. It is part of the contract: this call may consult the web up to N times, then it must answer with what it has.
That constraint also makes the agent better. Forced to be economical, it spends its searches on the thing it is unsure about instead of wandering.
Typed or it did not happen
Every result comes back as zod-validated structured output. The model does not hand me prose that I then regex into a row. It hands me an object that either matches the schema or gets rejected. Age range is a number range. The date is a date. The booking URL is a URL or it is null, never the string "see website".
This is the single decision that turns an LLM from a demo into a component you can build on. When the output is typed at the boundary, the rest of the pipeline gets to assume it is clean, and failures surface right where they happen instead of three steps downstream when something tries to geocode the word "various".
A prompt matrix, not a prompt
Coverage comes from sweeping a region-by-category matrix across NSW: every region crossed with every activity category, each cell its own targeted run. That is how you go from "the agent found some stuff" to "the agent has looked at the Central Coast for swimming, and Newcastle for art, and you can see which cells are thin." Systematic beats clever here. A matrix you can audit beats one big prompt that does everything and tells you nothing about what it missed.
Results dedupe into a local SQLite store, which is the boring, correct choice: a single file, queryable, easy to diff between runs, no server to babysit. Locations resolve through Google Maps geocoding so an address becomes coordinates the booking app can map. And every call tracks its own token and search cost, so the whole run has a price tag attached and I can answer "what did this dataset cost to build" with a number instead of a shrug.
AI as a part, not the point
None of this is impressive in a demo. There is no chat window, nothing types back at you, the magic is invisible. What you get instead is a dataset assembled by a model that is nonetheless trustworthy, because the model was boxed in on every side: typed outputs, checkpoints between phases, a hard ceiling on spend, and a matrix that makes its blind spots visible.
That is the version of "AI-powered" I am interested in. Not a model you talk to. A model you wire into a pipeline, fence with guardrails, and then forget about, the same way you forget any other component that does its job.
- whisper_scheduleA recording goes in, a speaker-labelled transcript comes out.Lab
- Semantic food searchSearch food by meaning, not by exact product name.Lab
- Chunk audio with VAD before you transcribeFeeding long silent audio to a transcriber wastes time and money. Split on speech first using webrtcvad, then send only the chunks that contain voice.Snippet
- Voice that never leaves the deviceTranscription, diarisation, and speech running entirely on Apple silicon, and why keeping voice local is a product decision before it is a technical one.Musing
- One voice note, five diariesIn NutriM8 you can mumble your whole day into your phone once. A background worker untangles it into sleep, weight, exercise, hydration and food, and resolves "a snack after lunch" to a real timestamp.Musing