The vitality transition has a knowledge downside
The UK’s vitality grid is in the course of its most vital structural transformation in many years. As renewables like wind and photo voltaic take a bigger share of electrical energy era, intermittency turns into a first-class downside: vitality is reasonable when the solar shines and costly when it would not.
The prevailing settlement mannequin – constructed on month-to-month meter reads and averaged consumption profiles – can not worth that sign precisely. And if you cannot worth it precisely, you may’t go the sign to shoppers, and demand by no means shifts to match provide.
Market-wide Half-Hourly Settlement (MHHS) is the regulatory response. Each family in Nice Britain strikes from two meter reads monthly to 48 reads per day. That’s not an incremental change. For a provider like Octopus Vitality serving over 8 million prospects, it’s a 48x improve within the knowledge factors driving each margin calculation, each settlement obligation, and each industrial choice.
The info engineering implication is direct: with out re-architecture, the infrastructure price to run Octopus Vitality’s margin pipelines was projected to balloon by $1 million yearly.
Why throwing compute at this does not work
The intuition when knowledge volumes improve 48x is to provision extra infrastructure. For Octopus Vitality’s margin knowledge group, that intuition was shortly validated as untenable. The projected price per settlement date underneath the legacy structure was $23.63 – a 33x improve from historic norms. Multiply that throughout settlement home windows, and the invoice compounds quick.
Nevertheless, the deeper downside was not compute price – it was structure mismatch. The legacy pipeline had been constructed round a single grain: month-to-month. Billing ran month-to-month. Settlement ran month-to-month. The complete pipeline was monolithic by design.
MHHS launched a basic break up. Business price knowledge now arrives at half-hourly granularity – 48 knowledge factors per buyer per day. Sensible tariff prospects with EVs and warmth pumps want half-hourly income calculations. Commonplace tariff prospects nonetheless settle month-to-month. Working all three via a single monolithic pipeline meant processing your entire dataset on each run, no matter what had truly modified.
As Saad Ali, Lead of the Margin Knowledge Group at Octopus Vitality, framed it: “You may’t simply throw extra compute at an issue like this. It’s a must to rebuild and rethink your logic from the bottom up.”
The structure: three streams, one supply of fact
The group re-architected round three specialised streams, every optimised independently for its pure grain:
Settlement – Half-hourly granularity for regulatory settlement and value allocation. Business costs at 48 knowledge factors per day; this stream matches that grain precisely.
Half-Hourly – Half-hourly processing for sensible tariff prospects: EV drivers, warmth pump customers, and time-of-use merchandise the place the half-hourly worth sign is your entire industrial proposition.
Month-to-month – Month-to-month processing for traditional tariff prospects, unchanged in grain however now reconcilable towards the half-hourly knowledge.
A “Job of Jobs” orchestration sample manages dependencies and parallel execution throughout all three streams. Every stream is independently tunable – what works as a Spark optimisation for Settlement is just not essentially proper for NHH.
Underpinning all three is the downstream consumption layer: a unified, multi-grain supply of fact consolidating meter reads, sensible meter knowledge, and business flows at multi-terabyte scale. This layer is the reconciliation bridge between month-to-month billing and half-hourly settlement – and it grew to become the positioning of the one highest-leverage optimisation within the mission.
Incremental processing: 98.8% fewer rows
The naive method to the upstream consumption tables – reprocessing your entire multi-terabyte dataset on each run – would have meant unsustainable compute prices on the new quantity.
Delta Lake’s Change Knowledge Feed (CDF) made true incremental processing viable at this grain. As a substitute of full overwrites, the pipeline now reads solely data which have truly modified because the final run. The outcome: rows processed per run dropped from 25 billion to 300 million – a 98.8% discount.
Knowledge freshness improved from weekly to each day. For the industrial group, that shift means margin visibility on the grain the place pricing selections are literally made – each morning, not as soon as per week.
Notice: the $1M in annualised financial savings figures cited under exclude the extra financial savings from this transfer to incremental processing on upstream tables. The total effectivity acquire is bigger.
Spark & Delta optimisation – and what to take away
With 48x extra knowledge flowing via the system, the group utilized focused optimisations validated by measurement throughout 4 classes:
Lineage and I/O discount
- Simplified lineage by consolidating knowledge early within the pipeline, decreasing downstream joins and shuffle operations
- Knowledge pruning: chosen solely the columns strictly essential for settlement and pruned rows on the earliest doable stage, decreasing I/O overhead earlier than costly transformations
Be part of and partition tuning
- Broadcast joins for reference tables underneath 500MB, eliminating costly shuffle operations on advanced multi-key joins with date ranges
- Liquid clustering was enabled throughout a number of tables for columns incessantly utilized in filters and joins. Liquid clustering dynamically co-locates associated data on the required clustering keys with out requiring mounted partition boundaries. Liquid clustering avoids the small-file downside, larger reminiscence consumption, and I/O overhead that come from over-partitioning.
Trusted the optimiser
- In a number of circumstances, Spark’s Adaptive Question Execution (AQE) outperformed hand-tuned logic. The group eliminated customized optimisation code and let AQE do its job.
That final level bears emphasis: eradicating unjustified compute operations was as impactful as including new optimisations. In case you are working Z-ordering or ANALYZE with out measuring their impact, they might be costing you greater than they’re saving.
Serverless as a growth accelerator
Databricks Serverless made the three-month supply window viable. Zero cluster startup time meant the group might iterate quickly – write, run, measure, alter – with out ready for infrastructure to provision.
The Serverless UI enabled side-by-side run comparisons, making it sensible to isolate the impact of particular person optimisations.
Within the group’s personal phrases: “The testing and growth course of couldn’t have been finished with out serverless. Utilizing the serverless UI helped us to establish bottlenecks and make straightforward comparisons between totally different runs.”
Outcomes
| Metric | Earlier than | After | Change |
| Rows processed per run | 25 billion | 300 million | 98.8% discount |
| Price per settlement date (projected MHHS) | $23.63 | $0.48 | ~50x discount |
| Price per settlement date (vs legacy) | $0.71 | $0.48 | 2x extra environment friendly |
| Financial savings per month-end run | – | ~$83,000 | vs unoptimised projection |
| Annualised price avoidance | – | ~$1,000,000 | excludes upstream financial savings |
| Knowledge freshness | Weekly | Day by day | 7x enchancment |
| Construct time | – | 3 months | Group of three |
The $0.48 per settlement date isn’t just a 50x discount from the MHHS projected price – it’s 2x cheaper than the legacy system had ever been, regardless of processing 48x extra knowledge factors. Re-architecture delivered regulatory compliance and made the system materially extra environment friendly than the one it changed.
What this implies past vitality
MHHS is a UK vitality regulation. Nevertheless, the sample it represents – a regulatory or enterprise occasion that multiplies knowledge quantity at a finer grain – is just not distinctive to vitality. Any time a system strikes from month-to-month to each day, each day to real-time, or combination to transactional, the identical dynamics apply.
4 transferable takeaways from the Octopus Vitality expertise:
- Grain misalignment is the hidden price driver. When a pipeline processes the whole lot on the best grain no matter enterprise want, you pay for it in compute, freshness, and upkeep complexity. Establish the pure grains in your knowledge and align processing to them.
- Incremental processing transforms pipeline economics. The 98.8% row discount got here from CDF-based incremental logic, not Spark tuning. Begin there – and keep in mind the complete financial savings are bigger than the headline determine.
- Take away earlier than you add. Audit current optimisation selections earlier than assuming you want extra compute. Z-ordering, ANALYZE, and customized shuffle logic utilized with out measurement could also be costing you greater than they save.
- Belief the optimiser. AQE outperformed hand-coded logic in a number of circumstances. Earlier than writing customized optimisation, take a look at whether or not Spark already handles your case.
The larger image
Within the phrases of Saad: “By making our techniques sooner and extra environment friendly, we will supply smarter tariffs that assist our prospects use vitality when it is most cost-effective and cleanest.”
The lowered price base does one thing particular: it removes the financial barrier to high-frequency knowledge processing. That makes grid balancing viable as a product. That makes sensible tariffs commercially sustainable. That’s how knowledge engineering at scale connects to the vitality transition – not as infrastructure overhead, however because the industrial basis for it.
MHHS compliance was the mandate. Making sustainable vitality the reasonably priced choice is the mission. The info engineering is what connects the 2.
Go additional
———
Saad Ali is Lead of the Margin Knowledge Group at Octopus Vitality. Ismail Makhlouf, David Poulet, and Daniel Taylor are Options Architects at Databricks.
