2.4 C
Canberra
Saturday, July 4, 2026

OpenLineage Integration: Bridging Open Requirements with the Exactly Knowledge Integrity Suite



physique.orbit:not(.legacy-content) .pageblock .elementor-widget-container:has(> desk) {
max-width: 100%;
overflow-x: unset;
border-radius: 0px;
box-shadow: none;
}
physique.orbit:not(.legacy-content) .pageblock desk tr:first-child td {
background-color: #8017e1 !vital;
colour: #fff !vital;
}
physique.orbit:not(.legacy-content) .pageblock desk tbody > tr:nth-child(2n+1) > td {
background-color: clear;
}
physique.orbit tbody tr:nth-child(odd) {
background-color: #f7f8fa;
}
physique.orbit h2 {
margin-top: 50px;
}

Key Takeaways

  • Join any OpenLineage-compatible orchestrator to the Exactly Knowledge Integrity Suite in minutes — no customized connector required.
  • Dataset-level and column-level lineage are each captured robotically primarily based on the occasion payload.
  • Lineage is at all times full: when a dataset hasn’t been formally found but, the catalog creates placeholders and robotically enriches them when discovery runs.

Knowledge pipelines have by no means been extra complicated. Fashionable knowledge groups run workloads throughout a rising mixture of orchestration instruments — Airflow, Spark, dbt, Dagster — and each new software historically meant a brand new customized connector simply to seize lineage.

The result’s fragmented visibility, brittle integrations, and lineage graphs that go stale the second a software model change. There’s a greater method, and at Exactly, we tackled this problem immediately.

Why Bespoke Lineage Connectors Maintain Knowledge Groups Again

Conventional lineage seize requires a devoted connector for each orchestration software: one for Dagster, one for Airflow, one for dbt, one for Spark. Every connector evolves by itself schedule, breaks model upgrades, and multiplies upkeep burden with each new software added.

We solved this by constructing the Exactly Knowledge Integrity Suite to talk a language that orchestrators already perceive: OpenLineage.

What Is OpenLineage and Why Does It Matter for Knowledge Groups?

OpenLineage is an open commonplace for metadata and lineage assortment designed to instrument jobs as they run. When a pipeline job is executed, the orchestrator emits a structured occasion payload to any HTTP endpoint that helps the protocol.

As a result of the usual is tool-agnostic and community-maintained, it has achieved broad adoption throughout the trendy knowledge stack. Fairly than sustaining proprietary connectors, groups get lineage protection that grows robotically because the ecosystem evolves.

Each main orchestration software both ships with built-in assist or has a mature group integration:

Instrument OpenLineage Help
Dagster Constructed-in by way of openlineage-dagster
Apache Airflow Constructed-in by way of apache-airflow-providers-openlineage
dbt Constructed-in by way of dbt-core OpenLineage integration
Apache Spark OpenLineage Spark integration (computerized column lineage)
Apache Flink OpenLineage Flink integration
Trino / Starburst OpenLineage Trino integration

In case your crew makes use of any of those instruments, you might be one configuration change away from computerized lineage seize.

OpenLineage Integration: Bridging Open Requirements with the Exactly Knowledge Integrity Suite

Connecting Your Orchestrator

How Do You Join an Orchestrator to the Exactly Knowledge Integrity Suite?

Configure your orchestrator to ship occasions to the Exactly API Gateway:

Endpoint: POST /v2/catalog/lineage

Authentication: API key or bearer token out of your workspace credentials

Area Worth
US https://api.cloud.exactly.com
EU https://api.eu1.cloud.exactly.com
GB https://api.gb1.cloud.exactly.com
AU https://api.au1.cloud.exactly.com

 

openlineage.yml instance:

openlineage.yml example

No extra setup is required on the catalog facet. Occasions seem as quickly as your subsequent pipeline run completes.

How Occasions Movement

Data Integrity Suite Open Lineage Ingestion & Process Flow - Precisely

The endpoint acknowledges every occasion instantly and processes it asynchronously — your orchestrator isn’t blocked ready for catalog writes.

What Ends Up within the Catalog

After a pipeline run completes, you get:

  • Searchable, browsable Transformation Job belongings for each pipeline run
  • Lineage edges connecting supply and goal datasets
  • Full column-level lineage with transformation labels
  • Placeholder belongings that improve to totally enriched belongings when discovery runs

The Catalog Idea Mapping

OpenLineage Idea Catalog Idea
Job (namespace + title) A Transformation Job asset, searchable and browsable
Run (distinctive run ID) Tracked for audit
Dataset (namespace + title) An current catalog asset, or a placeholder
Enter → Output edge A lineage relation
Aspects Asset properties: schema, possession, knowledge high quality, docs

What Occurs When a Dataset Hasn’t Been Found But?

Pipelines usually run earlier than formal knowledge supply discovery completes. Fairly than dropping lineage edges, the catalog creates placeholder belongings — absolutely navigable catalog entries with provenance metadata from the occasion. When discovery runs later, the placeholder is enriched with harvested metadata; no lineage edges want rebuilding.

This implies lineage is full from day one — even in environments the place knowledge sources are nonetheless being cataloged. Groups can belief the graph with out ready for full discovery protection.

⚠  Professional tip: Dataset/area identifier matching is actual. A case distinction, a lacking port, or a website prefix mismatch causes the catalog to create a placeholder as a substitute of linking to an current asset. Confirm your OpenLineage producer’s namespace and title format towards your catalog connection settings earlier than enabling manufacturing lineage seize.

Column-Degree Lineage

How Does Column-Degree Lineage Work?

Dataset-level lineage solutions which desk feeds into which desk. Column-level lineage solutions which column, reworked how, produces which output column — enabling root-cause evaluation and change-impact evaluation.

Column-level lineage travels within the column Lineage side of a COMPLETE occasion. Instruments like Spark and dbt emit this robotically.

Column-level lineage travels in the column Lineage facet of a COMPLETE event

Transformation Job: Full Transformation Context

Every column lineage relation hyperlinks to a Transformation Job asset that captures:

Property What IT Tells You
Title The pipeline that produced this column mapping
Sort / Subtype Transformation class (e.g., AGGREGATION / SUM, IDENTITY, TRANSFORMATION)
Column Masked Whether or not the supply worth was masked or anonymized
Run ID The particular run that generated this lineage
Namespace The orchestrator setting (e.g., dagster-prod)
Occasion Time When the pipeline run accomplished
Producer Which software emitted the occasion

Clever Graph: No Duplicate Paths

When column-level lineage is absolutely resolvable for a supply–goal pair, the catalog shops column-level relations solely. Dataset-level lineage for these pairs is robotically inferred by rollup — so each views seem within the UI with out duplicate edges within the graph. For orchestrators that don’t emit columnLineage, the catalog falls again to dataset-level lineage.

Partial Occasion Resilience

Resolvable column mappings are captured instantly. Unresolvable ones (referencing not-yet-discovered columns) are retried after discovery. An incomplete column mapping by no means blocks the dataset-level lineage or knowledge high quality metadata for a similar occasion.

Reliability You Can Rely On

Protected replays: Re-sending the identical occasion has no impact. Lineage relations usually are not duplicated, Transformation Job belongings usually are not re-created, and metadata isn’t overwritten.

This issues greater than it might sound. In apply, pipeline orchestrators retry on failure, CI/CD methods replay jobs throughout deployment, and catastrophe restoration procedures re-run historic occasions. With out idempotent occasion dealing with, every of these situations dangers corrupting the lineage graph with duplicate edges or stale metadata. The Exactly Knowledge Integrity Suite processes every occasion precisely as soon as no matter what number of occasions it’s obtained.

Any software that emits commonplace OpenLineage RunEvent payloads to an HTTP endpoint will work.

Abstract

Functionality Element
✓  Zero-connector integration Any OpenLineage-compatible software connects with a URL and a token
✓  Dataset lineage Computerized lineage relations from each COMPLETE pipeline occasion
✓  Column lineage Discipline-level lineage with transformation kind, subtype, description, and masking context
✓  Placeholder belongings Lineage is full from day one, even earlier than discovery runs
✓  Metadata enrichment Schema, possession, knowledge supply, and documentation from OpenLineage aspects
✓  Protected retries Duplicate or replayed occasions by no means corrupt catalog state
✓  TransformationJob belongings Full provenance path of what reworked every column and when

Knowledge pipelines are solely as reliable because the lineage behind them. By constructing on an open commonplace that the trendy knowledge stack already speaks, the Exactly Knowledge Integrity Suite makes correct, constant, and contextual lineage computerized — so your groups can transfer quick with out second-guessing the place their knowledge got here from.

_____________________________________________________________________

Steadily Requested Questions

Q. Does OpenLineage work with my current orchestrator?

A. In case your orchestrator is Airflow, Spark, dbt, Dagster, Flink, or Trino/Starburst, built-in or mature group assist is out there. Configuration is a single YAML change pointing to the Exactly API endpoint. In case your software isn’t on this listing, any software that emits commonplace OpenLineage RunEvent payloads over HTTP will even work with out modification.

Q. What occurs if a dataset hasn’t been found but?

A. The catalog creates a placeholder asset with provenance metadata from the occasion, retaining lineage edges intact. When discovery runs later, the placeholder is robotically enriched with full metadata. No lineage must be rebuilt.

Q. Is dataset-level lineage nonetheless obtainable when column-level lineage is captured?

A. Sure. When column-level lineage is resolvable, dataset-level lineage is robotically inferred by rollup so each views can be found within the catalog UI. There are not any duplicate edges within the graph.

Q. What occurs if an occasion is re-sent or replayed?

A. Nothing adjustments within the catalog. Occasions are processed idempotently — re-sending the identical occasion doesn’t create duplicate lineage relations, re-create Transformation Job belongings, or overwrite current metadata.

The put up OpenLineage Integration: Bridging Open Requirements with the Exactly Knowledge Integrity Suite appeared first on Exactly.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles