Seize knowledge lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

June 25, 2025

7

The subsequent era of Amazon SageMaker is the middle to your knowledge, analytics, and AI. SageMaker brings collectively AWS synthetic intelligence and machine studying (AI/ML) and analytics capabilities and delivers an built-in expertise for analytics and AI with unified entry to knowledge. From Amazon SageMaker Unified Studio, a single interface, you’ll be able to entry your knowledge and use a collection of highly effective instruments for knowledge processing, SQL analytics, mannequin growth, coaching and inference, in addition to generative AI growth. This unified expertise is assisted by Amazon Q and Amazon SageMaker Catalog (powered by Amazon DataZone), which delivers an embedded generative AI and governance expertise at each step.

With knowledge lineage, now a part of SageMaker Catalog, area directors and knowledge producers can centralize lineage metadata of their knowledge property in a single place. You possibly can observe the circulate of knowledge over time, providing you with a transparent understanding of the place it originated, the way it has modified, and its final use throughout the enterprise. By offering this stage of transparency across the origin of knowledge, knowledge lineage helps knowledge shoppers achieve belief that the info is appropriate for his or her use case. As a result of knowledge lineage is captured on the desk, column, and job stage, knowledge producers may conduct impression evaluation and reply to knowledge points when wanted.

Seize of knowledge lineage in SageMaker begins after connections and knowledge sources are configured and lineage occasions are generated when knowledge is reworked in AWS Glue or Amazon Redshift. This functionality can also be totally suitable with OpenLineage, so you’ll be able to additional increase knowledge lineage seize to different knowledge processing instruments. This submit walks you thru find out how to use the OpenLineage-compatible API of SageMaker or Amazon DataZone to push knowledge lineage occasions programmatically from instruments supporting the OpenLineage commonplace like dbt, Apache Airflow, and Apache Spark.

Answer overview

Many third-party and open supply instruments which might be used right this moment to orchestrate and run knowledge pipelines, like dbt, Airflow, and Spark, have lively help of the OpenLineage commonplace to offer interoperability throughout environments. With this functionality, you solely want to incorporate and configure the fitting library to your setting, to have the ability to stream lineage occasions from jobs operating on this device on to their corresponding output logs or to a goal HTTP endpoint that you simply specify.

With the goal HTTP endpoint choice, you’ll be able to introduce a sample to submit lineage occasions from these instruments into SageMaker or Amazon DataZone to additional assist you centralize governance of your knowledge property and processes in a single place. This sample takes the type of a proxy, and its simplified structure is illustrated within the following determine.

The best way that the proxy for OpenLineage works is straightforward:

Amazon API Gateway exposes an HTTP endpoint and path. Jobs operating with the OpenLineage package deal on prime of the supported knowledge processing instruments could be arrange with the HTTP transport choice pointing to this endpoint and path. If connectivity permits, lineage occasions will likely be streamed into this endpoint because the job runs.
An Amazon Easy Queue Service (Amazon SQS) queue buffers the occasions as they arrive. By storing them in a queue, you’ve gotten the choice to implement methods for retries and errors when wanted. For instances the place occasion order is required, we suggest the usage of first-in-first-out (FIFO) queues; nevertheless, SageMaker and Amazon DataZone are capable of map incoming OpenLineage occasions, even when they’re out of order.
An AWS Lambda perform retrieves occasions from the queue in batches. For each occasion in a batch, the perform can carry out transformations when wanted and submit the ensuing occasion to the goal SageMaker or Amazon DataZone area.
Although it’s not proven within the structure, AWS Id and Entry Administration (IAM) and Amazon CloudWatch are key capabilities that enable safe interplay between assets with minimal permissions and logging for troubleshooting and observability.

The AWS pattern OpenLineage HTTP Proxy for Amazon SageMaker Governance and Amazon DataZone gives a working implementation of this simplified structure that you may check and customise as wanted. To deploy in a check setting, comply with the steps as described within the repository. We use an AWS CloudFormation template to deploy resolution assets.

After you’ve gotten deployed the OpenLineage HTTP Proxy resolution, you need to use it to submit lineage occasions from knowledge processing instruments like dbt, Airflow, and Spark right into a SageMaker or Amazon DataZone area, as proven within the following examples.

Arrange the OpenLineage package deal for Spark in AWS Glue 4.0

AWS Glue added built-in help for OpenLineage with AWS Glue 5.0 (to be taught extra, see Introducing AWS Glue 5.0 for Apache Spark). For jobs which might be nonetheless operating on AWS Glue 4.0, you continue to can stream OpenLineage occasions into SageMaker or Amazon DataZone through the use of the OpenLineage HTTP Proxy resolution. This serves for example that may be utilized to different platforms operating Spark like Amazon EMR, third-party options, or self-managed clusters.

To correctly add OpenLineage capabilities to an AWS Glue 4.0 job and configure it to stream lineage occasions into the OpenLineage HTTP Proxy resolution, full the next steps:

Obtain the official OpenLineage package deal for Spark. For our instance, we used the JAR package deal model 2.12 launch 1.9.1.
Retailer the JAR file in an Amazon Easy Storage Service (Amazon S3) bucket that may be accessed by your AWS Glue job.
On the AWS Glue console, open your job.
Underneath Libraries, for Dependent JARs path, enter the trail of the JAR package deal saved in your S3 bucket.

Within the Job parameters part, add the next parameters:
1. Allow the OpenLineage package deal:
  1. Key: --user-jars-first
  2. Worth: true
2. Configure how the OpenLineage package deal will likely be used to stream lineage occasions. Substitute and with the corresponding values of the OpenLineage HTTP Proxy resolution. These values could be discovered as outputs of the deployed CloudFormation stack. Substitute along with your AWS account ID.
  1. Key: --conf
  2. Worth:
```
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
--conf spark.openlineage.transport.kind=http 
--conf spark.openlineage.transport.url=
--conf spark.openlineage.transport.endpoint=/
--conf spark.openlineage.aspects.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] 
--conf spark.glue.accountId=
```

With this setup, the AWS Glue 4.0 job will use the HTTP transport choice of the OpenLineage package deal to stream lineage occasions into the OpenLineage proxy, which can submit occasions to the SageMaker or Amazon DataZone area.

Run the AWS Glue 4.0 job.

The job’s ensuing datasets needs to be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them. As you discover the sourced dataset in SageMaker Unified Studio, you’ll be able to observe its origin path as described by the OpenLineage occasions streamed via the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical outcome.

The origin path on this instance is in depth and maps the ensuing dataset all the way down to its origin, on this case, a few tables hosted in a relational database and reworked via a knowledge pipeline with two AWS Glue 4.0 (Spark) jobs.

Arrange the OpenLineage package deal for dbt

dbt has quickly change into a well-liked framework to construct knowledge pipelines on prime of knowledge processing and knowledge warehouse instruments like Amazon Redshift, Amazon EMR, and AWS Glue, in addition to different conventional and third-party options. This framework helps OpenLineage as a method to standardize era of lineage occasions and combine with the rising knowledge governance ecosystem.dbt deployments would possibly range per setting, which is why we don’t dive into the specifics on this submit. Nevertheless, to easily configure your dbt undertaking to leverage the OpenLineage HTTP Proxy resolution, full the next steps:

Set up the OpenLineage package deal for dbt. You possibly can be taught extra within the OpenLineage documentation.
Within the root folder of your dbt undertaking, create an openlineage.yml file the place you’ll be able to specify the transport configuration. Substitute and with the values of the OpenLineage HTTP Proxy resolution. These values could be discovered as outputs of the deployed CloudFormation stack.

transport:
  kind: http
  url: 
  endpoint: 
  timeout: 5

Run your dbt pipeline. As defined within the OpenLineage documentation, as a substitute of operating the usual dbt run command, you run the dbt-ol run command. The latter command is only a wrapper on prime of the usual dbt run command in order that lineage occasions are captured and streamed as configured.

The job’s ensuing datasets needs to be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them. As you discover the sourced dataset in SageMaker Unified Studio, you’ll be able to observe its lineage path as described by the OpenLineage occasions streamed via the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical outcome.

On this instance, the dbt undertaking is operating on prime of Amazon Redshift, which is a standard use case amongst clients. Amazon Redshift is built-in for automated lineage seize with SageMaker and Amazon DataZone, however such capabilities weren’t used as a part of this instance as an example how one can nonetheless combine OpenLineage occasions from dbt utilizing the sample applied within the OpenLineage HTTP Proxy resolution.The dbt pipeline is made by two phases operating sequentially, that are illustrated within the origin path because the nodes with the dbt kind.

Arrange the OpenLineage package deal for Airflow

Airflow is a well-positioned device to orchestrate knowledge pipelines at any scale. AWS gives Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a managed different for patrons that need to cut back administration and speed up the event of their knowledge technique with Airflow in an economical method. Airflow additionally helps OpenLineage, so you’ll be able to centralize lineage with instruments like SageMaker and Amazon DataZone.

The next steps are particular for Amazon MWAA, however they are often extrapolated to different types of deployment of Airflow:

Set up the OpenLineage package deal for Airflow. You possibly can be taught extra within the OpenLineage documentation. For variations 2.7 and later, it’s advisable to make use of the native Airflow OpenLineage package deal (apache-airflow-providers-openlineage), which is the case for this instance.
To put in the package deal, add it to the necessities.txt file that you’re storing in Amazon S3 and that you’re pointing to when provisioning your Amazon MWAA setting. To be taught extra, seek advice from Managing Python dependencies in necessities.txt.
As you put in the OpenLineage package deal or afterwards, you’ll be able to configure it to ship lineage occasions to the OpenLineage proxy:
1. When filling the shape to create a brand new Amazon MWAA setting or edit an current one, within the Airflow configuration choices part, add the next. Substitute and with the values of the OpenLineage HTTP Proxy resolution. These values could be discovered as outputs of the deployed CloudFormation stack:
  1. Configuration choice: openlineage.transport
  2. Customized worth: {"kind": "http", "url": "", "endpoint": ""}

Run your pipeline.

The Airflow duties will routinely use the transport configuration to stream lineage occasions into the OpenLineage proxy as they run. The duty’s ensuing datasets needs to be sourced into SageMaker or Amazon DataZone in order that OpenLineage occasions are mapped to them.As you discover the sourced dataset in SageMaker Unified Studio, you’ll be able to observe its origin path as described by the OpenLineage occasions streamed via the OpenLineage proxy.

When working with Amazon DataZone, you’ll get the identical outcome.

On this instance, the Amazon MWAA Directed Acyclic Graph (DAG) is working on prime of Amazon Redshift, much like the dbt instance earlier than. Nevertheless, it’s nonetheless not utilizing the native integration for automated knowledge seize between Amazon Redshift and SageMaker or Amazon DataZone. This fashion, we will illustrate how one can nonetheless combine OpenLineage occasions from Airflow utilizing the sample applied within the OpenLineage HTTP Proxy resolution.The Airflow DAG is made by a single process that outputs the ensuing dataset through the use of datasets that have been created as a part of the dbt pipeline within the earlier instance. That is illustrated within the origin path, the place it consists of nodes with the dbt kind and a node with AIRFLOW kind. With this remaining instance, observe how SageMaker and Amazon DataZone map all datasets and jobs to replicate the truth of your knowledge pipelines.

Further concerns when implementing the OpenLineage proxy sample

The OpenLineage proxy sample applied within the pattern OpenLineage HTTP Proxy resolution and offered on this submit has proven to be a sensible method to combine a rising set of knowledge processing instruments right into a centralized knowledge governance technique on prime of SageMaker. We encourage you to dive into it and use it in your check environments to find out how it may be finest used to your particular setup.If enthusiastic about taking this sample to manufacturing, we propose you first evaluate it totally and customise it to your specific wants. The next are some objects value reviewing as you consider this sample implementation:

The answer used within the examples of this submit makes use of a public API endpoint with no authentication or authorization mechanism. For a manufacturing workload, we suggest limiting entry to the endpoint to a minimal so solely licensed assets are capable of stream messages into it. To be taught extra, seek advice from Management and handle entry to HTTP APIs in API Gateway.
The logic applied within the Lambda perform is meant to be personalized relying in your use case. You would possibly have to implement transformation logic, relying on how OpenLineage occasions are created by the device you’re utilizing. As a reference, for the case of the Amazon MWAA instance of this submit, some minor transformations have been required on the title and namespace fields of the inputs and outputs parts of the occasion for full compatibility with the format anticipated for Amazon Redshift datasets as described within the dataset naming conventions of OpenLineage. You may additionally want to alter how the perform logs execution particulars or embody retry/error logic and extra.
The SQS queue used within the OpenLineage HTTP Proxy resolution is commonplace, which means that occasions aren’t delivered so as. If it is a requirement, you can use FIFO queues as a substitute.

For instances the place you need to submit OpenLineage occasions instantly into SageMaker or Amazon DataZone, with out utilizing the proxy sample defined on this submit, a customized transport is now accessible as an extension of the OpenLineage undertaking model 1.33.0. Leverage this characteristic in instances the place you don’t want further controls in your OpenLineage occasion stream, for instance, if you happen to don’t want any customized transformation logic.

Abstract

On this submit, we confirmed find out how to use the OpenLineage-compatible APIs of SageMaker to seize knowledge lineage from any device supporting this commonplace, by following an architectural sample launched because the OpenLineage proxy. We offered some examples of how one can arrange instruments like dbt, Airflow, and Spark to stream lineage occasions to the OpenLineage proxy, which subsequently posts them to a SageMaker or Amazon DataZone area. Lastly, we launched a working implementation of this sample that you may check and mentioned some concerns when implementing this similar sample to manufacturing.

The SageMaker compatibility with OpenLineage will help simplify governance of your knowledge property and improve belief in your knowledge. This functionality is among the many options that are actually accessible to construct a complete governance technique powered by knowledge lineage, knowledge high quality, enterprise metadata, knowledge discovery, entry automation, and extra. By bundling knowledge governance capabilities with the rising set of instruments accessible for knowledge and AI growth, you’ll be able to derive worth out of your knowledge quicker and get nearer to consolidating a data-driven tradition. Check out this resolution and get began with SageMaker to affix the rising set of shoppers which might be modernizing their knowledge platform.

Concerning the authors

Jose Romero is a Senior Options Architect for Startups at AWS, based mostly in Austin, Texas. He’s keen about serving to clients architect fashionable platforms at scale for knowledge, AI, and ML. As a former senior architect in AWS Skilled Providers, he enjoys constructing and sharing options for widespread complicated issues in order that clients can speed up their cloud journey and undertake finest practices. Join with him on LinkedIn.

Priya Tiruthani is a Senior Technical Product Supervisor with Amazon SageMaker Catalog (Amazon DataZone) at AWS. She focuses on constructing merchandise and their capabilities in knowledge analytics and governance. She is keen about constructing progressive merchandise to handle and simplify clients’ challenges of their end-to-end knowledge journey. Exterior of labor, she enjoys being open air to hike and seize nature’s magnificence. Join along with her on LinkedIn.

Seize knowledge lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

Answer overview

Arrange the OpenLineage package deal for Spark in AWS Glue 4.0

Arrange the OpenLineage package deal for dbt

Arrange the OpenLineage package deal for Airflow

Further concerns when implementing the OpenLineage proxy sample

Abstract

Concerning the authors

Related Articles

Comcast’s MachineQ IoT unit launches e-ink gadget for pharma business

From Alerts to Insights: Constructing a Actual-Time Streaming Knowledge Platform with Material Eventstream | Microsoft Material Weblog

Constructing the Way forward for Actual-Time AI Purposes

LEAVE A REPLY Cancel reply

Latest Articles

Comcast’s MachineQ IoT unit launches e-ink gadget for pharma business

From Alerts to Insights: Constructing a Actual-Time Streaming Knowledge Platform with Material Eventstream | Microsoft Material Weblog

Constructing the Way forward for Actual-Time AI Purposes

Speed up protected software program releases with new built-in blue/inexperienced deployments in Amazon ECS

Amazon AI coding agent hacked to inject knowledge wiping instructions

ABOUT US