One-time and complicated queries are two widespread situations in enterprise knowledge analytics. One-time queries are versatile and appropriate for immediate evaluation and exploratory analysis. Advanced queries, then again, confer with large-scale knowledge processing and in-depth evaluation primarily based on petabyte-level knowledge warehouses in huge knowledge situations. These complicated queries sometimes contain knowledge sources from a number of enterprise methods, requiring multilevel nested SQL or associations with quite a few tables for extremely subtle analytical duties.
Nevertheless, combining the information lineage of those two question sorts presents a number of challenges:
- Variety of information sources
- Various question complexity
- Inconsistent granularity in lineage monitoring
- Totally different real-time necessities
- Difficulties in cross-system integration
Furthermore, sustaining the accuracy and completeness of lineage data whereas offering system efficiency and scalability are essential issues. Addressing these challenges requires a fastidiously designed structure and superior technical options.
Amazon Athena gives serverless, versatile SQL analytics for one-time queries, enabling direct querying of Amazon Easy Storage Service (Amazon S3) knowledge for speedy, cost-effective on the spot evaluation. Amazon Redshift, optimized for complicated queries, gives high-performance columnar storage and massively parallel processing (MPP) structure, supporting large-scale knowledge processing and superior SQL capabilities. Amazon Neptune, as a graph database, is right for knowledge lineage evaluation, providing environment friendly relationship traversal and complicated graph algorithms to deal with large-scale, intricate knowledge lineage relationships. The mixture of those three providers gives a robust, complete answer for end-to-end knowledge lineage evaluation.
Within the context of complete knowledge governance, Amazon DataZone gives organization-wide knowledge lineage visualization utilizing Amazon Net Providers (AWS) providers, whereas dbt gives project-level lineage by mannequin evaluation and helps cross-project integration between knowledge lakes and warehouses.
On this submit, we use dbt for knowledge modeling on each Amazon Athena and Amazon Redshift. dbt on Athena helps real-time queries, whereas dbt on Amazon Redshift handles complicated queries, unifying the event language and considerably decreasing the technical studying curve. Utilizing a single dbt modeling language not solely simplifies the event course of but additionally routinely generates constant knowledge lineage data. This strategy gives strong adaptability, simply accommodating modifications in knowledge constructions.
By integrating Amazon Neptune graph database to retailer and analyze complicated lineage relationships, mixed with AWS Step Features and AWS Lambda features, we obtain a totally automated knowledge lineage era course of. This mixture promotes consistency and completeness of lineage knowledge whereas enhancing the effectivity and scalability of the complete course of. The result’s a robust and versatile answer for end-to-end knowledge lineage evaluation.
Structure overview
The experiment’s context includes a buyer already utilizing Amazon Athena for one-time queries. To higher accommodate huge knowledge processing and complicated question situations, they goal to undertake a unified knowledge modeling language throughout completely different knowledge platforms. This led to the implementation of each Athena on dbt and Amazon Redshift on dbt architectures.
AWS Glue crawler crawls knowledge lake data from Amazon S3, producing a Information Catalog to help dbt on Amazon Athena knowledge modeling. For complicated question situations, AWS Glue performs extract, remodel, and cargo (ETL) processing, loading knowledge into the petabyte-scale knowledge warehouse, Amazon Redshift. Right here, knowledge modeling makes use of dbt on Amazon Redshift.
Lineage knowledge authentic information from each elements are loaded into an S3 bucket, offering knowledge help for end-to-end knowledge lineage evaluation.
The next picture is the structure diagram for the answer.

Some essential issues:
This experiment makes use of the next knowledge dictionary:
| Supply desk | Software | Goal desk |
imdb.name_basics |
DBT/Athena | stg_imdb__name_basics |
imdb.title_akas |
DBT/Athena | stg_imdb__title_akas |
imdb.title_basics |
DBT/Athena | stg_imdb__title_basics |
imdb.title_crew |
DBT/Athena | stg_imdb__title_crews |
imdb.title_episode |
DBT/Athena | stg_imdb__title_episodes |
imdb.title_principals |
DBT/Athena | stg_imdb__title_principals |
imdb.title_ratings |
DBT/Athena | stg_imdb__title_ratings |
stg_imdb__name_basics |
DBT/Redshift | new_stg_imdb__name_basics |
stg_imdb__title_akas |
DBT/Redshift | new_stg_imdb__title_akas |
stg_imdb__title_basics |
DBT/Redshift | new_stg_imdb__title_basics |
stg_imdb__title_crews |
DBT/Redshift | new_stg_imdb__title_crews |
stg_imdb__title_episodes |
DBT/Redshift | new_stg_imdb__title_episodes |
stg_imdb__title_principals |
DBT/Redshift | new_stg_imdb__title_principals |
stg_imdb__title_ratings |
DBT/Redshift | new_stg_imdb__title_ratings |
new_stg_imdb__name_basics |
DBT/Redshift | int_primary_profession_flattened_from_name_basics |
new_stg_imdb__name_basics |
DBT/Redshift | int_known_for_titles_flattened_from_name_basics |
new_stg_imdb__name_basics |
DBT/Redshift | names |
new_stg_imdb__title_akas |
DBT/Redshift | titles |
new_stg_imdb__title_basics |
DBT/Redshift | int_genres_flattened_from_title_basics |
new_stg_imdb__title_basics |
DBT/Redshift | titles |
new_stg_imdb__title_crews |
DBT/Redshift | int_directors_flattened_from_title_crews |
new_stg_imdb__title_crews |
DBT/Redshift | int_writers_flattened_from_title_crews |
new_stg_imdb__title_episodes |
DBT/Redshift | titles |
new_stg_imdb__title_principals |
DBT/Redshift | titles |
new_stg_imdb__title_ratings |
DBT/Redshift | titles |
int_known_for_titles_flattened_from_name_basics |
DBT/Redshift | titles |
int_primary_profession_flattened_from_name_basics |
DBT/Redshift | |
int_directors_flattened_from_title_crews |
DBT/Redshift | names |
int_genres_flattened_from_title_basics |
DBT/Redshift | genre_titles |
int_writers_flattened_from_title_crews |
DBT/Redshift | names |
| genre_titles | DBT/Redshift | |
names |
DBT/Redshift | |
titles |
DBT/Redshift |
The lineage knowledge generated by dbt on Athena consists of partial lineage diagrams, as exemplified within the following photographs. The primary picture exhibits the lineage of name_basics in dbt on Athena. The second picture exhibits the lineage of title_crew in dbt on Athena.


The lineage knowledge generated by dbt on Amazon Redshift consists of partial lineage diagrams, as illustrated within the following picture.

Referring to the information dictionary and screenshots, it’s evident that the whole knowledge lineage data is extremely dispersed, unfold throughout 29 lineage diagrams. Understanding the end-to-end complete view requires vital time. In real-world environments, the scenario is commonly extra complicated, with full knowledge lineage doubtlessly distributed throughout lots of of information. Consequently, integrating a whole end-to-end knowledge lineage diagram turns into essential and difficult.
This experiment will present an in depth introduction to processing and merging knowledge lineage information saved in Amazon S3, as illustrated within the following diagram.

Stipulations
To carry out the answer, you must have the next stipulations in place:
- The Lambda perform for preprocessing lineage information will need to have permissions to entry Amazon S3 and Amazon Redshift.
- The Lambda perform for developing the directed acyclic graph (DAG) will need to have permissions to entry Amazon S3 and Amazon Neptune.
Answer walkthrough
To carry out the answer, observe the steps within the subsequent sections.
Preprocess uncooked lineage knowledge for DAG era utilizing Lambda features
Use Lambda to preprocess the uncooked lineage knowledge generated by dbt, changing it into key-value pair JSON information which might be simply understood by Neptune: athena_dbt_lineage_map.json and redshift_dbt_lineage_map.json.
- To create a brand new Lambda perform within the Lambda console, enter a Operate title, choose the Runtime (Python on this instance), configure the Structure and Execution position, then click on the “Create perform” button.

- Open the created Lambda perform and on the Configuration tab, within the navigation pane, choose Surroundings variables and select your configurations. Utilizing Athena on dbt processing for instance, configure the atmosphere variables as follows (the method for Amazon Redshift on dbt is analogous):
INPUT_BUCKET:data-lineage-analysis-24-09-22(change with the S3 bucket path storing the unique Athena on dbt lineage information)INPUT_KEY:athena_manifest.json(the unique Athena on dbt lineage file)OUTPUT_BUCKET:data-lineage-analysis-24-09-22(change with the S3 bucket path for storing the preprocessed output of Athena on dbt lineage information)OUTPUT_KEY:athena_dbt_lineage_map.json(the output file after preprocessing the unique Athena on dbt lineage file)

- On the Code tab, within the lambda_function.py file, enter the preprocessing code for the uncooked lineage knowledge. Right here’s a code reference utilizing Athena on dbt processing for instance (the method for Amazon Redshift on dbt is analogous). The preprocessing code for Athena on dbt’s authentic lineage file is as follows:
The athena_manifest.json, redshift_manifest.json, and different information used on this experiment could be obtained from the Information Lineage Graph Building GitHub repository.
Merge preprocessed lineage knowledge and write to Neptune utilizing Lambda features
- Earlier than processing knowledge with the Lambda perform, create a Lambda layer by importing the required Gremlin plugin. For detailed steps on creating and configuring Lambda Layers, see the AWS Lambda Layers documentation.
As a result of connecting Lambda to Neptune for developing a DAG requires the Gremlin plugin, it must be uploaded earlier than utilizing Lambda. The Gremlin package deal could be obtained from the Information Lineage Graph Building GitHub repository.

- Create a brand new Lambda perform. Select the perform to configure. To the lately created layer, on the backside of the web page, select Add a layer.

Create one other Lambda layer for the requests library, much like the way you created the layer for the Gremlin plugin. This library might be used for HTTP consumer performance within the Lambda perform.
- Select the lately created Lambda perform to configure. Hook up with Neptune by Lambda to merge the 2 datasets and assemble a DAG. On the Code tab, the reference code to execute is as follows:
Create Step Features workflow
- On the Step Features console, select State machines, after which select Create state machine. On the Select a template web page, choose Clean template.

- Within the Clean template, select Code to outline your state machine. Use the next instance code:
- After finishing the configuration, select the Design tab to view the workflow proven within the following diagram.

Create scheduling guidelines with Amazon EventBridge
Configure Amazon EventBridge to generate lineage knowledge each day throughout off-peak enterprise hours. To do that:
- Create a brand new rule within the EventBridge console with a descriptive title.
- Set the rule sort to “Schedule” and configure it to run as soon as each day (utilizing both a set price or the Cron expression “0 0 * * ? *”).
- Choose the AWS Step Features state machine because the goal and specify the state machine you created earlier.
Question ends in Neptune
- On the Neptune console, choose Notebooks. Open an present pocket book or create a brand new one.

- Within the pocket book, create a brand new code cell to carry out a question. The next code instance exhibits the question assertion and its outcomes:
Now you can see the end-to-end knowledge lineage graph data for each dbt on Athena and dbt on Amazon Redshift. The next picture exhibits the merged DAG knowledge lineage graph in Neptune.

You may question the generated knowledge lineage graph for knowledge associated to a particular desk, equivalent to title_crew.
The pattern question assertion and its outcomes are proven within the following code instance:
The next picture exhibits the filtered outcomes primarily based on title_crew desk in Neptune.

Clear up
To scrub up your sources, full the next steps:
- Delete EventBridge guidelines
- Delete Step Features state machine
- Delete Lambda features
- Clear up the Neptune database
- Observe the directions at Deleting a single object to wash up the S3 buckets
Conclusion
On this submit, we demonstrated how dbt allows unified knowledge modeling throughout Amazon Athena and Amazon Redshift, integrating knowledge lineage from each one-time and complicated queries. Through the use of Amazon Neptune, this answer gives complete end-to-end lineage evaluation. The structure makes use of AWS serverless computing and managed providers, together with Step Features, Lambda, and EventBridge, offering a extremely versatile and scalable design.
This strategy considerably lowers the training curve by a unified knowledge modeling methodology whereas enhancing improvement effectivity. The tip-to-end knowledge lineage graph visualization and evaluation not solely strengthen knowledge governance capabilities but additionally provide deep insights for decision-making.
The answer’s versatile and scalable structure successfully optimizes operational prices and improves enterprise responsiveness. This complete strategy balances technical innovation, knowledge governance, operational effectivity, and cost-effectiveness, thus supporting long-term enterprise progress with the adaptability to satisfy evolving enterprise wants.
With OpenLineage-compatible knowledge lineage now usually obtainable in Amazon DataZone, we plan to discover integration prospects to additional improve the system’s functionality to deal with complicated knowledge lineage evaluation situations.
If in case you have any questions, please be at liberty to go away a remark within the feedback part.
Concerning the authors

Nancy Wu is a Options Architect at AWS, accountable for cloud computing structure consulting and design for multinational enterprise prospects. Has a few years of expertise in massive knowledge, enterprise digital transformation analysis and improvement, consulting, and venture administration throughout telecommunications, leisure, and monetary industries.
Xu Feng is a Senior Trade Answer Architect at AWS, accountable for designing, constructing, and selling trade options for the Media & Leisure and Promoting sectors, equivalent to clever customer support and enterprise intelligence. With 20 years of software program trade expertise, presently targeted on researching and implementing generative AI and AI-powered knowledge options.
Xu Da is a Amazon Net Providers (AWS) Associate Options Architect primarily based out of Shanghai, China. He has greater than 25 years of expertise in IT trade, software program improvement and answer structure. He’s captivated with collaborative studying, data sharing, and guiding neighborhood of their cloud applied sciences journey.
