27.4 C
Canberra
Wednesday, March 4, 2026

Simplify real-time analytics with zero-ETL from Amazon DynamoDB to Amazon SageMaker Lakehouse


At AWS re:Invent 2024, we launched a no code zero-ETL integration between Amazon DynamoDB and Amazon SageMaker Lakehouse, simplifying how organizations deal with information analytics and AI workflows. This integration alleviates the standard challenges of constructing and sustaining complicated extract, remodel, and cargo (ETL) pipelines for reworking NoSQL information into analytics-ready codecs, which beforehand required vital time and sources whereas introducing potential system vulnerabilities. Organizations can now seamlessly mix the power of DynamoDB in dealing with speedy, concurrent transactions with quick analytical processing by means of the zero-ETL integration. For instance, an ecommerce platform storing person session information and cart data in DynamoDB can now analyze this information in close to actual time with out constructing customized pipelines. Gaming firms utilizing DynamoDB for participant information can immediately analyze person habits as occasions happen, enabling real-time insights into recreation steadiness and participant engagement patterns.

The zero-ETL functionality makes use of built-in change information seize (CDC) to routinely synchronize information updates and schema modifications between DynamoDB and SageMaker Lakehouse tables. Through the use of Apache Iceberg format, the mixing supplies dependable efficiency with ACID transaction assist and environment friendly large-scale information dealing with. Information scientists can prepare ML fashions on contemporary information and information analysts can generate studies utilizing present data, with typical synchronization latency in minutes somewhat than hours.

On this put up, we share the right way to arrange this zero-ETL integration from DynamoDB to your SageMaker Lakehouse atmosphere.

Resolution overview

We use a SageMaker Lakehouse catalog, AWS Lake Formation, Amazon Athena, AWS Glue, and Amazon SageMaker Unified Studio for this integration. The next is the reference information circulation diagram for the zero-ETL integration.

ref architecture

The workflow consists of the next elements:

  1. The lately launched zero-ETL integration functionality throughout the AWS Glue console allows direct integration between DynamoDB and SageMaker Lakehouse, storing information in Iceberg format. This streamlined strategy opens up new prospects for information groups by making a large-scale open and safe information ecosystem with out conventional ETL processing overhead.
  2. When constructing a SageMaker Lakehouse structure, you need to use an Amazon Easy Storage Service (Amazon S3) primarily based managed catalog as your zero-ETL goal, offering seamless information integration with out transformation overhead. This strategy creates a sturdy basis in your SageMaker Lakehouse implementation whereas sustaining the cost-effectiveness and scalability inherent to Amazon S3 storage, enabling environment friendly analytics and machine studying workflows.
  3. Organizations can use a Redshift Managed Storage (RMS) primarily based managed catalog once they want high-performance SQL analytics and multi-table transactions. This strategy makes use of RMS for storage whereas sustaining information within the Iceberg format, offering an optimum steadiness of efficiency and adaptability.
  4. After you determine your Lakehouse infrastructure, you’ll be able to entry it by means of numerous analytics engines, together with AWS providers like Athena, Amazon Redshift, AWS Glue, and Amazon EMR as impartial providers. For a extra streamlined expertise, SageMaker Unified Studio presents centralized analytics administration, the place you’ll be able to question your information from a single unified interface.

Stipulations

On this part, we stroll by means of the steps to arrange your resolution sources and make sure your permission settings.

Create a SageMaker Unified Studio area, venture, and IAM position

Earlier than you start, you want an AWS Id and Entry Administration (IAM) position for enabling the zero-ETL integration. On this put up, we use SageMaker Unified Studio, which presents a unified information platform expertise. It routinely manages required Lake Formation permissions on information and catalogs for you.

It’s important to first create a SageMaker Unified Studio area, an administrative entity that controls person entry, permissions, and sources for groups working throughout the SageMaker Unified Studio atmosphere. Be aware down the SageMaker Unified Studio URL after you create the area. You’ll be utilizing this URL later to log in to the SageMaker Unified Studio portal and question our information throughout a number of engines.

Then, you create a SageMaker Unified Studio venture, an built-in growth atmosphere (IDE) that gives a unified expertise for information processing, analytics, and AI growth. As a part of venture creation, an IAM position is routinely generated. This position might be used if you entry SageMaker Unified Studio later. For extra particulars on the right way to create a SageMaker Unified Studio venture and area, consult with An built-in expertise for all of your information and AI with Amazon SageMaker Unified Studio.

Put together a pattern dataset inside DynamoDB

To implement this resolution, you want a DynamoDB desk that may both be used out of your present sources, or created utilizing the pattern information file that you could import from an S3 bucket. For this put up, we information you thru importing pattern information from an S3 bucket into a brand new DynamoDB desk, offering a sensible basis for the ideas mentioned.

To create a pattern desk in DynamoDB, full the next steps:

  1. Obtain the fictional ecommerce_customer_behavior.csv dataset. This dataset captures buyer habits and interactions on an ecommerce platform.
  2. On the Amazon S3 console, open the S3 bucket utilized by the SageMaker Unified Studio venture.
  3. Add the CSV file you downloaded.

BDB-4928-image-2.png

  1. Choose the uploaded file to view its particulars web page.

  1. Copy the worth for S3 URI and make a remark of it; you’ll use this path for the next DynamoDB desk creation step.

Create a Dynamo DB desk

Full the next steps to create a DynamoDB desk from a file from Amazon S3, utilizing the import from Amazon S3 performance. Then you’ll be able to allow the settings on the DynamoDB desk required to allow zero-ETL integration.

  1. On the DynamoDB console, choose Imports from S3 within the navigation pane.
  2. Choose Import from S3.

  1. Enter the S3 URI from earlier step for Supply S3 URL, choose CSV for Import file format, and choose Subsequent.

  1. Present the desk identify as ecommerce_customer_behavior, the partition key as customer_id, and the type key as product_id, then choose Subsequent.

  1. Use the default desk settings, then choose Subsequent to evaluation the small print.

  1. Assessment the settings and choose Import.

It can take a couple of minutes for the import standing to alter from Importing to Accomplished.

When the import is full, it’s best to be capable of see the desk created on the Tables web page.

  1. Choose the ecommerce_customer_behavior desk and choose Edit PTIR.

  1. Choose Activate cut-off date restoration and choose Save modifications.

That is required for establishing zero-ETL utilizing DynamoDB as supply.
On the Backups tab, it’s best to see the standing for PITR as On.

  1. Moreover, you’ll want to use a desk coverage to allow entry for zero-ETL integration. On the Permissions tab, and duplicate the next code underneath Useful resource-based coverage for desk:
{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "TablePolicy01",
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": [
                "dynamodb:ExportTableToPointInTime",
                "dynamodb:DescribeExport",
                "dynamodb:DescribeTable"
            ],
            "Useful resource": "*"
        }
    ]
}

This coverage makes use of all of the sources, which shouldn’t be utilized in manufacturing workload. To deploy this setup in manufacturing, prohibit it to solely particular zero-ETL integration sources by including a situation to the resource-based coverage.

Now that you’ve used the Amazon S3 import methodology to load a CSV file to create a DynamoDB desk, you’ll be able to allow zero-ETL integration on the desk.

Validate permission settings

To validate if the catalog permission setting is suitable, full the next steps:

  1. On the AWS Glue console, choose Databases within the navigation pane.

  1. Examine for the database salesmarketing_XXX.

  1. Choose Catalog settings within the navigation pane, and save the permissions.

The next code is an instance of permissions for catalog settings:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam:::root"
            },
            "Action": "glue:CreateInboundIntegration",
            "Resource": "arn:aws:glue:::database/salesmarketing_XXX"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "glue:AuthorizeInboundIntegration",
            "Resource": "arn:aws:glue:::database/salesmarketing_XXX"
        }
    ]
}

Now you’re able to create your zero-ETL integration.

Create a zero-ETL integration

Full the next steps to create a zero-ETL integration:

  1. On the AWS Glue console, choose Zero-ETL integrations within the navigation pane.

  1. Choose “Create zero-ETL integration” to create a brand new configuration.

  1. Choose Amazon DynamoDB because the supply sort.

  1. Beneath Supply particulars, choose ecommerce_customer_behavior for DynamoDB desk.


  1. Beneath Goal particulars, present the next data:
    1. For AWS account, choose Use the present account.
    2. For Information warehouse or catalog, enter the account ID of your default catalog.
    3. For Goal database, enter salesmarketing_XXX.
    4. For Goal IAM position, enter datazone_usr_role_XXX.

  1. Beneath Output settings, choose Unnest all fields and Use major keys from DynamoDB tables, go away Configure goal desk identify because the default worth (ecommerce_customer_behavior), then choose Subsequent.

  1. Enter zetl-ecommerce-customer-behavior for Identify underneath Integration particulars, then choose Subsequent.

  1. Choose Create and launch integration to launch the mixing.

The standing ought to be Creating after the mixing is efficiently initiated.
The standing will change to Lively in roughly a minute.

Confirm that the SageMaker Lakehouse desk exists. This course of may take as much as quarter-hour to finish, as a result of the default refresh interval from DynamoDB is ready to fifteen minutes.

Validate the SageMaker Lakehouse desk

Now you can question your SageMaker Lakehouse desk, created by means of zero-ETL integration, utilizing numerous question engines. Full the next steps to confirm you’ll be able to you see the desk in SageMaker Unified Studio:

  1. Log in to the SageMaker Unified Studio portal utilizing the one sign-on (SSO) choice.

  1. Choose your venture to view its particulars web page.

  1. Choose Information within the navigation pane.

  1. Confirm that you could see the Iceberg desk within the SageMaker Lakehouse catalog.

Question with Athena

On this part, we present the right way to use Athena to question the SageMaker Lakehouse desk from SageMaker Unified Studio. On the venture web page, find the ecommerce_customer_behavior desk within the catalog, and on the choices menu (three dots), choose Question with Athena.

This creates a SELECT question in opposition to the SageMaker Lakehouse desk in a brand new window, and it’s best to see the question outcomes as proven within the following screenshot.

Question with Amazon Redshift

You may also question the SageMaker Lakehouse desk from SageMaker Unified Studio utilizing Amazon Redshift. Full the next steps:

  1. Choose the connection on the highest proper.
  2. Choose Redshift (Lakehouse) from the record of connections.
  3. Choose the awsdatacatalog database.
  4. Choose the salesmarketing schema.
  5. Choose Select button.

The outcomes might be proven within the Amazon Redshift Question Editor.

Question with Amazon EMR Serverless

You’ll be able to question the Lakehouse desk utilizing Amazon EMR Serverless, which makes use of Apache Spark’s processing capabilities. Full the next steps:

  1. On the venture web page, choose Compute within the navigation pane.
  2. Choose Add compute on the Information processing tab to create an EMR Serverless compute related to the venture.

  1. You’ll be able to create new compute sources or connect with present sources. For this instance, choose Create new compute sources.

  1. Choose EMR Serverless.

  1. Enter a compute identify (for instance, Gross sales-Advertising and marketing), choose the latest launch of EMR Serverless, and choose Add compute.

It can take a while to create the compute.

It’s best to see the standing as Began for the compute. Now it’s prepared for use as your compute choice for querying by means of a Jupyter pocket book.

  1. Choose the Construct menu and choose JupyterLab.

It can take a while to arrange the workspace for operating JupyterLab.

After the Jupyter Lab area is ready up, it’s best to see a web page much like the next screenshot.

  1. Choose the brand new folder icon to create a brand new folder.

  1. Identify the folder lakehouse_zetl_lab.

  1. Navigate to the folder you simply created and create a pocket book underneath this folder.
  1. Choose the pocket book Python3 (ipykernel) on the Launcher tab, and rename the pocket book to query_lakehouse_table.

You’ll be able to observe that the pocket book is exhibiting native Python as default language and compute. The 2 drop down menus present the connection sort and compute for the chosen connection sort, simply above the primary cell throughout the Jupyter pocket book.

  1. Choose PySpark because the connection, and choose the EMR Serverless utility as compute.

  1. Enter the next pattern code to question the desk utilizing Spark SQL:
import sys
from pyspark.sql import SparkSession
from pyspark.sql.features import *

# Set the present database
spark.catalog.setCurrentDatabase("salesmarketing_XXX")

# Execute SQL question and retailer leads to DataFrame
df = spark.sql("choose * from ecommerce_customer_behavior restrict 10")

# Show the outcomes
df.present()

You’ll be able to see the Spark DataFrame outcomes.

Clear up

To keep away from incurring future prices, delete the SageMaker area, DynamoDB desk, AWS Glue sources, and different objects created from this put up.

Conclusion

This put up demonstrated how one can set up a zero-ETL connection from DynamoDB to SageMaker Lakehouse, making your information accessible in Iceberg format with out constructing customized information pipelines. We confirmed how one can analyze this DynamoDB information by means of numerous compute engines inside SageMaker Unified Studio. This streamlined strategy alleviates conventional information motion complexities, and allows extra environment friendly information evaluation workflows immediately out of your DynamoDB tables.

Check out this resolution in your personal use case, and share your suggestions within the feedback.


In regards to the authors

Narayani Ambashta is an Analytics Specialist Options Architect at AWS, specializing in the automotive and manufacturing sector, the place she guides strategic prospects in creating fashionable information and AI methods. With over 15 years of cross-industry expertise, she makes a speciality of huge information structure, real-time analytics, and AI/ML applied sciences, serving to organizations implement fashionable information architectures. Her experience spans throughout lakehouse, generative AI, and IoT platforms, enabling prospects to drive digital transformation initiatives. When not architecting fashionable options, she enjoys staying lively by means of sports activities and yoga.

Raj Ramasubbu is a Senior Analytics Specialist Options Architect targeted on huge information and analytics and AI/ML with AWS. He helps prospects architect and construct extremely scalable, performant, and safe cloud-based options on AWS. Raj supplied technical experience and management in constructing information engineering, huge information analytics, enterprise intelligence, and information science options for over 18 years previous to becoming a member of AWS. He helped prospects in numerous {industry} verticals like healthcare, medical gadgets, life sciences, retail, asset administration, automobile insurance coverage, residential REIT, agriculture, title insurance coverage, provide chain, doc administration, and actual property.

Yadgiri Pottabhathini is a Senior Analytics Specialist Options Architect within the media and leisure sector. He makes a speciality of helping enterprise prospects with their information and analytics cloud transformation initiatives, whereas offering steerage on accelerating their Generative AI adoption by means of the event of knowledge foundations and fashionable information methods that leverage open-source frameworks and applied sciences.

Junpei Ozono is a Sr. Go-to-market (GTM) Information & AI options architect at AWS in Japan. He drives technical market creation for information and AI options whereas collaborating with international groups to develop scalable GTM motions. He guides organizations in designing and implementing progressive data-driven architectures powered by AWS providers, serving to prospects speed up their cloud transformation journey by means of fashionable information and AI options. His experience spans throughout fashionable information architectures together with Information Mesh, Information Lakehouse, and Generative AI, enabling prospects to construct scalable and progressive options on AWS.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles