11.2 C
Canberra
Sunday, April 26, 2026

Analyzing your information catalog: Question SageMaker Catalog metadata with SQL


As your information and machine studying (ML) belongings develop, monitoring which belongings lack documentation or monitoring asset registration traits turns into difficult with out customized reporting infrastructure. You want visibility into your catalog’s well being, with out the overhead of managing ETL jobs. The metadata characteristic of Amazon SageMaker supplies this functionality to customers. Changing catalog asset metadata into Apache Iceberg tables saved in Amazon S3 Tables removes the necessity to construct and keep customized ETL pipelines. Your workforce can then question asset metadata instantly utilizing customary SQL instruments. Now you can reply governance questions like asset registration traits, classification standing, and metadata completeness utilizing customary SQL queries by means of instruments like Amazon Athena, Amazon SageMaker Unified Studio notebooks, and BIsystems.

This automated method reduces ETL growth time and provides your workforce visibility into catalog well being, compliance gaps, and asset lifecycle patterns. The exported tables embrace technical metadata, enterprise metadata, mission possession particulars, and timestamps, partitioned by snapshot date to allow time journey queries and historic evaluation. Groups can use this functionality to proactively monitor catalog well being, establish gaps in documentation, observe asset lifecycle patterns, and ensure that governance insurance policies are constantly utilized.

How metadata export works

After you allow the metadata export characteristic, it runs mechanically on a every day schedule:

  1. SageMaker Catalog creates the infrastructure — An Amazon Easy Storage Service (Amazon S3) desk bucket named aws-sagemaker-catalog is created with an asset_metadata namespace and an empty asset desk.
  2. Each day snapshots are captured — A scheduled job runs as soon as per day round midnight (native time per AWS Area) to export up to date asset metadata.
  3. Metadata is structured and partitioned — The export captures technical metadata (resource_id, resource_type), enterprise metadata (asset_name, business_description), mission possession particulars, and timestamps, partitioned by snapshot_date for question efficiency.
  4. Knowledge turns into queryable — Inside 24 hours, the asset desk seems in Amazon SageMaker Unified Studio beneath the aws-sagemaker-catalog bucket and turns into accessible by means of Amazon Athena, Studio notebooks, or exterior BI instruments.
  5. Groups question utilizing customary SQL — Knowledge groups can now reply questions like “What number of belongings had been registered final month?” or “Which belongings lack enterprise descriptions?” with out constructing customized ETL pipelines.

The export evaluates catalog belongings and their metadata properties within the area, changing them into Apache Iceberg desk format. The info flows into downstream analytics operations instantly, with no separate ETL or batch processes to keep up. The exported metadata turns into a part of a queryable information lake that helps time-travel queries and historic evaluation.

On this submit, we show methods to use the metadata export functionality in Amazon SageMaker Catalog and carry out analytics on these tables. We discover the next particular use-cases.

  • Audit historic adjustments to analyze what an asset regarded like at a particular cut-off date.
  • Monitor asset development view how the info catalog has grown over the past 30 days.
  • Monitor metadata enhancements to see which belongings gained descriptions or possession over time.

Answer overview

AWS Cloud architecture diagram showing data pipeline from Amazon SageMaker Catalog to Amazon S3 Tables with daily export, connecting to query engines including Amazon Athena, Amazon Redshift, and Apache Spark

Determine 1 – SageMaker catalog export to S3 Tables

The structure consists of three key elements:

  1. Amazon SageMaker Catalog exports asset metadata every day to Amazon S3.
  2. S3 Tables shops metadata as Apache Iceberg tables within the aws-sagemaker-catalog bucket with ACID compliance and time journey.
  3. Question engines (Amazon Athena, Amazon Redshift, and Apache Spark) entry metadata utilizing customary SQL from the asset_metadata.asset desk.

What metadata is uncovered?

SageMaker Catalog exports metadata within the asset_metadata.asset desk:

Metadata Kind Fields Description
Technical metadata resource_id, resource_type_enum, account_id, area Useful resource identifiers (ARN), sorts (GlueTable, RedshiftTable, S3Collection), and placement
Namespace hierarchy catalog, namespace, resource_name Organizational construction for belongings
Enterprise metadata asset_name, business_description Human-readable names and descriptions
Possession extended_metadata['owningEntityId'] Asset possession data
Timestamps asset_created_time, asset_updated_time, snapshot_time Creation
Customized metadata extended_metadata['form-name.field-name'] Person-defined metadata varieties as key-value pairs

The snapshot_time column helps point-in-time evaluation and question of historic catalog states.

Conditions

To observe together with this submit, you need to have the next:

For SageMaker Unified Studio area setup directions, check with the SageMaker Unified Studio Getting began information.

After you full the conditions, full the next steps.

  1. Add this coverage to our IAM person or position to allow metadata export. If utilizing SageMaker Unified Studio to question the catalog, add this coverage to the AmazonSageMakerAdminIAMExecutionRole managed position.
{ "Model": "2012-10-17", 
"Assertion": [ 
{
 "Effect": "Allow",
 "Action": [ "datazone:GetDataExportConfiguration",
 "datazone:PutDataExportConfiguration"
 ],
 "Useful resource": "*"
 },
 {
 "Impact": "Enable",
 "Motion": [
 "s3tables:CreateTableBucket",
 "s3tables:PutTableBucketPolicy"
 ],
 "Useful resource": "arn:aws:s3tables:*:*:bucket/aws-sagemaker-catalog" 
} 
]
}
  1. Grant describe and choose permissions for SageMaker Catalog with AWS Lake Formation. This step may be carried out within the AWS Lake Formation console.
    1. Choose Permissions -> Knowledge permissions and select Grant.

      AWS Lake Formation Grant Permissions interface showing principal type selection with IAM users and roles option selected and AmazonSageMakerAdminIAMExecutionRole assigned

      Determine 2 – AWS Lake Formation grant permission

    2. Beneath Principal sort, choose Principals, IAM customers and roles and the AWS managed AmazonSageMakerAdminIAMExecutionRole execution position.
    3. Select Named Knowledge Catalog sources.
    4. Beneath Catalogs, seek for and choose :s3tablecatalog/aws-sagemaker-catalog.
    5. Beneath Databases, choose asset_metadata database.
      AWS Lake Formation Grant Permissions page showing Named Data Catalog resources method with s3tablescatalog/aws-sagemaker-catalog selected, asset_metadata database, and asset table configured

      Determine 3 – AWS Lake Formation catalog, database, and desk

      AWS Lake Formation Grant Permissions interface showing table permissions with Select and Describe checked, grantable permissions section, and All data access radio button selected

      Determine 4 – AWS Lake Formation grant permission

    6. For Desk, choose asset.
    7. Beneath Desk permissions, test Choose and Describe.
    8. Select Grant to avoid wasting the permissions.

Allow information export utilizing the AWS CLI

Configure metadata export utilizing the PutDataExportConfiguration API. The Amazon DataZone service mechanically creates an S3 desk bucket named aws-sagemaker-catalog with an asset_metadata namespace, and schedules a every day export job. Asset metadata is exported as soon as every day round midnight native time per AWS Area.

The SageMaker Area identifier is offered on area element web page within the AWS Administration Console. Accessing the asset desk by means of the S3 Tables console or the Knowledge tab in SageMaker Unified Studio can require as much as 24 hours.

AWS CLI command to allow SageMaker catalog export:

aws datazone put-data-export-configuration --domain-identifier  --region  --enable-export

Use this AWS CLI command to validate the configuration is enabled:

aws datazone get-data-export-configuration --domain-identifier  --region 
{
    "isExportEnabled": true,
    "standing": "COMPLETED",
    "s3TableBucketArn": "arn:aws:s3tables:::bucket/aws-sagemaker-catalog",
    "createdAt": "2025-11-26T18:24:02.150000+00:00",
    "updatedAt": "2026-02-23T19:33:40.987000+00:00"
}

Entry the exported asset desk

  1. Navigate to Amazon SageMaker Domains within the AWS Administration Console.
  2. Choose your area and choose Open.
    Amazon SageMaker Domains management page showing an Identity Center based domain with Available status, created February 26, 2026, with Open unified studio button highlighted

    Determine 5 – Open Amazon SageMaker Unified Studio

  3. In SageMaker Unified Studio, select a mission from the Choose a mission dropdown record.
  4. To question SageMaker catalog information, choose Construct within the menu bar after which select Question Editor. To create a brand new mission, observe the directions within the Amazon SageMaker Unified Studio Person Information.
    SageMaker Unified Studio project overview dashboard showing IDE and Applications, Data Analysis and Integration with Query Editor highlighted, Orchestration, and Machine Learning and Generative AI categories

    Determine 6 – Open SageMaker Unified Studio Question Editor

The asset_metadata.asset desk is offered in Knowledge explorer. Use Knowledge explorer to view the schema and question information to carry out analytics from.

  1. Increase Catalogs in Knowledge explorer. Then, choose and increase s3tablecatalog, aws-sagemaker-catalog, asset_metadata, and asset.
  2. Check querying the catalog with SELECT * FROM asset_metadata.asset LIMIT 10;.
SageMaker Unified Studio Query Editor with Data Explorer showing Lakehouse hierarchy including s3tablescatalog, aws-sagemaker-catalog, asset_metadata database, and asset table schema with SQL SELECT query

Determine 7 – Question SageMaker catalog

Queries for observability and analytics

With setup full, execute queries to achieve insights on catalog utilization and adjustments. To observe asset development, and think about how the info catalog has grown over the past 5 days:

SELECT 
    DATE (snapshot_time) as date,
    COUNT (*) as total_assets
FROM asset_metadata.asset
WHERE 
     DATE (snapshot_time) >= CURRENT_DATE - INTERVAL '5' DAY
GROUP BY DATE (snapshot_time)
ORDER BY date DESC;

SageMaker Unified Studio Query Editor showing SQL aggregation query on asset_metadata.asset table with results displaying date and total_assets columns, returning 42 assets for March 7-8, 2026"

Determine 8 – Question asset development

Use the catalog to trace metadata adjustments to find out which belongings gained descriptions or possession over time. Use this question to establish belongings that gained enterprise descriptions over the previous 5 days by evaluating in the present day’s snapshot with the sooner snapshot.

SELECT
    t.asset_id,
    t.resource_name,
    p.business_description as description_before,
    t.business_description as description_now
FROM asset_metadata.asset t
JOIN asset_metadata.asset p ON t.asset_id = p.asset_id
WHERE DATE(t.snapshot_time) = CURRENT_DATE
    AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '5' DAY
    AND p.business_description IS NULL
    AND t.business_description IS NOT NULL;

Examine asset values at a particular cut-off date utilizing this question to retrieve metadata from any snapshot date.

SELECT
     asset_id,
     resource_name,
     business_description,
     extended_metadata['owningEntityId'] as proprietor,
     snapshot_time
FROM asset_metadata.asset
WHERE asset_id = 'your-asset-id'
     AND DATE(snapshot_time) = DATE('2025-11-26');

Clear up sources

To keep away from ongoing costs, clear up the sources created on this walkthrough:

  1. Disable metadata export:

Disable the every day metadata export to cease new snapshots:

aws datazone put-data-export-configuration 
  --domain-identifier 

  1. Delete S3 Tables sources:

Optionally, delete the S3 Tables namespace containing the exported metadata to take away historic snapshots and cease storage costs. For directions on methods to delete S3 tables, see Deleting an Amazon S3 desk within the Amazon Easy Storage Service Person Information.

Conclusion

On this submit, you enabled the metadata export characteristic of SageMaker Catalog and used SQL queries to achieve visibility into your asset stock. The characteristic converts asset metadata into Apache Iceberg tables partitioned by snapshot date, so you’ll be able to carry out time-travel queries, monitor catalog development, observe metadata completeness, and audit historic asset states. This supplies a repeatable, low-overhead option to keep catalog well being and meet governance necessities over time.

To be taught extra about Amazon SageMaker Catalog, see the Amazon SageMaker Catalog documentation. To discover Apache Iceberg desk codecs and time-travel queries, see the Amazon S3 Tables documentation.


Concerning the Authors

Photo of Author Ramesh Singh

Ramesh is a Senior Product Supervisor Technical (Exterior Providers) at AWS in Seattle, Washington, at the moment with the Amazon SageMaker workforce. He’s obsessed with constructing high-performance ML/AI and analytics merchandise that assist enterprise clients obtain their essential objectives utilizing cutting-edge know-how.

Photo of Author Pradeep Misra

Pradeep is a Principal Analytics and Utilized AI Options Architect at AWS. He’s obsessed with fixing buyer challenges utilizing information, analytics, and Utilized AI. Exterior of labor, he likes exploring new locations and taking part in badminton together with his household. He additionally likes doing science experiments, constructing LEGOs, and watching anime together with his daughters.

Photo of Author - Rohith Kayathi

Rohith is a Senior Software program Engineer at Amazon Internet Providers (AWS) working with Amazon SageMaker workforce. He leads enterprise information catalog, generative AI–powered metadata curation, and lineage options. He’s obsessed with constructing large-scale distributed methods, fixing complicated issues, and setting the bar for engineering excellence for his workforce.

Photo of AUthor - Steve Phillips

Steve is a Principal Technical Account Supervisor and Analytics specialist at AWS within the North America area. Steve at the moment focuses on information warehouse architectural design, information lakes, information ingestion pipelines, and cloud distributed architectures.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles