Efficient knowledge governance has lengthy been a essential precedence for organizations searching for to maximise the worth of their knowledge belongings. It encompasses the processes, insurance policies, and practices a company makes use of to handle its knowledge sources. The important thing targets of knowledge governance are to make knowledge discoverable and usable by those that want it, correct and constant, safe and shielded from unauthorized entry or misuse, and compliant with related rules and requirements. Knowledge governance entails establishing clear possession and accountability for knowledge, together with defining roles, duties, and decision-making authority associated to knowledge administration.
Historically, knowledge governance frameworks have been designed to handle knowledge at relaxation—the structured and unstructured info saved in databases, knowledge warehouses, and knowledge lakes. Amazon DataZone is a knowledge governance and catalog service from Amazon Net Providers (AWS) that enables organizations to centrally uncover, management, and evolve schemas for knowledge at relaxation together with AWS Glue tables on Amazon Easy Storage Service (Amazon S3), Amazon Redshift tables, and Amazon SageMaker fashions.
Nonetheless, the rise of real-time knowledge streams and streaming knowledge functions impacts knowledge governance, necessitating adjustments to present frameworks and practices to successfully handle the brand new knowledge dynamics. Governing these speedy, decentralized knowledge streams presents a brand new set of challenges that reach past the capabilities of many standard knowledge governance approaches. Components such because the ephemeral nature of streaming knowledge, the necessity for real-time responsiveness, and the technical complexity of distributed knowledge sources require a reimagining of how we take into consideration knowledge oversight and management.
On this put up, we discover how AWS prospects can prolong Amazon DataZone to help streaming knowledge corresponding to Amazon Managed Streaming for Apache Kafka (Amazon MSK) subjects. Builders and DevOps managers can use Amazon MSK, a well-liked streaming knowledge service, to run Kafka functions and Kafka Join connectors on AWS with out turning into specialists in working it. We clarify how they will use Amazon DataZone customized asset sorts and customized authorizers to: 1) catalog Amazon MSK subjects, 2) present helpful metadata corresponding to schema and lineage, and three) securely share Amazon MSK subjects throughout the group. To speed up the implementation of Amazon MSK governance in Amazon DataZone, we use the Knowledge Options Framework on AWS (DSF), an opinionated open supply framework that we introduced earlier this yr. DSF depends on AWS Cloud Growth Equipment (AWS CDK) and supplies a number of AWS CDK L3 constructs that speed up constructing knowledge options on AWS, together with streaming governance.
Excessive-level method for governing streaming knowledge in Amazon DataZone
To anchor the dialogue on supporting streaming knowledge in Amazon DataZone, we use Amazon MSK as an integration instance, however the method and the architectural patterns stay the identical for different streaming providers (corresponding to Amazon Kinesis Knowledge Streams). At a excessive stage, to combine streaming knowledge, you want the next capabilities:
- A mechanism for the Kafka subject to be represented within the Amazon DataZone catalog for discoverability (together with the schema of the information flowing inside the subject), monitoring of lineage and different metadata, and for shoppers to request entry towards.
- A mechanism to deal with the customized authorization circulate when a shopper triggers the subscription grant to an setting. This circulate consists of the next high-level steps:
- Gather metadata of goal Amazon MSK cluster or subject that’s being subscribed to by the buyer
- Replace the producer Amazon MSK cluster’s useful resource coverage to permit entry from the buyer position
- Present Kafka subject stage AWS Id and Entry Administration (IAM) permission to the buyer roles (extra on this later) in order that it has entry to the goal Amazon MSK cluster
- Lastly, replace the interior metadata of Amazon DataZone in order that it’s conscious of the prevailing subscription between producer and shopper
Amazon DataZone catalog
Earlier than you’ll be able to characterize the Kafka subject as an entry within the Amazon DataZone catalog, it’s essential outline:
- A customized asset sort that describes the metadata that’s wanted to explain a Kafka subject. To explain the schema as a part of the metadata, use the built-in kind sort
amazon.datazone.RelationalTableFormTypeand create two extra customized kind sorts:
-
MskSourceReferenceFormTypethat accommodates thecluster_ARNand thecluster_type. The kind is used to find out whether or not the Amazon MSK cluster isprovisionedorserverless, on condition that there’s a distinct course of to grant devour permissions.
-
KafkaSchemaFormType, which accommodates varied metadata on the schema, together with thekafka_topic, theschema_version,schema_arn,registry_arn,compatibility_mode(for instance, backward-compatible or forward-compatible) anddata_format(for instance, Avro or JSON), which is useful in the event you plan to combine with the AWS Glue Schema registry.
- After the customized asset sort has been outlined, now you can create an asset based mostly on the customized asset sort. The asset describes the schema, the Amazon MSK cluster, and the subject that you just wish to be made discoverable and accessible to shoppers.
Knowledge supply for Amazon MSK clusters with AWS Glue Schema registry
In Amazon DataZone, you’ll be able to create knowledge sources for AWS Glue Knowledge Catalog to import technical metadata of database tables from AWS Glue and have the belongings registered within the Amazon DataZone undertaking. For importing metadata associated to Amazon MSK, it’s essential use a customized knowledge supply, which might be an AWS Lambda perform, utilizing the Amazon DataZone APIs.
We offer as a part of the answer a customized Amazon MSK knowledge supply with the AWS Glue Schema registry, for automating the creation, replace, and deletion of customized Amazon MSK belongings. It makes use of AWS Lambda to extract schema definitions from a Schema registry and metadata from the Amazon MSK clusters after which creates or updates the corresponding belongings in Amazon DataZone.
Earlier than explaining how the information supply works, it’s essential know that each customized asset in Amazon DataZone has a singular identifier. When the information supply creates an asset, it shops the asset’s distinctive identifier in Parameter Retailer, a functionality of AWS Techniques Supervisor.
The steps for a way the information supply works are as follows:
- The Amazon MSK AWS Glue Schema registry knowledge supply might be scheduled to be triggered on a given interval or by listening to AWS Glue Schema occasions corresponding to Create, Replace or Delete Schema. It may also be invoked manually by means of the AWS Lambda console.
- When triggered, it retrieves all the prevailing distinctive identifiers from Parameter Retailer. These parameters function reference to establish if an Amazon MSK asset already exists in Amazon DataZone.
- The perform lists the Amazon MSK clusters and retrieves the Amazon Useful resource Identify (ARN) for the given Amazon MSK title and extra metadata associated to the Amazon MSK cluster sort (serverless or provisioned). This metadata will probably be used later by the customized authorization circulate.
- Then the perform lists all of the schemas within the Schema registry for a given registry title. For every schema, it retrieves the most recent model and schema definition. The schema definition is what is going to permit you to add schema info when creating the asset in Amazon DataZone.
- For every schema retrieved within the Schema registry, the Lambda perform checks if the belongings exist already by trying into the Techniques Supervisor parameters retrieved within the second step.
- If the asset exists, the Lambda perform updates the asset in Amazon DataZone, creating a brand new revision with the up to date schema or kinds.
- If the asset doesn’t exist, the Lambda perform creates the asset in Amazon DataZone and shops its distinctive identifier in Techniques Supervisor for future reference.
- If there are schemas registered in Parameter Retailer which can be not within the Schema registry, the information supply deletes the corresponding Amazon DataZone belongings and removes the related parameters from Techniques Supervisor.
The Amazon MSK AWS Glue Schema registry knowledge supply for Amazon DataZone allows seamless registration of Kafka subjects as customized belongings in Amazon DataZone. It does require that the subjects within the Amazon MSK cluster are utilizing the Schema registry for schema administration.
Customized authorization circulate
For managed belongings corresponding to AWS Glue Knowledge Catalog and Amazon Redshift belongings, the method to grant entry to the buyer is managed by Amazon DataZone. Customized asset sorts are thought-about unmanaged belongings, and the method to grant entry must be carried out outdoors of Amazon DataZone.
The high-level steps for the end-to-end circulate are as follows:
- (Conditional) If the buyer setting doesn’t have a subscription goal, create it by means of the
CreateSubscriptionTargetAPI name. The subscription goal tells Amazon DataZone which environments are suitable with an asset sort. - The patron triggers a subscription request by subscribing to the related streaming knowledge asset by means of the Amazon DataZone portal.
- The producer receives the subscription request and approves (or denies) the request.
- After the subscription request has been accepted by the producer, the buyer can observe the streaming knowledge asset of their undertaking beneath the Subscribed knowledge part.
- The patron can decide to set off a subscription grant to a goal setting immediately from the Amazon DataZone portal, and this motion triggers the customized authorization circulate.
For steps 2–4, you depend on the default habits of Amazon DataZone and no change is required. The main focus of this part is then step 1 (subscription goal) and step 5 (subscription grant course of).
Subscription goal
Amazon DataZone has an idea known as environments inside a undertaking, which signifies the place the sources are positioned and the associated entry configuration (for instance, the IAM position) that’s used to entry these sources. To permit an setting to have entry to the customized asset sort, shoppers have to make use of the Amazon DataZone CreateSubscriptionTarget API previous to the subscription grants. The creation of the subscription goal is a one-time operation per customized asset sort per setting. As well as, the authorizedPrincipals parameter contained in the CreateSubscriptionTarget API lists the varied IAM principals given entry to the Amazon MSK subject as a part of the grant authorization circulate. Lastly, when calling CreateSubscriptionTarget, the underlying precept used to name the API should belong to the goal setting’s AWS account ID.
After the subscription goal has been created for a customized asset sort and setting, the setting is eligible as a goal for subscription grants.
Subscription grant course of
Amazon DataZone emits occasions based mostly on consumer actions, and you utilize this mechanism to set off the customized authorization course of when a subscription grant has been triggered for Amazon MSK subjects. Particularly, you utilize the Subscription grant requested occasion. These are the steps of the authorization circulate:
- A Lambda perform collects metadata on the next:
- Producer Amazon MSK cluster or Kinesis knowledge stream that the buyer is requesting entry to. Metadata is collected utilizing the
GetListingAPI. - Metadata in regards to the goal setting utilizing a name to
GetEnvironmentAPI. - Metadata in regards to the subscription goal utilizing a name to
GetSubscriptionTargetAPI to gather the buyer roles to grant. - In parallel, Amazon DataZone inside metadata in regards to the standing of the subscription grant must be up to date, and this occurs on this step. Relying on the kind of motion that’s being achieved (corresponding to
GRANTorREVOKE), the standing of the subscription grant is up to date respectively (for instance,GRANT_IN_PROGRESS,REVOKE_IN_PROGRESS).
- Producer Amazon MSK cluster or Kinesis knowledge stream that the buyer is requesting entry to. Metadata is collected utilizing the
After the metadata has been collected, it’s handed downstream as a part of the AWS Step Capabilities state.
- Replace the useful resource coverage of the goal useful resource (for instance, Amazon MSK cluster or Kinesis knowledge stream) within the producer account. The replace permits approved principals from the buyer to entry or learn the goal useful resource. Instance of the coverage is as follows:
- Replace the configured approved principals by attaching further IAM permissions relying on particular situations. The next examples illustrate what’s being added.
The bottom entry or learn permissions are as follows:
If there’s an AWS Glue Schema registry ARN offered as a part of the AWS CDK assemble parameter, then further permissions are added to permit entry to each the registry and the particular schema:
If this grant is for a shopper in a distinct account, the next permissions are additionally added to permit managed VPC connections to be created by the buyer:
- Replace the Amazon DataZone inside metadata on the progress of the subscription grant (for instance,
GRANTEDorREVOKED). If there’s an exception in a step, it’s dealt with inside Step Capabilities and the subscription grant metadata is up to date with a failed state (for instance,GRANT_FAILEDorREVOKE_FAILED).
As a result of Amazon DataZone helps multi-account structure, the subscription grant course of is a distributed workflow that should carry out actions throughout completely different accounts, and it’s orchestrated from the Amazon DataZone area account the place all of the occasions are obtained.
Implement streaming governance in Amazon DataZone with DSF
On this part, we deploy an instance for instance the answer utilizing DSF on AWS, which supplies all of the required parts to speed up the implementation of the answer. We use the next CDK L3 constructs from DSF:
DataZoneMskAssetTypecreates the customized asset sort representing an Amazon MSK subject in Amazon DataZoneDataZoneGsrMskDataSourcerobotically creates Amazon MSK subject belongings in Amazon DataZone based mostly on schema definitions within the Schema registryDataZoneMskCentralAuthorizerandDataZoneMskEnvironmentAuthorizerimplement the subscription grant course of for Amazon MSK subjects and IAM authentication
The next diagram is the structure for the answer.

On this instance, we use Python for the instance code. DSF additionally helps TypeScript.
Deployment steps
Comply with the steps within the data-solutions-framework-on-aws README to deploy the answer. You’ll want to deploy the CDK stack first, then create the customized setting and redeploy the stack with further info.
Confirm the instance is working
To confirm the instance is working, produce pattern knowledge utilizing the Lambda perform StreamingGovernanceStack-ProducerLambda. Comply with these steps:
- Use the AWS Lambda console to check the Lambda perform by working a pattern check occasion. The occasion JSON ought to be empty. Save your check occasion and click on Check.

- Producing check occasions will generate a brand new schema
producer-data-productwithin the Schema registry. Test the schema is created from the AWS Glue console utilizing the Knowledge Catalog menu from the left and deciding on Stream schema registries.

- New knowledge belongings ought to be within the Amazon DataZone portal, beneath the PRODUCER undertaking
- On the DATA tab, within the left navigation pane, choose Stock knowledge, as proven within the following screenshot
- Choose producer-data-product

- Choose the BUSINESS METADATA tab to view the enterprise metadata, as proven within the following screenshot.

- To view the schema, choose the SCHEMA tab, as proven within the following screenshot

- To view the lineage, choose the LINEAGE tab
- To publish the asset, choose PUBLISH ASSET, as proven within the following screenshot
Â
Subscribe
To subscribe, comply with these steps:
- Change to the buyer undertaking by deciding on CONSUMER within the high left of the display
- Choose Browse Catalog
- Select producer-data-product and select SUBSCRIBE, as proven within the following screenshot

- Return to the PRODUCER undertaking and select producer-data-product, as proven within the following screenshot

- Select APPROVE, as proven within the following screenshot

- Go to the AWS Id and Entry Administration (IAM) console and seek for the buyer position. Within the position definition, you need to see an IAM inline coverage with permissions on the Amazon MSK cluster, the Kafka subject, the Kafka shopper group, the AWS Glue schema registry and the schema from the producer.

- Now let’s swap to the buyer’s setting within the Amazon Managed Service for Apache Flink console and run the Flink software known as flink-consumer utilizing the Run button.

- Return to the Amazon DataZone portal, and ensure that the lineage beneath the CONSUMER undertaking was up to date and the brand new Flink job run node was added to the lineage graph, as proven within the following screenshot

Clear up
To wash up the sources you created as a part of this walkthrough, comply with these steps:
- Cease the Amazon Managed Streaming for Apache Flink job.
- Revoke the subscription grant from the Amazon DataZone console.
- Run
cdk destroyin your native terminal to delete the stack. Since you marked the constructs with aRemovalPolicy.DESTROYand configured DSF to take away knowledge on destroy, workingcdk destroyor deleting the stack from the AWS CloudFormation console will clear up the provisioned sources.
Conclusion
On this put up, we shared how one can combine streaming knowledge from Amazon MSK inside Amazon DataZone to create a unified knowledge governance framework that spans the whole knowledge lifecycle, from the ingestion of streaming knowledge to its storage and eventual consumption by various producers and shoppers.
We additionally demonstrated the way to use the AWS CDK and the DSF on AWS to shortly implement this resolution utilizing built-in greatest practices. Along with the Amazon DataZone streaming governance, DSF helps different patterns, corresponding to Spark knowledge processing and Amazon Redshift knowledge warehousing. Our roadmap is publicly out there, and we look ahead to your characteristic requests, contributions, and suggestions. You may get began utilizing DSF by following our Fast begin information.
In regards to the Authors
Vincent Gromakowski is a Principal Analytics Options Architect at AWS the place he enjoys fixing prospects’ knowledge challenges. He makes use of his robust experience on analytics, distributed methods and useful resource orchestration platform to be a trusted technical advisor for AWS prospects.
Francisco Morillo is a Sr. Streaming Options Architect at AWS, specializing in real-time analytics architectures. With over 5 years within the streaming knowledge house, Francisco has labored as a knowledge analyst for startups and as an enormous knowledge engineer for consultancies, constructing streaming knowledge pipelines. He has deep experience in Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink. Francisco collaborates carefully with AWS prospects to construct scalable streaming knowledge options and superior streaming knowledge lakes, making certain seamless knowledge processing and real-time insights.
Jan Michael Go Tan is a Principal Options Architect for Amazon Net Providers. He helps prospects design scalable and progressive options with the AWS Cloud.
Sofia Zilberman is a Sr. Analytics Specialist Options Architect at Amazon Net Providers. She has a monitor file of 15 years of making large-scale, distributed processing methods. She stays obsessed with huge knowledge applied sciences and structure developments, and is continually looking out for purposeful and technological improvements.
