Jumia builds a next-generation information platform with metadata-driven specification frameworks

January 2, 2025

32

Jumia is a expertise firm born in 2012, current in 14 African nations, with its predominant headquarters in Lagos, Nigeria. Jumia is constructed round a market, a logistics service, and a cost service. The logistics service permits the supply of packages by means of a community of native companions, and the cost service facilitates the funds of on-line transactions inside Jumia’s ecosystem. Jumia is current in NYSE and has a market cap of $554 million.

On this put up, we share a part of the journey that Jumia took with AWS Skilled Companies to modernize its information platform that ran below a Hadoop distribution to AWS serverless based mostly options. A number of the challenges that motivated the modernization had been the excessive value of upkeep, lack of agility to scale computing at particular occasions, job queuing, lack of innovation when it got here to buying extra trendy applied sciences, advanced automation of the infrastructure and purposes, and the lack to develop domestically.

Answer overview

The fundamental idea of the modernization venture is to create metadata-driven frameworks, that are reusable, scalable, and in a position to answer the totally different phases of the modernization course of. These phases are: information orchestration, information migration, information ingestion, information processing, and information upkeep.

This standardization for every part was thought-about as a approach to streamline the event workflows and decrease the danger of errors that may come up from utilizing disparate strategies. This additionally enabled migration of various sorts of information following an identical strategy whatever the use case. By adopting this strategy, the information dealing with is constant, extra environment friendly, and extra simple to handle throughout totally different tasks and groups. As well as, though the use instances have autonomy of their area from a governance perspective, on high of them is a centralized governance mannequin that defines the entry management within the shared architectural parts. Importantly, this implementation emphasizes information safety by imposing encryption throughout all companies, together with Amazon Easy Storage Service (Amazon S3) and Amazon DynamoDB. Moreover, it adheres to the precept of least privilege, thereby enhancing general system safety and decreasing potential vulnerabilities.

The next diagram describes the frameworks that had been created. On this design, the workloads within the new information platform are divided by use case. Every use case requires the creation of a set of YAML recordsdata for every part, from information migration to information move orchestration, and they’re mainly the enter of the system. The output is a set of DAGs that run the particular duties.

Overview

Within the following sections, we talk about the goals, implementation, and learnings of every part in additional element.

Knowledge orchestration

The target of this part is to construct a metadata-driven framework to orchestrate the information flows alongside the entire modernization course of. The orchestration framework gives a sturdy and scalable answer that has the next capacities: dynamically create DAGs, combine natively with non-AWS companies, permit the creation of dependencies based mostly on previous executions, and add an accessible metadata technology per every execution. Due to this fact, it was determined to make use of Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which, by means of the Apache Airflow engine, gives these functionalities whereas abstracting customers from the administration operation.

The next is the outline of the metadata recordsdata which are supplied as a part of the information orchestration part for a given use case that performs the information processing utilizing Spark on Amazon EMR Serverless:

proprietor: # Use case proprietor
dags: # Checklist of DAGs to be created for this use case
  - title: # Use case title
    kind: # Kind of DAG (might be migration, ingestion, transformation or upkeep)
    tags: # Checklist of TAGs
    notification: # Defines notificacions for this DAGs
      on_success_callback: true
      on_failure_callback: true
    spark: # Spark job data 
      entrypoint: # Spark script 
      arguments: # Arguments required by the Spark script
      spark_submit_parameters: # Spark submit parameters.

The thought behind all of the frameworks is to construct reusable artifacts that allow the event groups to speed up their work whereas offering reliability. On this case, the framework gives the capabilities to create DAG objects inside Amazon MWAA based mostly on configuration recordsdata (YAML recordsdata).

This specific framework is constructed on layers that add totally different functionalities to the ultimate DAG:

DAGs – The DAGs are constructed based mostly on the metadata data supplied to the framework. The information engineers don’t have to write down Python code so as to create the DAGs, they’re mechanically created and this module is answerable for performing this dynamic creation of DAGs.
Validations – This layer handles YAML file validation so as to stop corrupted recordsdata from affecting the creation of different DAGs.
Dependencies – This layer handles dependencies amongst totally different DAGs so as to deal with advanced interconnections.
Notifications – This layer handles the kind of notifications and alerts which are a part of the workflows.

Orchestration

One facet to contemplate when utilizing Amazon MWAA is that, being a managed service, it requires some upkeep from the customers, and it’s necessary to have a very good understanding of the variety of DAGs and processes that you just’re anticipated to have so as to fine-tune the occasion and procure the specified efficiency. A number of the parameters that had been fine-tuned in the course of the engagement had been core.dagbag_import_timeout, core.dag_file_processor_timeout, core.min_serialized_dag_update_interval, core.min_serialized_dag_fetch_interval, scheduler.min_file_process_interval, scheduler.max_dagruns_to_create_per_loop, scheduler.processor_poll_interval, scheduler.dag_dir_list_interval, and celery.worker_autoscale.

One of many layers described within the previous diagram corresponds to validation. This was an necessary part for the creation of dynamic DAGs. As a result of the enter to the framework consists of YML recordsdata, it was determined to filter out corrupted recordsdata earlier than trying to create the DAG objects. Following this strategy, Jumia may keep away from undesired interruptions of the entire course of. The module that really builds DAGs solely receives configuration recordsdata that comply with the required specs to efficiently create them. In case of corrupted recordsdata, data concerning the precise points is logged into Amazon CloudWatch in order that builders can repair them.

Knowledge migration

The target of this part is to construct a metadata-driven framework for migrating information from HDFS to Amazon S3 with Apache Iceberg storage format, which entails the least operational overhead, gives scalability capability throughout peak hours, and ensures information integrity and confidentiality.

The next diagram illustrates the structure.

Migration

Throughout this part, a metadata-driven framework inbuilt PySpark receives a configuration file as enter in order that some migration duties can run in an Amazon EMR Serverless job. This job makes use of the PySpark framework because the script location. Then the orchestration framework described beforehand is used to create a migration DAG that runs the next duties:

The primary process creates the DDLs in Iceberg format within the AWS Glue Knowledge Catalog utilizing the migration framework inside an Amazon EMR Serverless job.
After the tables are created, the second process transfers HDFS information to a touchdown bucket in Amazon S3 utilizing AWS DataSync to sync buyer information. This course of brings information from all of the totally different layers of the information lake.
When this course of is full, a 3rd process converts information to Iceberg format from the touchdown bucket to the vacation spot bucket (uncooked, course of, or analytics) utilizing once more another choice of the migration framework embedded in an Amazon EMR Serverless job.

Knowledge switch efficiency is healthier when the scale of the recordsdata to be transferred is round 128–256 MB, so it’s beneficial to compress the recordsdata on the supply. By decreasing the variety of recordsdata, metadata evaluation and integrity phases are decreased, dashing up the migration part.

Knowledge ingestion

The target of this part is to implement one other framework based mostly on metadata that responds to the 2 information ingestion fashions. A batch mode is liable for extracting information from totally different information sources (similar to Oracle or PostgreSQL) and a micro-batch-based mode extracts information from a Kafka cluster that, based mostly on configuration parameters, has the capability to run native streams in streaming.

The next diagram illustrates the structure for the batch and micro-batch and streaming strategy.

Ingestion

Throughout this part, a metadata-driven framework builds the logic to convey information from Kafka, databases, or exterior companies, that can be run utilizing an ingestion DAG deployed in Amazon MWAA.

Spark Structured Streaming was used to ingest information from Kafka subjects. The framework receives configuration recordsdata in YAML format that point out which subjects to learn, what extraction processes needs to be carried out, whether or not it needs to be learn in streaming or micro-batch, and wherein vacation spot desk the knowledge needs to be saved, amongst different configurations.

For batch ingestion, a metadata-driven framework written in Pyspark was applied. In the identical means because the earlier one, the framework acquired a configuration in YAML format with the tables to be migrated and their vacation spot.

One of many features to contemplate in this kind of migration is the synchronization of information from the ingestion part and the migration part, in order that there isn’t a lack of information and that information shouldn’t be reprocessed unnecessarily. To this finish, an answer has been applied that saves the timestamps of the final historic information (per desk) migrated in a DynamoDB desk. Each sorts of frameworks are programmed to make use of this information the primary time they’re run. For micro-batching use instances, which use Spark Structured Streaming, Kafka information is learn by assigning the worth saved in DynamoDB to the startingTimeStamp parameter. For all different executions, precedence can be given to the metadata within the checkpoint folder. This manner, you may make certain ingestion is synchronized with the information migration.

Knowledge processing

The target on this part was to have the ability to deal with updates and deletions of information in an object-oriented file system, so Iceberg is a key answer that was adopted all through the venture as delta lake recordsdata due to its ACID capabilities. Though all phases use Iceberg as delta recordsdata, the processing part makes intensive use of Iceberg’s capabilities to do incremental processing of information, creating the processing layer utilizing UPSERT utilizing Iceberg’s capability to run MERGE INTO instructions.

The next diagram illustrates the structure.

Processing

The structure is just like the ingestion part, with simply modifications to the information supply to be Amazon S3. This strategy hastens the supply part and maintains high quality with a production-ready answer.

By default, Amazon EMR Serverless has the spark.dynamicAllocation.enabled parameter set to True. This feature scales up or down the variety of executors registered throughout the utility, based mostly on the workload. This brings quite a lot of benefits when coping with several types of workloads, but it surely additionally brings concerns when utilizing Iceberg tables. For example, whereas writing information into an Iceberg desk, the Amazon EMR Serverless utility can use a lot of executors so as to velocity up the duty. This can lead to reaching Amazon S3 limits, particularly the variety of requests per second per prefix. For that reason, it’s necessary to use good information partitioning practices.

One other necessary facet to contemplate in these instances is the item storage file structure. By default, Iceberg makes use of the Hive storage structure, however it may be set to make use of ObjectStoreLocationProvider. By setting this property, a deterministic hash is generated for every file, with a hash appended instantly after write.information.path. This will significantly decrease throttle requests based mostly on object prefix, in addition to maximize throughput for Amazon S3 associated I/O operations, as a result of the recordsdata written are equally distributed throughout a number of prefixes.

Knowledge upkeep

When working with information lake desk codecs similar to Iceberg, it’s important to interact in routine upkeep duties to optimize desk metadata file administration, stopping a lot of pointless recordsdata from accumulating and promptly eradicating any unused recordsdata. The target of this part was to construct one other framework that may carry out these kinds of duties on the tables throughout the information lake.

The next diagram illustrates the structure.

Maintenance

The framework, in addition to the opposite ones, receives a configuration file (YAML recordsdata) indicating the tables and the record of upkeep duties with their respective parameters. It was constructed on PySpark in order that it may run as an Amazon EMR Serverless job and might be orchestrated utilizing the orchestration framework identical to the opposite frameworks constructed as a part of this answer.

The next upkeep duties are supported by the framework:

Expire snapshots – Snapshots can be utilized for rollback operations in addition to time touring queries. Nevertheless, they will accumulate over time and may result in efficiency degradation. It’s extremely beneficial to often expire snapshots which are not wanted.
Take away outdated metadata recordsdata – Metadata recordsdata can accumulate over time identical to snapshots. Eradicating them often can also be beneficial, particularly when coping with streaming or micro-batching operations, which was one of many instances of the general answer.
Compact recordsdata – Because the variety of information recordsdata will increase, the variety of metadata saved within the manifest recordsdata additionally will increase, and small information recordsdata can result in much less environment friendly queries. As a result of this answer makes use of a streaming and micro-batching utility writing into Iceberg tables, the scale of the recordsdata tends to be small. For that reason, a technique to compact recordsdata was crucial to boost the general efficiency.
Arduous delete information – One of many necessities was to have the ability to carry out laborious deletes within the information older than a sure time frame. This means eradicating expiring snapshots and eradicating metadata recordsdata.

The upkeep duties had been scheduled with totally different frequencies relying on the use case and the particular process. For that reason, the schedule data for this duties is outlined in every of the YAML recordsdata of the particular use case.

On the time this framework was applied, there was no any computerized upkeep answer on high of Iceberg tables. At AWS re:Invent 2024, Amazon S3 Tables performance has been launched to automatize the upkeep of Iceberg Tables . This performance automates file compaction, snapshot administration, and unreferenced file elimination.

Conclusion

Constructing a knowledge platform on high of standarized frameworks that use metadata for various features of the information dealing with course of, from information migration and ingestion to orchestration, enhances the visibility and management over every of the phases and considerably hastens implementation and improvement processes. Moreover, by utilizing companies similar to Amazon EMR Serverless and DynamoDB, you’ll be able to convey all the advantages of serverless architectures, together with scalability, simplicity, versatile integration, improved reliability, and cost-efficiency.

With this structure, Jumia was capable of cut back their information lake value by 50%. Moreover, with this strategy, information and DevOps groups had been capable of deploy full infrastructures and information processing capabilities by creating metadata recordsdata together with Spark SQL recordsdata. This strategy has decreased turnaround time to manufacturing and decreased failure charges. Moreover, AWS Lake Formation supplied the capabilities to collaborate and govern datasets on numerous storage layers on the AWS platform and externally.

Leveraging AWS for our information platform has not solely optimized and decreased our infrastructure prices but additionally standardized our workflows and methods of working throughout information groups and established a extra reliable single supply of fact for our information property. This transformation has boosted our effectivity and agility, enabling sooner insights and enhancing the general worth of our information platform.

– Hélder Russa, Head of Knowledge Engineering at Jumia Group.

Take step one in direction of streamlining the information migration course of now, with AWS.

In regards to the Authors

Ramón Díez is a Senior Buyer Supply Architect at Amazon Net Companies. He led the venture with the agency conviction of utilizing expertise in service of the enterprise.

Paula Marenco is a Knowledge Architect at Amazon Net Companies, she enjoys designing analytical options that convey mild into complexity, turning intricate information processes into clear and actionable insights. Her work focuses on making information extra accessible and impactful for decision-making.

Hélder Russa is the Head of Knowledge Engineering at Jumia Group, contributing to the technique definition, design, and implementation of a number of Jumia information platforms that assist the general decision-making course of, in addition to operational options, information science tasks, and real-time analytics.

Pedro Gonçalves is a Principal Knowledge Engineer at Jumia Group, liable for designing and overseeing the information structure, emphasizing on AWS Platform and datalakehouse applied sciences to make sure strong and agile information options and analytics capabilities.

Jumia builds a next-generation information platform with metadata-driven specification frameworks

Answer overview

Knowledge orchestration

Knowledge migration

Knowledge ingestion

Knowledge processing

Knowledge upkeep

Conclusion

In regards to the Authors

Related Articles

Researchers establish strength-limiting defects in 3D printed ceramics

10 Python One-Liners for Producing Time Collection Options

Is ChatGPT Atlas Higher than Perplexity Comet?

LEAVE A REPLY Cancel reply

Latest Articles

Researchers establish strength-limiting defects in 3D printed ceramics

10 Python One-Liners for Producing Time Collection Options

Is ChatGPT Atlas Higher than Perplexity Comet?

Google disputes false claims of huge Gmail information breach

Extremely-Skinny Filters Might Assist Enhance Manufacturing of Medicines and Dyes

ABOUT US