8.4 C
Canberra
Tuesday, August 19, 2025

Tips on how to Replicate Uber’s GenAI Bill Processing System?


Handbook knowledge entry from invoices is a sluggish, error-prone activity that companies have battled for many years. Just lately, Uber Engineering revealed how they tackled this problem with their “TextSense” platform, a complicated system for GenAI bill processing. This technique showcases the facility of clever doc processing, combining Optical Character Recognition (OCR) with Giant Language Fashions (LLMs) for extremely correct, automated knowledge extraction. This superior strategy might sound out of attain for smaller tasks. Nevertheless, the core ideas at the moment are accessible to everybody. This information will present you the right way to replicate the elemental workflow of Uber’s system. We are going to use easy, highly effective instruments to create a system that automates bill knowledge extraction.

Understanding Uber’s “TextSense” System

Earlier than we construct our model, it’s useful to know what impressed it. Uber’s objective was to automate the processing of tens of millions of paperwork, from invoices to receipts. Their “TextSense” platform, detailed of their engineering weblog, is a sturdy, multi-stage pipeline designed for this goal.

The determine exhibits the complete doc processing pipeline. For processing any doc, pre-processing is normally frequent earlier than calling an LLM.

At its core, the system works in three principal phases:

  1. Digitization (by way of OCR): First, the system takes a doc, like a PDF or a picture of an bill. It makes use of a sophisticated OCR engine to “learn” the doc and convert all of the visible textual content into machine-readable textual content. This uncooked textual content is the muse for the following step.
  2. Clever Extraction (by way of LLM): The uncooked textual content from the OCR course of is usually messy and unstructured. That is the place the GenAI magic occurs. Uber feeds this textual content to a big language mannequin. The LLM acts like an professional who understands the context of an bill. It will probably determine and extract particular items of data, such because the “Bill Quantity,” “Whole Quantity,” and “Provider Title,” and manage them right into a structured format, like JSON.
  3. Verification (Human-in-the-Loop): No AI is ideal. To make sure 100% accuracy, Uber applied a human-in-the-loop AI system. This verification step presents the unique doc alongside the AI-extracted knowledge to a human operator. The operator can shortly affirm that the information is right or make minor changes if wanted. This suggestions loop additionally helps enhance the mannequin over time.

This mix of OCR with AI and human oversight makes their system each environment friendly and dependable. The next determine explains the workflow of TextSense in an in depth method, as defined within the above factors.

Our Sport Plan: Replicating the Core Workflow

Our objective is to not rebuild Uber’s whole production-grade platform. As an alternative, we’ll replicate its core intelligence in a simplified, accessible manner. We are going to construct our GenAI bill processing POC in a single Google Colab pocket book.

Our plan follows the identical logical steps:

  1. Ingest Doc: We are going to create a easy approach to add a PDF bill on to our pocket book.
  2. Carry out OCR: We are going to use Tesseract, a strong open-source OCR engine, to extract all of the textual content from the uploaded bill.
  3. Extract Entities with AI: We are going to use the Google Gemini API to carry out the automated knowledge extraction. We’ll craft a selected immediate to instruct the mannequin to tug out the important thing fields we’d like.
  4. Create a Verification UI: We are going to construct a easy interactive interface utilizing ipywidgets to function our human-in-the-loop AI system, permitting for fast validation of the extracted knowledge.
Flowchart of Application

This strategy offers us a strong and cheap approach to obtain clever doc processing with no need advanced infrastructure.

Fingers-On Implementation: Constructing the POC Step-by-Step

Let’s start constructing our system. You possibly can observe these steps in a brand new Google Colab pocket book.

Step 1: Setting Up the Atmosphere

First, we have to set up the mandatory Python libraries. This command installs packages for dealing with PDFs (PyMuPDF), operating OCR (pytesseract), interacting with the Gemini API, and constructing the UI (ipywidgets). It additionally installs the Tesseract OCR engine itself.

!pip set up -q -U google-generativeai PyMuPDF pytesseract pandas ipywidgets

!apt-get -qq set up tesseract-ocr

Step 2: Configuring the Google Gemini API

Subsequent, you might want to configure your Gemini API key. To maintain your key secure, we’ll use Colab’s built-in secret supervisor.

  1. Get your API key from Google AI Studio.
  2. In your Colab pocket book, click on the important thing icon on the left sidebar.
  3. Create a brand new secret named GEMINI_API_KEY and paste your key as the worth.

The next code will securely entry your key and configure the API.

import google.generativeai as genai

from google.colab import userdata

import fitz  # PyMuPDF

import pytesseract

from PIL import Picture

import pandas as pd

import ipywidgets as widgets

from ipywidgets import Structure

from IPython.show import show, clear_output

import json

import io

# Configure the Gemini API

strive:

   api_key = userdata.get(“GEMINI_API_KEY”)

   genai.configure(api_key=api_key)

   print("Gemini API configured efficiently.")

besides userdata.SecretNotFoundError:

   print("ERROR: Secret 'GEMINI_API_KEY' not discovered. Please observe the directions to set it up.")

Step 3: Importing and Pre-processing the PDF

This code uploads a PDF file that’s an bill PDF. Once you add a PDF, it converts every web page right into a high-resolution picture, which is the best format for OCR.

import fitz  # PyMuPDF

from PIL import Picture

import io

import os

invoice_images = []

uploaded_file_name = "/content material/sample-invoice.pdf"  # Substitute with the precise path to your PDF file

# Make sure the file exists (non-compulsory however beneficial)

if not os.path.exists(uploaded_file_name):

   print(f"ERROR: File not discovered at '{uploaded_file_name}'. Please replace the file path.")

else:

   print(f"Processing '{uploaded_file_name}'...")

   # Convert PDF to pictures

   doc = fitz.open(uploaded_file_name)

   for page_num in vary(len(doc)):

       web page = doc.load_page(page_num)

       pix = web page.get_pixmap(dpi=300) # Larger DPI for higher OCR

       img = Picture.open(io.BytesIO(pix.tobytes()))

       invoice_images.append(img)

   doc.shut()

   print(f"Efficiently transformed {len(invoice_images)} web page(s) to pictures.")

   # Show the primary web page as a preview

   if invoice_images:

       print("n--- Bill Preview (First Web page) ---")

       show(invoice_images[0].resize((600, 800)))

Output:

Preprocessed PDF

Now, we run the OCR course of on the photographs we simply created. The textual content from all pages is mixed right into a single string. That is the context we’ll ship to the Gemini mannequin. This step is an important a part of the OCR with AI workflow.

full_invoice_text = ""

if not invoice_images:

   print("Please add a PDF bill within the step above first.")

else:

   print("Extracting textual content with OCR...")

   for i, img in enumerate(invoice_images):

       textual content = pytesseract.image_to_string(img)

       full_invoice_text += f"n--- Web page {i+1} ---n{textual content}"

   print("OCR extraction full.")

   print("n--- Extracted Textual content (first 500 characters) ---")

   print(full_invoice_text[:500] + "...")

Output:

Extracting text using OCR

That is the place the GenAI bill processing occurs. We create an in depth immediate that tells the Gemini mannequin its position. We instruct it to extract particular fields and return the end in a clear JSON format. Asking for JSON is a strong method that makes the mannequin’s output structured and straightforward to work with.

extracted_data = {}

if not full_invoice_text.strip():

   print("Can not proceed. The extracted textual content is empty. Please test the PDF high quality.")

else:

   # Instantiate the Gemini Professional mannequin

   mannequin = genai.GenerativeModel('gemini-2.5-pro')

   # Outline the fields you need to extract

   fields_to_extract = "Bill Quantity, Bill Date, Due Date, Provider Title, Provider Tackle, Buyer Title, Buyer Tackle, Whole Quantity, Tax Quantity"

   # Create the detailed immediate

   immediate = f"""

   You might be an professional in bill knowledge extraction.

   Your activity is to research the offered OCR textual content from an bill and extract the next fields: {fields_to_extract}.

   Observe these guidelines strictly:

   1.  Return the output as a single, clear JSON object.

   2.  The keys of the JSON object have to be precisely the sphere names offered.

   3.  If a area can't be discovered within the textual content, its worth within the JSON must be `null`.

   4.  Don't embody any explanatory textual content, feedback, or markdown formatting (like ```json) in your response. Solely the JSON object is allowed.

   Right here is the bill textual content:

   ---

   {full_invoice_text}

   ---

   """

   print("Sending request to Gemini API...")

   strive:

       # Name the API

       response = mannequin.generate_content(immediate)

       # Robustly parse the JSON response

       response_text = response.textual content.strip()

       # Clear potential markdown formatting

       if response_text.startswith('```json'):

           response_text = response_text[7:-3].strip()

       extracted_data = json.masses(response_text)

       print("n--- AI Extracted Knowledge (JSON) ---")

       print(json.dumps(extracted_data, indent=2))

   besides json.JSONDecodeError:

       print("n--- ERROR ---")

       print("Didn't decode the mannequin's response into JSON.")

       print("Mannequin's Uncooked Response:", response.textual content)

   besides Exception as e:

       print(f"nAn sudden error occurred: {e}")

       print("Mannequin's Uncooked Response (if out there):", getattr(response, 'textual content', 'N/A'))

Output:

Augmenting OCR using Gemini API

Step 6: Constructing the Human-in-the-Loop (HITL) UI

Lastly, we constructed the verification interface. This code shows the bill picture on the left and creates an editable kind on the best, pre-filled with the information from Gemini. The person can shortly overview the knowledge, make any essential edits, and ensure.

# UI Widgets

text_widgets = {}

if not extracted_data:

   print("No knowledge was extracted by the AI. Can not construct verification UI.")

else:

   form_items = []

   # Create a textual content widget for every extracted area

   for key, worth in extracted_data.gadgets():

       text_widgets[key] = widgets.Textual content(

           worth=str(worth) if worth just isn't None else "",

           description=key.change('_', ' ').title() + ':',

           type={'description_width': 'preliminary'},

           format=Structure(width="95%")

       )

       form_items.append(text_widgets[key])

   # The shape container

   kind = widgets.VBox(form_items, format=Structure(padding='10px'))

   # Picture container

   if invoice_images:

       img_byte_arr = io.BytesIO()

       invoice_images[0].save(img_byte_arr, format="PNG")

       image_widget = widgets.Picture(

           worth=img_byte_arr.getvalue(),

           format="png",

           width=500

       )

       image_box = widgets.HBox([image_widget], format=Structure(justify_content="heart"))

   else:

       image_box = widgets.HTML("No picture to show.")

   # Affirmation button

   confirm_button = widgets.Button(description="Verify and Save", button_style="success")

   output_area = widgets.Output()

   def on_confirm_button_clicked(b):

       with output_area:

           clear_output()

           final_data = {key: widget.worth for key, widget in text_widgets.gadgets()}

           # Create a pandas DataFrame

           df = pd.DataFrame([final_data])

           df['Source File'] = uploaded_file_name

           print("--- Verified and Finalized Knowledge ---")

           show(df)

           # Now you can save this DataFrame to CSV, and many others.

           df.to_csv('verified_invoice.csv', index=False)

           print("nData saved to 'verified_invoice.csv'")

   confirm_button.on_click(on_confirm_button_clicked)

   # Closing UI Structure

   ui = widgets.HBox([

       widgets.VBox([widgets.HTML("Invoice Image"), image_box]),

       widgets.VBox([

           widgets.HTML("Verify Extracted Data"),

           form,

           widgets.HBox([confirm_button], format=Structure(justify_content="flex-end")),

           output_area

       ], format=Structure(flex='1'))

   ])

   print("--- Human-in-the-Loop (HITL) Verification ---")

   print("Overview the information on the best. Make any corrections and click on 'Verify and Save'.")

   show(ui)

Output:

Human-in-the-loop UI

Modify some values after which save.

Output:

HITL UI with some values

You possibly can entry the complete code right here: GitHub, Colab

Conclusion

This POC efficiently demonstrates that the core logic behind a complicated system like Uber’s “TextSense” is replicable. By combining open-source OCR with a strong LLM like Google’s Gemini, you possibly can construct an efficient system for GenAI bill processing. This strategy to clever doc processing dramatically reduces guide effort and improves accuracy. The addition of a easy human-in-the-loop AI interface ensures that the ultimate knowledge is reliable.

Be happy to develop on this basis by including extra fields, enhancing validation, and integrating it into bigger workflows.

Often Requested Questions

Q1. How correct is the AI knowledge extraction?

A. The accuracy may be very excessive, particularly with clear invoices. The Gemini mannequin is superb at understanding context, however high quality can lower if the OCR textual content is poor as a consequence of a low-quality scan.

Q2. Can this technique deal with completely different bill layouts?

A. Sure. In contrast to template-based programs, the LLM understands language and context. This permits it to search out fields like “Bill Quantity” or “Whole” no matter their place on the web page.

Q3. What’s the price of operating this POC?

A. The fee is minimal. Tesseract and the opposite libraries are free. You solely pay on your utilization of the Google Gemini API, which may be very reasonably priced for the sort of activity.

Q4. Can I extract different fields from the bill?

A. Completely. Merely add the brand new area names to the fields_to_extract string in Step 5, and the Gemini mannequin will try to search out them for you.

Q5. How can I enhance the OCR high quality?

A. Guarantee your supply PDFs are high-resolution. Within the code, we set dpi=300 when changing the PDF to a picture, which is an efficient commonplace for OCR. Larger DPI can generally yield higher outcomes for blurry paperwork.

Harsh Mishra is an AI/ML Engineer who spends extra time speaking to Giant Language Fashions than precise people. Keen about GenAI, NLP, and making machines smarter (so that they don’t change him simply but). When not optimizing fashions, he’s in all probability optimizing his espresso consumption. 🚀☕

Login to proceed studying and luxuriate in expert-curated content material.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

[td_block_social_counter facebook="tagdiv" twitter="tagdivofficial" youtube="tagdiv" style="style8 td-social-boxed td-social-font-icons" tdc_css="eyJhbGwiOnsibWFyZ2luLWJvdHRvbSI6IjM4IiwiZGlzcGxheSI6IiJ9LCJwb3J0cmFpdCI6eyJtYXJnaW4tYm90dG9tIjoiMzAiLCJkaXNwbGF5IjoiIn0sInBvcnRyYWl0X21heF93aWR0aCI6MTAxOCwicG9ydHJhaXRfbWluX3dpZHRoIjo3Njh9" custom_title="Stay Connected" block_template_id="td_block_template_8" f_header_font_family="712" f_header_font_transform="uppercase" f_header_font_weight="500" f_header_font_size="17" border_color="#dd3333"]
- Advertisement -spot_img

Latest Articles