April 2025 - Page 2 of 8

Google AI Unveils 601 Real-World Generative AI Use Cases Across Indust …

Posted on April 27, 2025 by i-genie

Google Cloud has just released an extraordinary compendium of 601 real-world generative AI (GenAI) use cases from some of the world’s top organizations — a major leap from the 101 use cases it shared just a year ago at Google Cloud Next 2024. This sixfold expansion showcases the explosive pace at which GenAI technologies are moving from prototypes to production, powering transformations across virtually every sector.

Announced during Google Cloud Next 2025, the comprehensive list covers companies ranging from Uber, Samsung, and Citi to Mercedes-Benz, Deutsche Bank, and Alaska Airlines. The breadth of applications highlights GenAI’s growing importance as an operational, creative, and strategic lever across automotive, finance, healthcare, manufacturing, media, retail, and public sector industries.

The Structure: Agents, Industries, and Applications

Google structured the showcase across 11 major industry groups and six AI agent types:

Customer Agents: Enhance user experiences via chatbots, predictive services, and personalization

Employee Agents: Boost internal productivity through content generation, summarization, and knowledge discovery

Creative Agents: Accelerate campaign design, media production, and product innovation

Code Agents: Streamline software engineering and IT workflows

Data Agents: Leverage data for analysis, optimization, and decision support

Security Agents: Fortify organizations with AI-driven threat detection and fraud prevention.

This agent-based taxonomy makes it clear: AI is no longer a separate tool — it’s becoming embedded into the organizational fabric.

Industry Snapshots: Real-World Impact

Automotive & Logistics

The automotive industry is rapidly adopting conversational and predictive AI. Volkswagen of America built a multimodal virtual assistant inside the myVW app using Google’s Gemini models, letting users point their phones at dashboard indicators for instant explanations. Mercedes-Benz launched an automotive AI agent offering natural language navigation and e-commerce sales capabilities directly within its vehicles.

Even logistics giants are innovating: UPS is constructing a digital twin of its global package network for real-time package tracking and optimization.

Financial Services

Banks and fintech companies are particularly aggressive in AI adoption. Citi is using Vertex AI to empower developer toolkits and document digitization. Deutsche Bank’s “DB Lumina” research tool, powered by Gemini, slashes research report creation times from hours to minutes.

Meanwhile, Discover Financial Services deployed AI assistants that aid both customers and contact center representatives, significantly improving service efficiency.

Healthcare & Life Sciences

In healthcare, the impact of AI extends from diagnostics to operational efficiency. Freenome is building early-detection cancer tests combining AI and blood samples. Mayo Clinic unlocked 50 petabytes of clinical data with Vertex AI Search, accelerating research access.

Apollo Hospitals in India scaled tuberculosis and breast cancer screening to 3 million people by applying AI to radiology workflows.

Manufacturing & Electronics

Manufacturers like Samsung are embedding Google’s Gemini AI directly into their devices — the Galaxy S24 now offers AI-driven text summarization and image editing features. Trimble and Honeywell have incorporated Gemini for Workspace to enhance engineering productivity and document automation.

Media, Retail, and Hospitality

AI is dramatically altering customer engagement. Papa John’s, Wendy’s, and Uber are using AI-powered predictive ordering systems. Radisson Hotel Group reported a 50% gain in marketing productivity and over 20% revenue lift by personalizing ads with Vertex AI.

Even creative industries are leveraging AI: Adobe has integrated Imagen 3 and Veo 2 into Adobe Express, dramatically accelerating campaign creation.

Technology Highlights: Google’s Evolving Stack

Many of these applications were made possible through core Google Cloud AI technologies, notably:

Vertex AI: Model training, deployment, RAG (retrieval-augmented generation) pipelines

Gemini Models: Multimodal LLMs powering text, code, vision, and conversational capabilities

Imagen & Veo: High-fidelity generative image and video models

BigQuery ML: Data warehousing with embedded machine learning

Security AI: AI-first threat detection with Google SecOps.

An emerging trend is the heavy use of enterprise-tuned AI agents, such as Gemini Code Assist for developer productivity or Gemini in Security for threat intelligence.

Emerging Patterns Across Use Cases

Several clear trends emerge from Google’s compilation:

Generative AI is moving from experiments to mission-critical systems: Whether automating underwriting in finance, driving drug discovery, or powering multimodal search in automotive apps, GenAI is now operational at scale.

Hybrid Multimodal Models are increasingly vital: Many solutions integrate text, vision, and structured data — not just plain language models.

Verticalized AI Agents are accelerating: Google’s partners aren’t just fine-tuning LLMs — they’re building domain-specific, industry-tuned AI agents tightly integrated into their workflows.

Democratization of AI: Solutions like Vertex AI’s search and data agents are putting sophisticated AI tools into the hands of business users, scientists, and even drivers — not just engineers.

Final Thoughts

The 601 use cases shared by Google paint an exhilarating picture: AI transformation is no longer theoretical — it is happening today, at massive scale, in nearly every sector.

Google’s strategy of aligning its AI offerings with real-world operational needs — from customer engagement and logistics to employee productivity and cybersecurity — is accelerating this adoption curve.

As Google’s President of Global Revenue, Matt Renner, said in the announcement, “This is just scratching the surface of what’s becoming possible with AI across the enterprise”.

If these use cases are any indication, the next year promises even more staggering innovation.

Check out the Report. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Google AI Unveils 601 Real-World Generative AI Use Cases Across Industries appeared first on MarkTechPost.

This AI Paper from China Proposes a Novel Training-Free Approach DEER …

Posted on April 27, 2025 by i-genie

Recent progress in large reasoning language models (LRLMs), such as DeepSeek-R1 and GPT-O1, has greatly improved complex problem-solving abilities by extending the length of CoT generation during inference. These models benefit from test-time scaling laws, allowing richer and more diverse reasoning paths. However, generating overly long CoT sequences leads to computational inefficiency and increased latency, making the deployment of real-world systems challenging. Moreover, excessive reasoning often introduces redundant or irrelevant steps, which can cause models to deviate from correct answers, ultimately reducing accuracy. This overthinking problem stems from traditional supervised fine-tuning and reinforcement learning approaches that do not prioritize dynamic control over reasoning length. Research has shown that in many cases, reasoning could be halted earlier, at what the authors call “pearl reasoning” points, without sacrificing correctness. Identifying and stopping at these critical points could significantly improve efficiency while maintaining model performance.

Existing approaches to improve inference efficiency generally fall into three categories: post-training, prompt-based, and output-based methods. Post-training techniques involve retraining models with variable-length CoT examples or length rewards, but they are often computationally intensive and risk overfitting. Prompt-based methods adjust CoT length by modifying the input prompts based on task difficulty, achieving more concise reasoning without sacrificing much accuracy. Output-based methods typically focus on sampling techniques, such as early stopping when multiple outputs converge on the same answer. However, with newer models like R1, reliance on best-of-N sampling has decreased. Recent works have explored early exiting strategies, but they often require separate verification models or are only effective in limited settings. In contrast, the discussed approach aims to empower models to recognize optimal stopping points during their reasoning process, providing a more seamless and generalizable solution.

Researchers from the Institute of Information Engineering, the University of Chinese Academy of Sciences, and Huawei Technologies have proposed DEER, a simple, training-free method to enable LRLMs to dynamically exit early during reasoning. DEER monitors key transition points, such as the generation of “Wait” tokens, and prompts the model to produce trial answers at these moments. If the model shows high confidence, reasoning is halted; otherwise, it continues. This approach integrates seamlessly with existing models, such as DeepSeek, and reduces CoT length by 31–43%, while improving accuracy by 1.7–5.7% across benchmarks including MATH-500, AIME 2024, and GPQA Diamond.

The DEER (Dynamic Early Exit in Reasoning) method enables large reasoning language models to exit reasoning early by evaluating their confidence in trial answers at key transition points. It uses three modules: a reasoning transition monitor to detect “thought switch” signals, an answer inducer to prompt a trial conclusion, and a confidence evaluator to assess if the reasoning is sufficient. If confidence exceeds a threshold, reasoning stops; otherwise, it continues. To reduce latency from trial answer generation, DEER also employs branch-parallel decoding with dynamic cache management, thereby improving efficiency without sacrificing accuracy, particularly for tasks such as code generation.

The experiments evaluated models on four major reasoning benchmarks: MATH-500, AMC 2023, AIME 2024, and GPQA Diamond, as well as programming benchmarks HumanEval and BigCodeBench. Tests were conducted using DeepSeek-R1-Distill-Qwen models of varying sizes (1.5B to 32B parameters) under a Zero-shot Chain-of-Thought setup. DEER significantly improved performance by reducing reasoning length by 31–43% while increasing accuracy by 1.7–5.7% compared to standard CoT. A detailed analysis revealed that DEER corrected more responses through early exits, particularly for smaller models and simpler tasks. On programming benchmarks, DEER also reduced reasoning length by over 60% with minimal or no loss in accuracy, demonstrating its robustness across various tasks.

In conclusion, the study validates the idea of using early exits during CoT generation through pilot studies. Based on these findings, it introduces a training-free dynamic early exit method that enables models to stop reasoning once enough information is gathered. Tested across various model sizes and six major reasoning benchmarks, the method achieves better accuracy with fewer tokens, effectively balancing efficiency and performance. Unlike traditional approaches that rely on long CoT for complex tasks, this method dynamically monitors model confidence to determine when to stop reasoning, thereby avoiding unnecessary steps. Experiments show significant reductions in reasoning length while boosting overall accuracy.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post This AI Paper from China Proposes a Novel Training-Free Approach DEER that Allows Large Reasoning Language Models to Achieve Dynamic Early Exit in Reasoning appeared first on MarkTechPost.

Skywork AI Advances Multimodal Reasoning: Introducing Skywork R1V2 wit …

Posted on April 26, 2025 by i-genie

Recent advancements in multimodal AI have highlighted a persistent challenge: achieving strong specialized reasoning capabilities while preserving generalization across diverse tasks. “Slow-thinking” models such as OpenAI-o1 and Gemini-Thinking have made strides in deliberate analytical reasoning but often exhibit compromised performance on general visual understanding tasks, with increased tendencies toward visual hallucinations. As the field progresses toward building general-purpose AI systems, reconciling this tradeoff remains a critical research problem.

Skywork AI Introduces Skywork R1V2

Skywork AI has released Skywork R1V2, a next-generation multimodal reasoning model designed to address the reasoning-generalization tradeoff systematically. Building upon the foundation of Skywork R1V, R1V2 introduces a hybrid reinforcement learning framework, combining reward-model guidance with structured rule-based signals. The model bypasses the conventional reliance on teacher-student distillation by learning directly from multimodal interactions, offering an open and reproducible advancement through its release on Hugging Face.

Technical Approach and Innovations

Skywork R1V2 incorporates Group Relative Policy Optimization (GRPO) alongside a Selective Sample Buffer (SSB) to enhance training stability and efficiency. GRPO enables relative evaluation among candidate responses within the same query group, but convergence issues can diminish effective learning signals. The SSB mechanism addresses this by maintaining a cache of informative samples, ensuring continuous access to high-value gradients.

Additionally, the model adopts a Mixed Preference Optimization (MPO) strategy, integrating reward-model-based preferences with rule-based constraints. This hybrid optimization allows Skywork R1V2 to strengthen step-by-step reasoning quality while maintaining consistency in general perception tasks. A modular training approach, utilizing lightweight adapters between a frozen Intern ViT-6B vision encoder and a pretrained language model, preserves the language model’s reasoning capabilities while optimizing cross-modal alignment efficiently.

Empirical Results and Analysis

Skywork R1V2 demonstrates robust performance across a range of reasoning and multimodal benchmarks. On text reasoning tasks, the model achieves 78.9% on AIME2024, 63.6% on LiveCodeBench, 73.2% on LiveBench, 82.9% on IFEVAL, and 66.3% on BFCL. These results represent significant improvements over Skywork R1V1 and are competitive with substantially larger models, such as Deepseek R1 (671B parameters).

In multimodal evaluation, R1V2 achieves 73.6% on MMMU, 74.0% on MathVista, 62.6% on OlympiadBench, 49.0% on MathVision, and 52.0% on MMMU-Pro. The model consistently outperforms open-source baselines of comparable or larger size, including Qwen2.5-VL-72B and QvQ-Preview-72B, particularly excelling in tasks that require structured problem-solving across visual and textual inputs.

When compared against proprietary models, R1V2 demonstrates narrowing performance gaps. It surpasses Claude 3.5 Sonnet and Gemini 2 Flash on critical multimodal benchmarks such as MMMU and MathVista. Importantly, hallucination rates were substantially reduced to 8.7% through calibrated reinforcement strategies, maintaining factual integrity alongside complex reasoning.

Qualitative assessments further illustrate R1V2’s systematic problem-solving approach, with the model demonstrating methodical decomposition and verification behaviors in complex scientific and mathematical tasks, reinforcing its alignment with reflective cognitive patterns.

Conclusion

Skywork R1V2 advances the state of multimodal reasoning through a carefully designed hybrid reinforcement learning framework. By addressing the vanishing advantages problem with the Selective Sample Buffer and balancing optimization signals through Mixed Preference Optimization, the model achieves notable improvements in both specialized reasoning tasks and general multimodal understanding.

With benchmark-leading performances such as 62.6% on OlympiadBench and 73.6% on MMMU, Skywork R1V2 establishes a strong open-source baseline. Its design principles and training methodology offer a pragmatic approach toward developing robust, efficient multimodal AI systems. Future directions for Skywork AI include enhancing general visual understanding capabilities while preserving the sophisticated reasoning foundations laid by R1V2.

Check out the Paper and Model on HuggingFace. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Skywork AI Advances Multimodal Reasoning: Introducing Skywork R1V2 with Hybrid Reinforcement Learning appeared first on MarkTechPost.

From GenAI Demos to Production: Why Structured Workflows Are Essential

Posted on April 26, 2025 by i-genie

At technology conferences worldwide and on social media, generative AI applications demonstrate impressive capabilities: composing marketing emails, creating data visualizations, or writing functioning code. Yet behind these polished demonstrations lies a stark reality. What works in controlled environments often fails when confronted with the demands of production systems.

Industry surveys reveal the scale of this challenge: 68% of organizations have moved 30% or fewer of their generative AI experiments into production, while only 53% of AI projects overall progress from prototype to production – with a mere 10% achieving measurable ROI (Wallaroo). Why does this gap persist? The controlled environment of a demonstration bears little resemblance to the unpredictable demands of real-world deployment.

Most current GenAI applications rely on what some have called ‘vibes-based’ assessments rather than rigorous validation. A developer reviews the output, determines it looks reasonable, and the system advances to the next stage of development. While this approach might sometimes identify obvious flaws, it fails to detect subtle inconsistencies that emerge only at scale or with edge-case inputs.

These reliability concerns become critical when AI systems influence business decisions with tangible consequences. 70% of organizations estimate needing at least 12 months to resolve challenges in achieving expected ROI from GenAI, highlighting the high stakes of production failures. Each misstep carries measurable costs: an incorrect product recommendation affects not just immediate sales but customer retention; an inaccurate financial summary might lead to misallocation of resources; a flawed legal interpretation could create significant liability exposure.

The transition from promising demonstrations to dependable production systems requires more than incremental improvements. It demands a fundamental shift in how we architect and evaluate GenAI applications. Structured workflows and systematic evaluation offer a methodical path forward—one that transforms unpredictable prototypes into systems worthy of trust with consequential decisions.

The Limitations of Monolithic GenAI Applications

Most first-generation GenAI applications employ a deceptively simple architecture: user input enters the system, a language model processes it with some contextual information, and the system produces a response. This end-to-end approach, while straightforward to implement, introduces significant limitations when deployed beyond controlled environments.

The most pressing challenge involves identifying the source of errors. When a monolithic system produces incorrect, biased, or nonsensical output, determining the cause becomes an exercise in speculation. Did the retrieval mechanism provide irrelevant context? Was the prompt construction flawed? Does the base model lack necessary capabilities? Without visibility into these components, improvement efforts resemble guesswork rather than engineering. Choco, a food distribution platform, discovered this when their single “catch-all” prompt worked in a hackathon but proved “not scalable or maintainable” in production.

Language models introduce another complication through their probabilistic nature. Even with identical inputs, these models may generate different outputs across successive executions. This variability creates a fundamental tension: creative applications benefit from diverse outputs, but business processes require consistency. The legal field saw an infamous example when an attorney unknowingly submitted hallucinated court cases from ChatGPT, leading to sanctions. The lack of internal measurement points further hampers improvement efforts. Without defined evaluation boundaries, teams struggle to isolate performance issues or quantify improvements.

Many current frameworks exacerbate these problems through premature abstraction. They encapsulate functionality behind interfaces that obscure necessary details, creating convenience at the expense of visibility and control. A team at Prosus found that off-the-shelf agent frameworks were fine for prototyping but too inflexible for production at scale.

These limitations become most apparent as organizations scale from prototype to production. Approaches that function adequately in limited tests falter when confronted with the volume, variety, and velocity of real-world data. Production deployment requires architectures that support not just initial development but ongoing operation, monitoring, and improvement—needs that monolithic systems struggle to satisfy. Successful teams have responded by breaking monolithic designs into modular pipelines, taming randomness with deterministic components, building comprehensive evaluation infrastructure, and favoring transparent architectures over premature abstractions.

Component-Driven GenAI: Breaking Down the Black Box

The transition to component-driven architecture represents more than a technical preference—it applies fundamental software engineering principles to generative AI development. By decomposing monolithic systems into discrete functional units, this approach transforms opaque black boxes into transparent, manageable workflows.

Component-based architecture divides complex systems into units with specific responsibilities, connected through well-defined interfaces. In GenAI applications, these components might include:

Data Retrieval Component: A vector database with embedding search that finds relevant documents or knowledge snippets based on user queries (e.g., Pinecone or Weaviate storing product information).

Prompt Construction Component: A template engine that formats retrieved information and user input into optimized prompts (e.g., a system that assembles query context).

Model Interaction Component: An API wrapper that handles communication with language models, manages retries, and standardizes input/output formats (e.g., a service that routes requests to Azure OpenAI or local Ollama endpoints).

Output Validation Component: A rule-based or LLM-based validator that checks outputs for accuracy, harmful content, or hallucinations (e.g., a fact-checking module that compares generated statements with retrieved knowledge).

Response Processing Component: A formatter that restructures raw model output into application-appropriate formats (e.g., a JSON parser that extracts structured data from text responses).

Each component addresses a specific function, creating natural boundaries for both execution and evaluation.

This decomposition yields several practical advantages that directly address the limitations of monolithic approaches. First, it establishes separation of concerns, allowing developers to focus on specific functionality without addressing the entire system simultaneously. Second, it creates discrete evaluation points where inputs and outputs can be validated against defined criteria. Third, it simplifies reasoning about system behavior by reducing complex interactions to manageable units that can be understood and modified independently.

Leading organizations have demonstrated these benefits in production. Uber’s DragonCrawl, a system for automated mobile app testing, uses LLMs to execute tests with human-like intuition. While not explicitly described as component-driven in Uber’s blog, its architecture effectively separates concerns into functional areas working together:

A representation component that converts app UI screens into text for the model to process

A decision-making component using a fine-tuned MPNet model (110M parameters) that determines what actions to take based on context and goals

An execution component that implements these decisions as interactions with the app

This structured approach achieved “99%+ stability” in November-December 2023 and successfully executed end-to-end trips in 85 out of 89 top cities without any city-specific tweaks. Most importantly, the system required no maintenance—when app changes occurred, DragonCrawl figured out how to navigate new flows on its own, unlike traditional tests that required hundreds of maintenance hours in 2023. The deliberate model selection process (evaluating multiple options against precision metrics) further demonstrates how systematic evaluation leads to reliable production systems.

Well-designed interfaces between components further enhance system maintainability. By establishing explicit contracts for data exchange, these interfaces create natural boundaries for testing and make components interchangeable. For example, a data retrieval component might specify that it accepts natural language queries and returns relevant document chunks with source metadata and relevance scores. This clear contract allows teams to swap between different retrieval implementations (keyword-based, embedding-based, or hybrid) without changing downstream components as long as the interface remains consistent.

The Component-Evaluation Pair: A Fundamental Pattern

At the heart of reliable GenAI systems lies a simple but powerful pattern: each component should have a corresponding evaluation mechanism that verifies its behavior. This component-evaluation pair creates a foundation for both initial validation and ongoing quality assurance.

This approach parallels unit testing in software engineering but extends beyond simple pass/fail validation. Component evaluations should verify basic functionality, identify performance boundaries, detect drift from expected behavior, and provide diagnostic information when issues arise. These evaluations serve as both quality gates during development and monitoring tools during operation.

Real-world implementations demonstrate this pattern’s effectiveness. Aimpoint Digital built a travel itinerary generator with separate evaluations for its retrieval component (measuring relevance of fetched results) and generation component (using an LLM-as-judge to grade output quality). This allowed them to quickly identify whether issues stemmed from poor information retrieval or flawed generation.

Payment processing company Stripe implemented a component-evaluation pair for their customer support AI by tracking “match rate” – how often the LLM’s suggested responses aligned with human agent final answers. This simple metric served as both quality gate and production monitor for their generation component.

The one-to-one relationship between components and evaluations enables targeted improvement when issues emerge. Rather than making broad changes to address vague performance concerns, teams can identify specific components that require attention. This precision reduces both development effort and the risk of unintended consequences from system-wide modifications.

The metrics from component evaluations form a comprehensive dashboard of system health. Engineers can monitor these indicators to identify performance degradation before it affects end users—a significant advantage over systems where problems become apparent only after they impact customers. This proactive approach supports maintenance activities and helps prevent production incidents.

When implemented systematically, component evaluations build confidence in system composition. If each component demonstrates acceptable performance against defined metrics, engineers can combine them with greater assurance that the resulting system will behave as expected. This compositional reliability becomes particularly important as systems grow in complexity.

Eval-First Development: Starting With Measurement

Conventional development processes often treat evaluation as an afterthought—something to be addressed after implementation is complete. Eval-first development inverts this sequence, establishing evaluation criteria before building components. This approach ensures that success metrics guide development from the outset rather than being retrofitted to match existing behavior.

The eval-first methodology creates a multi-tiered framework that operates at increasing levels of abstraction:

At the component level, evaluations function like unit tests in software development. These assessments verify that individual functional units perform their specific tasks correctly under various conditions. A retrieval component might be evaluated on the relevance of returned information across different query types, while a summarization component could be assessed on factual consistency between source text and generated summaries. These targeted evaluations provide immediate feedback during development and ongoing monitoring in production.

Step-level evaluations examine how components interact in sequence, similar to integration testing in software development. These assessments verify that outputs from one component serve as appropriate inputs for subsequent components and that the combined functionality meets intermediate requirements. For example, step-level evaluation might confirm that a classification component correctly routes queries to appropriate retrieval components, which then provide relevant context to a generation component.

Workflow-level evaluations assess whether the entire pipeline satisfies business requirements. These system-level tests validate end-to-end performance against defined success criteria. For a customer support system, workflow evaluation might measure resolution rate, customer satisfaction, escalation frequency, and handling time. These metrics connect technical implementation to business outcomes, providing a framework for prioritizing improvements.

This layered approach offers significant advantages over end-to-end evaluation alone. First, it provides a comprehensive view of system performance, identifying issues at multiple levels of granularity. Second, it establishes traceability between business metrics and component behavior, connecting technical performance to business outcomes. Third, it supports incremental improvement by highlighting specific areas that require attention.

Organizations that implement eval-first development often discover requirements and constraints earlier in the development process. By defining how components will be evaluated before implementation begins, teams identify potential issues when they’re least expensive to address. This proactive approach reduces both development costs and time-to-market for reliable systems.

Implementing Component-Based GenAI Workflows

Practical implementation of component-based GenAI workflows requires methodical decomposition of applications into steps that can be evaluated. This process begins with identifying core functions, then establishing clear responsibilities and interfaces for each component.

Effective breakdown balances granularity with practicality. Each component should have a single responsibility without creating excessive interaction overhead. Uber’s GenAI Gateway demonstrates this through a unified service layer handling 60+ LLM use cases. By mirroring OpenAI’s API interface, they created standardized endpoints that separate integration logic from application business logic.

Well-designed interfaces specify both data formats and semantic requirements. Microsoft’s Azure Copilot uses RESTful APIs between components like its Knowledge Service (document chunking) and LLM processors. This enables independent development while ensuring components exchange properly structured, semantically valid data.

Components and evaluations should be versioned together for traceable evolution. Uber’s approach allows centralized model upgrades – adding GPT-4V required only gateway adjustments rather than client changes. This containment of version impacts prevents system-wide disruptions.

Agentic components require constrained decision boundaries. Microsoft implements extensible plugins where each Azure service team builds domain-specific “chat handlers.” These predefined operations maintain control while enabling specialized functionality.

Sophisticated fallback mechanisms become possible with component isolation. Uber’s gateway implements automated model fallbacks, switching to internal models when external providers fail. This graceful degradation maintains service continuity without compromising entire workflows.

Microsoft’s golden dataset approach provides versioned benchmarking against 500+ validated question/answer pairs. Component updates are tested against this dataset before deployment, creating a closed feedback loop between evaluation and improvement.

Key challenges persist:

Initial Investment – Designing interfaces and evaluation frameworks requires upfront resources

Skill Gaps – Teams need both software engineering and AI expertise

Coordination Overhead – Inter-component communication adds complexity

Organizations must balance these against the benefits of maintainability and incremental improvement. As demonstrated by Uber’s gateway – now handling authentication, PII redaction, and monitoring across all LLM interactions – centralized components with clear contracts enable scalability while maintaining governance.

Practical Considerations

Implementing component-based GenAI workflows involves several practical considerations that influence their effectiveness in production environments.

Parcha discovered users preferred reliable “agent-on-rails” designs over fully autonomous systems after their initial agent approach proved too unpredictable. RealChar implemented a deterministic event-driven pipeline for AI phone calls, achieving low latency through fixed processing cycles rather than free-form agent architectures.

The organizational implications of component-based architecture extend beyond technical considerations. PagerDuty formed a centralized LLM service team that enabled four new AI features in two months by standardizing infrastructure across product teams. This mirrors how companies established dedicated data platform teams during earlier tech waves.

Organizations with established machine learning infrastructure have a significant advantage when implementing component-based GenAI systems. Many foundational MLOps capabilities transfer directly to LLMOps with minimal adaptation. For example, existing model registry systems can be extended to track LLM versions and their performance metrics. Data pipeline orchestrators that manage traditional ML workflows can be repurposed to coordinate GenAI component execution. Monitoring systems already watching for ML model drift can be adapted to detect LLM performance degradation.

Leading organizations have found that reusing these battle-tested MLOps components accelerates GenAI adoption while maintaining consistent governance and operational standards. Rather than building parallel infrastructure, enterprise companies have extended their ML platforms to accommodate the unique needs of LLMs, preserving the investment in tooling while adapting to new requirements.

Resource allocation represents another practical consideration. Component-based architectures require investment in infrastructure for component orchestration, interface management, and comprehensive evaluation. These investments compete with feature development and other organizational priorities. Successful implementation requires executive support based on understanding the long-term benefits of maintainable, evaluatable systems over short-term feature delivery.

Building for the Future

Component-based, evaluated workflows provide a foundation for sustainable GenAI development that extends beyond current capabilities. This approach positions organizations to incorporate emerging technologies without wholesale system replacement.

The field of generative AI continues to evolve rapidly, with new model architectures, specialized models, and improved techniques emerging regularly. Component-based systems can integrate these advances incrementally, replacing individual components as better alternatives become available. This adaptability provides significant advantage in a rapidly evolving field, allowing organizations to benefit from technological progress without disruptive rebuilding.

The reliability advantage of evaluated components becomes increasingly important as GenAI applications address critical business functions. Organizations that implement systematic evaluation establish quantitative evidence of system performance, supporting both internal confidence and external trust. This evidence-based approach helps organizations navigate regulatory requirements, customer expectations, and internal governance. As regulatory scrutiny of AI systems increases, the ability to demonstrate systematic evaluation and quality assurance will become a competitive differentiator.

Component evaluation enables continuous, data-driven improvement by providing detailed performance insights. Rather than relying on broad assessments or anecdotal feedback, teams can analyze component-level metrics to identify specific improvement opportunities. This targeted approach supports efficient resource allocation, directing effort toward areas with measurable impact.

Organizations should assess their current GenAI implementations through the lens of componentization and systematic evaluation. This assessment might examine several questions: Are system responsibilities clearly divided into evaluable components? Do explicit interfaces exist between these components? Are evaluation metrics defined at component, step, and workflow levels? Does the architecture support incremental improvement?

The transition from impressive demonstrations to reliable production systems ultimately requires both technical architecture and organizational commitment. Component-based workflows with systematic evaluation provide the technical foundation, while organizational priorities determine whether this foundation supports sustainable development or merely adds complexity. Organizations that commit to this approach—investing in component design, interface definition, and comprehensive evaluation—position themselves to deliver not just impressive demonstrations but dependable systems worthy of trust with consequential decisions.
The post From GenAI Demos to Production: Why Structured Workflows Are Essential appeared first on MarkTechPost.

A Comprehensive Tutorial on the Five Levels of Agentic AI Architecture …

Posted on April 26, 2025 by i-genie

In this tutorial, we explore five levels of Agentic Architectures, from the simplest language model calls to a fully autonomous code-generating system. This tutorial is designed to run seamlessly on Google Colab. Starting with a basic “simple processor” that simply echoes the model’s output, you will progressively build routing logic, integrate external tools, orchestrate multi-step workflows, and ultimately empower the model to plan, validate, refine, and execute its own Python code. Throughout each section, you’ll find detailed explanations, self-contained demo functions, and clear prompts that illustrate how to balance human control and machine autonomy in real-world AI applications.

Copy CodeCopiedUse a different Browserimport os
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import re
import json
import time
import random
from IPython.display import clear_output

We import core Python and third-party libraries, including os and time for environment and execution control, torch, along with Hugging Face’s transformers (pipeline, AutoTokenizer, AutoModelForCausalLM) for model loading and inference. Also, we utilize re and json for parsing LLM outputs, random seeds, and mock data, while clear_output maintains a tidy Colab interface.

Copy CodeCopiedUse a different BrowserMODEL_NAME = “TinyLlama/TinyLlama-1.1B-Chat-v1.0”
def get_model_and_tokenizer():
if not hasattr(get_model_and_tokenizer, “model”):
print(f”Loading model {MODEL_NAME}…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16,
device_map=”auto”,
low_cpu_mem_usage=True
)
get_model_and_tokenizer.model = model
get_model_and_tokenizer.tokenizer = tokenizer
print(“Model loaded successfully!”)

return get_model_and_tokenizer.model, get_model_and_tokenizer.tokenizer

Here, we define MODEL_NAME to point at the TinyLlama 1.1B chat model and implement a lazy‐loading helper get_model_and_tokenizer() that downloads and initializes the tokenizer and model only once, caching them on first call to minimize overhead, and then returns the cached instances for all subsequent inference calls.

Copy CodeCopiedUse a different Browserdef get_model_and_tokenizer():
if not hasattr(get_model_and_tokenizer, “model”):
print(f”Loading model {MODEL_NAME}…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16,
device_map=”auto”,
low_cpu_mem_usage=True
)
get_model_and_tokenizer.model = model
get_model_and_tokenizer.tokenizer = tokenizer
print(“Model loaded successfully!”)

return get_model_and_tokenizer.model, get_model_and_tokenizer.tokenizer

This helper function implements a lazy-loading pattern for the TinyLlama model and its tokenizer. On the first call, it downloads and initializes both with half-precision and automatic device placement, caches them as attributes on the function object, and on subsequent calls, simply returns the already-loaded instances to avoid redundant overhead.

Copy CodeCopiedUse a different Browserdef generate_text(prompt, max_length=512):
model, tokenizer = get_model_and_tokenizer()

messages = [{“role”: “user”, “content”: prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

inputs = tokenizer(formatted_prompt, return_tensors=”pt”).to(model.device)

with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_length,
do_sample=True,
temperature=0.7,
top_p=0.9,
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

response = generated_text.split(“ASSISTANT: “)[-1].strip()
return response

The generate_text function wraps the TinyLlama inference workflow: it retrieves the cached model and tokenizer, formats the user prompt into the chat template, tokenizes and moves inputs to the model’s device, then samples a response with temperature and top-p settings. After generation, it decodes the output and extracts just the assistant’s reply by splitting on the “ASSISTANT: ” marker.

Level 1: Simple Processor

At the simplest level, the code defines a straightforward text‐generation pipeline that treats the model purely as a language processor. When the user provides a prompt, the `simple_processor` function invokes the `generate_text` helper, which is built on the TinyLlama 1.1B chat model, to produce a free-form response. It then displays that response directly. Under the hood, `generate_text` ensures the model and tokenizer are loaded just once by caching them inside the `get_model_and_tokenizer` function, formats the prompt for the chat model, runs generation with sampling parameters for diversity, and extracts the assistant’s reply by splitting on the “ASSISTANT:” marker. This level demonstrates the most basic interaction pattern: input is received, output is generated, and program flow remains entirely under human control.

Copy CodeCopiedUse a different Browserdef simple_processor(prompt):
“””Level 1: Simple Processor – Model has no impact on program flow”””
response = generate_text(prompt)
return response

def demo_level1():
print(“n” + “=”*50)
print(“LEVEL 1: SIMPLE PROCESSOR DEMO”)
print(“=”*50)
print(“At this level, the AI has no control over program flow.”)
print(“It simply takes input and produces output.n”)

user_input = input(“Enter your question or prompt: “) or “Write a short poem about artificial intelligence.”
print(“nProcessing your request…n”)

output = simple_processor(user_input)
print(“OUTPUT:”)
print(“-“*50)
print(output)
print(“-“*50)

The simple_processor function embodies the Simple Processor of our agent hierarchy by treating the model purely as a text generator; it accepts a user-provided prompt and delegates to generate_text. It returns whatever the model produces without any branching or decision logic. The accompanying demo_level1 routine provides a minimal interactive loop, printing a clear header, soliciting user input (with a sensible default), invoking simple_processor, and then displaying the raw output, showcasing the most basic prompt-to-response workflow in which the AI exerts no influence over the program’s flow.

Level 2: Router

The second level introduces conditional routing based on the model’s classification of the user’s query. The `router_agent` function first asks the model to classify a query into “technical,” “creative,” or “factual,” then normalizes the model’s response into one of those categories. Depending on which category is detected, the query is dispatched to a specialized handler, either `handle_technical_query`, `handle_creative_query`, or `handle_factual_query`, each of which wraps the user’s query in a system-style prompt tailored to the chosen tone and purpose. This routing mechanism provides the model with partial control over program flow, enabling it to guide the subsequent interaction path while still relying on human-defined handlers to generate the final output.

Copy CodeCopiedUse a different Browserdef router_agent(user_query):
“””Level 2: Router – Model determines basic program flow”””

category_prompt = f”””Classify the following query into one of these categories:
‘technical’, ‘creative’, or ‘factual’.

Query: {user_query}

Return ONLY the category name and nothing else.”””

category_response = generate_text(category_prompt)

category = category_response.lower()
if “technical” in category:
category = “technical”
elif “creative” in category:
category = “creative”
else:
category = “factual”

print(f”Query classified as: {category}”)

if category == “technical”:
return handle_technical_query(user_query)
elif category == “creative”:
return handle_creative_query(user_query)
else:
return handle_factual_query(user_query)

def handle_technical_query(query):
system_prompt = f”””You are a technical assistant. Provide detailed technical explanations.

User query: {query}”””

response = generate_text(system_prompt)
return f”[Technical Response]n{response}”

def handle_creative_query(query):
system_prompt = f”””You are a creative assistant. Be imaginative and inspiring.

User query: {query}”””

response = generate_text(system_prompt)
return f”[Creative Response]n{response}”

def handle_factual_query(query):
system_prompt = f”””You are a factual assistant. Provide accurate information concisely.

User query: {query}”””

response = generate_text(system_prompt)
return f”[Factual Response]n{response}”

def demo_level2():
print(“n” + “=”*50)
print(“LEVEL 2: ROUTER DEMO”)
print(“=”*50)
print(“At this level, the AI determines basic program flow.”)
print(“It decides which processing path to take.n”)

user_query = input(“Enter your question or prompt: “) or “How do neural networks work?”
print(“nProcessing your request…n”)

result = router_agent(user_query)
print(“OUTPUT:”)
print(“-“*50)
print(result)
print(“-“*50)

The router_agent function implements Router behavior by first asking the model to classify the user’s query as “technical,” “creative,” or “factual,” then normalizing that classification and dispatching the query to the corresponding handler (handle_technical_query, handle_creative_query, or handle_factual_query), each of which wraps the original query in an appropriate system‐style prompt before calling generate_text. The demo_level2 routine provides a clear CLI-style interface, printing headers, accepting input (with a default), invoking router_agent, and displaying the categorized response, showcasing how the model can take basic control over program flow by choosing which processing path to follow.

Level 3: Tool Calling

At the third level, the code empowers the model to decide which of several external tools to invoke by embedding a JSON-based function selection protocol into the prompt. The `tool_calling_agent` presents the user’s question alongside a menu of potential tools, including weather lookup, web search simulation, current date and time retrieval, or direct response, and instructs the model to respond with a valid JSON message specifying the chosen tool and its parameters. A regex then extracts the first JSON object from the model’s output, and the code safely falls back to a direct response if parsing fails. Once the tool and arguments are identified, the corresponding Python function is executed, its result is captured, and a final model call integrates that result into a coherent answer. This pattern bridges LLM reasoning with concrete code execution by letting the model orchestrate which APIs or utilities to call.

Copy CodeCopiedUse a different Browserdef tool_calling_agent(user_query):
“””Level 3: Tool Calling – Model determines how functions are executed”””

tool_selection_prompt = f”””Based on the user query, select the most appropriate tool from the following list:
1. get_weather: Get the current weather for a location
2. search_information: Search for specific information on a topic
3. get_date_time: Get current date and time
4. direct_response: Provide a direct response without using tools

USER QUERY: {user_query}

INSTRUCTIONS:
– Return your response in valid JSON format
– Include the tool name and any required parameters
– For get_weather, include location parameter
– For search_information, include query and depth parameter (basic or detailed)
– For get_date_time, include timezone parameter (optional)
– For direct_response, no parameters needed

Example output format: {{“tool”: “get_weather”, “parameters”: {{“location”: “New York”}}}}”””

tool_selection_response = generate_text(tool_selection_prompt)

try:
json_match = re.search(r'({.*})’, tool_selection_response, re.DOTALL)
if json_match:
tool_selection = json.loads(json_match.group(1))
else:
print(“Could not parse tool selection. Defaulting to direct response.”)
tool_selection = {“tool”: “direct_response”, “parameters”: {}}
except json.JSONDecodeError:
print(“Invalid JSON in tool selection. Defaulting to direct response.”)
tool_selection = {“tool”: “direct_response”, “parameters”: {}}

tool_name = tool_selection.get(“tool”, “direct_response”)
parameters = tool_selection.get(“parameters”, {})

print(f”Selected tool: {tool_name}”)

if tool_name == “get_weather”:
location = parameters.get(“location”, “Unknown”)
tool_result = get_weather(location)
elif tool_name == “search_information”:
query = parameters.get(“query”, user_query)
depth = parameters.get(“depth”, “basic”)
tool_result = search_information(query, depth)
elif tool_name == “get_date_time”:
timezone = parameters.get(“timezone”, “UTC”)
tool_result = get_date_time(timezone)
else:
return generate_text(f”Please provide a helpful response to: {user_query}”)

final_prompt = f”””User Query: {user_query}
Tool Used: {tool_name}
Tool Result: {json.dumps(tool_result)}

Based on the user’s query and the tool result above, provide a helpful response.”””

final_response = generate_text(final_prompt)
return final_response

def get_weather(location):
weather_conditions = [“Sunny”, “Partly cloudy”, “Overcast”, “Light rain”, “Heavy rain”, “Thunderstorms”, “Snowy”, “Foggy”]
temperatures = {
“cold”: list(range(-10, 10)),
“mild”: list(range(10, 25)),
“hot”: list(range(25, 40))
}

location_hash = sum(ord(c) for c in location)
condition_index = location_hash % len(weather_conditions)
season = [“winter”, “spring”, “summer”, “fall”][location_hash % 4]

temp_range = temperatures[“cold”] if season in [“winter”, “fall”] else temperatures[“hot”] if season == “summer” else temperatures[“mild”]
temperature = random.choice(temp_range)

return {
“location”: location,
“temperature”: f”{temperature}°C”,
“conditions”: weather_conditions[condition_index],
“humidity”: f”{random.randint(30, 90)}%”
}

def search_information(query, depth=”basic”):
mock_results = [
f”First result about {query}”,
f”Second result discussing {query}”,
f”Third result analyzing {query}”
]

if depth == “detailed”:
mock_results.extend([
f”Fourth detailed analysis of {query}”,
f”Fifth comprehensive overview of {query}”,
f”Sixth academic paper on {query}”
])

return {
“query”: query,
“results”: mock_results,
“depth”: depth,
“sources”: [f”source{i}.com” for i in range(1, len(mock_results) + 1)]
}

def get_date_time(timezone=”UTC”):
current_time = time.strftime(“%Y-%m-%d %H:%M:%S”, time.gmtime())
return {
“current_datetime”: current_time,
“timezone”: timezone
}

def demo_level3():
print(“n” + “=”*50)
print(“LEVEL 3: TOOL CALLING DEMO”)
print(“=”*50)
print(“At this level, the AI selects which tools to use and with what parameters.”)
print(“It can process the results from tools to create a final response.n”)

user_query = input(“Enter your question or prompt: “) or “What’s the weather like in San Francisco?”
print(“nProcessing your request…n”)

result = tool_calling_agent(user_query)
print(“OUTPUT:”)
print(“-“*50)
print(result)
print(“-“*50)

In the Level 3 implementation, the tool_calling_agent function prompts the model to choose among a predefined set of utilities, such as weather lookup, mock web search, or date/time retrieval, by returning a JSON object with the selected tool name and its parameters. It then safely parses that JSON, invokes the corresponding Python function to obtain structured data, and makes a follow-up model call to integrate the tool’s output into a coherent, user-facing response.

Level 4: Multi-Step Agent

The fourth level extends the tool-calling pattern into a full multi-step agent that manages its workflow and state. The `MultiStepAgent` class maintains an internal memory of user inputs, tool outputs, and agent actions. Each iteration generates a planning prompt that summarizes the entire memory, asking the model to choose one of several tools, such as web search simulation, information extraction, text summarization, or report creation, or to conclude the task with a final output. After executing the selected tool and appending its results back to memory, the process repeats until either the model issues a “complete” action or the maximum number of steps is reached. Finally, the agent collates the memory into a cohesive final response. This structure shows how an LLM can orchestrate complex, multi-stage processes while consulting external functions and refining its plan based on previous results.

Copy CodeCopiedUse a different Browserclass MultiStepAgent:
“””Level 4: Multi-Step Agent – Model controls iteration and program continuation”””

def __init__(self):
self.tools = {
“search_web”: self.search_web,
“extract_info”: self.extract_info,
“summarize_text”: self.summarize_text,
“create_report”: self.create_report
}
self.memory = []
self.max_steps = 5

def run(self, user_task):
self.memory.append({“role”: “user”, “content”: user_task})

steps_taken = 0
while steps_taken < self.max_steps:
next_action = self.determine_next_action()

if next_action[“action”] == “complete”:
return next_action[“output”]

tool_name = next_action[“tool”]
tool_args = next_action[“args”]

print(f”n Step {steps_taken + 1}: Using tool ‘{tool_name}’ with arguments: {tool_args}”)

tool_result = self.tools[tool_name](**tool_args)

self.memory.append({
“role”: “tool”,
“content”: json.dumps(tool_result)
})

steps_taken += 1

return self.generate_final_response(“Maximum steps reached. Here’s what I’ve found so far.”)

def determine_next_action(self):
context = “Current memory state:n”
for item in self.memory:
if item[“role”] == “user”:
context += f”USER INPUT: {item[‘content’]}nn”
elif item[“role”] == “tool”:
context += f”TOOL RESULT: {item[‘content’]}nn”

prompt = f”””{context}

Based on the above information, determine the next action to take.
Choose one of the following options:
1. search_web: Search for information (args: query)
2. extract_info: Extract specific information from a text (args: text, target_info)
3. summarize_text: Create a summary of text (args: text)
4. create_report: Create a structured report (args: title, content)
5. complete: Task is complete (include final output)

Respond with a JSON object with the following structure:
For tools: {{“action”: “tool”, “tool”: “tool_name”, “args”: {{tool-specific arguments}}}}
For completion: {{“action”: “complete”, “output”: “final output text”}}

Only return the JSON object and nothing else.”””

next_action_response = generate_text(prompt)

try:
json_match = re.search(r'({.*})’, next_action_response, re.DOTALL)
if json_match:
next_action = json.loads(json_match.group(1))
else:
return {“action”: “complete”, “output”: “I encountered an error in planning. Here’s what I know so far: ” + self.generate_final_response(“Error in planning”)}
except json.JSONDecodeError:
return {“action”: “complete”, “output”: “I encountered an error in planning. Here’s what I know so far: ” + self.generate_final_response(“Error in planning”)}

self.memory.append({“role”: “assistant”, “content”: next_action_response})
return next_action

def generate_final_response(self, prefix=””):
context = “Task history:n”
for item in self.memory:
if item[“role”] == “user”:
context += f”USER INPUT: {item[‘content’]}nn”
elif item[“role”] == “tool”:
context += f”TOOL RESULT: {item[‘content’]}nn”
elif item[“role”] == “assistant”:
context += f”AGENT ACTION: {item[‘content’]}nn”

prompt = f”””{context}

{prefix} Generate a comprehensive final response that addresses the original user task.”””

final_response = generate_text(prompt)
return final_response

def search_web(self, query):
time.sleep(1)

query_hash = sum(ord(c) for c in query)
num_results = (query_hash % 3) + 2

results = []
for i in range(num_results):
results.append(f”Result {i+1}: Information about ‘{query}’ related to aspect {chr(97 + i)}.”)

return {
“query”: query,
“results”: results
}

def extract_info(self, text, target_info):
time.sleep(0.5)

return {
“extracted_info”: f”Extracted information about ‘{target_info}’ from the text: The text indicates that {target_info} is related to several key aspects mentioned in the content.”,
“confidence”: round(random.uniform(0.7, 0.95), 2)
}

def summarize_text(self, text):
time.sleep(0.5)

word_count = len(text.split())

return {
“summary”: f”Summary of the provided text ({word_count} words): The text discusses key points related to the subject matter, highlighting important aspects and providing context.”,
“original_length”: word_count,
“summary_length”: round(word_count * 0.3)
}

def create_report(self, title, content):
time.sleep(0.7)

report_sections = [
“## Introduction”,
f”This report provides an overview of {title}.”,
“”,
“## Key Findings”,
content,
“”,
“## Conclusion”,
f”This analysis of {title} highlights several important aspects that warrant consideration.”
]

return {
“report”: “n”.join(report_sections),
“word_count”: len(content.split()),
“section_count”: 3
}

def demo_level4():
print(“n” + “=”*50)
print(“LEVEL 4: MULTI-STEP AGENT DEMO”)
print(“=”*50)
print(“At this level, the AI manages the entire workflow, deciding which tools”)
print(“to use, when to use them, and determining when the task is complete.n”)

user_task = input(“Enter a research or analysis task: “) or “Research quantum computing recent developments and create a brief report”
print(“nProcessing your request… (this may take a minute)n”)

agent = MultiStepAgent()
result = agent.run(user_task)
print(“nFINAL OUTPUT:”)
print(“-“*50)
print(result)
print(“-“*50)

The MultiStepAgent class maintains an evolving memory of user inputs and tool outputs, then repeatedly prompts the LLM to decide its next action, whether to search the web, extract information, summarize text, create a report, or finish, executing the chosen tool and appending the result until the task is complete or a step limit is reached. In doing so, it showcases a Level 4 agent that orchestrates multi-step workflows by letting the model control iteration and program continuation.

Level 5: Fully Autonomous Agent

At the most advanced level, the `AutonomousAgent` class demonstrates a closed-loop system in which the model not only plans and executes but also generates, validates, refines, and runs new Python code. After the user task is recorded, the agent asks the model to produce a detailed plan, then prompts it to generate self-contained solution code, which is automatically cleaned of markdown formatting. A subsequent validation step queries the model for any syntax or logic issues; if issues are found, the agent asks the model to refine the code. The validated code is then wrapped with sandboxing utilities, such as safe printing, captured output buffers, and result-capture logic, and executed in a restricted local environment. Finally, the agent synthesizes a professional report explaining what was done, how it was accomplished, and the final results. This level exemplifies a truly autonomous AI system that can extend its capabilities through dynamic code creation and execution.

Copy CodeCopiedUse a different Browserclass AutonomousAgent:
“””Level 5: Fully Autonomous Agent – Model creates & executes new code”””

def __init__(self):
self.memory = []

def run(self, user_task):
self.memory.append({“role”: “user”, “content”: user_task})

print(” Planning solution approach…”)
planning_message = self.plan_solution(user_task)
self.memory.append({“role”: “assistant”, “content”: planning_message})

print(” Generating solution code…”)
generated_code = self.generate_solution_code()
self.memory.append({“role”: “assistant”, “content”: f”Generated code: “`pythonn{generated_code}n“`”})

print(” Validating code…”)
validation_result = self.validate_code(generated_code)
if not validation_result[“valid”]:
print(” Code validation found issues – refining…”)
refined_code = self.refine_code(generated_code, validation_result[“issues”])
self.memory.append({“role”: “assistant”, “content”: f”Refined code: “`pythonn{refined_code}n“`”})
generated_code = refined_code
else:
print(” Code validation passed”)

try:
print(” Executing solution…”)
execution_result = self.safe_execute_code(generated_code, user_task)
self.memory.append({“role”: “system”, “content”: f”Execution result: {execution_result}”})

# Generate a final report
print(” Creating final report…”)
final_report = self.create_final_report(execution_result)
return final_report

except Exception as e:
return f”Error executing the solution: {str(e)}nnGenerated code was:n“`pythonn{generated_code}n“`”

def plan_solution(self, task):
prompt = f”””Task: {task}

You are an autonomous problem-solving agent. Create a detailed plan to solve this task.
Include:
1. Breaking down the task into subtasks
2. What algorithms or approaches you’ll use
3. What data structures are needed
4. Any external resources or libraries required
5. Expected challenges and how to address them

Provide a step-by-step plan.
“””

return generate_text(prompt)

def generate_solution_code(self):
context = “Task and planning information:n”
for item in self.memory:
if item[“role”] == “user”:
context += f”USER TASK: {item[‘content’]}nn”
elif item[“role”] == “assistant”:
context += f”PLANNING: {item[‘content’]}nn”

prompt = f”””{context}

Generate clean, efficient Python code that solves this task. Include comments to explain the code.
The code should be self-contained and able to run inside a Python script or notebook.
Only include the Python code itself without any markdown formatting.
“””

code = generate_text(prompt)

code = re.sub(r’^“`pythonn|“`$’, ”, code, flags=re.MULTILINE)

return code

def validate_code(self, code):
prompt = f”””Code to validate:
“`python
{code}
“`

Examine the code for the following issues:
1. Syntax errors
2. Logic errors
3. Inefficient implementations
4. Security concerns
5. Missing error handling
6. Import statements for unavailable libraries

If the code has any issues, describe them in detail. If the code looks good, state “No issues found.”
“””

validation_response = generate_text(prompt)

if “no issues” in validation_response.lower() or “code looks good” in validation_response.lower():
return {“valid”: True, “issues”: None}
else:
return {“valid”: False, “issues”: validation_response}

def refine_code(self, original_code, issues):
prompt = f”””Original code:
“`python
{original_code}
“`

Issues identified:
{issues}

Please provide a corrected version of the code that addresses these issues.
Only include the Python code itself without any markdown formatting.
“””

refined_code = generate_text(prompt)

refined_code = re.sub(r’^“`pythonn|“`$’, ”, refined_code, flags=re.MULTILINE)

return refined_code

def safe_execute_code(self, code, user_task):

safe_imports = “””
# Standard library imports
import math
import random
import re
import time
import json
from datetime import datetime

# Define a function to capture printed output
captured_output = []
original_print = print

def safe_print(*args, **kwargs):
output = ” “.join(str(arg) for arg in args)
captured_output.append(output)
original_print(output)

print = safe_print

# Define a result variable to store the final output
result = None

# Function to store the final result
def store_result(value):
global result
result = value
return value
“””

result_capture = “””
# Store the final result if not already done
if ‘result’ not in locals() or result is None:
try:
# Look for variables that might contain the final result
potential_results = [var for var in locals() if not var.startswith(‘_’) and var not in
[‘math’, ‘random’, ‘re’, ‘time’, ‘json’, ‘datetime’,
‘captured_output’, ‘original_print’, ‘safe_print’,
‘result’, ‘store_result’]]
if potential_results:
# Use the last defined variable as the result
store_result(locals()[potential_results[-1]])
except:
pass
“””

full_code = safe_imports + “n# User code starts heren” + code + “nn” + result_capture

code_lines = code.split(‘n’)
first_lines = code_lines[:3]
print(f”nExecuting (first 3 lines):n{first_lines}”)

local_env = {}

try:
exec(full_code, {}, local_env)

return {
“output”: local_env.get(‘captured_output’, []),
“result”: local_env.get(‘result’, “No explicit result returned”)
}
except Exception as e:
return {“error”: str(e)}

def create_final_report(self, execution_result):
if isinstance(execution_result.get(‘output’), list):
output_text = “n”.join(execution_result.get(‘output’, []))
else:
output_text = str(execution_result.get(‘output’, ”))

result_text = str(execution_result.get(‘result’, ”))
error_text = execution_result.get(‘error’, ”)

context = “Task history:n”
for item in self.memory:
if item[“role”] == “user”:
context += f”USER TASK: {item[‘content’]}nn”

prompt = f”””{context}

EXECUTION OUTPUT:
{output_text}

EXECUTION RESULT:
{result_text}

{f”ERROR: {error_text}” if error_text else “”}

Create a final report that explains the solution to the original task. Include:
1. What was done
2. How it was accomplished
3. The final results
4. Any insights or conclusions drawn from the analysis

Format the report in a professional, easy to read manner.
“””

return generate_text(prompt)

def demo_level5():
print(“n” + “=”*50)
print(“LEVEL 5: FULLY AUTONOMOUS AGENT DEMO”)
print(“=”*50)
print(“At this level, the AI generates and executes code to solve complex problems.”)
print(“It can create, validate, refine, and run custom code solutions.n”)

user_task = input(“Enter a data analysis or computational task: “) or “Analyze a dataset of numbers [10, 45, 65, 23, 76, 12, 89, 32, 50] and create visualizations of the distribution”
print(“nProcessing your request… (this may take a minute or two)n”)

agent = AutonomousAgent()
result = agent.run(user_task)
print(“nFINAL REPORT:”)
print(“-“*50)
print(result)
print(“-“*50)

The AutonomousAgent class embodies the autonomy of a Fully Autonomous Agent by maintaining a running memory of the user’s task and systematically orchestrating five core phases: planning, code generation, validation, safe execution, and reporting. When the run is initiated, the agent prompts the model to generate a detailed plan for solving the task and stores this plan in memory. Next, it asks the model to create self-contained Python code based on that plan, strips away any markdown formatting, and then validates the code by querying the model for syntax, logic, performance, and security issues. If validation uncovers problems, the agent instructs the model to refine the code until it passes inspection. The finalized code is then wrapped in a sandboxed execution harness, complete with captured output buffers and automatic result extraction, and executed in an isolated local environment. Finally, the agent synthesizes a polished, professional report by feeding the execution results back into the model, producing a narrative that explains what was done, how it was accomplished, and what insights were gained. The accompanying demo_level5 function provides a straightforward, interactive loop that accepts a user task, runs the agent, and presents a comprehensive final report.

Main Function: All Above Steps

Copy CodeCopiedUse a different Browserdef main():
while True:
clear_output(wait=True)
print(“n” + “=”*50)
print(“AI AGENT LEVELS DEMO”)
print(“=”*50)
print(“nThis notebook demonstrates the 5 levels of AI agents:”)
print(“1. Simple Processor – Model has no impact on program flow”)
print(“2. Router – Model determines basic program flow”)
print(“3. Tool Calling – Model determines how functions are executed”)
print(“4. Multi-Step Agent – Model controls iteration and program continuation”)
print(“5. Fully Autonomous Agent – Model creates & executes new code”)
print(“6. Quit”)

choice = input(“nSelect a level to demo (1-6): “)

if choice == “1”:
demo_level1()
elif choice == “2”:
demo_level2()
elif choice == “3”:
demo_level3()
elif choice == “4”:
demo_level4()
elif choice == “5”:
demo_level5()
elif choice == “6”:
print(“nThank you for exploring the AI Agent levels!”)
break
else:
print(“nInvalid choice. Please select 1-6.”)

input(“nPress Enter to return to the main menu…”)

if __name__ == “__main__”:
main()

Finally, the main function presents a simple, interactive menu loop that clears the Colab output for readability, displays all five agent levels alongside a quit option, and then dispatches the user’s choice to the corresponding demo function before waiting for input to return to the menu. This structure provides a cohesive, CLI-style interface enabling you to explore each agent level in sequence without manual cell execution.

In conclusion, by working through these five levels, we have gained practical insight into the principles of agentic AI and the trade-offs between control, flexibility, and autonomy. We have seen how a system can evolve from straightforward prompt-response behavior to complex decision-making pipelines and even self-modifying code execution. Whether you aim to prototype intelligent assistants, build data pipelines, or experiment with emerging AI capabilities, this progression framework provides a roadmap for designing robust and scalable agents.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post A Comprehensive Tutorial on the Five Levels of Agentic AI Architectures: From Basic Prompt Responses to Fully Autonomous Code Generation and Execution appeared first on MarkTechPost.

Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Vis …

Posted on April 25, 2025 by i-genie

In recent years, contrastive language-image models such as CLIP have established themselves as a default choice for learning vision representations, particularly in multimodal applications like Visual Question Answering (VQA) and document understanding. These models leverage large-scale image-text pairs to incorporate semantic grounding via language supervision. However, this reliance on text introduces both conceptual and practical challenges: the assumption that language is essential for multimodal performance, the complexity of acquiring aligned datasets, and the scalability limits imposed by data availability. In contrast, visual self-supervised learning (SSL)—which operates without language—has historically demonstrated competitive results on classification and segmentation tasks, yet has been underutilized for multimodal reasoning due to performance gaps, especially in OCR and chart-based tasks.

Meta Releases WebSSL Models on Hugging Face (300M–7B Parameters)

To explore the capabilities of language-free visual learning at scale, Meta has released the Web-SSL family of DINO and Vision Transformer (ViT) models, ranging from 300 million to 7 billion parameters, now publicly available via Hugging Face. These models are trained exclusively on the image subset of the MetaCLIP dataset (MC-2B)—a web-scale dataset comprising two billion images. This controlled setup enables a direct comparison between WebSSL and CLIP, both trained on identical data, isolating the effect of language supervision.

The objective is not to replace CLIP, but to rigorously evaluate how far pure visual self-supervision can go when model and data scale are no longer limiting factors. This release represents a significant step toward understanding whether language supervision is necessary—or merely beneficial—for training high-capacity vision encoders.

Technical Architecture and Training Methodology

WebSSL encompasses two visual SSL paradigms: joint-embedding learning (via DINOv2) and masked modeling (via MAE). Each model follows a standardized training protocol using 224×224 resolution images and maintains a frozen vision encoder during downstream evaluation to ensure that observed differences are attributable solely to pretraining.

Models are trained across five capacity tiers (ViT-1B to ViT-7B), using only unlabeled image data from MC-2B. Evaluation is conducted using Cambrian-1, a comprehensive 16-task VQA benchmark suite encompassing general vision understanding, knowledge-based reasoning, OCR, and chart-based interpretation.

In addition, the models are natively supported in Hugging Face’s transformers library, providing accessible checkpoints and seamless integration into research workflows.

Performance Insights and Scaling Behavior

Experimental results reveal several key findings:

Scaling Model Size: WebSSL models demonstrate near log-linear improvements in VQA performance with increasing parameter count. In contrast, CLIP’s performance plateaus beyond 3B parameters. WebSSL maintains competitive results across all VQA categories and shows pronounced gains in Vision-Centric and OCR & Chart tasks at larger scales.

Data Composition Matters: By filtering the training data to include only 1.3% of text-rich images, WebSSL outperforms CLIP on OCR & Chart tasks—achieving up to +13.6% gains in OCRBench and ChartQA. This suggests that the presence of visual text alone, not language labels, significantly enhances task-specific performance.

High-Resolution Training: WebSSL models fine-tuned at 518px resolution further close the performance gap with high-resolution models like SigLIP, particularly for document-heavy tasks.

LLM Alignment: Without any language supervision, WebSSL shows improved alignment with pretrained language models (e.g., LLaMA-3) as model size and training exposure increase. This emergent behavior implies that larger vision models implicitly learn features that correlate well with textual semantics.

Importantly, WebSSL maintains strong performance on traditional benchmarks (ImageNet-1k classification, ADE20K segmentation, NYUv2 depth estimation), and often outperforms MetaCLIP and even DINOv2 under equivalent settings.

Concluding Observations

Meta’s Web-SSL study provides strong evidence that visual self-supervised learning, when scaled appropriately, is a viable alternative to language-supervised pretraining. These findings challenge the prevailing assumption that language supervision is essential for multimodal understanding. Instead, they highlight the importance of dataset composition, model scale, and careful evaluation across diverse benchmarks.

The release of models ranging from 300M to 7B parameters enables broader research and downstream experimentation without the constraints of paired data or proprietary pipelines. As open-source foundations for future multimodal systems, WebSSL models represent a meaningful advancement in scalable, language-free vision learning.

Check out the Models on Hugging Face, GitHub Page and Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Visual Representation Learning appeared first on MarkTechPost.

Meet Rowboat: An Open-Source IDE for Building Complex Multi-Agent Syst …

Posted on April 25, 2025 by i-genie

As multi-agent systems gain traction in real-world applications—from customer support automation to AI-native infrastructure—the need for a streamlined development interface has never been greater. Meet Rowboat, an open-source IDE designed to accelerate the construction, debugging, and deployment of multi-agent AI workflows. It’s powered by OpenAI Agents SDK, connects MCP servers, and can integrate into your apps using HTTP or the SDK. Backed by Y Combinator and tightly integrated with OpenAI’s Agents SDK, Rowboat offers a unique combination of visual development, tool modularity, and real-time testing—making it a compelling platform for engineering agentic AI systems at scale.

Rethinking Multi-Agent Development

Developing multi-agent systems typically requires orchestrating interactions between multiple specialized agents, each responsible for a distinct task or capability. This often involves stitching together prompts, toolchains, and APIs—an effort that is not only tedious but error-prone. Rowboat abstracts away much of this complexity by introducing a visual, AI-assisted development environment that allows teams to define agent behavior using natural language, integrate modular toolsets, and evaluate systems through interactive testing.

The IDE is built with developers and applied AI teams in mind, especially those working on domain-specific use cases in customer experience (CX), enterprise automation, and backend infrastructure.

Key Features and Architecture

1. Copilot: Natural Language-Based Agent Design

At the heart of Rowboat lies its AI-powered Copilot—a system that transforms natural language specifications into runnable multi-agent workflows. For example, users can describe, “Build an assistant for a telecom company to handle data plan upgrades and billing inquiries,” and the Copilot scaffolds the entire system accordingly. This dramatically reduces the ramp-up time for teams new to multi-agent architectures.

2. Tool Integration via MCP Compatibility

Rowboat supports Modular Command Protocol (MCP) servers, enabling seamless tool injection into agents. Developers can import tools defined in an external MCP server, assign them to individual agents within Rowboat, and trigger tool invocations through agent reasoning steps. This modular design ensures clear separation of responsibilities, enabling scalable and maintainable agent workflows.

3. Interactive Testing in the Playground

The built-in Playground offers a live testing environment where users can interact with their agents, observe system behavior, and debug tool calls. It supports step-by-step inspection of conversation history, function execution, and context propagation—critical capabilities when validating agent coordination or investigating unexpected behaviors.

4. Flexible Deployment via HTTP API and Python SDK

Rowboat isn’t just a visual IDE—it ships with an HTTP API and a Python SDK, giving teams the flexibility to embed Rowboat agents into broader infrastructure. Whether you’re running agents in a cloud-native microservice or embedding them in internal developer tools, the SDK provides both stateless and session-aware configurations.

Practical Use Cases

Rowboat is well-suited for teams building production-grade assistant systems. Some real-world applications include:

Financial Services: Automate credit card support, loan updates, and payment reminders using a team of domain-specific agents.

Insurance: Assist users with claims processing, policy inquiries, and premium calculations.

Travel & Hospitality: Handle flight updates, hotel bookings, itinerary changes, and multilingual support.

Telecom: Support billing resolution, plan changes, SIM management, and device troubleshooting.

These scenarios benefit from decomposing tasks into specialized agents with focused tool access—exactly the design pattern that Rowboat enables.

Conclusion

Rowboat fills an important gap in the AI development ecosystem: a purpose-built environment for prototyping and managing multi-agent systems. Its intuitive design, natural language integration, and modular architecture make it more than just an IDE—it’s a full development suite for agentic systems. Whether you’re building a customer service assistant, a backend orchestration tool, or a custom LLM agent pipeline, Rowboat provides the foundation.

Check out the GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Meet Rowboat: An Open-Source IDE for Building Complex Multi-Agent Systems appeared first on MarkTechPost.

OpenAI Launches gpt-image-1 API: Bringing High-Quality Image Generatio …

Posted on April 25, 2025 by i-genie

OpenAI has officially announced the release of its image generation API, powered by the gpt-image-1 model. This launch brings the multimodal capabilities of ChatGPT into the hands of developers, enabling programmatic access to image generation—an essential step for building intelligent design tools, creative applications, and multimodal agent systems.

The new API supports high-quality image synthesis from natural language prompts, marking a significant integration point for generative AI workflows in production environments. Available starting today, developers can now directly interact with the same image generation model that powers ChatGPT’s image creation capabilities.

Expanding the Capabilities of ChatGPT to Developers

The gpt-image-1 model is now available through the OpenAI platform, allowing developers to generate photorealistic, artistic, or highly stylized images using plain text. This follows a phased rollout of image generation features in the ChatGPT product interface and marks a critical transition toward API-first deployment.

The image generation endpoint supports parameters such as:

Prompt: Natural language description of the desired image.

Size: Standard resolution settings (e.g., 1024×1024).

n: Number of images to generate per prompt.

Response format: Choose between base64-encoded images or URLs.

Style: Optionally specify image aesthetics (e.g., “vivid” or “natural”).

The API follows a synchronous usage model, which means developers receive the generated image(s) in the same response—ideal for real-time interfaces like chatbots or design platforms.

Technical Overview of the API and gpt-image-1 Model

OpenAI has not yet released full architectural details about gpt-image-1, but based on public documentation, the model supports robust prompt adherence, detailed composition, and stylistic coherence across diverse image types. While it is distinct from DALL·E 3 in naming, the image quality and alignment suggest continuity in OpenAI’s image generation research lineage.

The API is designed to be stateless and easy to integrate:

Copy CodeCopiedUse a different Browserfrom openai import OpenAI
import base64
client = OpenAI()

prompt = “””
A children’s book drawing of a veterinarian using a stethoscope to
listen to the heartbeat of a baby otter.
“””

result = client.images.generate(
model=”gpt-image-1″,
prompt=prompt
)

image_base64 = result.data[0].b64_json
image_bytes = base64.b64decode(image_base64)

# Save the image to a file
with open(“otter.png”, “wb”) as f:
f.write(image_bytes)

Unlocking Developer Use Cases

By making this API available, OpenAI positions gpt-image-1 as a fundamental building block for multimodal AI development. Some key applications include:

Generative Design Tools: Seamlessly integrate prompt-based image creation into design software for artists, marketers, and product teams.

AI Assistants and Agents: Extend LLMs with visual generation capabilities to support richer user interaction and content composition.

Prototyping for Games and XR: Rapidly generate environments, textures, or concept art for iterative development pipelines.

Educational Visualizations: Generate scientific diagrams, historical reconstructions, or data illustrations on demand.

With image generation now programmable, these use cases can be scaled, personalized, and embedded directly into user-facing platforms.

Content Moderation and Responsible Use

Safety remains a core consideration. OpenAI has implemented content filtering layers and safety classifiers around the gpt-image-1 model to mitigate risks of generating harmful, misleading, or policy-violating images. The model is subject to the same usage policies as OpenAI’s text-based models, with automated moderation for prompts and generated content.

Developers are encouraged to follow best practices for end-user input validation and maintain transparency in applications that include generative visual content.

Conclusion

The release of gpt-image-1 to the API marks a pivotal step in making generative vision models accessible, controllable, and production-ready. It’s not just a model—it’s an interface to imagination, grounded in structured, repeatable, and scalable computation.

For developers building the next generation of creative software, autonomous agents, or visual storytelling tools, gpt-image-1 offers a robust foundation to bring language and imagery together in code.

Check out the Technical Details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post OpenAI Launches gpt-image-1 API: Bringing High-Quality Image Generation to Developers appeared first on MarkTechPost.

Enterprise-grade natural language to SQL generation using LLMs: Balanc …

Posted on April 25, 2025 by i-genie

This blog post is co-written with Renuka Kumar and Thomas Matthew from Cisco.
Enterprise data by its very nature spans diverse data domains, such as security, finance, product, and HR. Data across these domains is often maintained across disparate data environments (such as Amazon Aurora, Oracle, and Teradata), with each managing hundreds or perhaps thousands of tables to represent and persist business data. These tables house complex domain-specific schemas, with instances of nested tables and multi-dimensional data that require complex database queries and domain-specific knowledge for data retrieval.
Recent advances in generative AI have led to the rapid evolution of natural language to SQL (NL2SQL) technology, which uses pre-trained large language models (LLMs) and natural language to generate database queries in the moment. Although this technology promises simplicity and ease of use for data access, converting natural language queries to complex database queries with accuracy and at enterprise scale has remained a significant challenge. For enterprise data, a major difficulty stems from the common case of database tables having embedded structures that require specific knowledge or highly nuanced processing (for example, an embedded XML formatted string). As a result, NL2SQL solutions for enterprise data are often incomplete or inaccurate.
This post describes a pattern that AWS and Cisco teams have developed and deployed that is viable at scale and addresses a broad set of challenging enterprise use cases. The methodology allows for the use of simpler, and therefore more cost-effective and lower latency, generative models by reducing the processing required for SQL generation.
Specific challenges for enterprise-scale NL2SQL
Generative accuracy is paramount for NL2SQL use cases; inaccurate SQL queries might result in a sensitive enterprise data leak, or lead to inaccurate results impacting critical business decisions. Enterprise-scale data presents specific challenges for NL2SQL, including the following:

Complex schemas optimized for storage (and not retrieval) – Enterprise databases are often distributed in nature and optimized for storage and not for retrieval. As a result, the table schemas are complex, involving nested tables and multi-dimensional data structures (for example, a cell containing an array of data). As a further result, creating queries for retrieval from these data stores requires specific expertise and involves complex filtering and joins.
Diverse and complex natural language queries – The user’s natural language input might also be complex because they might refer to a list of entities of interest or date ranges. Converting the logical meaning of these user queries into a database query can lead to overly long and complex SQL queries due to the original design of the data schema.
LLM knowledge gap – NL2SQL language models are typically trained on data schemas that are publicly available for education purposes and might not have the necessary knowledge complexity required of large, distributed databases in production environments. Consequently, when faced with complex enterprise table schemas or complex user queries, LLMs have difficulty generating correct query statements because they have difficulty understanding interrelationships between the values and entities of the schema.
LLM attention burden and latency – Queries containing multi-dimensional data often involve multi-level filtering over each cell of the data. To generate queries for cases such as these, the generative model requires more attention to support attending to the increase in relevant tables, columns, and values; analyzing the patterns; and generating more tokens. This increases the LLM’s query generation latency, and the likelihood of query generation errors, because of the LLM misunderstanding data relationships and generating incorrect filter statements.
Fine-tuning challenge – One common approach to achieve higher accuracy with query generation is to fine-tune the model with more SQL query samples. However, it is non-trivial to craft training data for generating SQL for embedded structures within columns (for example, JSON, or XML), to handle sets of identifiers, and so on, to get baseline performance (which is the problem we are trying to solve in the first place). This also introduces a slowdown in the development cycle.

Solution design and methodology
The solution described in this post provides a set of optimizations that solve the aforementioned challenges while reducing the amount of work that has to be performed by an LLM for generating accurate output. This work extends upon the post Generating value from enterprise data: Best practices for Text2SQL and generative AI. That post has many useful recommendations for generating high-quality SQL, and the guidelines outlined might be sufficient for your needs, depending on the inherent complexity of the database schemas.
To achieve generative accuracy for complex scenarios, the solution breaks down NL2SQL generation into a sequence of focused steps and sub-problems, narrowing the generative focus to the appropriate data domain. Using data abstractions for complex joins and data structure, this approach enables the use of smaller and more affordable LLMs for the task. This approach results in reduced prompt size and complexity for inference, reduced response latency, and improved accuracy, while enabling the use of off-the-shelf pre-trained models.
Narrowing scope to specific data domains
The solution workflow narrows down the overall schema space into the data domain targeted by the user’s query. Each data domain corresponds to the set of database data structures (tables, views, and so on) that are commonly used together to answer a set of related user queries, for an application or business domain. The solution uses the data domain to construct prompt inputs for the generative LLM.
This pattern consists of the following elements:

Mapping input queries to domains – This involves mapping each user query to the data domain that is appropriate for generating the response for NL2SQL at runtime. This mapping is similar in nature to intent classification, and enables the construction of an LLM prompt that is scoped for each input query (described next).
Scoping data domain for focused prompt construction – This is a divide-and-conquer pattern. By focusing on the data domain of the input query, redundant information, such as schemas for other data domains in the enterprise data store, can be excluded. This might be considered as a form of prompt pruning; however, it offers more than prompt reduction alone. Reducing the prompt context to the in-focus data domain enables greater scope for few-shot learning examples, declaration of specific business rules, and more.
Augmenting SQL DDL definitions with metadata to enhance LLM inference – This involves enhancing the LLM prompt context by augmenting the SQL DDL for the data domain with descriptions of tables, columns, and rules to be used by the LLM as guidance on its generation. This is described in more detail later in this post.
Determine query dialect and connection information – For each data domain, the database server metadata (such as the SQL dialect and connection URI) is captured during use case onboarding and made available at runtime to be automatically included in the prompt for SQL generation and subsequent query execution. This enables scalability through decoupling the natural language query from the specific queried data source. Together, the SQL dialect and connectivity abstractions allow for the solution to be data source agnostic; data sources might be distributed within or across different clouds, or provided by different vendors. This modularity enables scalable addition of new data sources and data domains, because each is independent.

Managing identifiers for SQL generation (resource IDs)
Resolving identifiers involves extracting the named resources, as named entities, from the user’s query and mapping the values to unique IDs appropriate for the target data source prior to NL2SQL generation. This can be implemented using natural language processing (NLP) or LLMs to apply named entity recognition (NER) capabilities to drive the resolution process. This optional step has the most value when there are many named resources and the lookup process is complex. For instance, in a user query such as “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” there are named resources: ‘allyson felix’, ‘isabelle werth’, and ‘nedo nadi’. This step allows for rapid and precise feedback to the user when a resource can’t be resolved to an identifier (for example, due to ambiguity).
This optional process of handling many or paired identifiers is included to offload the burden on LLMs for user queries with challenging sets of identifiers to be incorporated, such as those that might come in pairs (such as ID-type, ID-value), or where there are many identifiers. Rather than having the generative LLM insert each unique ID into the SQL directly, the identifiers are made available by defining a temporary data structure (such as a temporary table) and a set of corresponding insert statements. The LLM is prompted with few-shot learning examples to generate SQL for the user query by joining with the temporary data structure, rather than attempt identity injection. This results in a simpler and more consistent query pattern for cases when there are one, many, or pairs of identifiers.
Handling complex data structures: Abstracting domain data structures
This step is aimed at simplifying complex data structures into a form that can be understood by the language model without having to decipher complex inter-data relationships. Complex data structures might appear as nested tables or lists within a table column, for instance.
We can define temporary data structures (such as views and tables) that abstract complex multi-table joins, nested structures, and more. These higher-level abstractions provide simplified data structures for query generation and execution. The top-level definitions of these abstractions are included as part of the prompt context for query generation, and the full definitions are provided to the SQL execution engine, along with the generated query. The resulting queries from this process can use simple set operations (such as IN, as opposed to complex joins) that LLMs are well trained on, thereby alleviating the need for nested joins and filters over complex data structures.
Augmenting data with data definitions for prompt construction
Several of the optimizations noted earlier require making some of the specifics of the data domain explicit. Fortunately, this only has to be done when schemas and use cases are onboarded or updated. The benefit is higher generative accuracy, reduced generative latency and cost, and the ability to support arbitrarily complex query requirements.
To capture the semantics of a data domain, the following elements are defined:

The standard tables and views in data schema, along with comments to describe the tables and columns.
Join hints for the tables and views, such as when to use outer joins.
Data domain-specific rules, such as which columns might not appear in a final select statement.
The set of few-shot examples of user queries and corresponding SQL statements. A good set of examples would include a wide variety of user queries for that domain.
Definitions of the data schemas for any temporary tables and views used in the solution.
A domain-specific system prompt that specifies the role and expertise that the LLM has, the SQL dialect, and the scope of its operation.
A domain-specific user prompt.
Additionally, if temporary tables or views are used for the data domain, a SQL script is required that, when executed, creates the desired temporary data structures needs to be defined. Depending on the use case, this can be a static or dynamically generated script.

Accordingly, the prompt for generating the SQL is dynamic and constructed based on the data domain of the input question, with a set of specific definitions of data structure and rules appropriate for the input query. We refer to this set of elements as the data domain context. The purpose of the data domain context is to provide the necessary prompt metadata for the generative LLM. Examples of this, and the methods described in the previous sections, are included in the GitHub repository. There is one context for each data domain, as illustrated in the following figure.

Bringing it all together: The execution flow
This section describes the execution flow of the solution. An example implementation of this pattern is available in the GitHub repository. Access the repository to follow along with the code.
To illustrate the execution flow, we use an example database with data about Olympics statistics and another with the company’s employee vacation schedule. We follow the execution flow for the domain regarding Olympics statistics using the user query “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” to show the inputs and outputs of the steps in the execution flow, as illustrated in the following figure.

Preprocess the request
The first step of the NL2SQL flow is to preprocess the request. The main objective of this step is to classify the user query into a domain. As explained earlier, this narrows down the scope of the problem to the appropriate data domain for SQL generation. Additionally, this step identifies and extracts the referenced named resources in the user query. These are then used to call the identity service in the next step to get the database identifiers for these named resources.
Using the earlier mentioned example, the inputs and outputs of this step are as follows:

user_query = “In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?”
pre_processed_request = request_pre_processor.run(user_query)
domain = pre_processed_request[app_consts.DOMAIN]

Resolve identifiers (to database IDs)
This step processes the named resources’ strings extracted in the previous step and resolves them to be identifiers that can be used in database queries. As mentioned earlier, the named resources (for example, “group22”, “user123”, and “I”) are looked up using solution-specific means, such through database lookups or an ID service.
The following code shows the execution of this step in our running example:

named_resources = pre_processed_request[app_consts.NAMED_RESOURCES]
if len(named_resources) > 0:
identifiers = id_service_facade.resolve(named_resources)
# add identifiers to the pre_processed_request object
pre_processed_request[app_consts.IDENTIFIERS] = identifiers
else:
pre_processed_request[app_consts.IDENTIFIERS] = []

# Output pre_processed_request:
{‘user_query’: ‘In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?’,
‘domain’: ‘olympics’,
‘named_resources’: {‘allyson felix’, ‘isabelle werth’, ‘nedo nadi’},
‘identifiers’: [ {‘id’: 34551, ‘role’: 32, ‘name’: ‘allyson felix’},
{‘id’: 129726, ‘role’: 32, ‘name’: ‘isabelle werth’},
{‘id’: 84026, ‘role’: 32, ‘name’: ‘nedo nadi’} ] }

Prepare the request
This step is pivotal in this pattern. Having obtained the domain and the named resources along with their looked-up IDs, we use the corresponding context for that domain to generate the following:

A prompt for the LLM to generate a SQL query corresponding to the user query
A SQL script to create the domain-specific schema

To create the prompt for the LLM, this step assembles the system prompt, the user prompt, and the received user query from the input, along with the domain-specific schema definition, including new temporary tables created as well as any join hints, and finally the few-shot examples for the domain. Other than the user query that is received as in input, other components are based on the values provided in the context for that domain.
A SQL script for creating required domain-specific temporary structures (such as views and tables) is constructed from the information in the context. The domain-specific schema in the LLM prompt, join hints, and the few-shot examples are aligned with the schema that gets generated by running this script. In our example, this step is shown in the following code. The output is a dictionary with two keys, llm_prompt and sql_preamble. The value strings for these have been clipped here; the full output can be seen in the Jupyter notebook.

prepared_request = request_preparer.run(pre_processed_request)

# Output prepared_request:
{‘llm_prompt’: ‘You are a SQL expert. Given the following SQL tables definitions, …
CREATE TABLE games (id INTEGER PRIMARY KEY, games_year INTEGER, …);
…
<example>
question: How many gold medals has Yukio Endo won? answer: “`{“sql”:
“SELECT a.id, count(m.medal_name) as “count”
FROM athletes_in_focus a INNER JOIN games_competitor gc …
WHERE m.medal_name = ‘Gold’ GROUP BY a.id;” }“`
</example>
…
‘sql_preamble’: [ ‘CREATE temp TABLE athletes_in_focus (row_id INTEGER
PRIMARY KEY, id INTEGER, full_name TEXT DEFAULT NULL);’,
‘INSERT INTO athletes_in_focus VALUES
(1,84026,’nedo nadi’), (2,34551,’allyson felix’), (3,129726,’isabelle werth’);”]}

Generate SQL
Now that the prompt has been prepared along with any information necessary to provide the proper context to the LLM, we provide that information to the SQL-generating LLM in this step. The goal is to have the LLM output SQL with the correct join structure, filters, and columns. See the following code:

llm_response = llm_service_facade.invoke(prepared_request[ ‘llm_prompt’ ])
generated_sql = llm_response[ ‘llm_output’ ]

# Output generated_sql:
{‘sql’: ‘SELECT g.games_name, g.games_year FROM athletes_in_focus a
JOIN games_competitor gc ON gc.person_id = a.id
JOIN games g ON gc.games_id = g.id;’}

Execute the SQL
After the SQL query is generated by the LLM, we can send it off to the next step. At this step, the SQL preamble and the generated SQL are merged to create a complete SQL script for execution. The complete SQL script is then executed against the data store, a response is fetched, and then the response is passed back to the client or end-user. See the following code:

sql_script = prepared_request[ ‘sql_preamble’ ] + [ generated_sql[ ‘sql’ ] ]
database = app_consts.get_database_for_domain(domain)
results = rdbms_service_facade.execute_sql(database, sql_script)

# Output results:
{‘rdbms_output’: [
(‘games_name’, ‘games_year’),
(‘2004 Summer’, 2004),
…
(‘2016 Summer’, 2016)],
‘processing_status’: ‘success’}

Solution benefits
Overall, our tests have shown several benefits, such as:

High accuracy – This is measured by a string matching of the generated query with the target SQL query for each test case. In our tests, we observed over 95% accuracy for 100 queries, spanning three data domains.
High consistency – This is measured in terms of the same SQL generated being generated across multiple runs. We observed over 95% consistency for 100 queries, spanning three data domains. With the test configuration, the queries were accurate most of the time; a small number occasionally produced inconsistent results.
Low cost and latency – The approach supports the use of small, low-cost, low-latency LLMs. We observed SQL generation in the 1–3 second range using models Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3.
Scalability – The methods that we employed in terms of data abstractions facilitate scaling independent of the number of entities or identifiers in the data for a given use case. For instance, in our tests consisting of a list of 200 different named resources per row of a table, and over 10,000 such rows, we measured a latency range of 2–5 seconds for SQL generation and 3.5–4.0 seconds for SQL execution.
Solving complexity – Using the data abstractions for simplifying complexity enabled the accurate generation of arbitrarily complex enterprise queries, which almost certainly would not be possible otherwise.

We attribute the success of the solution with these excellent but lightweight models (compared to a Meta Llama 70B variant or Anthropic’s Claude Sonnet) to the points noted earlier, with the reduced LLM task complexity being the driving force. The implementation code demonstrates how this is achieved. Overall, by using the optimizations outlined in this post, natural language SQL generation for enterprise data is much more feasible than would be otherwise.
AWS solution architecture
In this section, we illustrate how you might implement the architecture on AWS. The end-user sends their natural language queries to the NL2SQL solution using a REST API. Amazon API Gateway is used to provision the REST API, which can be secured by Amazon Cognito. The API is linked to an AWS Lambda function, which implements and orchestrates the processing steps described earlier using a programming language of the user’s choice (such as Python) in a serverless manner. In this example implementation, where Amazon Bedrock is noted, the solution uses Anthropic’s Claude Haiku 3.
Briefly, the processing steps are as follows:

Determine the domain by invoking an LLM on Amazon Bedrock for classification.
Invoke Amazon Bedrock to extract relevant named resources from the request.
After the named resources are determined, this step calls a service (the Identity Service) that returns identifier specifics relevant to the named resources for the task at hand. The Identity Service is logically a key/value lookup service, which might support for multiple domains.
This step runs on Lambda to create the LLM prompt to generate the SQL, and to define temporary SQL structures that will be executed by the SQL engine along with the SQL generated by the LLM (in the next step).
Given the prepared prompt, this step invokes an LLM running on Amazon Bedrock to generate the SQL statements that correspond to the input natural language query.
This step executes the generated SQL query against the target database. In our example implementation, we used an SQLite database for illustration purposes, but you could use another database server.

The final result is obtained by running the preceding pipeline on Lambda. When the workflow is complete, the result is provided as a response to the REST API request.
The following diagram illustrates the solution architecture.

Conclusion
In this post, the AWS and Cisco teams unveiled a new methodical approach that addresses the challenges of enterprise-grade SQL generation. The teams were able to reduce the complexity of the NL2SQL process while delivering higher accuracy and better overall performance.
Though we’ve walked you through an example use case focused on answering questions about Olympic athletes, this versatile pattern can be seamlessly adapted to a wide range of business applications and use cases. The demo code is available in the GitHub repository. We invite you to leave any questions and feedback in the comments.

About the authors

Renuka Kumar is a Senior Engineering Technical Lead at Cisco, where she has architected and led the development of Cisco’s Cloud Security BU’s AI/ML capabilities in the last 2 years, including launching first-to-market innovations in this space. She has over 20 years of experience in several cutting-edge domains, with over a decade in security and privacy. She holds a PhD from the University of Michigan in Computer Science and Engineering.

Toby Fotherby is a Senior AI and ML Specialist Solutions Architect at AWS, helping customers use the latest advances in AI/ML and generative AI to scale their innovations. He has over a decade of cross-industry expertise leading strategic initiatives and master’s degrees in AI and Data Science. Toby also leads a program training the next generation of AI Solutions Architects.

Shweta Keshavanarayana is a Senior Customer Solutions Manager at AWS. She works with AWS Strategic Customers and helps them in their cloud migration and modernization journey. Shweta is passionate about solving complex customer challenges using creative solutions. She holds an undergraduate degree in Computer Science & Engineering. Beyond her professional life, she volunteers as a team manager for her sons’ U9 cricket team, while also mentoring women in tech and serving the local community.
Thomas Matthew is an AL/ML Engineer at Cisco. Over the past decade, he has worked on applying methods from graph theory and time series analysis to solve detection and exfiltration problems found in Network security. He has presented his research and work at Blackhat and DevCon. Currently, he helps integrate generative AI technology into Cisco’s Cloud Security product offerings.
Daniel Vaquero is a Senior AI/ML Specialist Solutions Architect at AWS. He helps customers solve business challenges using artificial intelligence and machine learning, creating solutions ranging from traditional ML approaches to generative AI. Daniel has more than 12 years of industry experience working on computer vision, computational photography, machine learning, and data science, and he holds a PhD in Computer Science from UCSB.
Atul Varshneya is a former Principal AI/ML Specialist Solutions Architect with AWS. He currently focuses on developing solutions in the areas of AI/ML, particularly in generative AI. In his career of 4 decades, Atul has worked as the technology R&D leader in multiple large companies and startups.
Jessica Wu is an Associate Solutions Architect at AWS. She helps customers build highly performant, resilient, fault-tolerant, cost-optimized, and sustainable architectures.

AWS Field Experience reduced cost and delivered low latency and high p …

Posted on April 25, 2025 by i-genie

AWS Field Experience (AFX) empowers Amazon Web Services (AWS) sales teams with generative AI solutions built on Amazon Bedrock, improving how AWS sellers and customers interact. The AFX team uses AI to automate tasks and provide intelligent insights and recommendations, streamlining workflows for both customer-facing roles and internal support functions. Their approach emphasizes operational efficiency and practical enhancements to daily processes.
Last year, AFX introduced Account Summaries as the first in a forthcoming lineup of tools designed to support and streamline sales workflows. By integrating structured and unstructured data—from sales collateral and customer engagements to external insights and machine learning (ML) outputs—the tool delivers summarized insights that offer a comprehensive view of customer accounts. These summaries provide concise overviews and timely updates, enabling teams to make informed decisions during customer interactions.
The following screenshot shows an example of Account Summary for a customer account, including an executive summary, company overview, and recent account changes.

Migration to the Amazon Nova Light foundation model
Initially, AFX selected a range of models available on Amazon Bedrock, each chosen for its specific capabilities tailored to the diverse requirements of various summary sections. This was done to optimize accuracy, response time, and cost efficiency. However, following the introduction of state-of-the-art Amazon Nova foundation models in December 2024, the AFX team consolidated all its generative AI workload onto the Nova Lite model to capitalize on its industry-leading price performance and optimized latency.
Since moving to the Nova Lite model, the AFX team has achieved a remarkable 90% reduction in inference costs. This has empowered them to scale operations and deliver greater business value that directly supports their mission of creating efficient, high-performing sales processes.
Because Account Summaries are often used by sellers during on-the-go customer engagements, response speed is critical for maintaining seller efficiency. The Nova Lite model’s ultra-low latency helps ensure that sellers receive fast, reliable responses, without compromising on the quality of the insights.
The AFX team also highlighted the seamless migration experience, noting that their existing prompting, reasoning, and evaluation criteria transferred smoothly for the Amazon Nova Lite model without requiring significant modifications. The combination of tailored prompt controls and authorized reference content creates a bounded response framework, minimizing hallucinations elements and inaccuracies.
Overall impact
Since using the Nova Lite model, over 15,600 summaries have been generated by 3,600 sellers—with 1,500 of those sellers producing more than four summaries each. Impressively, the generative AI Account Summaries have achieved a 72% favorability rate, underscoring strong seller confidence and widespread approval.
AWS sellers report saving an average of 35 minutes per summary, a benefit that significantly boosts productivity and allocates more time for customer engagements. Additionally, about one-third of surveyed sellers noted that the summaries positively influenced their customer interactions, and those using generative AI Account Summaries experienced a 4.9% increase in the value of opportunities created.
A member of the AFX team explained, “The Amazon Nova Lite model has significantly reduced our costs without compromising performance. It allowed us to get fast, reliable account summaries, making customer interaction more productive and impactful.”
Conclusion
The AFX team’s product migration to the Nova Lite model has delivered tangible enterprise value by enhancing sales workflows. By migrating to the Amazon Nova Lite model, the team has not only achieved significant cost savings and reduced latency, but has also empowered sellers with a leading intelligent and reliable solution. This process has translated into real-world benefits—saving time, simplifying research, and bolstering customer engagement—laying a solid foundation for ongoing business goals and sustained success.
Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About the Authors
Anuj Jauhari is a Senior Product Marketing Manager at Amazon Web Services, where he helps customers realize value from innovations in generative AI.
Ashwin Nadagoudar is a Software Development Manager at Amazon Web Services, leading go-to-market (GTM) strategies and user journey initiatives with generative AI.
Sonciary Perez is a Principal Product Manager at Amazon Web Services, supporting the transformation of AWS Sales through AI-powered solutions that drive seller productivity and accelerate revenue growth.

Combine keyword and semantic search for text and images using Amazon B …

Posted on April 25, 2025 by i-genie

Customers today expect to find products quickly and efficiently through intuitive search functionality. A seamless search journey not only enhances the overall user experience, but also directly impacts key business metrics such as conversion rates, average order value, and customer loyalty. According to a McKinsey study, 78% of consumers are more likely to make repeat purchases from companies that provide personalized experiences. As a result, delivering exceptional search functionality has become a strategic differentiator for modern ecommerce services. With ever expanding product catalogs and increasing diversity of brands, harnessing advanced search technologies is essential for success.
Semantic search enables digital commerce providers to deliver more relevant search results by going beyond keyword matching. It uses an embeddings model to create vector embeddings that capture the meaning of the input query. This helps the search be more resilient to phrasing variations and to accept multimodal inputs such as text, image, audio, and video. For example, a user inputs a query containing text and an image of a product they like, and the search engine translates both into vector embeddings using a multimodal embeddings model and retrieves related items from the catalog using embeddings similarities. To learn more about semantic search and how Amazon Prime Video uses it to help customers find their favorite content, see Amazon Prime Video advances search for sports using Amazon OpenSearch Service.
While semantic search provides contextual understanding and flexibility, keyword search remains a crucial component for a comprehensive ecommerce search solution. At its core, keyword search provides the essential baseline functionality of accurately matching user queries to product data and metadata, making sure explicit product names, brands, or attributes can be reliably retrieved. This matching capability is vital, because users often have specific items in mind when initiating a search, and meeting these explicit needs with precision is important to deliver a satisfactory experience.
Hybrid search combines the strengths of keyword search and semantic search, enabling retailers to deliver more accurate and relevant results to their customers. Based on OpenSearch blog post, hybrid search improves result quality by 8–12% compared to keyword search and by 15% compared to natural language search. However, combining keyword search and semantic search presents significant complexity because different query types provide scores on different scales. Using Amazon OpenSearch Service hybrid search, customers can seamlessly integrate these approaches by combining relevance scores from multiple search types into one unified score.
OpenSearch Service is the AWS recommended vector database for Amazon Bedrock. It’s a fully managed service that you can use to deploy, operate, and scale OpenSearch on AWS. OpenSearch is a distributed open-source search and analytics engine composed of a search engine and vector database. OpenSearch Service can help you deploy and operate your search infrastructure with native vector database capabilities delivering as low as single-digit millisecond latencies for searches across billions of vectors, making it ideal for real-time AI applications. To learn more, see Improve search results for AI using Amazon OpenSearch Service as a vector database with Amazon Bedrock.
Multimodal embedding models like Amazon Titan Multimodal Embeddings G1, available through Amazon Bedrock, play a critical role in enabling hybrid search functionality. These models generate embeddings for both text and images by representing them in a shared semantic space. This allows systems to retrieve relevant results across modalities such as finding images using text queries or combining text with image inputs.
In this post, we walk you through how to build a hybrid search solution using OpenSearch Service powered by multimodal embeddings from the Amazon Titan Multimodal Embeddings G1 model through Amazon Bedrock. This solution demonstrates how you can enable users to submit both text and images as queries to retrieve relevant results from a sample retail image dataset.
Overview of solution
In this post, you will build a solution that you can use to search through a sample image dataset in the retail space, using a multimodal hybrid search system powered by OpenSearch Service. This solution has two key workflows: a data ingestion workflow and a query workflow.
Data ingestion workflow
The data ingestion workflow generates vector embeddings for text, images, and metadata using Amazon Bedrock and the Amazon Titan Multimodal Embeddings G1 model. Then, it stores the vector embeddings, text, and metadata in an OpenSearch Service domain.
In this workflow, shown in the following figure, we use a SageMaker JupyterLab notebook to perform the following actions:

Read text, images, and metadata from an Amazon Simple Storage Service (Amazon S3) bucket, and encode images in Base64 format.
Send the text, images, and metadata to Amazon Bedrock using its API to generate embeddings using the Amazon Titan Multimodal Embeddings G1 model.
The Amazon Bedrock API replies with embeddings to the Jupyter notebook.
Store both the embeddings and metadata in an OpenSearch Service domain.

Query workflow
In the query workflow, an OpenSearch search pipeline is used to convert the query input to embeddings using the embeddings model registered with OpenSearch. Then, within the OpenSearch search pipeline results processor, results of semantic search and keyword search are combined using the normalization processor to provide relevant search results to users. Search pipelines take away the heavy lifting of building score results normalization and combination outside your OpenSearch Service domain.
The workflow consists of the following steps shown in the following figure:

The client submits a query input containing text, a Base64 encoded image, or both to OpenSearch Service. Text submitted is used for both semantic and keyword search, and the image is used for semantic search.
The OpenSearch search pipeline performs the keyword search using textual inputs and a neural search using vector embeddings generated by Amazon Bedrock using Titan Multimodal Embeddings G1 model.
The normalization processor within the pipeline scales search results using techniques like min_max and combines keyword and semantic scores using arithmetic_mean.
Ranked search results are returned to the client.

Walkthrough overview
To deploy the solution, complete the following high-level steps:

Create a connector for Amazon Bedrock in OpenSearch Service.
Create an OpenSearch search pipeline and enable hybrid search.
Create an OpenSearch Service index for storing the multimodal embeddings and metadata.
Ingest sample data to the OpenSearch Service index.
Create OpenSearch Service query functions to test search functionality.

Prerequisites
For this walkthrough, you should have the following prerequisites:

An AWS account.
Amazon Bedrock with Amazon Titan Multimodal Embeddings G1 enabled. For more information, see Access Amazon Bedrock foundation models.
An OpenSearch Service domain. For instructions, see Getting started with Amazon OpenSearch Service.
An Amazon SageMaker notebook. For instructions, see Quick setup for Amazon SageMaker.
Familiarity with AWS Identity and Access Management (IAM), Amazon Elastic Compute Cloud (Amazon EC2), OpenSearch Service, and SageMaker.
Familiarity with Python programming language.

The code is open source and hosted on GitHub.
Create a connector for Amazon Bedrock in OpenSearch Service
To use OpenSearch Service machine learning (ML) connectors with other AWS services, you need to set up an IAM role allowing access to that service. In this section, we demonstrate the steps to create an IAM role and then create the connector.
Create an IAM role
Complete the following steps to set up an IAM role to delegate Amazon Bedrock permissions to OpenSearch Service:

Add the following policy to the new role to allow OpenSearch Service to invoke the Amazon Titan Multimodal Embeddings G1 model: {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: “bedrock:InvokeModel”,
“Resource”: “arn:aws:bedrock:region:account-id:foundation-model/amazon.titan-embed-image-v1”
}
]
}

Modify the role trust policy as follows. You can follow the instructions in IAM role management to edit the trust relationship of the role. {
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Principal”: {
“Service”: “opensearchservice.amazonaws.com”
},
“Action”: “sts:AssumeRole”
}
]
}

Connect an Amazon Bedrock model to OpenSearch
After you create the role, you can use the Amazon Resource Name (ARN) of the role to define the constant in the SageMaker notebook along with the OpenSearch domain endpoint. Complete the following steps:

Register a model group. Note the model group ID returned in the response to register a model in a later step.
Create a connector, which facilitates registering and deploying external models in OpenSearch. The response will contain the connector ID.
Register the external model to the model group and deploy the model. In this step, you register and deploy the model at the same time—by setting up deploy=true, the registered model is deployed as well.

Create an OpenSearch search pipeline and enable hybrid search
A search pipeline runs inside the OpenSearch Service domain and can have three types of processors: search request processor, search response processor, and search phase result processor. For our search pipeline, we use the search phase result processor, which runs between the search phases at the coordinating node level. The processor uses the normalization processor and normalizes the score from keyword and semantic search. For hybrid search, min-max normalization and arithmetic_mean combination techniques are preferred, but you can also try L2 normalization and geometric_mean or harmonic_mean combination techniques depending on your data and use case.
payload={
“phase_results_processors”: [
{
“normalization-processor”: {
“normalization”: {
“technique”: “min_max”
},
“combination”: {
“technique”: “arithmetic_mean”,
“parameters”: {
“weights”: [
OPENSEARCH_KEYWORD_WEIGHT,
1 – OPENSEARCH_KEYWORD_WEIGHT
]
}
}
}
}
]
}
response = requests.put(
url=f”{OPENSEARCH_ENDPOINT}/_search/pipeline/”+OPENSEARCH_SEARCH_PIPELINE_NAME,
json=payload,
headers={“Content-Type”: “application/json”},
auth=open_search_auth
)
Create an OpenSearch Service index for storing the multimodal embeddings and metadata
For this post, we use the Amazon Berkley Objects Dataset, which is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. In this example, we only use Shoes and listings that are in en_US as shown in section Prepare listings dataset for Amazon OpenSearch ingestion of the notebook.
Use the following code to create an OpenSearch index to ingest the sample data:
response = opensearch_client.indices.create(
index=OPENSEARCH_INDEX_NAME,
body={
“settings”: {
“index.knn”: True,
“number_of_shards”: 2
},
“mappings”: {
“properties”: {
“amazon_titan_multimodal_embeddings”: {
“type”: “knn_vector”,
“dimension”: 1024,
“method”: {
“name”: “hnsw”,
“engine”: “lucene”,
“parameters”: {}
}
}
}
}
}
)
Ingest sample data to the OpenSearch Service index
In this step, you select the relevant features used for generating embeddings. The images are converted to Base64. The combination of a selected feature and a Base64 image is used to generate multimodal embeddings, which are stored in the OpenSearch Service index along with the metadata using a OpenSearch bulk operation, and ingest listings in batches.
Create OpenSearch Service query functions to test search functionality
With the sample data ingested, you can run queries against this data to test the hybrid search functionality. To facilitate this process, we created helper functions to perform the queries in the query workflow section of the notebook. In this section, you explore specific parts of the functions that differentiate the search methods.
Keyword search
For keyword search, send the following payload to the OpenSearch domain search endpoint:
payload = {
“query”: {
“multi_match”: {
“query”: query_text,
}
},
}
Semantic search
For semantic search, you can send the text and image as part of the payload. Model_id in the request is the external embeddings model that you connected earlier. OpenSearch will invoke the model and convert text and image to embeddings.
payload = {
“query”: {
“neural”: {
“vector_embedding”: {
“query_text”: query_text,
“query_image”: query_jpg_image,
“model_id”: model_id,
“k”: 5
}
}
}
}
Hybrid search
This method uses the OpenSearch pipeline you created. The payload has both the semantic and neural search.
payload = {
“query”: {
“hybrid”: {
“queries”: [
{
“multi_match”: {
“query”: query_text,
}
},
{
“neural”: {
“vector_embedding”: {
“query_text”: query_text,
“query_image”: query_jpg_image,
“model_id”: model_id,
“k”: 5
}
}
}
]
}
}
}
Test search methods
To compare the multiple search methods, you can query the index using query_text which provides specific information about the desired output, and query_jpg_image which provides the overall abstraction of the desired style of the output.
query_text = “leather sandals in Petal Blush”
search_image_path = ’16/16e48774.jpg’

Keyword search
The following output lists the top three keyword search results. The keyword search successfully located leather sandals in the color Petal Blush, but it didn’t take the desired style into consideration.

——————————————————————————————————————————–
Score: 8.4351 Item ID: B01MYDNG7C
Item Name: Amazon Brand – The Fix Women’s Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather Material: None Color: Petal Blush Style: Cantu Ruffle Ankle Wrap Sandal
——————————————————————————————————————————–
Score: 8.4351 Item ID: B06XH8M37Q
Item Name: Amazon Brand – The Fix Women’s Farah Single Buckle Platform Dress Sandal, Petal Blush, 6.5 B US
Fabric Type: 100% Leather Material: None Color: Petal Blush Style: Farah Single Buckle Platform Sandal
——————————————————————————————————————————–
Score: 8.4351 Item ID: B01MSCV2YB
Item Name: Amazon Brand – The Fix Women’s Conley Lucite Heel Dress Sandal,Petal Blush,7.5 B US
Fabric Type: Leather Material: Suede Color: Petal Blush Style: Conley Lucite Heel Sandal
——————————————————————————————————————————–

Semantic search
Semantic search successfully located leather sandal and considered the desired style. However, the similarity to the provided images took priority over the specific color provided in query_text.

——————————————————————————————————————————–
Score: 0.7072 Item ID: B01MZF96N7
Item Name: Amazon Brand – The Fix Women’s Bonilla Block Heel Cutout Tribal Dress Sandal, Havana Tan, 7 B US
Fabric Type: Leather Material: Suede Color: Havana Tan Style: Bonilla Block Heel Cutout Tribal Sandal
——————————————————————————————————————————–
Score: 0.7018 Item ID: B01MUG3C0Q
Item Name: Amazon Brand – The Fix Women’s Farrell Triangle-Cutout Square Toe Flat Dress Sandal, Light Rose/Gold, 7.5 B US
Fabric Type: Synthetic Material: Leather Color: Light Rose/Gold Style: Farrell Cutout Tribal Square Toe Flat Sandal
——————————————————————————————————————————–
Score: 0.6858 Item ID: B01MYDNG7C
Item Name: Amazon Brand – The Fix Women’s Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather Material: None Color: Petal Blush Style: Cantu Ruffle Ankle Wrap Sandal
——————————————————————————————————————————–

Hybrid search
Hybrid search returned similar results to the semantic search because they use the same embeddings model. However, by combining the output of keyword and semantic searches, the ranking of the Petal Blush sandal that most closely matches query_jpg_image increases, moving it the top of the results list.

——————————————————————————————————————————–
Score: 0.6838 Item ID: B01MYDNG7C
Item Name: Amazon Brand – The Fix Women’s Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather Material: None Color: Petal Blush Style: Cantu Ruffle Ankle Wrap Sandal
——————————————————————————————————————————–
Score: 0.6 Item ID: B01MZF96N7
Item Name: Amazon Brand – The Fix Women’s Bonilla Block Heel Cutout Tribal Dress Sandal, Havana Tan, 7 B US
Fabric Type: Leather Material: Suede Color: Havana Tan Style: Bonilla Block Heel Cutout Tribal Sandal
——————————————————————————————————————————–
Score: 0.5198 Item ID: B01MUG3C0Q
Item Name: Amazon Brand – The Fix Women’s Farrell Triangle-Cutout Square Toe Flat Dress Sandal, Light Rose/Gold, 7.5 B US
Fabric Type: Synthetic Material: Leather Color: Light Rose/Gold Style: Farrell Cutout Tribal Square Toe Flat Sandal
——————————————————————————————————————————–

Clean up
After you complete this walkthrough, clean up all the resources you created as part of this post. This is an important step to make sure you don’t incur any unexpected charges. If you used an existing OpenSearch Service domain, in the Cleanup section of the notebook, we provide suggested cleanup actions, including delete the index, un-deploy the model, delete the model, delete the model group, and delete the Amazon Bedrock connector. If you created an OpenSearch Service domain exclusively for this exercise, you can bypass these actions and delete the domain.
Conclusion
In this post, we explained how to implement multimodal hybrid search by combining keyword and semantic search capabilities using Amazon Bedrock and Amazon OpenSearch Service. We showcased a solution that uses Amazon Titan Multimodal Embeddings G1 to generate embeddings for text and images, enabling users to search using both modalities. The hybrid approach combines the strengths of keyword search and semantic search, delivering accurate and relevant results to customers.
We encourage you to test the notebook in your own account and get firsthand experience with hybrid search variations. In addition to the outputs shown in this post, we provide a few variations in the notebook. If you’re interested in using custom embeddings models in Amazon SageMaker AI instead, see Hybrid Search with Amazon OpenSearch Service. If you want a solution that offers semantic search only, see Build a contextual text and image search engine for product recommendations using Amazon Bedrock and Amazon OpenSearch Serverless and Build multimodal search with Amazon OpenSearch Service.

About the Authors
Renan Bertolazzi is an Enterprise Solutions Architect helping customers realize the potential of cloud computing on AWS. In this role, Renan is a technical leader advising executives and engineers on cloud solutions and strategies designed to innovate, simplify, and deliver results.
Birender Pal is a Senior Solutions Architect at AWS, where he works with strategic enterprise customers to design scalable, secure and resilient cloud architectures. He supports digital transformation initiatives with a focus on cloud-native modernization, machine learning, and Generative AI. Outside of work, Birender enjoys experimenting with recipes from around the world.
Sarath Krishnan is a Senior Solutions Architect with Amazon Web Services. He is passionate about enabling enterprise customers on their digital transformation journey. Sarath has extensive experience in architecting highly available, scalable, cost-effective, and resilient applications on the cloud. His area of focus includes DevOps, machine learning, MLOps, and generative AI.

Protect sensitive data in RAG applications with Amazon Bedrock

Posted on April 24, 2025 by i-genie

Retrieval Augmented Generation (RAG) applications have become increasingly popular due to their ability to enhance generative AI tasks with contextually relevant information. Implementing RAG-based applications requires careful attention to security, particularly when handling sensitive data. The protection of personally identifiable information (PII), protected health information (PHI), and confidential business data is crucial because this information flows through RAG systems. Failing to address these security considerations can lead to significant risks and potential data breaches. For healthcare organizations, financial institutions, and enterprises handling confidential information, these risks can result in regulatory compliance violations and breach of customer trust. See the OWASP Top 10 for Large Language Model Applications to learn more about the unique security risks associated with generative AI applications.
Developing a comprehensive threat model for your generative AI applications can help you identify potential vulnerabilities related to sensitive data leakage, prompt injections, unauthorized data access, and more. To assist in this effort, AWS provides a range of generative AI security strategies that you can use to create appropriate threat models.
Amazon Bedrock Knowledge Bases is a fully managed capability that simplifies the management of the entire RAG workflow, empowering organizations to give foundation models (FMs) and agents contextual information from your private data sources to deliver more relevant and accurate responses tailored to your specific needs. Additionally, with Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can redact sensitive information such as PII to protect privacy using Amazon Bedrock Guardrails.
RAG workflow: Converting data to actionable knowledge
RAG consists of two major steps:

Ingestion – Preprocessing unstructured data, which includes converting the data into text documents and splitting the documents into chunks. Document chunks are then encoded with an embedding model to convert them to document embeddings. These encoded document embeddings along with the original document chunks in the text are then stored to a vector store, such as Amazon OpenSearch Service.
Augmented retrieval – At query time, the user’s query is first encoded with the same embedding model to convert the query into a query embedding. The generated query embedding is then used to perform a similarity search on the stored document embeddings to find and retrieve semantically similar document chunks to the query. After the document chunks are retrieved, the user prompt is augmented by passing the retrieved chunks as additional context, so that the text generation model can answer the user query using the retrieved context. If sensitive data isn’t sanitized before ingestion, this might lead to retrieving sensitive data from the vector store and inadvertently leak the sensitive data to unauthorized users as part of the model response.

The following diagram shows the architectural workflow of a RAG system, illustrating how a user’s query is processed through multiple stages to generate an informed response

Solution overview
In this post we present two architecture patterns: data redaction at storage level and role-based access, for protecting sensitive data when building RAG-based applications using Amazon Bedrock Knowledge Bases.
Data redaction at storage level – Identifying and redacting (or masking) sensitive data before storing them to the vector store (ingestion) using Amazon Bedrock Knowledge Bases. This zero-trust approach to data sensitivity reduces the risk of sensitive information being inadvertently disclosed to unauthorized users.
Role-based access to sensitive data – Controlling selective access to sensitive information based on user roles and permissions during retrieval. This approach is best in situations where sensitive data needs to be stored in the vector store, such as in healthcare settings with distinct user roles like administrators (doctors) and non-administrators (nurses or support personnel).
For all data stored in Amazon Bedrock, the AWS shared responsibility model applies.
Let’s dive in to understand how to implement the data redaction at storage level and role-based access architecture patterns effectively.
Scenario 1: Identify and redact sensitive data before ingesting into the vector store
The ingestion flow implements a four-step process to help protect sensitive data when building RAG applications with Amazon Bedrock:

Source document processing – An AWS Lambda function monitors the incoming text documents landing to a source Amazon Simple Storage Service (Amazon S3) bucket and triggers an Amazon Comprehend PII redaction job to identify and redact (or mask) sensitive data in the documents. An Amazon EventBridge rule triggers the Lambda function every 5 minutes. The document processing pipeline described here only processes text documents. To handle documents containing embedded images, you should implement additional preprocessing steps to extract and analyze images separately before ingestion.
PII identification and redaction – The Amazon Comprehend PII redaction job analyzes the text content to identify and redact PII entities. For example, the job identifies and redacts sensitive data entities like name, email, address, and other financial PII entities.
Deep security scanning – After redaction, documents move to another folder where Amazon Macie verifies redaction effectiveness and identifies any remaining sensitive data objects. Documents flagged by Macie go to a quarantine bucket for manual review, while cleared documents move to a redacted bucket ready for ingestion. For more details on data ingestion, see Sync your data with your Amazon Bedrock knowledge base.
Secure knowledge base integration – Redacted documents are ingested into the knowledge base through a data ingestion job. In case of multi-modal content, for enhanced security, consider implementing:

A dedicated image extraction and processing pipeline.
Image analysis to detect and redact sensitive visual information.
Amazon Bedrock Guardrails to filter inappropriate image content during retrieval.

This multi-layered approach focuses on securing text content while highlighting the importance of implementing additional safeguards for image processing. Organizations should evaluate their multi-modal document requirements and extend the security framework accordingly.
Ingestion flow
The following illustration demonstrates a secure document processing pipeline for handling sensitive data before ingestion into Amazon Bedrock Knowledge Bases.

The high-level steps are as follows:

The document ingestion flow begins when documents containing sensitive data are uploaded to a monitored inputs folder in the source bucket. An EventBridge rule triggers a Lambda function (ComprehendLambda).
The ComprehendLambda function monitors for new files in the inputs folder of the source bucket and moves landed files to a processing folder. It then launches an asynchronous Amazon Comprehend PII redaction analysis job and records the job ID and status in an Amazon DynamoDB JobTracking table for monitoring job completion. The Amazon Comprehend PII redaction job automatically redacts and masks sensitive elements such as names, addresses, phone numbers, Social Security numbers, driver’s license IDs, and banking information with the entity type. The job replaces these identified PII entities with placeholder tokens, such as [NAME], [SSN] etc. The entities to mask can be configured using RedactionConfig. For more information, see Redacting PII entities with asynchronous jobs (API). The MaskMode in RedactionConfig is set to REPLACE_WITH_PII_ENTITY_TYPE instead of MASK; redacting with a MaskCharacter would affect the quality of retrieved documents because many documents could contain the same MaskCharacter, thereby affecting the retrieval quality. After completion, the redacted files move to the for_macie_scan folder for secondary scanning.
The secondary verification phase employs Macie for additional sensitive data detection on the redacted files. Another Lambda function (MacieLambda) monitors the completion of the Amazon Comprehend PII redaction job. When the job is complete, the function triggers a Macie one-time sensitive data detection job with files in the for_macie_scan folder.
The final stage integrates with the Amazon Bedrock knowledge base. The findings from Macie determine the next steps: files with high severity ratings (3 or higher) are moved to a quarantine folder for human review by authorized personnel with appropriate permissions and access controls, whereas files with low severity ratings are moved to a designated redacted bucket, which then triggers a data ingestion job to the Amazon Bedrock knowledge base.

This process helps prevent sensitive details from being exposed when the model generates responses based on retrieved data.
Augmented retrieval flow
The augmented retrieval flow diagram shows how user queries are processed securely. It illustrates the complete workflow from user authentication through Amazon Cognito to response generation with Amazon Bedrock, including guardrail interventions that help prevent policy violations in both inputs and outputs.

The high-level steps are as follows:

For our demo, we use a web application UI built using Streamlit. The web application launches with a login form with user name and password fields.
The user enters the credentials and logs in. User credentials are authenticated using Amazon Cognito user pools. Amazon Cognito acts as our OpenID connect (OIDC) identity provider (IdP) to provide authentication and authorization services for this application. After authentication, Amazon Cognito generates and returns identity, access and refresh tokens in JSON web token (JWT) format back to the web application. Refer to Understanding user pool JSON web tokens (JWTs) for more information.
After the user is authenticated, they are logged in to the web application, where an AI assistant UI is presented to the user. The user enters their query (prompt) in the assistant’s text box. The query is then forwarded using a REST API call to an Amazon API Gateway endpoint along with the access tokens in the header.
API Gateway forwards the payload along with the claims included in the header to a conversation orchestrator Lambda function.
The conversation orchestrator Lambda function processes the user prompt and model parameters received from the UI and calls the RetrieveAndGenerate API to the Amazon Bedrock knowledge base. Input guardrails are first applied to this request to perform input validation on the user query.

The guardrail evaluates and applies predefined responsible AI policies using content filters, denied topic filters and word filters on user input. For more information on creating guardrail filters, see Create a guardrail.
If the predefined input guardrail policies are triggered on the user input, the guardrails intervene and return a preconfigured message like, “Sorry, your query violates our usage policy.”
Requests that don’t trigger a guardrail policy will retrieve the documents from the knowledge base and generate a response using the RetrieveAndGenerate. Optionally, if users choose to run Retrieve separately, guardrails can also be applied at this stage. Guardrails during document retrieval can help block sensitive data returned from the vector store.

During retrieval, Amazon Bedrock Knowledge Bases encodes the user query using the Amazon Titan Text v2 embeddings model to generate a query embedding.
Amazon Bedrock Knowledge Bases performs a similarity search with the query embedding against the document embeddings in the OpenSearch Service vector store and retrieves top-k chunks. Optionally, post-retrieval, you can incorporate a reranking model to improve the retrieved results quality from the OpenSearch vector store. Refer to Improve the relevance of query responses with a reranker model in Amazon Bedrock for more details.
Finally, the user prompt is augmented with the retrieved document chunks from the vector store as context and the final prompt is sent to an Amazon Bedrock foundation model (FM) for inference. Output guardrail policies are again applied post-response generation. If the predefined output guardrail policies are triggered, the model generates a predefined response like “Sorry, your query violates our usage policy.” If no policies are triggered, then the large language model (LLM) generated response is sent to the user.

To deploy Scenario 1, find the instructions here on Github
Scenario 2: Implement role-based access to PII data during retrieval
In this scenario, we demonstrate a comprehensive security approach that combines role-based access control (RBAC) with intelligent PII guardrails for RAG applications. It integrates Amazon Bedrock with AWS identity services to automatically enforce security through different guardrail configurations for admin and non-admin users.
The solution uses the metadata filtering capabilities of Amazon Bedrock Knowledge Bases to dynamically filter documents during similarity searches using metadata attributes assigned before ingestion. For example, admin and non-admin metadata attributes are created and attached to relevant documents before the ingestion process. During retrieval, the system returns only the documents with metadata matching the user’s security role and permissions and applies the relevant guardrail policies to either mask or block sensitive data detected on the LLM output.
This metadata-driven approach, combined with features like custom guardrails, real-time PII detection, masking, and comprehensive access logging creates a robust framework that maintains the security and utility of the RAG application while enforcing RBAC.
The following diagram illustrates how RBAC works with metadata filtering in the vector database.

For a detailed understanding of how metadata filtering works, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy.
Augmented retrieval flow
The augmented retrieval flow diagram shows how user queries are processed securely based on role-based access.

The workflow consists of the following steps:

The user is authenticated using an Amazon Cognito user pool. It generates a validation token after successful authentication.
The user query is sent using an API call along with the authentication token through Amazon API Gateway.
Amazon API Gateway forwards the payload and claims to an integration Lambda function.
The Lambda function extracts the claims from the header and checks for user role and determines whether to use an admin guardrail or a non-admin guardrail based on the access level.
Next, the Amazon Bedrock Knowledge Bases RetrieveAndGenerate API is invoked along with the guardrail applied on the user input.
Amazon Bedrock Knowledge Bases embeds the query using the Amazon Titan Text v2 embeddings model.
Amazon Bedrock Knowledge Bases performs similarity searches on the OpenSearch Service vector database and retrieves relevant chunks (optionally, you can improve the relevance of query responses using a reranker model in the knowledge base).
The user prompt is augmented with the retrieved context from the previous step and sent to the Amazon Bedrock FM for inference.
Based on the user role, the LLM output is evaluated against defined Responsible AI policies using either admin or non-admin guardrails.
Based on guardrail evaluation, the system either returns a “Sorry! Cannot Respond” message if the guardrail intervenes, or delivers an appropriate response with no masking on the output for admin users or sensitive data masked for non-admin users.

To deploy Scenario 2, find the instructions here on Github
This security architecture combines Amazon Bedrock guardrails with granular access controls to automatically manage sensitive information exposure based on user permissions. The multi-layered approach makes sure organizations maintain security compliance while fully utilizing their knowledge base, proving security and functionality can coexist.
Customizing the solution
The solution offers several customization points to enhance its flexibility and adaptability:

Integration with external APIs – You can integrate existing PII detection and redaction solutions with this system. The Lambda function can be modified to use custom APIs for PHI or PII handling before calling the Amazon Bedrock Knowledge Bases API.
Multi-modal processing – Although the current solution focuses on text, it can be extended to handle images containing PII by incorporating image-to-text conversion and caption generation. For more information about using Amazon Bedrock for processing multi-modal content during ingestion, see Parsing options for your data source.
Custom guardrails – Organizations can implement additional specialized security measures tailored to their specific use cases.
Structured data handling – For queries involving structured data, the solution can be customized to include Amazon Redshift as a structured data store as opposed to OpenSearch Service. Data masking and redaction on Amazon Redshift can be achieved by applying dynamic data masking (DDM) policies, including fine-grained DDM policies like role-based access control and column-level policies using conditional dynamic data masking.
Agentic workflow integration – When incorporating an Amazon Bedrock knowledge base with an agentic workflow, additional safeguards can be implemented to protect sensitive data from external sources, such as API calls, tool use, agent action groups, session state, and long-term agentic memory.
Response streaming support – The current solution uses a REST API Gateway endpoint that doesn’t support streaming. For streaming capabilities, consider WebSocket APIs in API Gateway, Application Load Balancer (ALB), or custom solutions with chunked responses using client-side reassembly or long-polling techniques.

With these customization options, you can tailor the solution to your specific needs, providing a robust and flexible security framework for your RAG applications. This approach not only protects sensitive data but also maintains the utility and efficiency of the knowledge base, allowing users to interact with the system while automatically enforcing role-appropriate information access and PII handling.
Shared security responsibility: The customer’s role
At AWS, security is our top priority and security in the cloud is a shared responsibility between AWS and our customers. With AWS, you control your data by using AWS services and tools to determine where your data is stored, how it is secured, and who has access to it. Services such as AWS Identity and Access Management (IAM) provide robust mechanisms for securely controlling access to AWS services and resources.
To enhance your security posture further, services like AWS CloudTrail and Amazon Macie offer advanced compliance, detection, and auditing capabilities. When it comes to encryption, AWS CloudHSM and AWS Key Management Service (KMS) enable you to generate and manage encryption keys with confidence.
For organizations seeking to establish governance and maintain data residency controls, AWS Control Tower offers a comprehensive solution. For more information on Data protection and Privacy, refer to Data Protection and Privacy at AWS.
While our solution demonstrates the use of PII detection and redaction techniques, it does not provide an exhaustive list of all PII types or detection methods. As a customer, you bear the responsibility for implementing the appropriate PII detection types and redaction methods using AWS services, including Amazon Bedrock Guardrails and other open-source libraries. The regular expressions configured in Bedrock Guardrails within this solution serve as a reference example only and do not cover all possible variations for detecting PII types. For instance, date of birth (DOB) formats can vary widely. Therefore, it falls on you to configure Bedrock Guardrails and policies to accurately detect the PII types relevant to your use case. Amazon Bedrock maintains strict data privacy standards. The service does not store or log your prompts and completions, nor does it use them to train AWS models or share them with third parties. We implement this through our Model Deployment Account architecture – each AWS Region where Amazon Bedrock is available has a dedicated deployment account per model provider, managed exclusively by the Amazon Bedrock service team. Model providers have no access to these accounts. When a model is delivered to AWS, Amazon Bedrock performs a deep copy of the provider’s inference and training software into these controlled accounts for deployment, making sure that model providers cannot access Amazon Bedrock logs or customer prompts and completions.
Ultimately, while we provide the tools and infrastructure, the responsibility for securing your data using AWS services rests with you, the customer. This shared responsibility model makes sure that you have the flexibility and control to implement security measures that align with your unique requirements and compliance needs, while we maintain the security of the underlying cloud infrastructure. For comprehensive information about Amazon Bedrock security, please refer to the Amazon Bedrock Security documentation.
Conclusion
In this post, we explored two approaches for securing sensitive data in RAG applications using Amazon Bedrock. The first approach focused on identifying and redacting sensitive data before ingestion into an Amazon Bedrock knowledge base, and the second demonstrated a fine-grained RBAC pattern for managing access to sensitive information during retrieval. These solutions represent just two possible approaches among many for securing sensitive data in generative AI applications.
Security is a multi-layered concern that requires careful consideration across all aspects of your application architecture. Looking ahead, we plan to dive deeper into RBAC for sensitive data within structured data stores when used with Amazon Bedrock Knowledge Bases. This can provide additional granularity and control over data access patterns while maintaining security and compliance requirements. Securing sensitive data in RAG applications requires ongoing attention to evolving security best practices, regular auditing of access patterns, and continuous refinement of your security controls as your applications and requirements grow.
To enhance your understanding of Amazon Bedrock security implementation, explore these additional resources:

Implementing least privilege access for Amazon Bedrock
Safeguard your generative AI workloads from prompt injections

The complete source code and deployment instructions for these solutions are available in our GitHub repository.
We encourage you to explore the repository for detailed implementation guidance and customize the solutions based on your specific requirements using the customization points discussed earlier.

About the authors
Praveen Chamarthi brings exceptional expertise to his role as a Senior AI/ML Specialist at Amazon Web Services, with over two decades in the industry. His passion for Machine Learning and Generative AI, coupled with his specialization in ML inference on Amazon SageMaker and Amazon Bedrock, enables him to empower organizations across the Americas to scale and optimize their ML operations. When he’s not advancing ML workloads, Praveen can be found immersed in books or enjoying science fiction films. Connect with him on LinkedIn to follow his insights.
Srikanth Reddy is a Senior AI/ML Specialist with Amazon Web Services. He is responsible for providing deep, domain-specific expertise to enterprise customers, helping them use AWS AI and ML capabilities to their fullest potential. You can find him on LinkedIn.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
Vivek Bhadauria is a Principal Engineer at Amazon Bedrock with almost a decade of experience in building AI/ML services. He now focuses on building generative AI services such as Amazon Bedrock Agents and Amazon Bedrock Guardrails. In his free time, he enjoys biking and hiking.
Brandon Rooks Sr. is a Cloud Security Professional with 20+ years of experience in the IT and Cybersecurity field. Brandon joined AWS in 2019, where he dedicates himself to helping customers proactively enhance the security of their cloud applications and workloads. Brandon is a lifelong learner, and holds the CISSP, AWS Security Specialty, and AWS Solutions Architect Professional certifications. Outside of work, he cherishes moments with his family, engaging in various activities such as sports, gaming, music, volunteering, and traveling.
Vikash Garg is a Principal Engineer at Amazon Bedrock with almost 4 years of experience in building AI/ML services. He has a decade of experience in building large-scale systems. He now focuses on building the generative AI service AWS Bedrock Guardrails. In his free time, he enjoys hiking and traveling.

Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrati …

Posted on April 23, 2025 by i-genie

VoltAgent is an open-source TypeScript framework designed to streamline the creation of AI‑driven applications by offering modular building blocks and abstractions for autonomous agents. It addresses the complexity of directly working with large language models (LLMs), tool integrations, and state management by providing a core engine that handles these concerns out-of-the-box. Developers can define agents with specific roles, equip them with memory, and tie them to external tools without having to reinvent foundational code for each new project.

Unlike DIY solutions that require extensive boilerplate and custom infrastructure, or no-code platforms that often impose vendor lock-in and limited extensibility, VoltAgent strikes a middle ground by giving developers full control over provider choice, prompt design, and workflow orchestration. It integrates seamlessly into existing Node.js environments, enabling teams to start small, build single assistants, and scale up to complex multi‑agent systems coordinated by supervisor agents.

The Challenge of Building AI Agents

Creating intelligent assistants typically involves three major pain points:

Model Interaction Complexity: Managing calls to LLM APIs, handling retries, latency, and error states.

Stateful Conversations: Persisting user context across sessions to achieve natural, coherent dialogues.

External System Integration: Connecting to databases, APIs, and third‑party services to perform real‑world tasks.

Traditional approaches either require you to write custom code for each of these layers, resulting in fragmented and hard-to-maintain repositories, or lock you into proprietary platforms that sacrifice flexibility. VoltAgent abstracts these layers into reusable packages, so developers can focus on crafting agent logic rather than plumbing.

Core Architecture and Modular Packages

At its core, VoltAgent consists of a Core Engine package (‘@voltagent/core’) responsible for agent lifecycle, message routing, and tool invocation. Around this core, a suite of extensible packages provides specialized features:

Multi‑Agent Systems: Supervisor agents coordinate sub‑agents, delegating tasks based on custom logic and maintaining shared memory channels.

Tooling & Integrations: ‘createTool’ utilities and type-safe tool definitions (via Zod schemas) enable agents to invoke HTTP APIs, database queries, or local scripts as if they were native LLM functions.

Voice Interaction: The ‘@voltagent/voice’ package provides speech-to-text and text-to-speech support, enabling agents to speak and listen in real-time.

Model Control Protocol (MCP): Standardized protocol support for inter‑process or HTTP‑based tool servers, facilitating vendor‑agnostic tool orchestration.

Retrieval‑Augmented Generation (RAG): Integrate vector stores and retriever agents to fetch relevant context before generating responses.

Memory Management: Pluggable memory providers (in-memory, LibSQL/Turso, Supabase) enable agents to retain past interactions, ensuring continuity of context.

Observability & Debugging: A separate VoltAgent Console provides a visual interface for inspecting agent states, logs, and conversation flows in real-time.

Getting Started: Automatic Setup

VoltAgent includes a CLI tool, ‘create-voltagent-app’, to scaffold a fully configured project in seconds. This automatic setup prompts for your project name and preferred package manager, installs dependencies, and generates starter code, including a simple agent definition so that you can run your first AI assistant with a single command.

Copy CodeCopiedUse a different Browser# Using npm
npm create voltagent-app@latest my-voltagent-app

# Or with pnpm
pnpm create voltagent-app my-voltagent-app

cd my-voltagent-app
npm run dev

Code Source

At this point, you can open the VoltAgent Console in your browser, locate your new agent, and start chatting directly in the built‑in UI. The CLI’s built‑in ‘tsx watch’ support means any code changes in ‘src/’ automatically restart the server.

Manual Setup and Configuration

For teams that prefer fine‑grained control over their project configuration, VoltAgent provides a manual setup path. After creating a new npm project and adding TypeScript support, developers install the core framework and any desired packages:

Copy CodeCopiedUse a different Browser// tsconfig.json
{
“compilerOptions”: {
“target”: “ES2020”,
“module”: “NodeNext”,
“outDir”: “dist”,
“strict”: true,
“esModuleInterop”: true
},
“include”: [“src”]
}

Code Source

Copy CodeCopiedUse a different Browser# Development deps
npm install –save-dev typescript tsx @types/node @voltagent/cli

# Framework deps
npm install @voltagent/core @voltagent/vercel-ai @ai-sdk/openai zod

Code Source

A minimal ‘src/index.ts’ might look like this:

Copy CodeCopiedUse a different Browserimport { VoltAgent, Agent } from “@voltagent/core”;
import { VercelAIProvider } from “@voltagent/vercel-ai”;
import { openai } from “@ai-sdk/openai”;

// Define a simple agent
const agent = new Agent({
name: “my-agent”,
description: “A helpful assistant that answers questions without using tools”,
llm: new VercelAIProvider(),
model: openai(“gpt-4o-mini”),
});

// Initialize VoltAgent
new VoltAgent({
agents: { agent },
});

Code Source

Adding an ‘.env’ file with your ‘OPENAI_API_KEY’ and updating ‘package.json’ scripts to include ‘”dev”: “tsx watch –env-file=.env ./src”‘ completes the local development setup. Running ‘npm run dev’ launches the server and automatically connects to the developer console.

Building Multi‑Agent Workflows

Beyond single agents, VoltAgent truly shines when orchestrating complex workflows via Supervisor Agents. In this paradigm, specialized sub‑agents handle discrete tasks, such as fetching GitHub stars or contributors, while a supervisor orchestrates the sequence and aggregates results:

Copy CodeCopiedUse a different Browserimport { Agent, VoltAgent } from “@voltagent/core”;
import { VercelAIProvider } from “@voltagent/vercel-ai”;
import { openai } from “@ai-sdk/openai”;

const starsFetcher = new Agent({
name: “Stars Fetcher”,
description: “Fetches star count for a GitHub repo”,
llm: new VercelAIProvider(),
model: openai(“gpt-4o-mini”),
tools: [fetchRepoStarsTool],
});

const contributorsFetcher = new Agent({
name: “Contributors Fetcher”,
description: “Fetches contributors for a GitHub repo”,
llm: new VercelAIProvider(),
model: openai(“gpt-4o-mini”),
tools: [fetchRepoContributorsTool],
});

const supervisor = new Agent({
name: “Supervisor”,
description: “Coordinates data gathering and analysis”,
llm: new VercelAIProvider(),
model: openai(“gpt-4o-mini”),
subAgents: [starsFetcher, contributorsFetcher],
});

new VoltAgent({ agents: { supervisor } });

Code Source

In this setup, when a user inputs a repository URL, the supervisor routes the request to each sub-agent in turn, gathers their outputs, and synthesizes a final report, demonstrating VoltAgent’s ability to structure multi-step AI pipelines with minimal boilerplate.

Observability and Telemetry Integration

Production‑grade AI systems require more than code; they demand visibility into runtime behavior, performance metrics, and error conditions. VoltAgent’s observability suite includes integrations with popular platforms like Langfuse, enabling automated export of telemetry data:

Copy CodeCopiedUse a different Browserimport { VoltAgent } from “@voltagent/core”;
import { LangfuseExporter } from “langfuse-vercel”;

export const volt = new VoltAgent({
telemetry: {
serviceName: “ai”,
enabled: true,
export: {
type: “custom”,
exporter: new LangfuseExporter({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
baseUrl: process.env.LANGFUSE_BASEURL,
}),
},
},
});

Code Source

This configuration wraps all agent interactions with metrics and traces, which are sent to Langfuse for real-time dashboards, alerting, and historical analysis, equipping teams to maintain service-level agreements (SLAs) and quickly diagnose issues in AI-driven workflows.

VoltAgent’s versatility empowers a broad spectrum of applications:

Customer Support Automation: Agents that retrieve order status, process returns, and escalate complex issues to human reps, all while maintaining conversational context.

Intelligent Data Pipelines: Agents orchestrate data extraction from APIs, transform records, and push results to business intelligence dashboards, fully automated and monitored.

DevOps Assistants: Agents that analyze CI/CD logs, suggest optimizations, and even trigger remediation scripts via secure tool calls.

Voice‑Enabled Interfaces: Deploy agents in kiosks or mobile apps that listen to user queries and respond with synthesized speech, enhanced by memory for personalized experiences.

RAG Systems: Agents that first retrieve domain‑specific documents (e.g., legal contracts, technical manuals) and then generate precise answers, blending vector search with LLM generation.

Enterprise Integration: Workflow agents that coordinate across Slack, Salesforce, and internal databases, automating cross‑departmental processes with full audit trails.

By abstracting common patterns, tool invocation, memory, multi‑agent coordination, and observability, VoltAgent reduces integration time from weeks to days, making it a powerful choice for teams seeking to infuse AI across products and services.

In conclusion, VoltAgent reimagines AI agent development by offering a structured yet flexible framework that scales from single-agent prototypes to enterprise-level multi-agent systems. Its modular architecture, with a robust core, rich ecosystem packages, and observability tooling, allows developers to focus on domain logic rather than plumbing. Whether you’re building a chat assistant, automating complex workflows, or integrating AI into existing applications, VoltAgent provides the speed, maintainability, and control you need to bring sophisticated AI solutions to production quickly. By combining easy onboarding via ‘create-voltagent-app’, manual configuration options for power users, and deep extensibility through tools and memory providers, VoltAgent positions itself as the definitive TypeScript framework for AI agent orchestration, helping teams deliver intelligent applications with confidence and speed.

Sources

https://voltagent.dev/docs/

https://github.com/VoltAgent/voltagent?tab=readme-ov-file

The post Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI Agents appeared first on MarkTechPost.

Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Gen …

Posted on April 23, 2025 by i-genie

Diffusion Transformers have demonstrated outstanding performance in image generation tasks, surpassing traditional models, including GANs and autoregressive architectures. They operate by gradually adding noise to images during a forward diffusion process and then learning to reverse this process through denoising, which helps the model approximate the underlying data distribution. Unlike the commonly used UNet-based diffusion models, Diffusion Transformers apply the transformer architecture, which has proven effective after sufficient training. However, their training process is slow and computationally intensive. A key limitation lies in their architecture: during each denoising step, the model must balance encoding low-frequency semantic information while simultaneously decoding high-frequency details using the same modules—this creates an optimization conflict between the two tasks.

To address the slow training and performance bottlenecks, recent work has focused on improving the efficiency of Diffusion Transformers through various strategies. These include utilizing optimized attention mechanisms, such as linear and sparse attention, to reduce computational costs, and introducing more effective sampling techniques, including log-normal resampling and loss reweighting, to stabilize the learning process. Additionally, methods like REPA, RCG, and DoD incorporate domain-specific inductive biases, while masked modeling enforces structured feature learning, boosting the model’s reasoning capabilities. Models like DiT, SiT, SD3, Lumina, and PixArt have extended the diffusion transformer framework to advanced areas such as text-to-image and text-to-video generation.

Researchers from Nanjing University and ByteDance Seed Vision introduce the Decoupled Diffusion Transformer (DDT), which separates the model into a dedicated condition encoder for semantic extraction and a velocity decoder for detailed generation. This decoupled design enables faster convergence and improved sample quality. On the ImageNet 256×256 and 512×512 benchmarks, their DDT-XL/2 model achieves state-of-the-art FID scores of 1.31 and 1.28, respectively, with up to 4× faster training. To further accelerate inference, they propose a statistical dynamic programming method that optimally shares encoder outputs across denoising steps with minimal impact on performance.

The DDT introduces a condition encoder and a velocity decoder to handle low- and high-frequency components in image generation separately. The encoder extracts semantic features (zt) from noisy inputs, timesteps, and class labels, which are then used by the decoder to estimate the velocity field. To ensure consistency of zt across steps, representation alignment and decoder supervision are applied. During inference, a shared self-condition mechanism reduces computation by reusing zt at certain timesteps. A dynamic programming approach identifies the optimal timesteps for recomputing zt, minimizing performance loss while accelerating the sampling process.The researchers trained their models on 256×256 ImageNet using a batch size of 256 without gradient clipping or warm-up. Using VAE-ft-EMA and Euler sampling, they evaluated performance using FID, sFID, IS, Precision, and Recall. They built improved baselines with SwiGLU, RoPE, RMSNorm, and lognorm sampling. Their DDT models consistently outperformed prior baselines, particularly in larger sizes, and converged significantly faster than REPA. Further gains were achieved through encoder sharing strategies and careful tuning of the encoder-decoder ratio, resulting in state-of-the-art FID scores on both 256×256 and 512×512 ImageNet.

In conclusion, the study presents the DDT, which addresses the optimization challenge in traditional diffusion transformers by separating semantic encoding and high-frequency decoding into distinct modules. By scaling encoder capacity relative to the decoder, DDT achieves notable performance gains, especially in larger models. The DDT-XL/2 model sets new benchmarks on ImageNet, achieving faster training convergence and lower FID scores for both 256×256 and 512×512 resolutions. Additionally, the decoupled design enables encoder sharing across denoising steps, significantly improving inference efficiency. A dynamic programming strategy further enhances this by determining optimal sharing points, maintaining image quality while reducing computational load.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing appeared first on MarkTechPost.

A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing …

Posted on April 23, 2025 by i-genie

In this tutorial, we’ll build an end‑to‑end ticketing assistant powered by Agentic AI using the PydanticAI library. We’ll define our data rules with Pydantic v2 models, store tickets in an in‑memory SQLite database, and generate unique identifiers with Python’s uuid module. Behind the scenes, two agents, one for creating tickets and one for checking status, leverage Google Gemini (via PydanticAI’s google-gla provider) to interpret your natural‑language prompts and call our custom database functions. The result is a clean, type‑safe workflow you can run immediately in Colab.

Copy CodeCopiedUse a different Browser!pip install –upgrade pip
!pip install pydantic-ai

First, these two commands update your pip installer to the latest version, bringing in new features and security patches, and then install PydanticAI. This library enables the definition of type-safe AI agents and the integration of Pydantic models with LLMs.

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass

if “GEMINI_API_KEY” not in os.environ:
os.environ[“GEMINI_API_KEY”] = getpass(“Enter your Google Gemini API key: “)

We check whether the GEMINI_API_KEY environment variable is already set. If not, we securely prompt you (without echoing) to enter your Google Gemini API key at runtime, then store it in os.environ so that your Agentic AI calls can authenticate automatically.

Copy CodeCopiedUse a different Browser!pip install nest_asyncio

We install the nest_asyncio package, which lets you patch the existing asyncio event loop so that you can call async functions (or use .run_sync()) inside environments like Colab without running into “event loop already running” errors.

Copy CodeCopiedUse a different Browserimport sqlite3
import uuid
from dataclasses import dataclass
from typing import Literal

from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext

We bring in Python’s sqlite3 for our in‑memory database and uuid to generate unique ticket IDs, use dataclass and Literal for clear dependency and type definitions, and load Pydantic’s BaseModel/Field for enforcing data schemas alongside Agent and RunContext from PydanticAI to wire up and run our conversational agents.

Copy CodeCopiedUse a different Browserconn = sqlite3.connect(“:memory:”)
conn.execute(“””
CREATE TABLE tickets (
ticket_id TEXT PRIMARY KEY,
summary TEXT NOT NULL,
severity TEXT NOT NULL,
department TEXT NOT NULL,
status TEXT NOT NULL
)
“””)
conn.commit()

We set up an in‑memory SQLite database and define a tickets table with columns for ticket_id, summary, severity, department, and status, then commit the schema so you have a lightweight, transient store for managing your ticket records.

Copy CodeCopiedUse a different Browser@dataclass
class TicketingDependencies:
“””Carries our DB connection into system prompts and tools.”””
db: sqlite3.Connection

class CreateTicketOutput(BaseModel):
ticket_id: str = Field(…, description=”Unique ticket identifier”)
summary: str = Field(…, description=”Text summary of the issue”)
severity: Literal[“low”,”medium”,”high”] = Field(…, description=”Urgency level”)
department: str = Field(…, description=”Responsible department”)
status: Literal[“open”] = Field(“open”, description=”Initial ticket status”)

class TicketStatusOutput(BaseModel):
ticket_id: str = Field(…, description=”Unique ticket identifier”)
status: Literal[“open”,”in_progress”,”resolved”] = Field(…, description=”Current ticket status”)

Here, we define a simple TicketingDependencies dataclass to pass our SQLite connection into each agent call, and then declare two Pydantic models: CreateTicketOutput (with fields for ticket ID, summary, severity, department, and default status “open”) and TicketStatusOutput (with ticket ID and its current status). These models enforce a clear, validated structure on everything our agents return, ensuring you always receive well-formed data.

Copy CodeCopiedUse a different Browsercreate_agent = Agent(
“google-gla:gemini-2.0-flash”,
deps_type=TicketingDependencies,
output_type=CreateTicketOutput,
system_prompt=”You are a ticketing assistant. Use the `create_ticket` tool to log new issues.”
)

@create_agent.tool
async def create_ticket(
ctx: RunContext[TicketingDependencies],
summary: str,
severity: Literal[“low”,”medium”,”high”],
department: str
) -> CreateTicketOutput:
“””
Logs a new ticket in the database.
“””
tid = str(uuid.uuid4())
ctx.deps.db.execute(
“INSERT INTO tickets VALUES (?,?,?,?,?)”,
(tid, summary, severity, department, “open”)
)
ctx.deps.db.commit()
return CreateTicketOutput(
ticket_id=tid,
summary=summary,
severity=severity,
department=department,
status=”open”
)

We create a PydanticAI Agent named’ create_agent’ that’s wired to Google Gemini and is aware of our SQLite connection (deps_type=TicketingDependencies) and output schema (CreateTicketOutput). The @create_agent.tool decorator then registers an async create_ticket function, which generates a UUID, inserts a new row into the tickets table, and returns a validated CreateTicketOutput object.

Copy CodeCopiedUse a different Browserstatus_agent = Agent(
“google-gla:gemini-2.0-flash”,
deps_type=TicketingDependencies,
output_type=TicketStatusOutput,
system_prompt=”You are a ticketing assistant. Use the `get_ticket_status` tool to retrieve current status.”
)

@status_agent.tool
async def get_ticket_status(
ctx: RunContext[TicketingDependencies],
ticket_id: str
) -> TicketStatusOutput:
“””
Fetches the ticket status from the database.
“””
cur = ctx.deps.db.execute(
“SELECT status FROM tickets WHERE ticket_id = ?”, (ticket_id,)
)
row = cur.fetchone()
if not row:
raise ValueError(f”No ticket found for ID {ticket_id!r}”)
return TicketStatusOutput(ticket_id=ticket_id, status=row[0])

We set up a second PydanticAI Agent, status_agent, also using the Google Gemini provider and our shared TicketingDependencies. It registers an async get_ticket_status tool that looks up a given ticket_id in the SQLite database and returns a validated TicketStatusOutput, or raises an error if the ticket isn’t found.

Copy CodeCopiedUse a different Browserdeps = TicketingDependencies(db=conn)

create_result = await create_agent.run(
“My printer on 3rd floor shows a paper jam error.”, deps=deps
)

print(“Created Ticket →”)
print(create_result.output.model_dump_json(indent=2))

tid = create_result.output.ticket_id
status_result = await status_agent.run(
f”What’s the status of ticket {tid}?”, deps=deps
)

print(“Ticket Status →”)
print(status_result.output.model_dump_json(indent=2))

Finally, we first package your SQLite connection into deps, then ask the create_agent to log a new ticket via a natural‑language prompt, printing the validated ticket data as JSON. It then takes the returned ticket_id, queries the status_agent for that ticket’s current state, and prints the status in JSON form.

In conclusion, you have seen how Agentic AI and PydanticAI work together to automate a complete service process, from logging a new issue to retrieving its live status, all managed through conversational prompts. Our use of Pydantic v2 ensures every ticket matches the schema you define, while SQLite provides a lightweight backend that’s easy to replace with any database. With these tools in place, you can expand the assistant, adding new agent functions, integrating other AI models like openai:gpt-4o, or connecting real‑world APIs, confident that your data remains structured and reliable throughout.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop
The post A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite Database appeared first on MarkTechPost.