Interview with Hamza Tahir: Co-founder and CTO of ZenML

Bio: Hamza Tahir is a software developer turned ML engineer. An indie hacker by heart, he loves ideating, implementing, and launching data-driven products. His previous projects include PicHance, Scrilys, BudgetML, and you-tldr. Based on his learnings from deploying ML in production for predictive maintenance use-cases in his previous startup, he co-created ZenML, an open-source MLOps framework for creating production grade ML pipelines on any infrastructure stack.

Question: From Early Projects to ZenML: Given your rich background in software development and ML engineering—from pioneering projects like BudgetML to co-founding ZenML and building production pipelines at maiot.io—how has your personal journey influenced your approach to creating an open-source ecosystem for production-ready AI?

My journey from early software development to co-founding ZenML has deeply shaped how I approach building open-source tools for AI production. Working on BudgetML taught me that accessibility in ML infrastructure is critical – not everyone has enterprise-level resources, yet everyone deserves access to robust tooling. 

At my first startup maiot.io, I witnessed firsthand how fragmented the MLOps landscape was, with teams cobbling together solutions that often broke in production. This fragmentation creates real business pain points – for example, many enterprises struggle with lengthy time-to-market cycles for their ML models due to these exact challenges.

These experiences drove me to create ZenML with a focus on being production-first, not production-eventual. We built an ecosystem that brings structure to the chaos of managing models, ensuring that what works in your experimental environment transitions smoothly to production. Our approach has consistently helped organizations reduce deployment times and increase efficiency in their ML workflows.

The open-source approach wasn’t just a distribution strategy—it was foundational to our belief that MLOps should be democratized, allowing teams of all sizes to benefit from best practices developed across the industry. We’ve seen organizations of all sizes—from startups to enterprises—accelerate their ML development cycles by 50-80% by adopting these standardized, production-first practices.

Question: From Lab to Launch: Could you share a pivotal moment or technical challenge that underscored the need for a robust MLOps framework in your transition from experimental models to production systems?

ZenML grew out of our experience working in predictive maintenance. We were essentially functioning as consultants, implementing solutions for various clients. A little over four years ago when we started, there were far fewer tools available and those that existed lacked maturity compared to today’s options.

We quickly discovered that different customers had vastly different needs—some wanted AWS, others preferred GCP. While Kubeflow was emerging as a solution that operated on top of Kubernetes, it wasn’t yet the robust MLOps framework that ZenML offers now.

The pivotal challenge was finding ourselves repeatedly writing custom glue code for each client implementation. This pattern of constantly developing similar but platform-specific solutions highlighted the clear need for a more unified approach. We initially built ZenML on top of TensorFlow’s TFX, but eventually removed that dependency to develop our own implementation that could better serve diverse production environments.

Question: Open-Source vs. Closed-Source in MLOps: While open-source solutions are celebrated for innovation, how do they compare with proprietary options in production AI workflows? Can you share how community contributions have enhanced ZenML’s capabilities in solving real MLOps challenges?

Proprietary MLOps solutions offer polished experiences but often lack adaptability. Their biggest drawback is the “black box” problem—when something breaks in production, teams are left waiting for vendor support. With open-source tools like ZenML, teams can inspect, debug, and extend the tooling themselves.

This transparency enables agility. Open-source frameworks incorporate innovations faster than quarterly releases from proprietary vendors. For LLMs, where best practices evolve weekly, this speed is invaluable.

The power of community-driven innovation is exemplified by one of our most transformative contributions—a developer who built the “Vertex” orchestrator integration for Google Cloud Platform. This wasn’t just another integration—it represented a completely new approach to orchestrating pipelines on GCP that opened up an entirely new market for us.

Prior to this contribution, our GCP users had limited options. The community member developed a comprehensive Vertex AI integration that enabled seamless orchestration in 

Question: Integrating LLMs into Production: With the surge in generative AI and large language models, what are the key obstacles you’ve encountered in LLMOps, and how does ZenML help mitigate these challenges?

LLMOps presents unique challenges including prompt engineering management, complex evaluation metrics, escalating costs, and pipeline complexity.

ZenML helps by providing:

Structured pipelines for LLM workflows, tracking all components from prompts to post-processing logic

Integration with LLM-specific evaluation frameworks

Caching mechanisms to control costs

Lineage tracking for debugging complex LLM chains

Our approach bridges traditional MLOps and LLMOps, allowing teams to leverage established practices while addressing LLM-specific challenges. ZenML’s extensible architecture lets teams incorporate emerging LLMOps tools while maintaining reliability and governance.

Question: Streamlining MLOps Workflows: What best practices would you recommend for teams aiming to build secure, scalable ML pipelines using open-source tools, and how does ZenML facilitate this process?

For teams building ML pipelines with open-source tools, I recommend:

Start with reproducibility through strict versioning

Design for observability from day one

Embrace modularity with interchangeable components

Automate testing for data, models, and security

Standardize environments through containerization

ZenML facilitates these practices with a Pythonic framework that enforces reproducibility, integrates with popular MLOps tools, supports modular pipeline steps, provides testing hooks, and enables seamless containerization.

We’ve seen these principles transform organizations like Adeo Leroy Merlin. After implementing these best practices through ZenML, they reduced their ML development cycle by 80%, with their small team of data scientists now deploying new ML use cases from research to production in days rather than months, delivering tangible business value across multiple production models.

The key insight: MLOps isn’t a product you adopt, but a practice you implement. Our framework makes following best practices the path of least resistance while maintaining flexibility.

Question: Engineering Meets Data Science: Your career spans both software engineering and ML engineering—how has this dual expertise influenced your design of MLOps tools that cater to real-world production challenges?

My dual background has revealed a fundamental disconnect between data science and software engineering cultures. Data scientists prioritize experimentation and model performance, while software engineers focus on reliability and maintainability. This divide creates significant friction when deploying ML systems to production.

ZenML was designed specifically to bridge this gap by creating a unified framework where both disciplines can thrive. Our Python-first APIs provide the flexibility data scientists need while enforcing software engineering best practices like version control, modularity, and reproducibility. We’ve embedded these principles into the framework itself, making the right way the easy way.

This approach has proven particularly valuable for LLM projects, where the technical debt accumulated during prototyping can become crippling in production. By providing a common language and workflow for both researchers and engineers, we’ve helped organizations reduce their time-to-production while simultaneously improving system reliability and governance.

Question: MLOps vs. LLMOps: In your view, what distinct challenges do traditional MLOps face compared to LLMOps, and how should open-source frameworks evolve to address these differences?

Traditional MLOps focuses on feature engineering, model drift, and custom model training, while LLMOps deals with prompt engineering, context management, retrieval-augmented generation, subjective evaluation, and significantly higher inference costs.

Open-source frameworks need to evolve by providing:

Consistent interfaces across both paradigms

LLM-specific cost optimizations like caching and dynamic routing

Support for both traditional and LLM-specific evaluation

First-class prompt versioning and governance

ZenML addresses these needs by extending our pipeline framework for LLM workflows while maintaining compatibility with traditional infrastructure. The most successful teams don’t see MLOps and LLMOps as separate disciplines, but as points on a spectrum, using common infrastructure for both.

Question: Security and Compliance in Production: With data privacy and security being critical, what measures does ZenML implement to ensure that production AI models are secure, especially when dealing with dynamic, data-intensive LLM operations?

ZenML implements robust security measures at every level:

Granular pipeline-level access controls with role-based permissions

Comprehensive artifact provenance tracking for complete auditability

Secure handling of API keys and credentials through encrypted storage

Data governance integrations for validation, compliance, and PII detection

Containerization for deployment isolation and attack surface reduction

These measures enable teams to implement security by design, not as an afterthought. Our experience shows that embedding security into the workflow from the beginning dramatically reduces vulnerabilities compared to retrofitting security later. This proactive approach is particularly crucial for LLM applications, where complex data flows and potential prompt injection attacks create unique security challenges that traditional ML systems don’t face.

Question: Future Trends in AI: What emerging trends for MLOps and LLMOps do you believe will redefine production workflows over the next few years, and how is ZenML positioning itself to lead these changes?

Agents and workflows represent a critical emerging trend in AI. Anthropic notably differentiated between these approaches in their blog about Claude agents, and ZenML is strategically focusing on workflows primarily for reliability considerations.

While we may eventually reach a point where we can trust LLMs to autonomously generate plans and iteratively work toward goals, current production systems demand the deterministic reliability that well-defined workflows provide. We envision a future where workflows remain the backbone of production AI systems, with agents serving as carefully constrained components within a larger, more controlled process—combining the creativity of agents with the predictability of structured workflows.

The industry is witnessing unprecedented investment in LLMOps and LLM-driven projects, with organizations actively experimenting to establish best practices as models rapidly evolve. The definitive trend is the urgent need for systems that deliver both innovation and enterprise-grade reliability—precisely the intersection where ZenML is leveraging its years of battle-tested MLOps experience to create transformative solutions for our customers.

Question: Fostering Community Engagement: Open source thrives on collaboration—what initiatives or strategies have you found most effective in engaging the community around ZenML and encouraging contributions in MLOps and LLMOps?

We’ve implemented several high-impact community engagement initiatives that have yielded measurable results. Beyond actively soliciting and integrating open-source contributions for components and features, we hosted one of the first large-scale MLOps competitions in 2023, which attracted over 200 participants and generated dozens of innovative solutions to real-world MLOps challenges.

We’ve established multiple channels for technical collaboration, including an active Slack community, regular contributor meetings, and comprehensive documentation with clear contribution guidelines. Our community members regularly discuss implementation challenges, share production-tested solutions, and contribute to expanding the ecosystem through integrations and extensions. These strategic community initiatives have been instrumental in not only growing our user base substantially but also advancing the collective knowledge around MLOps and LLMOps best practices across the industry.

Question: Advice for Aspiring AI Engineers: Finally, what advice would you give to students and early-career professionals who are eager to dive into the world of open-source AI, MLOps and LLMOps, and what key skills should they focus on developing?

For those entering MLOps and LLMOps: 

Build complete systems, not just models—the challenges of production offer the most valuable learning

Develop strong software engineering fundamentals

Contribute to open-source projects to gain exposure to real-world problems

Focus on data engineering—data quality issues cause more production failures than model problems

Learn cloud infrastructure basics–Key skills to develop include Python proficiency, containerization, distributed systems concepts, and monitoring tools. For bridging roles, focus on communication skills and product thinking. Cultivate “systems thinking”—understanding component interactions is often more valuable than deep expertise in any single area. Remember that the field is evolving rapidly. Being adaptable and committed to continuous learning is more important than mastering any particular tool or framework.

Question: How does ZenML’s approach to workflow orchestration differ from traditional ML pipelines when handling LLMs, and what specific challenges does it solve for teams implementing RAG or agent-based systems?

At ZenML, we believe workflow orchestration must be paired with robust evaluation systems—otherwise, teams are essentially flying blind. This is especially crucial for LLM workflows, where behaviour can be much less predictable than traditional ML models.

Our approach emphasizes “eval-first development” as the cornerstone of effective LLM orchestration. This means evaluation runs as quality gates or as part of the outer development loop, incorporating user feedback and annotations to continually improve the system.

For RAG or agent-based systems specifically, this eval-first approach helps teams identify whether issues are coming from retrieval components, prompt engineering, or the foundation models themselves. ZenML’s orchestration framework makes it straightforward to implement these evaluation checkpoints throughout your workflow, giving teams confidence that their systems are performing as expected before reaching production.

Question: What patterns are you seeing emerge for successful hybrid systems that combine traditional ML models with LLMs, and how does ZenML support these architectures?

ZenML takes a deliberately unopinionated approach to architecture, allowing teams to implement patterns that work best for their specific use cases. Common hybrid patterns include RAG systems with custom-tuned embedding models and specialized language models for structured data extraction.

This hybrid approach—combining custom-trained models with foundation models—delivers superior results for domain-specific applications. ZenML supports these architectures by providing a consistent framework for orchestrating both traditional ML components and LLM components within a unified workflow.

Our platform enables teams to experiment with different hybrid architectures while maintaining governance and reproducibility across both paradigms, making the implementation and evaluation of these systems more manageable.

Question: As organizations rush to implement LLM solutions, how does ZenML help teams maintain the right balance between experimentation speed and production governance?

ZenML handles best practices out of the box—tracking metadata, evaluations, and the code used to produce them without teams having to build this infrastructure themselves. This means governance doesn’t come at the expense of experimentation speed.

As your needs grow, ZenML grows with you. You might start with local orchestration during early experimentation phases, then seamlessly transition to cloud-based orchestrators and scheduled workflows as you move toward production—all without changing your core code.

Lineage tracking is a key feature that’s especially relevant given emerging regulations like the EU AI Act. ZenML captures the relationships between data, models, and outputs, creating an audit trail that satisfies governance requirements while still allowing teams to move quickly. This balance between flexibility and governance helps prevent organizations from ending up with “shadow AI” systems built outside official channels.

Question: What are the key integration challenges enterprises face when incorporating foundation models into existing systems, and how does ZenML’s workflow approach address these?

A key integration challenge for enterprises is tracking which foundation model (and which version) was used for specific evaluations or production outputs. This lineage and governance tracking is critical both for regulatory compliance and for debugging issues that arise in production.

ZenML addresses this by maintaining a clear lineage between model versions, prompts, inputs, and outputs across your entire workflow. This provides both technical and non-technical stakeholders with visibility into how foundation models are being used within enterprise systems.

Our workflow approach also helps teams manage environment consistency and version control as they move LLM applications from development to production. By containerizing workflows and tracking dependencies, ZenML reduces the “it works on my machine” problems that often plague complex integrations, ensuring that LLM applications behave consistently across environments.
The post Interview with Hamza Tahir: Co-founder and CTO of ZenML appeared first on MarkTechPost.

OpenAI Open Sources BrowseComp: A New Benchmark for Measuring the Abil …

Despite advances in large language models (LLMs), AI agents still face notable limitations when navigating the open web to retrieve complex information. While many models excel on static knowledge benchmarks, they often underperform when tasked with locating nuanced, context-dependent facts across multiple sources. Most existing benchmarks evaluate a model’s recall of easily accessible knowledge, which does not reflect the intricacy of real-world browsing tasks. In contrast, agents operating in applied settings—whether assisting with research, summarizing policy, or fact-checking claims—require persistence, structured reasoning, and the ability to dynamically adapt their search strategies. These capabilities remain underdeveloped in current AI systems.

OpenAI Open Sources BrowseComp: A Benchmark of 1,266 Information-Seeking Tasks

To better evaluate these capabilities, OpenAI has released BrowseComp, a benchmark designed to assess agents’ ability to persistently browse the web and retrieve hard-to-find information. The benchmark includes 1,266 fact-seeking problems, each with a short, unambiguous answer. Solving these tasks often requires navigating through multiple webpages, reconciling diverse information, and filtering relevant signals from noise.

The benchmark is inspired by the notion that just as programming competitions serve as focused tests for coding agents, BrowseComp offers a similarly constrained yet revealing evaluation of web-browsing agents. It deliberately avoids tasks with ambiguous user goals or long-form outputs, focusing instead on the core competencies of precision, reasoning, and endurance.

BrowseComp is created using a reverse-question design methodology: beginning with a specific, verifiable fact, they constructed a question designed to obscure the answer through complexity and constraint. Human trainers ensured that questions could not be solved via superficial search and would challenge both retrieval and reasoning capabilities. Additionally, questions were vetted to ensure they would not be easily solvable by GPT-4, OpenAI o1, or earlier browsing-enabled models.

The dataset spans a broad range of domains—including science, history, arts, sports, and entertainment—and is balanced to promote topic diversity. Each task is formulated so that the correct answer is a short string, which simplifies evaluation and reduces ambiguity. Human performance was also assessed, with human trainers given two hours per task; most failed to solve the majority of tasks, reflecting their difficulty.

Model Evaluation and Findings

OpenAI evaluated several models on BrowseComp, including GPT-4o (with and without browsing), GPT-4.5, OpenAI o1, and Deep Research—a model specifically trained to handle persistent browsing tasks. The results indicate that models without advanced search or reasoning strategies perform poorly: GPT-4o without browsing achieved 0.6% accuracy, and with browsing enabled, only 1.9%. GPT-4.5 scored similarly low. OpenAI o1, with improved reasoning but no browsing, performed moderately better at 9.9%.

Deep Research outperformed all other models, achieving 51.5% accuracy. Its architecture and training emphasize iterative searching, evidence synthesis, and adaptive navigation. Performance improved further with multiple trials per question and aggregation strategies such as best-of-N selection and confidence-based voting. While Deep Research exhibited higher calibration error—frequently being overconfident in incorrect answers—it often identified its own correct outputs with internal consistency, suggesting a usable confidence signal.

Human Performance and Task Difficulty

Human trainers attempted to solve the benchmark problems without the assistance of AI tools. Of the 1,255 attempted tasks, 71% were marked as unsolvable within the two-hour window, and only 29% were successfully completed. Among those, the agreement rate with the reference answer was 86.4%. These outcomes underscore the complexity of the benchmark and suggest that current AI models still fall short of the adaptability and background reasoning skills needed for such tasks.

Conclusion

BrowseComp introduces a focused, verifiable, and technically demanding benchmark for evaluating the core capabilities of web-browsing agents. By shifting emphasis from static recall to dynamic retrieval and multi-hop reasoning, it presents a realistic challenge that aligns closely with emerging real-world applications. Although current models, including those with browsing capabilities, perform unevenly, the Deep Research agent illustrates the potential of dedicated architectures to bridge this gap.

BrowseComp is publicly available via GitHub and detailed on OpenAI’s official blog. Check out the Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post OpenAI Open Sources BrowseComp: A New Benchmark for Measuring the Ability for AI Agents to Browse the Web appeared first on MarkTechPost.

Reduce ML training costs with Amazon SageMaker HyperPod

Training a frontier model is highly compute-intensive, requiring a distributed system of hundreds, or thousands, of accelerated instances running for several weeks or months to complete a single job. For example, pre-training the Llama 3 70B model with 15 trillion training tokens took 6.5 million H100 GPU hours. On 256 Amazon EC2 P5 instances (p5.48xlarge, each with 8 NVIDIA H100 GPUs), this would take approximately 132 days.
Distributed training workloads run in a synchronous manner because each training step requires all participating instances to complete their calculations before the model can advance to the next step. It implies that if a single instance fails, it stops the entire job. As cluster sizes grow, the likelihood of failure increases due to the number of hardware components involved. Each hardware failure can result in wasted GPU hours and requires valuable engineering time to identify and resolve the issue, making the system prone to downtime that can disrupt progress and delay completion. To assess system reliability, engineering teams often rely on key metrics such as mean time between failures (MTBF), which measures the average operational time between hardware failures and serves as a valuable indicator of system robustness.
In this post, we explore the challenges of large-scale frontier model training, focusing on hardware failures and the benefits of Amazon SageMaker HyperPod—a resilient solution that minimizes disruptions, enhances efficiency, and reduces training costs.
Instance failure rate
To understand the typical MTBF for large-scale frontier model training, it helps to first understand instance failure rates by reviewing three noteworthy examples:

When training OPT-175B on 992 A100 GPUs, Meta AI encountered significant hardware reliability challenges. Across 2 months, the team managed 35 manual restarts and cycled over 100 hosts due to hardware issues, and automated systems triggered more than 70 restarts. Operating 124 instances (each with 8 GPUs) continuously over 1,440 hours, Meta accumulated a total of 178,560 instance-hours. The observed failure rate during this period was around 0.0588% per instance-hour, underscoring the reliability hurdles in training large frontier models at this scale.
During the training of Llama 3.1 405B on 16,000 H100 GPUs, a total of 417 unscheduled hardware failures occurred during a 54-day period. This translates to an effective failure rate of about 0.0161% per instance-hour.
MPT-7B was trained on 1 trillion tokens over the course of 9.5 days on 440 x A100-40GB. During this period, the training job experienced four hardware failures, resulting in an effective failure rate of approximately 0.0319% per instance-hour.

Based on these examples, it’s realistic to expect that in a single hour of large-scale distributed training, an instance will fail about 0.02%–0.06% of the time.
Larger clusters, more failures, smaller MTBF
As cluster size increases, the entropy of the system increases, resulting in a lower MTBF. The following table illustrates how the MTBF (in hours) changes with the number of instances in a cluster and the estimated failure rate for each instance. For example, with a 0.04% per-hour failure rate per instance, a 512-instance system is expected to experience a failure approximately every 5 hours. The following table shows MTBF (in hours) by failure rates.

 .
Size of cluster (instances)

Failure rate (per instance per hour)
4
8
16
32
64
128
256
512

0.01%
2500
1250
625
313
157
79
40
20

0.02%
1250
625
313
157
79
40
20
10

0.04%
625
313
157
79
40
20
10
5

0.08%
313
157
79
40
20
10
5
3

Table 1: The change in MTBF (in hours) with the number of instances in a training cluster (with assumed failure rates in the columns)
What happens after a failure?
In a perfect world, without failures, the training job proceeds as shown in the following graph, which illustrates the total training time without failures, demonstrating a linear progression.

Figure 1: Training is linear in a perfect world without failures, since there are no interruptions to completion.

However, as previously noted, hardware failures are inevitable. Troubleshooting these failures typically involves several steps:

Root cause analysis (mean time to detect) – Identifying hardware failures as the root cause of training interruptions can be time-consuming, especially in complex systems with multiple potential failure points. The time taken to determine the root cause is referred to as mean time to detect (MTTD).
Hardware repair or replacement (mean time to replace) – Sometimes, a simple instance restart resolves the issue. At other times, the instance must be replaced, which can involve logistical delays, especially if specialized components aren’t readily available. If a replacement instance isn’t on hand when a GPU fails, the system must wait for one to become available. Common redistribution techniques, such as PyTorch FSDP, don’t permit workload redistribution among remaining instances.
System recovery and resumption (mean time to restart) – After resolving hardware issues and replacing the instance, additional time is needed to restore it to its previous state. The new instance must match the original configuration, and the entire cluster must load the model weights from the latest saved checkpoint.

Each failure incurs engineering effort to identify its root cause. When hardware issues arise, diagnostics confirm the problem and isolate the faulty instance, pausing the training job and increasing downtime. The impact of these failures is illustrated in the following figure and can be empirically measured for large distributed training jobs. The figure outlines the troubleshooting steps that follow a failure.

Figure 2: Impact of failures on a distributed training run. Once a failure occurs, time (idle GPUs) is spent on detecting (MTD), replacing (MTT Replace), and continuing (MTR Restart) a training run, often wasting time and expensive resources.

In a scenario where a distributed training job is running on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with n reserved instances and an Auto Scaling group set to maintain a minimum of n instances, a hardware issue such as a GPU failure can cause the job to fail. The affected instance will be marked as Unhealthy by a Kubernetes health monitor such as Node Problem Detector, and Amazon EKS will attempt to reschedule the training pods to healthy instances. If no instances have sufficient resources, the pods remain in a Pending state, and because the instance count is limited to n, no new instance will be automatically provisioned.
In such cases, the failed job must be manually identified through pod logs or the Kubernetes API and deleted. The failed instance also needs to be isolated and terminated manually, either through the AWS Management Console, AWS Command Line Interface (AWS CLI), or tools like kubectl or eksctl. To restore cluster capacity, the user must increase the cluster size by modifying the Auto Scaling group or updating the instance group. After the new instance is provisioned, bootstrapped, and added to the cluster, the training job must be restarted manually. If checkpointing is enabled, the job can resume from the last saved state. The overall downtime depends on the time required to provision a new instance and restart the job by rescheduling the pods.
Faster failure detection (shorter MTTD), shorter replacement times (shorter MTTR), and rapid resumption will all contribute to reducing total training time. Automating these processes with minimal user intervention is a key advantage of Amazon SageMaker HyperPod. 
Amazon SageMaker HyperPod resilient training infrastructure
SageMaker HyperPod is a compute environment optimized for large-scale frontier model training. This means users can build resilient clusters for machine learning (ML) workloads and develop or fine-tune state-of-the-art frontier models, as demonstrated by organizations such as Luma Labs and Perplexity AI. SageMaker HyperPod runs health monitoring agents in the background for each instance. When it detects a hardware failure, SageMaker HyperPod automatically repairs or replaces the faulty instance and resumes training from the last saved checkpoint. This automation alleviates the need for manual management, which means customers can train in distributed settings for weeks or months with minimal disruption. The benefits are particularly significant for customers deploying many instances (greater than 16) in a cluster.
Frontier model builders can further enhance model performance using built-in ML tools within SageMaker HyperPod. They can use Amazon SageMaker AI with MLflow to create, manage, and track ML experiments, or use Amazon SageMaker AI with TensorBoard to visualize model architecture and address convergence issues. Additionally, integrating with observability tools such as Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana provides deeper insights into cluster performance, health, and utilization, ultimately saving valuable development time. The following figure compares the downtime of an infrastructure system using SageMaker HyperPod versus one without SageMaker HyperPod.

Figure 3: Comparing downtime chart from figure 1 with downtime on SageMaker HyperPod. When a failure occurs, it is detected automatically by HyperPod agents, and the instance is replaced in the background. Training is also resumed from the latest checkpoint

SageMaker HyperPod reduces the downtime per hardware failure by automatically detecting hardware issues. When these issues are detected, SageMaker HyperPod automatically replaces the faulty node(s) and resumes your training job from the latest checkpoint, assuming that checkpoints are written.
To evaluate this, we conducted experiments on SageMaker HyperPod using different cluster sizes of p5.48xlarge instances. The results in the following table, showing empirical measurements of time to resume by cluster size, displays the 90th percentile (P90), which represents a value that will be met or exceeded 90% of the time.

Cluster size (number of instances)
P90 time to detect (in seconds)
P90 time to replace (in seconds)
P90 time to resume (in seconds)
Total downtime per failure (in seconds)
Total downtime per failure (in minutes)

16
83
912
1212
2207
36.8

64
90
963
1320
2373
39.6

256
89
903
1398
2390
39.8

1024
80
981
1440
2501
41.7

Table 2: MTTResume (in seconds) on clusters with different sizes
As shown, the mean time to replace an instance is independent of cluster size. For a cluster of 256 x p5.48xlarge instances training Meta Llama 3.1 70B parameter model with batch size = 8, replacing an instance takes about 940 seconds (or 15.7 minutes). After replacement, the new instance must install additional packages using lifecycle scripts and run deep health checks before reading from the latest saved checkpoint. When it’s operational, the training job resumes from the most recent checkpoint, minimizing progress loss despite the interruption. For a 256-instance cluster, it took us about 2,390 seconds (about 40 minutes) to automatically resume the training job after each failure.
Without SageMaker HyperPod, when a GPU failure occurs during a training job, the time it takes to resume the training can vary widely depending on the infrastructure and processes in place. With proper check-pointing, automated job orchestration, and efficient hardware provisioning, the resume time can be reduced. However, without these optimizations, the impact can be much more severe. Empirical evidence from customer experiences—including a leading open source frontier model provider, a top large language model (LLM) startup, an AI company specializing in enterprise frontier models, and a cutting-edge scientific research institute—indicates that without SageMaker HyperPod, the total downtime per GPU failure can average approximately 280 minutes per failure. Thus, Amazon SageMaker HyperPod saves about 240 minutes (or about 4 hours) of downtime per failure:

.
Without SageMaker HyperPod (in minutes)
With SageMaker HyperPod (in minutes)

Mean time to root-cause
10
1.5

Mean time to replace
240
15

Mean time to resume
30
23.5

Total downtime per failure
280
40

Table 3: Typical failure numbers, in minutes (as described in section “What happens after a failure?” with and without SageMaker HyperPod)
Quantifying the downtime savings
Depending on the frequency of failures, we can calculate the time to train and the cost savings of using SageMaker HyperPod. To illustrate this calculation, we assume it takes 40 minutes to replace an instance with SageMaker HyperPod compared to 280 minutes without it (as previously explained). Additionally, for this calculation, let’s assume a training job requiring 10 million GPU hours on H100 instances, running on a 256-instance P5 cluster.
Although the actual overhead (in hours) depends on the size of the training job, the relative overhead remains constant. The benefits of SageMaker HyperPod in reducing total training time are demonstrated in the following chart. For example, in a 256-instance cluster with a failure rate of 0.05%, SageMaker HyperPod reduces total training time by 32%.

.
Size of cluster (instances)

Failure rate (per instance per hour)
4
8
16
32
64
128
256
512

0.01%
0%
0%
1%
1%
2%
5%
9%
17%

0.02%
0%
1%
1%
2%
5%
9%
17%
28%

0.05%
1%
2%
3%
6%
11%
20%
32%
48%

0.07%
1%
2%
4%
8%
15%
25%
40%
55%

Table 4: Total % of training time reduced by SageMaker HyperPod compared to a P5 cluster of comparable size
To translate this into actual savings, for a training job requiring 10 million GPU hours on a 256-instance cluster, SageMaker HyperPod saves 104 days of training time. As a result, customers can reduce time-to-market by 3.5 months. Without SageMaker HyperPod, the total time to train would be approximately 325 days, 121 of which are just spent on isolating and mitigating hardware issues. The following table shows the time to train benefits.

H100 GPU hours for training
10,000,000

Number of instances
256

Failure rate (per instance per hour)
0.05%

Additional time to fix per failure (hours)
4

Days lost due to hardware issues (with SageMaker HyperPod)
17

Days lost due to hardware issues (without SageMaker HyperPod)
121

Time to train with SageMaker HyperPod (days)
221

Time to train without SageMaker HyperPod (days)
325

SageMaker HyperPod improvement
32%

Time saved with SageMaker HyperPod (days)
104

Table 5: Benefits presented by SageMaker HyperPod for a training run requiring 10 million GPU hours and a 256 instance cluster. SageMaker HyperPod saves 104 days of training time overall, resulting in a faster time to market (by 3.5 months!)
For the same example, we can estimate the total cost savings using:
Days lost due to hardware issues = (Number of instances) × (Failure rate per instance per hour) × (24 hours per day) × (Total training days) × (Downtime per failure in hours)
The following shows cost to train benefits.

H100 GPU hours for training
10,000,000

Number of instances
256

Failure rate (per instance per hour)
0.05%

Time saved with SageMaker HyperPod (days)
104

Cost per GPU per hour
$5

Total cost saving with SageMaker HyperPod
$25,559,040

Table 6: Using the calculation described above, the cost to train benefits laid out for a training run requiring 10 million GPU hours, 256 GPU based instances, and an assumed failure rate of 0.05% per instance per hour
A training job requiring 10 million GPU hours and 104 additional days of resolving hardware issues results in significant idle cluster time. Assuming a GPU cost of $5 per hour (equivalent to the price of P5 instances on Capacity Blocks for ML), the total cost savings with SageMaker HyperPod amounts to $25,559,040.
Summary
Training frontier models is a complex, resource-intensive process that is particularly vulnerable to hardware failures. In this post, we explored the instance failure rate, which can range about 0.02%–0.07% per hour during large-scale distributed training. As cluster size grows, the likelihood of failures increases, and the MTBF decreases. We also examined what happens after failure, including root cause analysis, hardware repair or replacement, and system recovery and resumption.
Next, we examined Amazon SageMaker HyperPod—a purpose-built, fully resilient cluster for frontier model training. By incorporating robust fault-tolerance mechanisms and automated health monitoring, SageMaker HyperPod minimizes disruptions caused by hardware issues. This not only streamlines the training process but also enhances the reliability and efficiency of model development, enabling faster and more effective innovation delivery. The benefits are measurable and correlate with both cluster size and failure rate. For a 256-instance cluster with a 0.05% per-instance-per-hour failure rate, SageMaker HyperPod reduces total training time by 32%, resulting in an approximate savings of $25.6 million in total training costs.
By addressing the reliability challenges of frontier model training, SageMaker HyperPod allows ML teams to focus on model innovation rather than infrastructure management. Organizations can now conduct long training runs with confidence, knowing that hardware failures will be automatically detected and resolved with minimal disruption to their ML workloads. Get started with Amazon SageMaker HyperPod.
Special thanks to Roy Allela, Senior AI/ML Specialist Solutions Architect for his support on the launch of this post.

About the Authors
Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on generative AI model training and inference. He partners with top frontier model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.
Trevor Harvey is a Principal Specialist in generative AI at Amazon Web Services (AWS) and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.
Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Model customization, RAG, or both: A case study with Amazon Nova

As businesses and developers increasingly seek to optimize their language models for specific tasks, the decision between model customization and Retrieval Augmented Generation (RAG) becomes critical. In this post, we seek to address this growing need by offering clear, actionable guidelines and best practices on when to use each approach, helping you make informed decisions that align with your unique requirements and objectives.
The introduction of Amazon Nova models represent a significant advancement in the field of AI, offering new opportunities for large language model (LLM) optimization. In this post, we demonstrate how to effectively perform model customization and RAG with Amazon Nova models as a baseline. We conducted a comprehensive comparison study between model customization and RAG using the latest Amazon Nova models, and share these valuable insights.
Approach and base model overview
In this section, we discuss the differences between a fine-tuning and RAG approach, present common use cases for each approach, and provide an overview of the base model used for experiments.
Demystifying RAG and model customization
RAG is a technique to enhance the capability of pre-trained models by allowing the model access to external domain-specific data sources. It combines two components: retrieval of external knowledge and generation of responses. It allows pre-trained language models to dynamically incorporate external data during the response-generation process, enabling more contextually accurate and updated outputs. Unlike fine-tuning, in RAG, the model doesn’t undergo any training and the model weights aren’t updated to learn the domain knowledge. Although fine-tuning implicitly uses domain-specific information by embedding the required knowledge directly into the model, RAG explicitly uses the domain-specific information through external retrieval.
Model customization refers to adapting a pre-trained language model to better fit specific tasks, domains, or datasets. Fine-tuning is one such technique, which helps in injecting task-specific or domain-specific knowledge for improving model performance. It adjusts the model’s parameters to better align with the nuances of the target task while using its general knowledge.
Common use cases for each approach
RAG is optimal for use cases requiring dynamic or frequently updated data (such as customer support FAQs and ecommerce catalogs), domain-specific insights (such as legal or medical Q&A), scalable solutions for broad applications (such as software as a service (SaaS) platforms), multimodal data retrieval (such as document summarization), and strict compliance with secure or sensitive data (such as financial and regulatory systems).
Conversely, fine-tuning thrives in scenarios demanding precise customization (such as personalized chatbots or creative writing), high accuracy for narrow tasks (such as code generation or specialized summarization), ultra-low latency (such as real-time customer interactions), stability with static datasets (such as domain-specific glossaries), and cost-efficient scaling for high-volume tasks (such as call center automation).
Although RAG excels at real-time grounding in external data and fine-tuning specializes in static, structured, and personalized workflows, choosing between them often depends on nuanced factors. This post offers a comprehensive comparison of RAG and fine-tuning, clarifying their strengths, limitations, and contexts where each approach delivers the best performance.
Introduction to Amazon Nova models
Amazon Nova is a new generation of foundation model (FM) offering frontier intelligence and industry-leading price-performance. Amazon Nova Pro and Amazon Nova Lite are multimodal models excelling in accuracy and speed, with Amazon Nova Lite optimized for low-cost, fast processing. Amazon Nova Micro focuses on text tasks with ultra-low latency. They offer fast inference, support agentic workflows with Amazon Bedrock Knowledge Bases and RAG, and allow fine-tuning for text and multi-modal data. Optimized for cost-effective performance, they are trained on data in over 200 languages.
Solution overview
To evaluate the effectiveness of RAG compared to model customization, we designed a comprehensive testing framework using a set of AWS-specific questions. Our study used Amazon Nova Micro and Amazon Nova Lite as baseline FMs and tested their performance across different configurations.
We structured our evaluation as follows:

Base model:

Used out-of-box Amazon Nova Micro and Amazon Nova Lite
Generated responses to AWS-specific questions without additional context

Base model with RAG:

Connected the base models to Amazon Bedrock Knowledge Bases
Provided access to relevant AWS documentation and blogs

Model customization:

Fine-tuned both Amazon Nova models using 1,000 AWS-specific question-answer pairs generated from the same set of AWS articles
Deployed the customized models through provisioned throughput
Generated responses to AWS-specific questions with fine-tuned models

Model customization and RAG combined approach:

Connected the fine-tuned models to Amazon Bedrock Knowledge Bases
Provided fine-tuned models access to relevant AWS articles at inference time

In the following sections, we walk through how to set up the second and third approaches (base model with RAG and model customization with fine-tuning) in Amazon Bedrock.
Prerequisites
To follow along with this post, you need the following prerequisites:

An AWS account and appropriate permissions
An Amazon Simple Storage Service (Amazon S3) bucket with two folders: one containing your training data, and one for your model output and training metrics

Implement RAG with the baseline Amazon Nova model
In this section, we walk through the steps to implement RAG with the baseline model. To do so, we create a knowledge base. Complete the following steps:

On the Amazon Bedrock console, choose Knowledge Bases in the navigation pane.
Under Knowledge Bases, choose Create.

On the Configure data source page, provide the following information:

Specify the Amazon S3 location of the documents.
Specify a chunking strategy.

Choose Next.

On the Select embeddings model and configure vector store page, provide the following information:

In the Embeddings model section, choose your embeddings model, which is used for embedding the chunks.
In the Vector database section, create a new vector store or use an existing one where the embeddings will be stored for retrieval.

Choose Next.

On the Review and create page, review the settings and choose Create Knowledge Base.

Fine-tune an Amazon Nova model using the Amazon Bedrock API
In this section, we provide detailed walkthroughs on fine-tuning and hosting customized Amazon Nova models using Amazon Bedrock. The following diagram illustrates the solution architecture.
 
Create a fine-tuning job
Fine-tuning Amazon Nova models through the Amazon Bedrock API is a streamlined process:

On the Amazon Bedrock console, choose us-east-1 as your AWS Region.

At the time of writing, Amazon Nova model fine-tuning is exclusively available in us-east-1.

Choose Custom models under Foundation models in the navigation pane.
Under Customization methods, choose Create Fine-tuning job.

For Source model, choose Select model.
Choose Amazon as the provider and the Amazon Nova model of your choice.
Choose Apply.

For Fine-tuned model name, enter a unique name for the fine-tuned model.
For Job name, enter a name for the fine-tuning job.
Under Input data, enter the location of the source S3 bucket (training data) and target S3 bucket (model outputs and training metrics), and optionally the location of your validation dataset.

Configure hyperparameters
For Amazon Nova models, the following hyperparameters can be customized:

Parameter
Range/Constraints

Epochs
1–5

Batch Size
Fixed at 1

Learning Rate
0.000001–0.0001

Learning Rate Warmup Steps
0–100

Prepare the dataset for compatibility with Amazon Nova models
Similar to other LLMs, Amazon Nova requires prompt-completion pairs, also known as question and answer (Q&A) pairs, for supervised fine-tuning (SFT). This dataset should contain the ideal outputs you want the language model to produce for specific tasks or prompts. Refer to Guidelines for preparing your data for Amazon Nova on best practices and example formats when preparing datasets for fine-tuning Amazon Nova models.
Examine fine-tuning job status and training artifacts
After you create your fine-tuning job, choose Custom models under Foundation models in the navigation pane. You will find the current fine-tuning job listed under Jobs. You can use this page to monitor your fine-tuning job status.

When your fine-tuning job status changes to Complete, you can choose the job name and navigate to the Training job overview page. You will find the following information:

Training job specifications
Amazon S3 location for input data used for fine-tuning
Hyperparameters used during fine-tuning
Amazon S3 location for training output

Host the fine-tuned model with provisioned throughput
After your fine-tuning job completes successfully, you can access your customized model through the following steps:

On the Amazon Bedrock console, choose Custom models under Foundation models in the navigation pane.
Under Models, choose your custom model.

The model details page shows the following information:

Fine-tuned model details
Amazon S3 location for input data used for fine-tuning
Hyperparameters used during fine-tuning
Amazon S3 location for training output

To make your fine-tuned model available for inference, choose Purchase provisioned throughput.
Choose a commitment term (no commitment, 1 month, or 6 months) and review the associated cost for hosting the fine-tuned models.

After the customized model is hosted through provisioned throughput, a model ID will be assigned and can be used for inference.
The aforementioned fine-tuning and inference steps can also be done programmatically. For more information, refer to the following GitHub repo, which contains sample code.
Evaluation framework and results
In this section, we first introduce our multi-LLM-judge evaluation framework, which is set up to mitigate an individual LLM judge’s bias. We then compare RAG vs. fine-tuning results in terms of response quality as well as latency and token implications.
Multiple LLMs as judges to mitigate bias
The following diagram illustrates our workflow using multiple LLMs as judges.

Using LLMs as judges has become an increasingly popular approach to evaluate tasks that are challenging to assess through traditional methods or human evaluation. For our evaluation framework, we constructed 10 domain-specific test questions covering key aspects of AWS services and features, designed to test both factual accuracy and depth of understanding. Each model-generated response was evaluated using a standardized scoring system on a scale of 0–10, where 0–3 indicates incorrect or misleading information, 4–6 represents partially correct but incomplete answers, 7–8 signifies mostly correct with minor inaccuracies, and 9–10 denotes completely accurate with comprehensive explanation.
We use the following LLM judge evaluation prompt:

{
“system_prompt”: “You are a helpful assistant.”,
“prompt_template”: “[Instruction] Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: “[[rating]]”, for example: “Rating: [[5]]”.nn[Question]n{question}nn[The Start of Assistant’s Answer]n{answer}n[The End of Assistant’s Answer]”,
“description”: “Prompt for general questions”,
“category”: “general”,
“output_format”: “[[rating]]”
}

We use the following sample evaluation question and ground truth:

{
“question_id”: 9161,
“category”: “AWS”,
“turns”: [
” “What specific details are collected and sent to AWS when anonymous operational metrics are enabled for an Amazon EFS file system?”,
“What’s required for a successful AWS CloudFormation launch?”
],
“reference”: [
“When anonymous operational metrics are enabled for an Amazon EFS file system, the following specific details are collected and sent to AWS: Solution ID, Unique ID, Timestamp, Backup ID, Backup Start Time, Backup Stop Time, Backup Window, Source EFS Size, Destination EFS Size, Instance Type, Retain, S3 Bucket Size, Source Burst Credit Balance, Source Burst Credit Balance Post Backup, Source Performance Mode, Destination Performance Mode, Number of Files, Number of Files Transferred, Total File Size, Total Transferred File Size, Region, Create Hard Links Start Time, Create Hard Links Stop Time, Remove Snapshot Start Time, Remove Snapshot Stop Time, Rsync Delete Start Time, Rsync Delete Stop Time.”,
“For a successful AWS CloudFormation launch, you need to sign in to the AWS Management Console, choose the correct AWS Region, use the button to launch the template, verify the correct template URL, assign a name to your solution stack, review and modify the parameters as necessary, review and confirm the settings, check the boxes acknowledging that the template creates AWS Identity and Access Management resources and may require an AWS CloudFormation capability, and choose Create stack to deploy the stack. You should receive a CREATE_COMPLETE status in approximately 15 minutes.”
]
}

To mitigate potential intrinsic biases among different LLM judges, we adopted two LLM judges to evaluate the model-generated responses: Anthropic’s Claude Sonnet 3.5 and Meta’s Llama 3.1 70B. Each judge was provided with the original test question, the model-generated response, and specific scoring criteria focusing on factual accuracy, completeness, relevance, and clarity. Overall, we observed a high level of rank correlation among LLM judges in assessing different approaches, with consistent evaluation patterns across all test cases.
Response quality comparison
Both fine-tuning and RAG significantly improve the quality of generated responses on AWS-specific questions over the base model. Using Amazon Nova Lite as the base model, we observed that both fine-tuning and RAG improved the average LLM judge score on response quality by 30%, whereas combining fine-tuning with RAG enhanced the response quality by a total of 83%, as shown in the following figure.

Notably, our evaluation revealed an interesting finding (as shown in the following figure): when combining fine-tuning and RAG approaches, smaller models like Amazon Nova Micro showed significant performance improvements in domain-specific tasks, nearly matching the performance of bigger models. This suggests that for specialized use cases with well-defined scope, using smaller models with both fine-tuning and RAG could be a more cost-effective solution compared to deploying larger models.

Latency and token implications
In addition to enhancing the response quality, both fine-tuning and RAG help reduce the response generation latency compared to the base model. For both Amazon Nova Micro and Amazon Nova Lite, fine-tuning reduced the base model latency by approximately 50%, whereas RAG reduced it by about 30%, as shown in the following figure.

Fine-tuning also presented the unique advantage of improving the tone and style of the generated answers to align more closely with the training data. In our experiments, the average total tokens (input and output tokens) dropped by more than 60% with both fine-tuned models. However, the average total tokens more than doubled with the RAG approach due to passing of context, as shown in the following figure. This finding suggests that for latency-sensitive use cases or when the objective is to align the model’s responses to a specific tone, style, or brand voice, model customization might offer more business value.

Conclusion
In this post, we compared model customization (fine-tuning) and RAG for domain-specific tasks with Amazon Nova. We first provided a detailed walkthrough on how to fine-tune, host, and conduct inference with customized Amazon Nova through the Amazon Bedrock API. We then adopted an LLM-as-a-judge approach to evaluate response quality from different approaches. In addition, we examined the latency and token implications of different setups.
Both fine-tuning and RAG improved the model performance. Depending on the task and evaluation criteria, model customization showed similar, or sometimes better, performance compared to RAG. Model customization can also be helpful to improve the style and tone of a generated answer. In this experiment, the customized model’s response follows the succinct answer style of the given training data, which resulted in lower latency compared to the baseline counterpart. Additionally, model customization can also be used for many use cases where RAG isn’t as straightforward to be used, such as tool calling, sentiment analysis, entity extraction, and more. Overall, we recommend combining model customization and RAG for question answering or similar tasks to maximize performance.
For more information on Amazon Bedrock and the latest Amazon Nova models, refer to the Amazon Bedrock User Guide and Amazon Nova User Guide. The AWS Generative AI Innovation Center has a group of AWS science and strategy experts with comprehensive expertise spanning the generative AI journey, helping customers prioritize use cases, build a roadmap, and move solutions into production. Check out the Generative AI Innovation Center for our latest work and customer success stories.

About the Authors
Mengdie (Flora) Wang is a Data Scientist at AWS Generative AI Innovation Center, where she works with customers to architect and implement scalable Generative AI solutions that address their unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations harness the full potential of generative AI technology. Prior to AWS, Flora earned her Master’s degree in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.
Sungmin Hong is a Senior Applied Scientist at Amazon Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, he prides himself on keeping his indoor plants alive for 3+ years.
Jae Oh Woo is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he specializes in developing custom solutions and model customization for a diverse range of use cases. He has a strong passion for interdisciplinary research that connects theoretical foundations with practical applications in the rapidly evolving field of generative AI. Prior to joining Amazon, Jae Oh was a Simons Postdoctoral Fellow at the University of Texas at Austin, where he conducted research across the Mathematics and Electrical and Computer Engineering departments. He holds a Ph.D. in Applied Mathematics from Yale University.
Rahul Ghosh is an Applied Scientist at Amazon’s Generative AI Innovation Center, where he works with AWS customers across different verticals to expedite their use of Generative AI. Rahul holds a Ph.D. in Computer Science from the University of Minnesota.
Baishali Chaudhury is an Applied Scientist at the Generative AI Innovation Center at AWS, where she focuses on advancing Generative AI solutions for real-world applications. She has a strong background in computer vision, machine learning, and AI for healthcare. Baishali holds a PhD in Computer Science from University of South Florida and PostDoc from Moffitt Cancer Centre.
Anila Joshi has more than a decade of experience building AI solutions. As a AWSI Geo Leader at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Generate user-personalized communication with Amazon Personalize and A …

Today, businesses are using AI and generative models to improve productivity in their teams and provide better experiences to their customers. Personalized outbound communication can be a powerful tool to increase user engagement and conversion.
For instance, as a marketing manager for a video-on-demand company, you might want to send personalized email messages tailored to each individual user—taking into account their demographic information, such as gender and age, and their viewing preferences. You want the messaging and movie recommendations to be both engaging and applicable to the customer. To achieve this, you can use Amazon Personalize to generate user-personalized recommendations and Amazon Bedrock to generate the text of the email.
Amazon Personalize enables your business to improve customer engagement by creating personalized product and content recommendations in websites, applications, and targeted marketing campaigns. You can get started without any prior machine learning (ML) experience, and Amazon Personalize allows you to use APIs to build sophisticated personalization capabilities. Using this service, all your data is encrypted to be private and secure, and is only used to create recommendations for your users.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, customize the model using fine tuning, or restrict the model output using Retrieval Augmented Generaion (RAG), and build agents that execute tasks using your enterprise systems and data sources.
In this post, we demonstrate how to use Amazon Personalize and Amazon Bedrock to generate personalized outreach emails for individual users using a video-on-demand use case. This concept can be applied to other domains, such as compelling customer experiences for ecommerce and digital marketing use cases.
Solution overview
The following diagram shows how you can use Amazon Personalize and Amazon Bedrock to generate user-personalized outreach messages for each user.

The workflow consists of the following steps:

Import your user, item, and interaction data into Amazon Personalize. The user and item datasets are not required for Amazon Personalize to generate recommendations, but providing good item and user metadata provides the best results in your trained models.
Train an Amazon Personalize “Top picks for you” recommender. Amazon Personalize recommenders are domain-specific resources that generate recommendations. When you create an Amazon Personalize recommender, Amazon Personalize trains the models backing the recommender with the best configurations for the use case. In our example, we use the “Top picks for you” recommender. This recommender generates personalized content recommendations for a user that you specify. With this use case, Amazon Personalize automatically filters videos the user watched.
After the model is trained, you can get the top recommended movies for each user by querying the recommender with each user ID through the Amazon Personalize Runtime API.
Combine a predefined prompt template with the top recommendations and user demographic information to generate an enhanced prompt.
Use the enhanced prompt in Amazon Bedrock through its API to generate your personalized outbound communication.
Amazon Bedrock returns the personalized outbound communication that you can email to your users.

We go deeper into each of these steps in the following sections. A code sample for this use case is available on AWS Samples on GitHub.
Prerequisites
To generate personalized recommendations, you must first set up Amazon Personalize resources. You start by creating your dataset group, loading your data, and then training a recommender. For full instructions, see Getting started tutorials.

Create a dataset group.
Create an Interactions dataset using the following schema:

{
“type”: “record”
“name”: “Interactions”,
“namespace”: “com.amazonaws.personalize.schema”,
“fields”: [
{
“name”: “USER_ID”,
“type”: “string”
},
{
“name”: “ITEM_ID”,
“type”: “string”
},
{
“name”: “TIMESTAMP”,
“type”: “long”
},
{
“name”: “EVENT_TYPE”,
“type”: “string”
}
],
“version”: “1.0”
}
Interaction data consists of information about the user interactions with the content in your application. This usually comes from analytics tools or a customer data platform (CDP). The best interaction data to use in Amazon Personalize includes the sequential order of user behavior and the content the user watched or clicked on. For this example, we use the ml-latest-small dataset from the MovieLens dataset to simulate user-item interactions.
Import the interaction data to Amazon Personalize from Amazon Simple Storage Service (Amazon S3). For this example, we convert the data to the appropriate format following the steps in the notebook 01_Introduction_and_Data_Preparation.
Item data consists of information about the content that is being interacted with, which generally comes from a content management system (CMS) in video-on-demand use cases. This can be information like the title, description, or movie genre. To provide additional metadata, and also provide a consistent experience for our users, we use a subset of the IMDb Essential Metadata for Movies/TV/OTT dataset. IMDb has multiple datasets available in AWS Data Exchange. For this post, we have extracted and prepared a subset of data for use with the following information from the IMDb Essential Metadata for Movies/TV/OTT (Bulk data) dataset.With this data, create an Items dataset using the following schema:

items_schema = {
“type”: “record”,
“name”: “Items”,
“namespace”: “com.amazonaws.personalize.schema”,
“fields”: [
{
“name”: “ITEM_ID”,
“type”: “string”
},
{
“name”: “TITLE”,
“type”: “string”
},
{
“name”: “YEAR”,
“type”: “int”
},
{
“name”: “IMDB_RATING”,
“type”: “int”
},
{
“name”: “IMDB_NUMBEROFVOTES”,
“type”: “int”
},
{
“name”: “PLOT”,
“type”: “string”,
“textual”: True
},
{
“name”: “US_MATURITY_RATING_STRING”,
“type”: “string”
},
{
“name”: “US_MATURITY_RATING”,
“type”: “int”
},
{
“name”: “GENRES”,
“type”: “string”,
“categorical”: True
},
{
“name”: “CREATION_TIMESTAMP”,
“type”: “long”
},
{
“name”: “PROMOTION”,
“type”: “string”
}
],
“version”: “1.0
}

Import the item data to Amazon Personalize from Amazon S3. For this example, we convert the data to the appropriate format following the steps in the notebook 01_Introduction_and_Data_Preparation. For more information on formatting and importing your interactions and items data from Amazon S3, see Importing bulk records.
Create a recommender. In this example, we create a “Top picks for you” recommender.

Get personalized recommendations using Amazon Personalize
Now that we have trained the “Top picks for you” recommender, we can generate recommendations for our users. For more details and ways to use Amazon Personalize to get recommendations, see Getting recommendations from Amazon Personalize.We include the item metadata in the response so we can use this information in our outbound communication in the next step.You can use the following code to get recommended movies for each user:

get_recommendations_response = personalize_runtime.get_recommendations(
recommenderArn = workshop_recommender_top_picks_arn,
userId = str(user_id),
numResults = number_of_movies_to_recommend,
metadataColumns = {
“ITEMS”: [
‘TITLE’, ‘PLOT’, ‘GENRES’]
}
)

In the items dataset, we can specify the metadata columns we want the recommender to return. In this case, we request the Title, Plot, and Genres of the recommended movie. You can request metadata columns only if this feature has been enabled when the recommender was created.
For an example user_Id, the following movies are recommended:

Title: There’s Something About Mary
Genres: Comedy and Romance
Plot: A man gets a chance to meet up with his dream girl from high school, even though his date with her back then was a complete disaster.

Title: Shakespeare in Love
Genres: Comedy and Drama and History and Romance
Plot: The world’s greatest ever playwright, William Shakespeare, is young, out of ideas and short of cash, but meets his ideal woman and is inspired to write one of his most famous plays.

Title: The Birdcage
Genres: Comedy
Plot: A gay cabaret owner and his drag queen companion agree to put up a false straight front so that their son can introduce them to his fiancée’s right-wing moralistic parents.

Get the user’s favorite movie genre
To provide a better personalized outbound communication experience, we determine the user’s favorite movie genre based on the genres of all the movies they have interacted with in the past. There are a number of ways to do this, such as counting the number of interactions per genre for our user. In this example, our sample user’s favorite genre is Comedy.
Generate personalized marketing emails with recommended movies
To generate personalized marketing emails, we use Amazon Bedrock. Amazon Bedrock users must request access to models before they are available for use. Amazon Bedrock is a fully managed service that makes base models from Amazon and third-party model providers accessible through an API.
To request access, select choose Model access in the navigation pane on the Amazon Bedrock console. For more information, see Access Amazon Bedrock foundation models.
In this example, we use Anthropic’s Claude 3.7 on Amazon Bedrock and have defined the following configuration parameters:

# The LLM we will be using
model_id = ‘us.anthropic.claude-3-7-sonnet-20250219-v1:0′

# The maximum number of tokens to use in the generated response
max_tokens_to_sample = 1000

Let’s generate a simple outreach email using the recommended movies and the following prompt template:

prompt_template = f”’Write a marketing email advertising several movies available in a video-on-demand streaming platform next week, given the movie and user information below. The movies to recommend and their information is contained in the <movie> tag. Put the email between <email> tags.

<movie>
{movie_list}
</movie>

Assistant: Email body:
<email>
”’

Using the recommended movies, the full prompt is as follows:

“Write a marketing email advertising several movies available in a video-on-demand streaming platform next week, given the movie and user information below. The movies to recommend and their information is contained in the <movie> tag. Put the email between <email> tags.
n
n
<movie>
n
[
{
‘title’: “There’s Something About Mary”,
‘genres’: ‘Comedy and Romance’,
‘plot’: ‘A man gets a chance to meet up with his dream girl from high school, even though his date with her back then was a complete disaster.’
},
{
‘title’: ‘Shakespeare in Love’,
‘genres’: ‘Comedy and Drama and History and Romance’,
‘plot’: “The world’s greatest ever playwright, William Shakespeare, is young, out of ideas and short of cash, but meets his ideal woman and is inspired to write one of his most famous plays.”
},
{
‘title’: ‘The Birdcage’,
‘genres’: ‘Comedy’,
‘plot’: “A gay cabaret owner and his drag queen companion agree to put up a false straight front so that their son can introduce them to his fiancu00e9e’s right-wing moralistic parents.”
}
]
n
</movie>
n
n
Assistant: Email body:
n
<email>.

We then use an Amazon Bedrock API call to generate the personalized email. For more information, see Amazon Bedrock API Reference.

request_body = json.dumps({
“max_tokens”: max_tokens_to_sample,
“messages”: [{“role”: “user”, “content”: prompt}],
“anthropic_version”: “bedrock-2023-05-31″
})

personalized_email_response = bedrock_client.invoke_model(
body = request_body,
modelId = identifier_of_the_model
)

Amazon Bedrock returns a personalized email for the user:

Subject: Your Weekend Movie Escape Awaits! Three Laugh-Out-Loud Comedies Coming Next Week
Hi there,
Need a break from reality? We’ve got you covered with three fantastic comedies hitting our streaming platform next week!
## This Week’s Spotlight: Comedy Gems That Will Make Your Day
**There’s Something About Mary** This hilarious romantic comedy follows a man who finally gets a second chance with his high school dream girl—after their first date went hilariously wrong. With unforgettable laughs and heartwarming moments, it’s the perfect weekend watch!
**Shakespeare in Love** When the greatest playwright of all time faces writer’s block and money troubles, an unexpected romance changes everything! This award-winning comedy-drama blends history, romance, and witty humor as Shakespeare finds his muse and creates one of his most beloved plays. A delightful escape for literature lovers and romantics alike!
**The Birdcage** Prepare for non-stop laughter in this comedy classic! When a gay cabaret owner and his drag queen partner pretend to be straight to impress their future in-laws (who happen to be ultra-conservative), chaos and hilarity ensue. A perfect blend of humor and heart that still resonates today.
So grab your popcorn, get comfortable on the couch, and enjoy these comedy classics starting next week!
Happy streaming!
The Movies-On-Demand Team
P.S. Don’t forget to check out our complete catalog for more great films in every genre!

Although this is already a good outreach email because the recommendations are personalized to the user, we can personalize it further by adding more information about the user.
Generate personalized communication with recommended movies, user demographic information, and favorite genre
We will generate emails by assuming two different demographics for the users as well as their favorite genre.
The version of the ml-latest-small dataset from the MovieLens dataset we used in this example doesn’t contain demographic data; therefore, we will try out multiple options. In a real-world scenario, you might know the demographics of your audience.
To experiment, let’s use the following example demographic:

# Sample user demographics
user_demographic_1 = f’The user is a 50 year old adult called Otto.’

We also add the user’s favorite genre to the prompt as follows:

prompt_template = f”’You are a skilled publicist. Write a high-converting marketing email advertising several movies available in a video-on-demand streaming platform next week,
given the movie and user information below. Do not add additional information. Your email will leverage the power of storytelling and persuasive language.
You want the email to impress the user, so make it appealing to them based on the information contained in the <user> tags,
and take into account the user’s favorite genre in the <genre> tags.
The movies to recommend and their information is contained in the <movie> tag.
All movies in the <movie> tag must be recommended. Give a summary of the movies and why the human should watch them.
Put the email between <email> tags.

<user>
{user_demographic}
</user>

<genre>
{favorite_genre}
</genre>

<movie>
{movie_list}
</movie>

Assistant:

<email>
”’

After adding the information, the new prompt is as follows:

“You are a skilled publicist. Write a high-converting marketing email advertising several movies available in a video-on-demand streaming platform next week, given the movie and user information below. Do not add additional information. Your email will leverage the power of storytelling and persuasive language. You want the email to impress the user, so make it appealing to them based on the information contained in the <user> tags, and take into account the user’s favorite genre in the <genre> tags. The movies to recommend and their information is contained in the <movie> tag. All movies in the <movie> tag must be recommended. Give a summary of the movies and why the human should watch them. Put the email between <email> tags.
n
n
<user>
n
The user is a 50 year old adult called Otto.
n
</user>
n
n
<genre>
n
Comedy
n
</genre>
n
n
<movie>
n
[
{
‘title’: “There’s Something About Mary”,
‘genres’: ‘Comedy and Romance’,
‘plot’: ‘A man gets a chance to meet up with his dream girl from high school, even though his date with her back then was a complete disaster.’
},
{
‘title’: ‘Shakespeare in Love’,
‘genres’: ‘Comedy and Drama and History and Romance’,
‘plot’: “The world’s greatest ever playwright, William Shakespeare, is young, out of ideas and short of cash, but meets his ideal woman and is inspired to write one of his most famous plays.”
},
{
‘title’: ‘The Birdcage’,
‘genres’: ‘Comedy’,
‘plot’: “A gay cabaret owner and his drag queen companion agree to put up a false straight front so that their son can introduce them to his fiancu00e9e’s right-wing moralistic parents.”
}
]
n
</movie>
n
n
Assistant:
n
<email>
n    ”

Amazon Bedrock returns a personalized email for the user:

Subject: Otto, Get Ready for a Comedy Extravaganza on Your Screen Next Week!
Dear Otto,
We’re thrilled to bring you an exclusive lineup of comedy classics hitting our streaming platform next week! As someone who appreciates a good laugh, you’re in for a treat with these award-winning comedies that will brighten your evenings.
## “There’s Something About Mary” This hilarious romantic comedy follows the misadventures of a man who finally gets a second chance with his high school dream girl. After a first date that was nothing short of catastrophic, he’s determined to make things right years later. With its perfect blend of outrageous humor and heartwarming moments, this comedy classic delivers laughs that have stood the test of time.
## “Shakespeare in Love” Experience the witty and charming story of a young, broke William Shakespeare who finds his muse in the most unexpected place. This brilliant comedy-drama offers a fictional account of how the greatest playwright found inspiration through love. With its clever dialogue, historical setting, and romantic storyline, this Academy Award-winning film combines your love of comedy with rich storytelling that will keep you engaged from beginning to end.
## “The Birdcage” A comedy masterpiece that delivers non-stop laughs! When a gay cabaret owner and his flamboyant partner must pretend to be straight to impress their future in-laws (who happen to be ultra-conservative), chaos ensues. The brilliant performances and hilarious situations make this one of the most beloved comedies of its era. It’s the perfect film for when you need genuine belly laughs and brilliant comedic timing.
Otto, these comedies are among the best in their genre and will be available for your enjoyment starting next week. Whether you’re in the mood for slapstick humor, clever wit, or situational comedy, this collection has something perfect for your evening entertainment.
Grab your favorite snack, get comfortable on the couch, and prepare for an unforgettable comedy marathon!
Happy streaming!
The VOD Team

The email now contains information about the user’s favorite genre and is personalized to the user using their name and the recommended the movies the user is most likely to be interested in.
Clean up
Make sure you clean up any unused resources you created in your account while following the steps outlined in this post. You can delete filters, recommenders, datasets, and dataset groups using the AWS Management Console or the Python SDK.
Conclusion
Traditional AI and generative AI allow you to build hyper-personalized experiences for your users. In this post, we showed how to generate personalized outbound communication by getting personalized recommendations for each user using Amazon Personalize and then using user preferences and demographic information to write a personalized email communication using Amazon Bedrock. By using AWS managed services, such as Amazon Personalize and Amazon Bedrock, you can create this content with only a few API calls—no ML experience required.
For more information about Amazon Personalize, see the Amazon Personalize Developer Guide. For more information on working with generative AI on AWS, see Announcing New Tools for Building with Generative AI on AWS.

About the Author
Anna Grüebler Clark is a Specialist Solutions Architect at AWS focusing on in Artificial Intelligence. She has more than 16 years experience helping customers develop and deploy machine learning applications. Her passion is taking new technologies and putting them in the hands of everyone, and solving difficult problems leveraging the advantages of using traditional and generative AI in the cloud.

Google Introduces Agent2Agent (A2A): A New Open Protocol that Allows A …

Google AI recently announced Agent2Agent (A2A), an open protocol designed to facilitate secure, interoperable communication among AI agents built on different platforms and frameworks. By offering a standardized approach to agent interaction, A2A aims to streamline complex workflows that involve specialized AI agents collaborating to complete tasks of varying complexity and duration.

A2A addresses a key challenge in the AI domain: the lack of a common mechanism for agents to discover, communicate, and coordinate across vendor ecosystems. In many industries, organizations often deploy multiple AI systems for specific functions, but these systems do not always integrate smoothly. A2A is intended to close this gap by providing a universal set of rules for agent interoperability, such that agents created by different teams or companies can work in tandem without custom integrations.

A prominent feature of A2A is its enterprise-grade focus. The protocol supports long-running tasks that extend over days, weeks, or even months—such as supply chain planning or multi-stage hiring. It also accommodates multimodal collaboration, so AI agents can share and process text, audio, and video in a unified workflow. By using Agent Cards in JSON format, agents can advertise their capabilities, security permissions, and any relevant information required to handle tasks. This approach allows each agent to quickly assess whether it can perform a given task, request additional resources, or delegate responsibilities to other capable agents.

Security is another central aspect of A2A. AI systems frequently deal with sensitive data, such as personal information in hiring or customer records in finance. To address these requirements, A2A aligns with OpenAPI-level authentication standards, enforcing role-based access control and encrypted data exchanges. This approach aims to ensure that only authorized agents, possessing the correct credentials and permissions, can participate in critical workflows or access protected data streams.

How A2A Works

To guide its development, A2A is built around five core design principles:

Agentic-first: Agents do not share memory or tools by default. They operate independently and communicate explicitly to exchange information.

Standards-compliant: The protocol uses widely adopted web technologies, such as HTTP, JSON-RPC, and Server-Sent Events (SSE), to minimize friction for developers.

Secure by default: Built-in authentication and authorization measures are intended to safeguard sensitive transactions and data.

Handles short and long tasks: A2A supports both brief interactions (like a quick information request) and extended processes that require ongoing collaboration.

Modality-agnostic: Agents can handle text, video, audio, or other data types by sharing structured task updates in real time.

From a technical standpoint, A2A can be seen as complementary to other emerging standards for AI multi-agent systems. For instance, Anthropic’s Model Context Protocol (MCP) focuses on how different language models handle shared context during multi-agent reasoning. A2A’s emphasis lies in the interoperability layer, ensuring agents can securely discover one another and collaborate once models are ready to exchange data or coordinate tasks. This combination of context sharing (MCP) and inter-agent communication (A2A) can form a more comprehensive foundation for multi-agent applications.

An example of a real-world application for A2A is the hiring process. One agent might screen candidates based on specific criteria, another could schedule interviews, while a third might manage background checks. These specialized agents can communicate through a unified interface, synchronizing the status of each step and ensuring that relevant information is passed along securely.

Google has open-sourced A2A to encourage community involvement and standardization across the AI industry. Key consulting and technology firms—including BCG, Deloitte, Cognizant, and Wipro—are contributing to its development, with the goal of refining interoperability and security features. By taking this collaborative approach, Google aims to lay the groundwork for a more flexible and efficient multi-agent ecosystem.

Overall, A2A offers a structured way for organizations to integrate specialized AI agents, enabling them to exchange data securely, manage tasks more effectively, and support a broad range of enterprise requirements. As AI continues to expand into various facets of business operations, protocols like A2A may help unify disparate systems, fostering more dynamic and reliable workflows at scale.

Check out the Technical details and Google Blog. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Google Introduces Agent2Agent (A2A): A New Open Protocol that Allows AI Agents Securely Collaborate Across Ecosystems Regardless of Framework or Vendor appeared first on MarkTechPost.

Google Releases Agent Development Kit (ADK): An Open-Source AI Framewo …

Google has released the Agent Development Kit (ADK), an open-source framework aimed at making it easier for developers to build, manage, and deploy multi-agent systems. ADK is written in Python and focuses on modularity and flexibility, making it suitable for both simple and more complex use cases involving multiple interacting agents.

Summary

Set up a basic multi-agent system with under 100 lines of Python.

Customize agents and tools using a flexible API.

Currently Python-based, with plans to support other languages in the future.

What is ADK?

ADK is a developer-oriented framework for creating multi-agent systems. It provides a set of components like agents, tools, orchestrators, and memory modules, all of which can be extended or replaced. The idea is to give developers control over how agents interact and manage their internal state, while also providing a structure that’s easy to understand and work with.

Core Features

Code-first approach: You write plain Python to define behavior.

Multi-agent support: Run and coordinate multiple agents.

Custom tools and memory: Extend with your own logic and state management.

Streaming support: Agents can exchange information in real time.

Example: A Basic Multi-Agent Setup

Here’s a short script that shows how to define and run a multi-agent system using ADK:

Copy CodeCopiedUse a different Browserfrom adk import Agent, Orchestrator, Tool

class EchoTool(Tool):
def run(self, input: str) -> str:
return f”Echo: {input}”

echo_agent = Agent(name=”EchoAgent”, tools=[EchoTool()])
relay_agent = Agent(name=”RelayAgent”)

orchestrator = Orchestrator(agents=[echo_agent, relay_agent])

if __name__ == “__main__”:
input_text = “Hello from ADK!”
result = orchestrator.run(input_text)
print(result)

This script creates two agents and a simple custom tool. One agent uses the tool to process input, and the orchestrator manages the interaction between them.

Development Workflow

ADK is designed to fit into standard development workflows. You can:

Log and debug agent behavior.

Manage short- and long-term memory.

Extend agents with custom tools and APIs.

Adding a Custom Tool

You can define your own tools to let agents call APIs or execute logic. For example:

Copy CodeCopiedUse a different Browserclass SearchTool(Tool):
def run(self, query: str) -> str:
# Placeholder for API logic
return f”Results for ‘{query}'”

Attach the tool to an agent and include it in the orchestrator to let your system perform searches or external tasks.

Integrations and Tooling

ADK integrates well with Google’s broader AI ecosystem. It supports Gemini models and connects to Vertex AI, allowing access to models from providers like Anthropic, Meta, Mistral, and others. Developers can choose the best models for their application needs.

Google also introduced Agent Engine, a managed runtime for deploying agents into production. It handles context management, scaling, security, evaluation, and monitoring. Though it complements ADK, Agent Engine is also compatible with other agent frameworks such as LangGraph and CrewAI.

To help developers get started, Google provides Agent Garden, a collection of pre-built agents and tools. This library allows teams to prototype faster by reusing existing components rather than starting from scratch.

Security and Governance

For enterprise-grade applications, ADK and its supporting tools offer several built-in safeguards:

Output control to moderate agent responses.

Identity permissions to restrict what agents can access or perform.

Input screening to catch problematic inputs.

Behavior monitoring to log and audit agent actions.

These features help teams deploy AI agents with more confidence in secure or sensitive environments.

What’s Next

Right now, ADK supports Python, and the team behind it has shared plans to support other languages over time. Since the project is open-source, contributions and extensions are encouraged, and the framework may evolve based on how developers use it in real-world settings.

Conclusion

ADK offers a structured but flexible way to build multi-agent systems. It’s especially useful if you want to experiment with agent workflows without having to build everything from scratch. With integration options, prebuilt libraries, and production-grade tooling, ADK can be a practical starting point for teams developing AI-driven applications.

Whether you’re experimenting with small agent workflows or exploring more involved systems, ADK is a practical tool to consider.

Check out the GitHub Page and Documentation. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Google Releases Agent Development Kit (ADK): An Open-Source AI Framework Integrated with Gemini to Build, Manage, Evaluate and Deploy Multi Agents appeared first on MarkTechPost.

Unveiling Attention Sinks: The Functional Role of First-Token Focus in …

LLMs often show a peculiar behavior where the first token in a sequence draws unusually high attention—known as an “attention sink.” Despite seemingly unimportant, this token frequently dominates attention across many heads in Transformer models. While prior research has explored when and how attention sinks occur, the reasons behind their emergence and functional role remain unclear. These attention patterns are linked to challenges and optimization in LLMs, such as quantization, key-value caching, streaming attention, and even security vulnerabilities, highlighting their significance and the need for deeper understanding.

Researchers from the University of Oxford, NUS, and Google DeepMind explored why attention sinks—where models focus heavily on the first token—emerge in LLMs. Contrary to past efforts to reduce them, they argue that these sinks serve a functional role by preventing over-mixing of token representations, which can lead to collapse or instability in deep Transformers. The ⟨bos⟩ token often attracts the majority of attention, limiting the spread of perturbations and stabilizing the model. Experiments on models like Gemma 7B and LLaMa 3.1 405B confirm that attention sinks become more prominent in deeper models and longer contexts, supporting their theory.

The study explores how decoder-only Transformers, the architecture behind most modern language models, use attention mechanisms to process sequences token by token. In such models, each token can only attend to past tokens due to causal masking. A recurring phenomenon in these models is the emergence of “attention sinks”—tokens like the beginning-of-sequence (⟨bos⟩) that disproportionately attract attention across multiple heads and layers. While these sinks were previously seen as artifacts of large key and query activations, this work argues that they are vital in maintaining stable representations, especially in long sequences. By concentrating attention, sinks prevent excessive mixing of information across layers, helping to preserve the uniqueness of token representations.

The study connects attention sinks to problems like rank collapse and over-squashing, which degrade model performance by compressing diverse inputs into indistinct representations. It uses mathematical tools like Jacobian norms to show how attention sinks reduce sensitivity to perturbations, effectively acting as stabilizers that prevent representational collapse. Experiments on models like Gemma 7B confirm that removing attention sinks increases information diffusion, while their presence maintains sharper, more localized attention patterns. Thus, attention sinks are not just a side effect but a structural feature that supports the Transformer’s ability to handle deep and long-range dependencies.

The study investigates whether the beginning-of-sequence (⟨bos⟩) token holds any special role in forming attention sinks in language models. Through a series of experiments using different data packing and masking strategies, the researchers find that attention sinks consistently form at the first token of the input, whether or not it is explicitly marked as ⟨bos⟩. However, when ⟨bos⟩ is fixed at the start of every sequence during pretraining, the model learns to rely on it more heavily to stabilize attention and prevent over-mixing of token representations. Removing ⟨bos⟩ during inference in such models leads to a collapse in sink formation and a significant drop in performance. This highlights that although the first token always plays a role in anchoring attention, the training setup—especially the consistent presence of ⟨bos⟩—greatly strengthens this effect.

In conclusion, the study argues that attention sinks are a structural solution to challenges like over-squashing and excessive mixing in deep Transformers. Directing attention toward the initial token—typically ⟨bos⟩—helps the model reduce its sensitivity to input noise and retain distinct token representations over long contexts. The findings also show that context length, model depth, and training configurations significantly affect how and where sinks form. By offering theoretical insights and empirical validation, the work presents attention sinks not as quirks but as components contributing to large language models’ stability and efficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Unveiling Attention Sinks: The Functional Role of First-Token Focus in Stabilizing Large Language Models appeared first on MarkTechPost.

Implement human-in-the-loop confirmation with Amazon Bedrock Agents

Agents are revolutionizing how businesses automate complex workflows and decision-making processes. Amazon Bedrock Agents helps you accelerate generative AI application development by orchestrating multi-step tasks. Agents use the reasoning capability of foundation models (FMs) to break down user-requested tasks into multiple steps. In addition, they use the developer-provided instruction to create an orchestration plan and then carry out the plan by invoking company APIs and accessing knowledge bases using Retrieval Augmented Generation (RAG) to provide an answer to the user’s request.
Building intelligent autonomous agents that effectively handle user queries requires careful planning and robust safeguards. Although FMs continue to improve, they can still produce incorrect outputs, and because agents are complex systems, errors can occur at multiple stages. For example, an agent might select the wrong tool or use correct tools with incorrect parameters. Although Amazon Bedrock agents can self-correct through their reasoning and action (ReAct) strategy, repeated tool execution might be acceptable for non-critical tasks but risky for business-critical operations, such as database modifications.
In these sensitive scenarios, human-in-the-loop (HITL) interaction is essential for successful AI agent deployments, encompassing multiple critical touchpoints between humans and automated systems. HITL can take many forms, from end-users approving actions and providing feedback, to subject matter experts reviewing responses offline and agents working alongside customer service representatives. The common thread is maintaining human oversight and using human intelligence to improve agent performance. This human involvement helps establish ground truth, validates agent responses before they go live, and enables continuous learning through feedback loops.
In this post, we focus specifically on enabling end-users to approve actions and provide feedback using built-in Amazon Bedrock Agents features, specifically HITL patterns for providing safe and effective agent operations. We explore the patterns available using a Human Resources (HR) agent example that helps employees requesting time off. You can recreate the example manually or using the AWS Cloud Development Kit (AWS CDK) by following our GitHub repository. We show you what these methods look like from an application developer’s perspective while providing you with the overall idea behind the concepts. For the post, we apply user confirmation and return of control on Amazon Bedrock to achieve the human confirmation.
Amazon Bedrock Agents frameworks for human-in-the-loop confirmation
When implementing human validation in Amazon Bedrock Agents, developers have two primary frameworks at their disposal: user confirmation and return of control (ROC). These mechanisms, though serving similar oversight purposes, address different validation needs and operate at different levels of the agent’s workflow.
User confirmation provides a straightforward way to pause and validate specific actions before execution. With user confirmation, the developer receives information about the function (or API) and parameters values that an agent wants to use to complete a certain task. The developer can then expose this information to the user in the agentic application to collect a confirmation that the function should be executed before continuing the agent’s orchestration process.
With ROC, the agent provides the developer with the information about the task that it wants to execute and completely relies on the developer to execute the task. In this approach, the developer has the possibility to not only validate the agent’s decision, but also contribute with additional context and modify parameters during the agent’s execution process. ROC also happens to be configured at the action group level, covering multiple actions.
Let’s explore how each framework can be implemented and their specific use cases.
Autonomous agent execution: No human-in-the-loop
First, let’s demonstrate what a user experience might look like if your application doesn’t have a HITL. For that, let’s consider the following architecture.

In the preceding diagram, the employee interacts with the HR Assistant agent, which then invokes actions that can change important details about the employee’s paid time off (PTO). In this scenario, when an employee requests time off, the agent will automatically request the leave after confirming that enough PTO days are still available for the requesting employee.
The following screenshot shows a sample frontend UI for an Amazon Bedrock agent with functions to retrieve PTOs and request new ones.

In this interaction, the PTO request was submitted with no confirmation from the end-user. What if the user didn’t want to actually submit a request, but only check that it could be done? What if the date they provided was incorrect and had a typo? For any action that changes the state of a user’s PTO, it would provide a better user experience if the system asked for confirmation before actually making those changes.
Simple human validation: User confirmation
When requesting PTO, employees expect to be able to confirm their actions. This minimizes the execution of accidental requests and helps confirm that the agent understood the request and its parameters correctly.
For such scenarios, a Boolean confirmation is already sufficient to continue to execution of the agentic flow. Amazon Bedrock Agents offers an out-of-the-box user confirmation feature that enables developers to incorporate an extra layer of safety and control into their AI-driven workflows. This mechanism strikes a balance between automation and human oversight by making sure that critical actions are validated by users before execution. With user confirmation, developers can decide which tools can be executed automatically and which ones should be first confirmed.
For our example, reading the values for available PTO hours and listing the past PTO requests taken by an employee are non-critical operations that can be executed automatically. However, booking, updating, or canceling a PTO request requires changes on a database and are actions that should be confirmed before execution. Let’s change our agent architecture to include user confirmation, as shown in the following updated diagram.

In the updated architecture, when the employee interacts with the HR Assistant agent and the create_pto_request() action needs to be invoked, the agent will first request user confirmation before execution.
To enable user confirmation, agent developers can use the AWS Management Console, an SDK such as Boto3, or infrastructure as code (IaC) with AWS CloudFormation (see AWS::Bedrock::Agent Function). The user experience with user confirmation will look like the following screenshot.

In this interaction, the agent requests a confirmation from the end-user in order to execute. The user can then choose if they want to proceed with the time off request or not. Choosing Confirm will let the agent execute the action based on the parameter displayed.

The following diagram illustrates the workflow for confirming the action.

In this scenario, the developer maps the way the confirmation is displayed to the user in the client-side UI and the agent validates the confirmation state before executing the action.
Customized human input: Return of control
User confirmation provides a simple yes/no validation, but some scenarios require a more nuanced human input. This is where ROC comes into play. ROC allows for a deeper level of human intervention, enabling users to modify parameters or provide additional context before an action is executed.
Let’s consider our HR agent example. When requesting PTO, a common business requirement is for employees to review and potentially edit their requests before submission. This expands upon the simple confirmation use case by allowing users to alter their original input before sending a request to the backend. Amazon Bedrock Agents offers an out-of-the-box solution to effectively parse user input and send it back in a structured format using ROC.
To implement ROC, we need to modify our agent architecture slightly, as shown in the following diagram.

In this architecture, ROC is implemented at the action group level. When an employee interacts with the HR Assistant agent, the system requires explicit confirmation of all function parameters under the “Request PTO Action Group” before executing actions within the action group.
With ROC, the user experience becomes more interactive and flexible. The following screenshot shows an example with our HR agent application.

Instead of executing the action automatically or just having a confirm/deny option, users are presented with a form to edit their intentions directly before processing. In this case, our user can realize they accidentally started their time off request on a Sunday and can edit this information before submission.
After the user reviews and potentially modifies the request, they can approve the parameters.

When implementing ROC, it’s crucial to understand that parameter validation occurs at two distinct points. The agent performs initial validation before returning control to the user (for example, checking available PTO balance), and the final execution relies on the application’s API validation layer.
For instance, if a user initially requests 3 days of PTO, the agent validates against their 5-day balance and returns control. However, if the user modifies the request to 100 days during ROC, the final validation and enforcement happen at the API level, not through the agent. This differs from confirmation flows where the agent directly executes API calls. In ROC, the agent’s role is to facilitate the interaction and return API responses, and the application maintains ultimate control over parameter validation and execution.
The core difference in the ROC approach is that the responsibility of processing the time off request is now handled by the application itself instead of being automatically handled by the agent. This allows for more complex workflows and greater human oversight.
To better understand the flow of information in a ROC scenario, let’s examine the following sequence diagram.

In this workflow, the agent prepares the action but doesn’t execute it. Instead, it returns control to the application, which then presents the editable information to the user. After the user reviews and potentially modifies the request, the application is responsible for executing the action with the final, user-approved parameters.
This approach provides several benefits:

Enhanced accuracy – Users can correct misunderstandings or errors in the agent’s interpretation of their request
Flexibility – It allows for last-minute changes or additions to the request
User empowerment – It gives users more control over the final action, increasing trust in the system
Compliance – In regulated industries, this level of human oversight can be crucial for adhering to legal or policy requirements

Implementing ROC requires more development effort compared to user confirmation, because it involves creating UIs for editing and handling the execution of actions within the application. However, for scenarios where precision and user control are paramount, the additional complexity is often justified.
Conclusion
In this post, we explored two primary frameworks for implementing human validation in Amazon Bedrock Agents: user confirmation and return of control. Although these mechanisms serve similar oversight purposes, they address different validation needs and operate at distinct levels of the agent’s workflow. User confirmation provides a straightforward Boolean validation, allowing users to approve or reject specific actions before execution. This method is ideal for scenarios where a simple yes/no decision is sufficient to promote safety and accuracy.
ROC offers a more nuanced approach, enabling users to modify parameters and provide additional context before action execution. This framework is particularly useful in complex scenarios, where changing of the agent’s decisions is necessary.
Both methods contribute to a robust HITL approach, providing an essential layer of human validation to the agentic application.
User confirmation and ROC are just two aspects of the broader HITL paradigm in AI agent deployments. In future posts, we will address other crucial use cases for HITL interactions with agents.
To get started creating your own agentic application with HITL validation, we encourage you to explore the HR example discussed in this post. You can find the complete code and implementation details in our GitHub repository.

About the Authors
Clement Perrot is a Senior Solutions Architect and AI/ML Specialist at AWS, where he helps early-stage startups build and implement AI solutions on the AWS platform. In his role, he architects large-scale GenAI solutions, guides startups in implementing LLM-based applications, and drives the technical adoption of AWS GenAI services globally. He collaborates with field teams on complex customer implementations and authors technical content to enable AWS GenAI adoption. Prior to AWS, Clement founded two successful startups that were acquired, and was recognized with an Inc 30 under 30 award.
Ryan Sachs is a Solutions Architect at AWS, specializing in GenAI application development. Ryan has a background in developing web/mobile applications at companies large and small through REST APIs. Ryan helps early-stage companies solve their business problems by integrating Generative AI technologies into their existing architectures.
Maira Ladeira Tanke is a Tech Lead for Agentic workloads in Amazon Bedrock at AWS, where she enables customers on their journey todevelop autonomous AI systems. With over 10 years of experience in AI/ML. At AWS, Maira partners with enterprise customers to accelerate the adoption of agentic applications using Amazon Bedrock, helping organizations harness the power of foundation models to drive innovation and business transformation. In her free time, Maira enjoys traveling, playing with her cat, and spending time with her family someplace warm.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS Certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

Boost team productivity with Amazon Q Business Insights

Employee productivity is a critical factor in maintaining a competitive advantage. Amazon Q Business offers a unique opportunity to enhance workforce efficiency by providing AI-powered assistance that can significantly reduce the time spent searching for information, generating content, and completing routine tasks. Amazon Q Business is a fully managed, generative AI-powered assistant that lets you build interactive chat applications using your enterprise data, generating answers based on your data or large language model (LLM) knowledge. At the core of this capability are native data source connectors that seamlessly integrate and index content from multiple data sources like Salesforce, Jira, and SharePoint into a unified index.
Key benefits for organizations include:

Simplified deployment and management – Provides a ready-to-use web experience with no machine learning (ML) infrastructure to maintain or manage
Access controls – Makes sure users only access content they have permission to view
Accurate query responses – Delivers precise answers with source citations, analyzing enterprise data
Privacy and control – Offers comprehensive guardrails and fine-grained access controls
Broad connectivity – Supports over 45 native data source connectors (at the time of writing), and provides the ability to create custom connectors

Data privacy and the protection of intellectual property are paramount concerns for most organizations. At Amazon, “Security is Job Zero,” which is why Amazon Q Business is designed with these critical considerations in mind. Your data is not used for training purposes, and the answers provided by Amazon Q Business are based solely on the data users have access to. This makes sure that enterprises can quickly find answers to questions, provide summaries, generate content, and complete tasks across various use cases with complete confidence in data security. Amazon Q Business supports encryption in transit and at rest, allowing end-users to use their own encryption keys for added security. This robust security framework enables end-users to receive immediate, permissions-aware responses from enterprise data sources with citations, helping streamline workplace tasks while maintaining the highest standards of data privacy and protection.
Amazon Q Business Insights provides administrators with details about the utilization and effectiveness of their AI-powered applications. By monitoring utilization metrics, organizations can quantify the actual productivity gains achieved with Amazon Q Business. Understanding how employees interact with and use Amazon Q Business becomes crucial for measuring its return on investment and identifying potential areas for further optimization. Tracking metrics such as time saved and number of queries resolved can provide tangible evidence of the service’s impact on overall workplace productivity. It’s essential for admins to periodically review these metrics to understand how users are engaging with Amazon Q Business and identify potential areas of improvement.
The dashboard enables administrators to track user interactions, including the helpfulness of generated answers through user ratings. By visualizing this feedback, admins can pinpoint instances where users aren’t receiving satisfactory responses. With Amazon Q Business Insights, administrators can diagnose potential issues such as unclear user prompts, misconfigured topics and guardrails, insufficient metadata boosters, or inadequate data source configurations. This comprehensive analytics approach empowers organizations to continuously refine their Amazon Q Business implementation, making sure users receive the most relevant and helpful AI-assisted support.
In this post, we explore Amazon Q Business Insights capabilities and its importance for organizations. We begin with an overview of the available metrics and how they can be used for measuring user engagement and system effectiveness. Then we provide instructions for accessing and navigating this dashboard. Finally, we demonstrate how to integrate Amazon Q Business logs with Amazon CloudWatch, enabling deeper insights into user interaction patterns and identifying areas for improvement. This integration can empower administrators to make data-driven decisions for optimizing their Amazon Q Business implementations and maximizing return on investment (ROI).
Amazon Q Business and Amazon Q Apps analytics dashboards
In this section, we discuss the Amazon Q Business and Amazon Q Apps analytics dashboards.
Overview of key metrics
Amazon Q Business Insights (see the following screenshot) offers a comprehensive set of metrics that provide valuable insights into user engagement and system performance. Key metrics include Total queries and Total conversations, which give an overall picture of system usage. More specific metrics such as Queries per conversation and Queries per user offer deeper insights into user interaction patterns and the complexity of inquiries. The Number of conversations and Number of queries metrics help administrators track adoption and usage trends over time.

The dashboard also provides critical information on system effectiveness through metrics like Unsuccessful query responses and Thumbs down reasons (see the following screenshot), which highlight areas where the AI assistant might be struggling to provide adequate answers. This is complemented by the end-user feedback metric, which includes user ratings and response effectiveness reasons. These metrics are particularly valuable for identifying specific issues users are encountering and areas where the system needs improvement.

Complementing the main dashboard, Amazon Q Business provides a dedicated analytics dashboard for Amazon Q Apps that offers detailed insights into application creation, usage, and adoption patterns. The dashboard tracks user engagement through metrics like:

Active users (average unique daily users interacting with Amazon Q Apps)
Active creators (average unique daily users creating or updating Amazon Q Apps)

Application metrics include:

Total Q Apps (average daily total)
Active Q Apps (average number of applications run or updated daily)

These metrics help provide a clear picture of application utilization.
The dashboard also features several trend analyses that help administrators understand usage patterns over time:

Q App participants trend shows the relationship between daily active users and creators
Q App trend displays the correlation between total applications created and active applications
Total Q App runs trend and Published Q App trend track daily execution rates and publication patterns, respectively

These metrics enable administrators to evaluate the performance and adoption of Amazon Q Apps within their organization, helping identify successful implementation patterns and areas needing attention.

These comprehensive metrics are crucial for organizations to optimize their Amazon Q Business implementation and maximize ROI. By analyzing trends in Total queries, Total conversations, and user-specific metrics, administrators can gauge adoption rates and identify potential areas for user training or system improvements. The Unsuccessful query responses and Customer feedback metrics help pinpoint gaps in the knowledge base or areas where the system struggles to provide satisfactory answers. By using these metrics, organizations can make data-driven decisions to enhance the effectiveness of their AI-powered assistant, ultimately leading to improved productivity and user experience across various use cases within the enterprise.
How to access Amazon Q Business Insights dashboards
As an Amazon Q admin, you can view the dashboards on the Amazon Q Business console. You can view the metrics in these dashboards over different pre-selected time intervals. They are available at no additional charge in AWS Regions where the Amazon Q Business service is offered.
To view these dashboards on the Amazon Q Business console, you choose your application environment and navigate to the Insights page. For more details, see Viewing the analytics dashboards.
The following screenshot illustrates how to access the dashboards for Amazon Q Business applications and Amazon Q Apps Insights.

Monitor Amazon Q Business user conversations
In addition to Amazon Q Business and Amazon Q Apps dashboards, you can use Amazon CloudWatch Logs to deliver user conversations and response feedback in Amazon Q Business for you to analyze. These logs can be delivered to multiple destinations, such as CloudWatch, Amazon Simple Storage Service (Amazon S3), or Amazon Data Firehose.
The following diagram depicts the flow of user conversation and feedback responses from Amazon Q Business to Amazon S3. These logs are then queryable using Amazon Athena.

Prerequisites
To set up CloudWatch Logs for Amazon Q Business, make sure you have the appropriate permissions for the intended destination. Refer to Monitoring Amazon Q Business and Q Apps for more details.
Set up log delivery with CloudWatch as a destination
Complete the following steps to set up log delivery with CloudWatch as the destination:

Open the Amazon Q Business console and sign in to your account.
In Applications, choose the name of your application environment.
In the navigation pane, choose Enhancements and choose Admin Controls and Guardrails.
In Log delivery, choose Add and select the option To Amazon CloudWatch Logs.
For Destination log group, enter the log group where the logs will be stored.

Log groups prefixed with /aws/vendedlogs/ will be created automatically. Other log groups must be created prior to setting up a log delivery.

To filter out sensitive or personally identifiable information (PII), choose Additional settings – optional and specify the fields to be logged, output format, and field delimiter.

If you want the users’ email recorded in your logs, it must be added explicitly as a field in Additional settings.

Choose Add.
Choose Enable logging to start streaming conversation and feedback data to your logging destination.

Set up log delivery with Amazon S3 as a destination
To use Amazon S3 as a log destination, you will need an S3 bucket and grant Amazon Q Business the appropriate permissions to write your logs to Amazon S3.

Open the Amazon Q Business console and sign in to your account.
In Applications, choose the name of your application environment.
In the navigation pane, choose Enhancements and choose Admin Controls and Guardrails.
In Log delivery, choose Add and select the option To Amazon S3
For Destination S3 bucket, enter your bucket.
To filter out sensitive or PII data, choose Additional settings – optional and specify the fields to be logged, output format, and field delimiter.

If you want the users’ email recorded in your logs, it must be added explicitly as a field in Additional settings.

Choose Add.
Choose Enable logging to start streaming conversation and feedback data to your logging destination.

The logs are delivered to your S3 bucket with the following prefix: AWSLogs/<your-aws-account-id>/AmazonQBusinessLogs/<your-aws-region>/<your-q-business-application–id>/year/month/day/hour/ The placeholders will be replaced with your AWS account, Region, and Amazon Q Business application identifier, respectively.
Set up Data Firehose as a log destination
Amazon Q Business application event logs can also be streamed to Data Firehose as a destination. This can be used for real-time observability. We have excluded setup instructions for brevity.
To use Data Firehose as a log destination, you need to create a Firehose delivery stream (with Direct PUT enabled) and grant Amazon Q Business the appropriate permissions to write your logs to Data Firehose. For example AWS Identity and Access Management (IAM) policies with the required permissions for your specific logging destination, see Enable logging from AWS services.
Protecting sensitive data
You can prevent an AWS console user or group of users from viewing specific CloudWatch log groups, S3 buckets, or Firehose streams by applying specific deny statements in their IAM policies. AWS follows an explicit deny overrides allow model, meaning that if you explicitly deny an action, it will take precedence over allow statements. For more information, see Policy evaluation logic.
Real-world use cases
This section outlines several key use cases for Amazon Q Business Insights, demonstrating how you can use Amazon Q Business operational data to improve your operational posture to help Amazon Q Business meet your needs.
Measure ROI using Amazon Q Business Insights
The dashboards offered by Amazon Q Business Insights provide powerful metrics that help organizations quantify their ROI. Consider this common scenario: traditionally, employees spend countless hours searching through siloed documents, knowledge bases, and various repositories to find answers to their questions. This time-consuming process not only impacts productivity but also leads to significant operational costs. With the dashboards provided by Amazon Q Business Insights, administrators can now measure the actual impact of their investment by tracking key metrics such as total questions answered, total conversations, active users, and positive feedback rates. For instance, if an organization knows that it previously took employees an average of 3 minutes to find an answer in their documentation, and with Amazon Q Business this time is reduced to 20 seconds, they can calculate the time savings per query (2 minutes and 40 seconds). When the dashboard shows 1,000 successful queries per week, this translates to approximately 44 hours of productivity gained—time that employees can now dedicate to higher-value tasks. Organizations can then translate these productivity gains into tangible cost savings based on their specific business metrics.
Furthermore, the dashboard’s positive feedback rate metric helps validate the quality and accuracy of responses, making sure employees aren’t just getting answers, but reliable ones that help them do their jobs effectively. By analyzing these metrics over time—whether it’s over 24 hours, 7 days, or 30 days—organizations can demonstrate how Amazon Q Business is transforming their knowledge management approach from a fragmented, time-intensive process to an efficient, centralized system. This data-driven approach to measuring ROI not only justifies the investment but also helps identify areas where the service can be optimized for even greater returns.
Organizations looking to quantify financial benefits can develop their own ROI calculators tailored to their specific needs. By combining Amazon Q Business Insights metrics with their internal business variables, teams can create customized ROI models that reflect their unique operational context. Several reference calculators are publicly available online, ranging from basic templates to more sophisticated models, which can serve as a starting point for organizations to build their own ROI analysis tools. This approach enables leadership teams to demonstrate the tangible financial benefits of their Amazon Q Business investment and make data-driven decisions about scaling their implementation, based on their organization’s specific metrics and success criteria.
Enforce financial services compliance with Amazon Q Business analytics
Maintaining regulatory compliance while enabling productivity is a delicate balance. As organizations adopt AI-powered tools like Amazon Q Business, it’s crucial to implement proper controls and monitoring. Let’s explore how a financial services organization can use Amazon Q Business Insights capabilities and logging features to maintain compliance and protect against policy violations.
Consider this scenario: A large investment firm has adopted Amazon Q Business to help their financial advisors quickly access client information, investment policies, and regulatory documentation. However, the compliance team needs to make sure the system isn’t being used to circumvent trading restrictions, particularly around day trading activities that could violate SEC regulations and company policies.
Identify policy violations through Amazon Q Business logs
When the compliance team enables log delivery to CloudWatch with the user_email field selected, Amazon Q Business begins sending detailed event logs to CloudWatch. These logs are separated into two CloudWatch log streams:

QBusiness/Chat/Message – Contains user interactions
QBusiness/Chat/Feedback – Contains user feedback on responses

For example, the compliance team monitoring the logs might spot this concerning chat from Amazon Q Business:

{
“application_id”: “881486e0-c027-40ae-96c2-8bfcf8b99c2a”,
“event_timestamp”: “2025-01-30T19:19:23Z”,
“log_type”: “Message”,
“conversation_id”: “ffd1116d-5a6d-4db0-a00e-331e4eea172f”,
“user_message”: “What are the best strategies for day trading client accounts?”,
“user_email”: “janedoe@example.com”
}

The compliance team can automate this search by creating an alarm on CloudWatch Metrics Insights queries in CloudWatch.
Implement preventative controls
Upon identifying these attempts, the Amazon Q Business admin can implement several immediate controls within Amazon Q Business:

Configure blocked phrases to make sure chat responses don’t include these words
Configure topic-level controls to configure rules to customize how Amazon Q Business should respond when a chat message matches a special topic

The following screenshot depicts configuring topic-level controls for the phrase “day trading.”

Using the previous topic-level controls, different variations of the phrase “day trading” will be blocked. The following screenshot represents a user entering variations of the phrase “day trading” and how Amazon Q Business blocks that phrase due to the topic-level control for the phrase.

By implementing monitoring and configuring guardrails, the investment firm can maintain its regulatory compliance while still allowing legitimate use of Amazon Q Business for approved activities. The combination of real-time monitoring through logs and preventive guardrails creates a robust defense against potential violations while maintaining detailed audit trails for regulatory requirements.
Analyze user feedback through the Amazon Q Business Insights dashboard
After log delivery has been set up, administrators can use the Amazon Q Business Insights dashboard to get a comprehensive view of user feedback. This dashboard provides valuable data about user experience and areas needing improvement through two key metric cards: Unsuccessful query responses and Thumbs down reasons. The Thumbs down reasons chart offers a detailed breakdown of user feedback, displaying the distribution and frequency of specific reasons why users found responses unhelpful. This granular feedback helps administrators identify patterns in user feedback, whether it’s due to incomplete information, inaccurate responses, or other factors.
Similarly, the Unsuccessful query responses chart distinguishes between queries that failed because answers weren’t found in the knowledge base vs. those blocked by guardrail settings. Both metrics allow administrators to drill down into specific queries through filtering options and detailed views, enabling them to investigate and address issues systematically. This feedback loop is crucial for continuous improvement, helping organizations refine their content, adjust guardrails, and enhance the overall effectiveness of their Amazon Q Business implementation.
To view a breakdown of unsuccessful query responses, follow these steps:

Select your application on the Amazon Q Business console.
Select Amazon Q Business insights under Insights.
Go to the Unsuccessful query responses metrics card and choose View details to resolve issues.

A new page will open with two tabs: No answers found and Blocked queries.

You can use these tabs to filter by response type. You can also filter by date using the date filter at the top.

Choose any of the queries to view the Query chain

This will give you more details and context on the conversation the user had when providing their feedback.

Analyze user feedback through CloudWatch logs
This use case focuses on identifying and analyzing unsatisfactory feedback from specific users in Amazon Q Business. After log delivery is enabled with the user_email field selected, the Amazon Q Business application sends event logs to the previously created CloudWatch log group. User chat interactions and feedback submissions generate events in the QBusiness/Chat/Message and QBusiness/Chat/Feedback log streams, respectively.
For example, consider if a user asks about their vacation policy and no answer is returned. The user can then choose the thumbs down icon and send feedback to the administrator.

The Send your feedback form provides the user the option to categorize the feedback and provide additional details for the administrator to review.

This feedback will be sent to the QBusiness/Chat/Feedback log stream for the administrator to later analyze. See the following example log entry:

{
“application_id”: “881486e0-c027-40ae-96c2-8bfcf8b99c2a”,
“event_timestamp”: “2025-02-25T18:50:41Z”,
“log_type”: “Feedback”,
“account_id”: “123456789012”,
“conversation_id”: “da2d22bf-86a1-4cc4-a7e2-96663aa05cc2”,
“system_message_id”: “3410aa16-5824-40cf-9d3d-1718cbe5b6bd”,
“user_message_id”: “221f85aa-494b-41e5-940a-034a3d22fba8”,
“user_message”: “Can you tell me about my vacation policy?”,
“system_message”: “No answer is found.”,
“comment”: “There is no response when asking about vacation policies. “,
“usefulness_reason”: “NOT_HELPFUL”,
“usefulness”: “NOT_USEFUL”,
“timestamp”: “1740509448782”,
“user_email”: “jane.doe@example.com”
}

By analyzing queries that result in unsatisfactory responses (thumbs down), administrators can take actions to improve answer quality, accuracy, and security. This feedback can help identify gaps in data sources. Patterns in feedback can indicate topics where users might benefit from extra training or guidance on effectively using Amazon Q Business.
To address issues identified through feedback analysis, administrators can take several actions:

Configure metadata boosting to prioritize more accurate content in responses for queries that consistently receive negative feedback
Refine guardrails and chat controls to better align with user expectations and organizational policies
Develop targeted training or documentation to help users formulate more effective prompts, including prompt engineering techniques
Analyze user prompts to identify potential risks and reinforce proper data handling practices

By monitoring the chat messages and which users are giving “thumbs up” or “thumbs down” responses for the associated prompts, administrators can gain insights into areas where the system might be underperforming, not meeting user expectations, or not complying with your organization’s security policies.
This use case is applicable to the other log delivery options, such as Amazon S3 and Data Firehose.
Group users getting the most unhelpful answers
For administrators seeking more granular insights beyond the standard dashboard, CloudWatch Logs Insights offers a powerful tool for deep-dive analysis of Amazon Q Business usage metrics. By using CloudWatch Log Insights, administrators can create custom queries to extract and analyze detailed performance data. For instance, you can generate a sorted list of users experiencing the most unhelpful interactions, such as identifying which employees are consistently receiving unsatisfactory responses. A typical query might reveal patterns like “User A received 9 unhelpful answers in the last 4 weeks, User B received 5 unhelpful answers, and User C received 3 unhelpful answers.” This level of detailed analysis enables organizations to pinpoint specific user groups or departments that might require additional training, data source configuration, or targeted support to improve their Amazon Q Business experience.
To get these kinds of insights, complete the following steps:

To obtain the Amazon Q Business application ID, open the Amazon Q Business console, open the specific application, and note the application ID on the Application settings

This unique identifier will be used to filter log groups in CloudWatch Logs Insights.

On the CloudWatch console, choose Logs Insights under Logs in the navigation pane.

Under Selection criteria, enter the application ID you previously copied. Choose the log group that follows the pattern /aws/vendedlogs/qbusiness/application/EVENT_LOGS/<your application id>.

For the data time range, select the range you want to use. In our case, we are using the last 4 weeks and so we choose Custom, then we specify 4 Weeks.
Replace the default query in the editor with this one:

filter usefulness = “NOT_USEFUL” and ispresent(user_email)
| stats count(*) as total_unhelpful_anwers by user_email

We use the condition NOT_USEFUL because we want to list users getting unhelpful answers. To get a list of users who received helpful answers, change the condition to USEFUL.

Choose Run query.

With this information, particularly user_email, you can write a new query to analyze the conversation logs where users got unhelpful answers. For example, to list messages where user john_doe gave a thumbs down, replace your query with the following:
filter usefulness = “NOT_USEFUL” and user_email = “john_doe@anycompany.com”
Alternatively, to filter unhelpful answers, you could use the following query:
filter usefulness = “NOT_USEFUL”
The results of these queries can help you better understand the context of the feedback users are providing. As mentioned earlier, it might be possible your guardrails are too restrictive, your application is missing a data source, or maybe your users’ prompts are not clear enough.
Clean up
To make sure you don’t incur ongoing costs, clean up resources by removing log delivery configurations, deleting CloudWatch resources, removing the Amazon Q Business application, and deleting any additional AWS resources created after you’re done experimenting with this functionality.
Conclusion
In this post, we explored several ways to improve your operational posture with Amazon Q Business Insights dashboards, the Amazon Q Apps analytics dashboard, and logging with CloudWatch Logs. By using these tools, organizations can gain valuable insights into user engagement patterns, identify areas for improvement, and make sure their Amazon Q Business implementation aligns with security and compliance requirements.
To learn more about Amazon Q Business key usage metrics, refer to Viewing Amazon Q Business and Q App metrics in analytics dashboards. For a comprehensive review of Amazon Q Business CloudWatch logs, including log query examples, refer to Monitoring Amazon Q Business user conversations with Amazon CloudWatch Logs.

About the Authors
Guillermo Mansilla is a Senior Solutions Architect based in Orlando, Florida. Guillermo has developed a keen interest in serverless architectures and generative AI applications. Prior to his current role, he gained over a decade of experience working as a software developer. Away from work, Guillermo enjoys participating in chess tournaments at his local chess club, a pursuit that allows him to exercise his analytical skills in a different context.
Amit Gupta is a Senior Q Business Solutions Architect Solutions Architect at AWS. He is passionate about enabling customers with well-architected generative AI solutions at scale.
Jed Lechner is a Specialist Solutions Architect at Amazon Web Services specializing in generative AI solutions with Amazon Q Business and Amazon Q Apps. Prior to his current role, he worked as a Software Engineer at AWS and other companies, focusing on sustainability technology, big data analytics, and cloud computing. Outside of work, he enjoys hiking and photography, and capturing nature’s moments through his lens.
Leo Mentis Raj Selvaraj is a Sr. Specialist Solutions Architect – GenAI at AWS with 4.5 years of experience, currently guiding customers through their GenAI implementation journeys. Previously, he architected data platform and analytics solutions for strategic customers using a comprehensive range of AWS services including storage, compute, databases, serverless, analytics, and ML technologies. Leo also collaborates with internal AWS teams to drive product feature development based on customer feedback, contributing to the evolution of AWS offerings.

Multi-LLM routing strategies for generative AI applications on AWS

Organizations are increasingly using multiple large language models (LLMs) when building generative AI applications. Although an individual LLM can be highly capable, it might not optimally address a wide range of use cases or meet diverse performance requirements. The multi-LLM approach enables organizations to effectively choose the right model for each task, adapt to different domains, and optimize for specific cost, latency, or quality needs. This strategy results in more robust, versatile, and efficient applications that better serve diverse user needs and business objectives.
Deploying a multi-LLM application comes with the challenge of routing each user prompt to an appropriate LLM for the intended task. The routing logic must accurately interpret and map the prompt into one of the pre-defined tasks, and then direct it to the assigned LLM for that task. In this post, we provide an overview of common multi-LLM applications. We then explore strategies for implementing effective multi-LLM routing in these applications, discussing the key factors that influence the selection and implementation of such strategies. Finally, we provide sample implementations that you can use as a starting point for your own multi-LLM routing deployments.
Overview of common multi-LLM applications
The following are some of the common scenarios where you might choose to use a multi-LLM approach in your applications:

Multiple task types – Many use cases need to handle different task types within the same application. For example, a marketing content creation application might need to perform task types such as text generation, text summarization, sentiment analysis, and information extraction as part of producing high-quality, personalized content. Each distinct task type will likely require a separate LLM, which might also be fine-tuned with custom data.
Multiple task complexity levels – Some applications are designed to handle a single task type, such as text summarization or question answering. However, they must be able to respond to user queries with varying levels of complexity within the same task type. For example, consider a text summarization AI assistant intended for academic research and literature review. Some user queries might be relatively straightforward, simply asking the application to summarize the core ideas and conclusions from a short article. Such queries could be effectively handled by a simple, lower-cost model. In contrast, more complex questions might require the application to summarize a lengthy dissertation by performing deeper analysis, comparison, and evaluation of the research results. These types of queries would be better addressed by more advanced models with greater reasoning capabilities.
Multiple task domains – Certain applications need to serve users across multiple domains of expertise. An example is a virtual assistant for enterprise business operations. Such a virtual assistant should support users across various business functions, such as finance, legal, human resources, and operations. To handle this breadth of expertise, the virtual assistant needs to use different LLMs that have been fine-tuned on datasets specific to each respective domain.
Software-as-a-service (SaaS) applications with tenant tiering – SaaS applications are often architected to provide different pricing and experiences to a spectrum of customer profiles, referred to as tiers. Through the use of different LLMs tailored to each tier, SaaS applications can offer capabilities that align with the varying needs and budgets of their diverse customer base. For instance, consider an AI-driven legal document analysis system designed for businesses of varying sizes, offering two primary subscription tiers: Basic and Pro. The Basic tier would use a smaller, more lightweight LLM well-suited for straightforward tasks, such as performing simple document searches or generating summaries of uncomplicated legal documents. The Pro tier, however, would require a highly customized LLM that has been trained on specific data and terminology, enabling it to assist with intricate tasks like drafting complex legal documents.

Multi-LLM routing strategies
In this section, we explore two main approaches to routing requests to different LLMs: static routing and dynamic routing.
Static routing
One effective strategy for directing user prompts to appropriate LLMs is to implement distinct UI components within the same interface or separate interfaces tailored to specific tasks. For example, an AI-powered productivity tool for an ecommerce company might feature dedicated interfaces for different roles, such as content marketers and business analysts. The content marketing interface incorporates two main UI components: a text generation module for creating social media posts, emails, and blogs, and an insight extraction module that identifies the most relevant keywords and phrases from customer reviews to improve content strategy. Meanwhile, the business analysis interface would focus on text summarization for analyzing various business documents. This is illustrated in the following figure.

This approach works well for applications where the user experience supports having a distinct UI component for each task. It also allows for a flexible and modular design, where new LLMs can be quickly plugged into or swapped out from a UI component without disrupting the overall system. However, the static nature of this approach implies that the application might not be easily adaptable to evolving user requirements. Adding a new task would necessitate the development of a new UI component in addition to the selection and integration of a new model.
Dynamic routing
In some use cases, such as virtual assistants and multi-purpose chatbots, user prompts usually enter the application through a single UI component. For instance, consider a customer service AI assistant that handles three types of tasks: technical support, billing support, and pre-sale support. Each of these tasks requires its own custom LLM to provide appropriate responses. In this scenario, you need to implement a dynamic routing layer to intercept each incoming request and direct it to the downstream LLM, which is best suited to handle the intended task within that prompt. This is illustrated in the following figure.

In this section, we discuss common approaches for implementing this dynamic routing layer: LLM-assisted routing, semantic routing, and a hybrid approach.
LLM-assisted routing
This approach employs a classifier LLM at the application’s entry point to make routing decisions. The LLM’s ability to comprehend complex patterns and contextual subtleties makes this approach well-suited for applications requiring fine-grained classifications across task types, complexity levels, or domains. However, this method presents trade-offs. Although it offers sophisticated routing capabilities, it introduces additional costs and latency. Furthermore, maintaining the classifier LLM’s relevance as the application evolves can be demanding. Careful model selection, fine-tuning, configuration, and testing might be necessary to balance the impact of latency and cost with the desired classification accuracy.
Semantic routing
This approach uses semantic search as an alternative to using a classifier LLM for prompt classification and routing in multi-LLM systems. Semantic search uses embeddings to represent prompts as numerical vectors. The system then makes routing decisions by measuring the similarity between the user’s prompt embedding and the embeddings for a set of reference prompts, each representing a different task category. The user prompt is then routed to the LLM associated with the task category of the reference prompt that has the closest match.
Although semantic search doesn’t provide explicit classifications like a classifier LLM, it succeeds at identifying broad similarities and can effectively handle variations in a prompt’s wording. This makes it particularly well-suited for applications where routing can be based on coarse-grained classification of prompts, such as task domain classification. It also excels in scenarios with a large number of task categories or when new domains are frequently introduced, because it can quickly accommodate updates by simply adding new prompts to the reference prompt set.
Semantic routing offers several advantages, such as efficiency gained through fast similarity search in vector databases, and scalability to accommodate a large number of task categories and downstream LLMs. However, it also presents some trade-offs. Having adequate coverage for all possible task categories in your reference prompt set is crucial for accurate routing. Additionally, the increased system complexity due to the additional components, such as the vector database and embedding LLM, might impact overall performance and maintainability. Careful design and ongoing maintenance are necessary to address these challenges and fully realize the benefits of the semantic routing approach.
Hybrid approach
In certain scenarios, a hybrid approach combining both techniques might also prove highly effective. For instance, in applications with a large number of task categories or domains, you can use semantic search for initial broad categorization or domain matching, followed by classifier LLMs for more fine-grained classification within those broad categories. This initial filtering allows you to use a simpler, more focused classifier LLM for the final routing decision.
Consider, for instance, a customer service AI assistant handling a diverse range of inquiries. In this context, semantic routing could initially route the user’s prompt to the appropriate department—be it billing, technical support, or sales. After the broad category is established, a dedicated classifier LLM for that specific department takes over. This specialized LLM, which can be trained on nuanced distinctions within its domain, can then determine crucial factors such as task complexity or urgency. Based on this fine-grained analysis, the prompt is then routed to the most appropriate LLM or, when necessary, escalated to a human agent.
This hybrid approach combines the scalability and flexibility of semantic search with the precision and context-awareness of classifier LLMs. The result is a robust, efficient, and highly accurate routing mechanism capable of adapting to the complexities and diverse needs of modern multi-LLM applications.
Implementation of dynamic routing
In this section, we explore different approaches to implementing dynamic routing on AWS, covering both built-in routing features and custom solutions that you can use as a starting point to build your own.
Intelligent prompt routing with Amazon Bedrock
Amazon Bedrock is a fully managed service that makes high-performing LLMs and other foundation models (FMs) from leading AI startups and Amazon available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using AWS tools without having to manage infrastructure.
If you’re building applications with Amazon Bedrock LLMs and need a fully managed solution with straightforward routing capabilities, Amazon Bedrock Intelligent Prompt Routing offers an efficient way to implement dynamic routing. This feature of Amazon Bedrock provides a single serverless endpoint for efficiently routing requests between different LLMs within the same model family. It uses advanced prompt matching and model understanding techniques to predict the performance of each model for every request. Amazon Bedrock then dynamically routes each request to the model that it predicts is most likely to give the desired response at the lowest cost. Intelligent Prompt Routing can reduce costs by up to 30% without compromising on accuracy. As of this writing, Amazon Bedrock supports routing within the Anthropic’s Claude and Meta’s Llama model families. For example, Amazon Bedrock can intelligently route requests between Anthropic’s Claude 3.5 Sonnet and Claude 3 Haiku depending on the complexity of the prompt, as illustrated the following figure. Similarly, Amazon Bedrock can route requests between Meta’s Llama 3.1 70B and 8B.

This architecture workflow includes the following steps:

A user submits a question through a web or mobile application.
Anthropic’s prompt router predicts the performance of each downstream LLM, selecting the model that it predicts will offer the best combination of response quality and cost.
Amazon Bedrock routes the request to the selected LLM, and returns the response along with information about the model.

For detailed implementation guidelines and examples of Intelligent Prompt Routing on Amazon Bedrock, see Reduce costs and latency with Amazon Bedrock Intelligent Prompt Routing and prompt caching.
Custom prompt routing
If your LLMs are hosted outside Amazon Bedrock, such as on Amazon SageMaker or Amazon Elastic Kubernetes Service (Amazon EKS), or you require routing customization, you will need to develop a custom routing solution.
This section provides sample implementations for both LLM-assisted and semantic routing. We discuss the solution’s mechanics, key design decisions, and how to use it as a foundation for developing your own custom routing solutions. For detailed deployment instructions for each routing solution, refer to the GitHub repo. The provided code in this repo is meant to be used in a development environment. Before migrating any of the provided solutions to production, we recommend following the AWS Well-Architected Framework.
LLM-assisted routing
In this solution, we demonstrate an educational tutor assistant that helps students in two domains of history and math. To implement the routing layer, the application uses the Amazon Titan Text G1 – Express model on Amazon Bedrock to classify the questions based on their topic to either history or math. History questions are routed to a more cost-effective and faster LLM such as Anthropic’s Claude 3 Haiku on Amazon Bedrock. Math questions are handled by a more powerful LLM, such as Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock, which is better suited for complex problem-solving, in-depth explanations, and multi-step reasoning. If the classifier LLM is unsure whether a question belongs to the history or math category, it defaults to classifying it as math.
The architecture of this system is illustrated in the following figure. The use of Amazon Titan and Anthropic models on Amazon Bedrock in this demonstration is optional. You can substitute them with other models deployed on or outside of Amazon Bedrock.

This architecture workflow includes the following steps:

A user submits a question through a web or mobile application, which forwards the query to Amazon API Gateway.
When API Gateway receives the request, it triggers an AWS Lambda
The Lambda function sends the question to the classifier LLM to determine whether it is a history or math question.
Based on the classifier LLM’s decision, the Lambda function routes the question to the appropriate downstream LLM, which will generate an answer and return it to the user.

Follow the deployment steps in the GitHub repo to create the necessary infrastructure for LLM-assisted routing and run tests to generate responses. The following output shows the response to the question “What year did World War II end?”

{
“answer”: “World War II ended in 1945.”,
“question_classification”: “history”,
“classifier_LLM”: “amazon.titan-text-express-v1”,
“classification_time”: 0.5374360084533691,
“answerer_LLM”: “anthropic.claude-3-haiku-20240307-v1:0”,
“answer_generation_time”: 0.2473313808441162,
“total_response_time”: 0.7847845554351807
}

The question was correctly classified as a history question, with the classification process taking approximately 0.53 seconds. The question was then routed to and answered by Anthropic’s Claude 3 Haiku, which took around 0.25 seconds. In total, it took about 0.78 seconds to receive the response.
Next, we will ask a math question. The following output shows the response to the question “Solve the quadratic equation: 2x^2 – 5x + 3 = 0.”

{
“answer”: “To solve this quadratic equation, we’ll use the quadratic formula: x = [-b ± √(b² – 4ac)] / 2annWhere a = 2, b = -5, and c = 3nnSteps:n1. Substitute values into the formulan2. Simplify under the square rootn3. Calculate the two solutionsnnx = [5 ± √(25 – 24)] / 4nx = (5 ± √1) / 4nx = (5 ± 1) / 4”,
“question_classification”: “math”,
“classifier_LLM”: “amazon.titan-text-express-v1”,
“classification_time”: 0.5975513458251953,
“answerer_LLM”: “anthropic.claude-3-5-sonnet-20240620-v1:0”,
“answer_generation_time”: 2.3191726207733154,
“total_response_time”: 2.9167449474334717
}

The question was correctly classified as a math question, with the classification process taking approximately 0.59 seconds. The question was then correctly routed to and answered by Anthropic’s Claude 3.5 Sonnet, which took around 2.3 seconds. In total, it took about 2.9 seconds to receive the response.
Semantic routing
In this solution, we focus on the same educational tutor assistant use case as in LLM-assisted routing. To implement the routing layer, you first need to create a set of reference prompts that represents the full spectrum of history and math topics you intend to cover. This reference set serves as the foundation for the semantic matching process, enabling the application to correctly categorize incoming queries. As an illustrative example, we’ve provided a sample reference set with five questions for each of the history and math topics. In a real-world implementation, you would likely need a much larger and more diverse set of reference questions to have robust routing performance.

History:
    – What were the main causes of World War I?
    – What region of the United States saw the largest economic growth as a result of the Industrial Revolution?
    – Who was the first man on the moon?
    – What country gifted the United States with the Statue of Liberty?
    – What major event sparked the beginning of the Great Depression in 1929?
Math:
    – Solve the quadratic equation: 2x^2 + 5x – 12 = 0.
    – Find the derivative of f(x) = 3x^4 – 2x^3 + 5x – 7.
    – In a right triangle, if one angle is 30° and the hypotenuse is 10 cm, find the lengths of the other two sides.
    – Determine the area of the region bounded by y = x^2, y = 4, and the y-axis.
    – If log_2(x) + log_2(y) = 5 and xy = 64, find the values of x and y.

You can use the Amazon Titan Text Embeddings V2 model on Amazon Bedrock to convert the questions in the reference set into embeddings. You can find the code for this conversion in the GitHub repo. These embeddings are then saved as a reference index inside an in-memory FAISS vector store, which is deployed as a Lambda layer.
The architecture of this system is illustrated in the following figure. The use of Amazon Titan and Anthropic models on Amazon Bedrock in this demonstration is optional. You can substitute them with other models deployed on or outside of Amazon Bedrock.

This architecture workflow includes the following steps:

A user submits a question through a web or mobile application, which forwards the query to API Gateway.
When API Gateway receives the request, it triggers a Lambda function.
The Lambda function sends the question to the Amazon Titan Text Embeddings V2 model to convert it to an embedding. It then performs a similarity search on the FAISS index to find the closest matching question in the reference index, and returns the corresponding category label.
Based on the retrieved category, the Lambda function routes the question to the appropriate downstream LLM, which will generate an answer and return it to the user.

Follow the deployment steps in the GitHub repo to create the necessary infrastructure for semantic routing and run tests to generate responses. The following output shows the response to the question “What year did World War II end?”

{
“answer”: “World War II ended in 1945.”,
“question_classification”: “history”,
“embedding_LLM”: “amazon.titan-embed-text-v2:0”,
“classification_time”: 0.1058051586151123,
“answerer_LLM”: “anthropic.claude-3-haiku-20240307-v1:0”,
“answer_generation_time”: 0.25673604011535645,
“total_response_time”: 0.36255788803100586
}

The question was correctly classified as a history question and the classification took about 0.1 seconds. The question was then routed and answered by Anthropic’s Claude 3 Haiku, which took about 0.25 seconds, resulting in a total of about 0.36 seconds to get the response back.
Next, we ask a math question. The following output shows the response to the question “Solve the quadratic equation: 2x^2 – 5x + 3 = 0.”

{
“answer”: “To solve this quadratic equation, we’ll use the quadratic formula: x = [-b ± √(b² – 4ac)] / 2annWhere a = 2, b = -5, and c = 3nnSteps:n1. Substitute the values into the formulan2. Simplify inside the square rootn3. Calculate the two solutionsnnx = [5 ± √(25 – 24)] / 4nx = (5 ± √1) / 4nx = (5 ± 1) / 4”,
“question_classification”: “math”,
“embedding_LLM”: “amazon.titan-embed-text-v2:0”,
“classification_time”: 0.09248232841491699,
“answerer_LLM”: “anthropic.claude-3-5-sonnet-20240620-v1:0”,
“answer_generation_time”: 2.6957757472991943,
“total_response_time”: 2.7882847785949707
}

The question was correctly classified as a math question and the classification took about 0.1 seconds. Moreover, the question was correctly routed and answered by Anthropic’s 3.5 Claude Sonnet, which took about 2.7 seconds, resulting in a total of about 2.8 seconds to get the response back.
Additional considerations for custom prompt routing
The provided solutions uses exemplary LLMs for classification in LLM-assisted routing and for text embedding in semantic routing. However, you will likely need to evaluate multiple LLMs to select the LLM that is best suited for your specific use case. Using these LLMs does incur additional cost and latency. Therefore, it’s critical that the benefits of dynamically routing queries to the appropriate LLM can justify the overhead introduced by implementing the custom prompt routing system.
For some use cases, especially those that require specialized domain knowledge, consider fine-tuning the classifier LLM in LLM-assisted routing and the embedding LLM in semantic routing with your own proprietary data. This can increase the quality and accuracy of the classification, leading to better routing decisions.
Additionally, the semantic routing solution used FAISS as an in-memory vector database for similarity search. However, you might need to evaluate alternative vector databases on AWS that better fit your use case in terms of scale, latency, and cost requirements. It will also be important to continuously gather prompts from your users and iterate on the reference prompt set. This will help make sure that it reflects the actual types of questions your users are asking, thereby increasing the accuracy of your similarity search classification over time.
Clean up
To avoid incurring additional costs, clean up the resources you created for LLM-assisted routing and semantic routing by running the following command for each of the respective created stacks:

cdk destroy

Cost analysis for custom prompt routing
This section analyzes the implementation cost and potential savings for the two custom prompt routing solutions, using an exemplary traffic scenario for our educational tutor assistant application.
Our calculations are based on the following assumptions:

The application is deployed in the US East (N. Virginia) AWS Region and receives 50,000 history questions and 50,000 math questions per day.
For LLM-assisted routing, the classifier LLM processes 150 input tokens and generates 1 output token per question.
For semantic routing, the embedding LLM processes 150 input tokens per question.
The answerer LLM processes 150 input tokens and generates 300 output tokens per question.
Amazon Titan Text G1 – Express model performs question classification in LLM-assisted routing at $0.0002 per 1,000 input tokens, with negligible output costs (1 token per question).
Amazon Titan Text Embeddings V2 model generates question embedding in semantic routing at $0.00002 per 1,000 input tokens.
Anthropic’s Claude 3 Haiku handles history questions at $0.00025 per 1,000 input tokens and $0.00125 per 1,000 output tokens.
Anthropic’s Claude 3.5 Sonnet answers math questions at $0.003 per 1,000 input tokens and $0.015 per 1,000 output tokens.
The Lambda runtime is 3 seconds per math question and 1 second per history question.
Lambda uses 1024 MB of memory and 512 MB of ephemeral storage, with API Gateway configured as a REST API.

The following table summarizes the cost of answer generation by LLM for both routing strategies.

Question Type
Total Input Tokens/Month
Total Output Tokens/Month
Answer Generation Cost/Month

History
225,000,000
450,000,000
$618.75

Math
225,000,000
450,000,000
$7425

The following table summarizes the cost of dynamic routing implementation for both routing strategies.

 
LLM-Assisted Routing
Semantic Routing

Question Type
Total Input Tokens/Month
Classifier LLM Cost/Month
Lambda + API Gateway Cost/Month
Embedding LLM Cost/Month
Lambda + API Gateway Cost/Month

History + Math
450,000,000
$90
$98.9
$9
$98.9

The first table shows that using Anthropic’s Claude 3 Haiku for history questions costs $618.75 per month, whereas using Anthropic’s Claude 3.5 Sonnet for math questions costs $7,425 per month. This demonstrates that routing questions to the appropriate LLM can achieve significant cost savings compared to using the more expensive model for all of the questions. The second table shows that these savings come with an implementation cost of $188.9/month for LLM-assisted routing and $107.9/month for semantic routing, which are relatively small compared to the potential savings in answer generation costs.
Selecting the right dynamic routing implementation
The decision on which dynamic routing implementation is best suited for your use case largely depends on three key factors: model hosting requirements, cost and operational overhead, and desired level of control over routing logic. The following table outlines these dimensions for Amazon Bedrock Intelligent Prompt Routing and custom prompt routing.

Design Criteria
Amazon Bedrock Intelligent Prompt Routing
Custom Prompt Routing

Model Hosting
Limited to Amazon Bedrock hosted models within the same model family
Flexible: can work with models hosted outside of Amazon Bedrock

Operational Management
Fully managed service with built-in optimization
Requires custom implementation and optimization

Routing Logic Control
Limited customization, predefined optimization for cost and performance
Full control over routing logic and optimization criteria

These approaches aren’t mutually exclusive. You can implement hybrid solutions, using Amazon Bedrock Intelligent Prompt Routing for certain workloads while maintaining custom prompt routing for others with LLMs hosted outside Amazon Bedrock or where more control on routing logic is needed.
Conclusion
This post explored multi-LLM strategies in modern AI applications, demonstrating how using multiple LLMs can enhance organizational capabilities across diverse tasks and domains. We examined two primary routing strategies: static routing through using dedicated interfaces and dynamic routing using prompt classification at the application’s point of entry.
For dynamic routing, we covered two custom prompt routing strategies, LLM-assisted and semantic routing, and discussed exemplary implementations for each. These techniques enable customized routing logic for LLMs, regardless of their hosting platform. We also discussed Amazon Bedrock Intelligent Prompt Routing as an alternative implementation for dynamic routing, which optimizes response quality and cost by routing prompts across different LLMs within Amazon Bedrock.
Although these dynamic routing approaches offer powerful capabilities, they require careful consideration of engineering trade-offs, including latency, cost optimization, and system maintenance complexity. By understanding these tradeoffs, along with implementation best practices like model evaluation, cost analysis, and domain fine-tuning, you can architect a multi-LLM routing solution optimized for your application’s needs.

About the Authors
Nima Seifi is a Senior Solutions Architect at AWS, based in Southern California, where he specializes in SaaS and GenAIOps. He serves as a technical advisor to startups building on AWS. Prior to AWS, he worked as a DevOps architect in the ecommerce industry for over 5 years, following a decade of R&D work in mobile internet technologies. Nima has authored over 20 technical publications and holds 7 US patents. Outside of work, he enjoys reading, watching documentaries, and taking beach walks.
Manish Chugh is a Principal Solutions Architect at AWS based in San Francisco, CA. He specializes in machine learning and is a generative AI lead for NAMER startups team. His role involves helping AWS customers build scalable, secure, and cost-effective machine learning and generative AI workloads on AWS. He regularly presents at AWS conferences and partner events. Outside of work, he enjoys hiking on East SF Bay trails, road biking, and watching (and playing) cricket.

Sensor-Invariant Tactile Representation for Zero-Shot Transfer Across …

Tactile sensing is a crucial modality for intelligent systems to perceive and interact with the physical world. The GelSight sensor and its variants have emerged as influential tactile technologies, providing detailed information about contact surfaces by transforming tactile data into visual images. However, vision-based tactile sensing lacks transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. Minor differences in optical design or manufacturing processes can create substantial discrepancies in sensor output, causing machine learning models trained on one sensor to perform poorly when applied to others.

Computer vision models have been widely applied to vision-based tactile images due to their inherently visual nature. Researchers have adapted representation learning methods from the vision community, with contrastive learning being popular for developing tactile and visual-tactile representations for specific tasks. Auto-encoding representation approaches are also explored, with some researchers utilizing Masked Auto-Encoder (MAE) to learn tactile representations. Methods like general-purpose multimodal representations utilize multiple tactile datasets in LLM frameworks, encoding sensor types as tokens. Despite these efforts, current methods often require large datasets, treat sensor types as fixed categories, and lack the flexibility to generalize to unseen sensors.

Researchers from the University of Illinois Urbana-Champaign proposed Sensor-Invariant Tactile Representations (SITR), a tactile representation to transfer across various vision-based tactile sensors in a zero-shot manner. It is based on the premise that achieving sensor transferability requires learning effective sensor-invariant representations through exposure to diverse sensor variations. It uses three core innovations: utilizing easy-to-acquire calibration images to characterize individual sensors with a transformer encoder, utilizing supervised contrastive learning to emphasize geometric aspects of tactile data across multiple sensors, and developing a large-scale synthetic dataset that contains 1M examples across 100 sensor configurations.

Researchers used the tactile image and a set of calibration images for the sensor as inputs for the network. The sensor background is subtracted from all input images to isolate the pixel-wise color changes. Following Vision Transformer (ViT), these images are linearly projected into tokens, with calibration images requiring tokenization only once per sensor. Further, two supervision signals guide the training process: a pixel-wise normal map reconstruction loss for the output patch tokens and a contrastive loss for the class token. During pre-training, a lightweight decoder reconstructs the contact surface as a normal map from the encoder’s output. Moreover, SITR  employs Supervised Contrastive Learning (SCL), extending traditional contrastive approaches by utilizing label information to define similarity.

In object classification tests using the researchers’ real-world dataset, SITR outperforms all baseline models when transferred across different sensors. While most models perform well in no-transfer settings, they fail to generalize when tested on distinct sensors. It shows SITR’s ability to capture meaningful, sensor-invariant features that remain robust despite changes in the sensor domain. In pose estimation tasks, where the goal is to estimate 3-DoF position changes using initial and final tactile images, SITR reduces the Root Mean Square Error by approximately 50% compared to baselines. Unlike classification results, ImageNet pre-training only marginally improves pose estimation performance, showing that features learned from natural images may not transfer effectively to tactile domains for precise regression tasks.

In this paper, researchers introduced SITR, a tactile representation framework that transfers across various vision-based tactile sensors in a zero-shot manner. They constructed large-scale, sensor-aligned datasets using synthetic and real-world data and developed a method to train SITR to capture dense, sensor-invariant features. The SITR represents a step toward a unified approach to tactile sensing, where models can generalize seamlessly across different sensor types without retraining or fine-tuning. This breakthrough has the potential to accelerate advancements in robotic manipulation and tactile research by removing a key barrier to the adoption and implementation of these promising sensor technologies.

Check out the Paper and Code. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post Sensor-Invariant Tactile Representation for Zero-Shot Transfer Across Vision-Based Tactile Sensors appeared first on MarkTechPost.

This AI Paper Introduces an LLM+FOON Framework: A Graph-Validated Appr …

Robots are increasingly being developed for home environments, specifically to enable them to perform daily activities like cooking. These tasks involve a combination of visual interpretation, manipulation, and decision-making across a series of actions. Cooking, in particular, is complex for robots due to the diversity in utensils, varying visual perspectives, and frequent omissions of intermediate steps in instructional materials like videos. For a robot to succeed in such tasks, a method is needed that ensures logical planning, flexible understanding, and adaptability to different environmental constraints.

One major problem in translating cooking demonstrations into robotic tasks is the lack of standardization in online content. Videos might skip steps, include irrelevant segments like introductions, or show arrangements that do not align with the robot’s operational layout. Robots must interpret visual data and textual cues, infer omitted steps, and translate this into a sequence of physical actions. However, when relying purely on generative models to produce these sequences, there is a high chance of logic failures or hallucinated outputs that render the plan infeasible for robotic execution.

Current tools supporting robotic planning often focus on logic-based models like PDDL or more recent data-driven approaches using Large Language Models (LLMs) or multimodal architectures. While LLMs are adept at reasoning from diverse inputs, they cannot often validate whether the generated plan makes sense in a robotic setting. Prompt-based feedback mechanisms have been tested, but they still fail to confirm the logical correctness of individual actions, especially for complex, multi-step tasks like those in cooking scenarios.

Researchers from the University of Osaka and the National Institute of Advanced Industrial Science and Technology (AIST), Japan, introduced a new framework integrating an LLM with a Functional Object-Oriented Network (FOON) to develop cooking task plans from subtitle-enhanced videos. This hybrid system uses an LLM to interpret a video and generate task sequences. These sequences are then converted into FOON-based graphs, where each action is checked for feasibility against the robot’s current environment. If a step is deemed infeasible, feedback is generated so that the LLM can revise the plan accordingly, ensuring that only logically sound steps are retained.

This method involves several layers of processing. First, the cooking video is split into segments based on subtitles extracted using Optical Character Recognition. Key video frames are selected from each segment and arranged into a 3×3 grid to serve as input images. The LLM is prompted with structured details, including task descriptions, known constraints, and environment layouts. Using this data, it infers the target object states for each segment. These are cross-verified by FOON, a graph system where actions are represented as functional units containing input and output object states. If an inconsistency is found—for instance, if a hand is already holding an item when it’s supposed to pick something else—the task is flagged and revised. This loop continues until a complete and executable task graph is formed.

The researchers tested their method using five full cooking recipes from ten videos. Their experiments successfully generated complete and feasible task plans for four of the five recipes. In contrast, a baseline approach that used only the LLM without FOON validation succeeded in just one case. Specifically, the FOON-enhanced method had a success rate of 80% (4/5), while the baseline achieved only 20% (1/5). Moreover, in the component evaluation of target object node estimation, the system achieved an 86% success rate in accurately predicting object states. During the video preprocessing stage, the OCR process extracted 270 subtitle words compared to the ground truth of 230, resulting in a 17% error rate, which the LLM could still manage by filtering redundant instructions.

In a real-world trial using a dual-arm UR3e robot system, the team demonstrated their method on a gyudon (beef bowl) recipe. The robot could infer and insert a missing “cut” action that was absent in the video, showing the system’s ability to identify and compensate for incomplete instructions. The task graph for the recipe was generated after three re-planning attempts, and the robot completed the cooking sequence successfully. The LLM also correctly ignored non-essential scenes like the video introduction, identifying only 8 of 13 necessary segments for task execution.

This research clearly outlines the problem of hallucination and logical inconsistency in LLM-based robotic task planning. The proposed method offers a robust solution to generate actionable plans from unstructured cooking videos by incorporating FOON as a validation and correction mechanism. The methodology bridges reasoning and logical verification, enabling robots to execute complex tasks by adapting to environmental conditions while maintaining task accuracy.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post This AI Paper Introduces an LLM+FOON Framework: A Graph-Validated Approach for Robotic Cooking Task Planning from Video Instructions appeared first on MarkTechPost.

A Code Implementation to Use Ollama through Google Colab and Building …

In this tutorial, we’ll build a fully functional Retrieval-Augmented Generation (RAG) pipeline using open-source tools that run seamlessly on Google Colab. First, we will look into how to set up Ollama and use models through Colab. Integrating the DeepSeek-R1 1.5B large language model served through Ollama, the modular orchestration of LangChain, and the high-performance ChromaDB vector store allows users to query real-time information extracted from uploaded PDFs. With a combination of local language model reasoning and retrieval of factual data from PDF documents, the pipeline demonstrates a powerful, private, and cost-effective alternative.

Copy CodeCopiedUse a different Browser!pip install colab-xterm
%load_ext colabxterm

We use the colab-xterm extension to enable terminal access directly within the Colab environment. By installing it with !pip install collab and loading it via %load_ext colabxterm, users can open an interactive terminal window inside Colab, making it easier to run commands like llama serve or monitor local processes.

Copy CodeCopiedUse a different Browser%xterm

The %xterm magic command is used after loading the collab extension to launch an interactive terminal window within the Colab notebook interface. This allows users to execute shell commands in real time, just like a regular terminal, making it especially useful for running background services like llama serve, managing files, or debugging system-level operations without leaving the notebook.

Here, we install ollama using curl https://ollama.ai/install.sh | sh.

Then, we start the ollama using ollama serve.

At last, we download the DeepSeek-R1:1.5B through ollama locally that can be utilized for building the RAG pipeline.

Copy CodeCopiedUse a different Browser!pip install langchain langchain-community sentence-transformers chromadb faiss-cpu

To set up the core components of the RAG pipeline, we install essential libraries, including langchain, langchain-community, sentence-transformers, chromadb, and faiss-cpu. These packages enable document processing, embedding, vector storage, and retrieval functionalities required to build an efficient and modular local RAG system.

Copy CodeCopiedUse a different Browserfrom langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from google.colab import files
import os
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM

We import key modules from the langchain-community and langchain-ollama libraries to handle PDF loading, text splitting, embedding generation, vector storage with Chroma, and LLM integration via Ollama. It also includes Colab’s file upload utility and prompt templates, enabling a seamless flow from document ingestion to query answering using a locally hosted model.

Copy CodeCopiedUse a different Browserprint(“Please upload your PDF file…”)
uploaded = files.upload()

file_path = list(uploaded.keys())[0]
print(f”File ‘{file_path}’ successfully uploaded.”)

if not file_path.lower().endswith(‘.pdf’):
print(“Warning: Uploaded file is not a PDF. This may cause issues.”)

To allow users to add their knowledge sources, we prompt for a PDF upload using google.colab.files.upload(). It verifies the uploaded file type and provides feedback, ensuring that only PDFs are processed for further embedding and retrieval.

Copy CodeCopiedUse a different Browser!pip install pypdf
import pypdf
loader = PyPDFLoader(file_path)
documents = loader.load()
print(f”Successfully loaded {len(documents)} pages from PDF”)

To extract content from the uploaded PDF, we install the pypdf library and use PyPDFLoader from LangChain to load the document. This process converts each page of the PDF into a structured format, enabling downstream tasks like text splitting and embedding.

Copy CodeCopiedUse a different Browsertext_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
print(f”Split documents into {len(chunks)} chunks”)

The loaded PDF is split into manageable chunks using RecursiveCharacterTextSplitter, with each chunk sized at 1000 characters and a 200-character overlap. This ensures better context retention across chunks, which improves the relevance of retrieved passages during question answering.

Copy CodeCopiedUse a different Browserembeddings = HuggingFaceEmbeddings(
model_name=”all-MiniLM-L6-v2″,
model_kwargs={‘device’: ‘cpu’}
)

persist_directory = “./chroma_db”

vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_directory
)

vectorstore.persist()
print(f”Vector store created and persisted to {persist_directory}”)

The text chunks are embedded using the all-MiniLM-L6-v2 model from sentence-transformers, running on CPU to enable semantic search. These embeddings are then stored in a persistent ChromaDB vector store, allowing efficient similarity-based retrieval across sessions.

Copy CodeCopiedUse a different Browserllm = OllamaLLM(model=”deepseek-r1:1.5b”)
retriever = vectorstore.as_retriever(
search_type=”similarity”,
search_kwargs={“k”: 3}
)

qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type=”stuff”,
retriever=retriever,
return_source_documents=True
)

print(“RAG pipeline created successfully!”)

The RAG pipeline is finalized by connecting the local DeepSeek-R1 model (via OllamaLLM) with the Chroma-based retriever. Using LangChain’s RetrievalQA chain with a “stuff” strategy, the model retrieves the top 3 most relevant chunks to a query and generates context-aware answers, completing the local RAG setup.

Copy CodeCopiedUse a different Browserdef query_rag(question):
result = qa_chain({“query”: question})

print(“nQuestion:”, question)
print(“nAnswer:”, result[“result”])

print(“nSources:”)
for i, doc in enumerate(result[“source_documents”]):
print(f”Source {i+1}:n{doc.page_content[:200]}…n”)

return result

question = “What is the main topic of this document?”
result = query_rag(question)

To test the RAG pipeline, a query_rag function takes a user question, retrieves relevant context using the retriever, and generates an answer using the LLM. It also displays the top source documents, providing transparency and traceability for the model’s response.

In conclusion, this tutorial combines ollama, the retrieval power of ChromaDB, the orchestration capabilities of LangChain, and the reasoning abilities of DeepSeek-R1 via Ollama. It showcased building a lightweight yet powerful RAG system that runs efficiently on Google Colab’s free tier. The solution enables users to ask questions grounded in up-to-date content from uploaded documents, with answers generated through a local LLM. This architecture provides a foundation for building scalable, customizable, and privacy-friendly AI assistants without incurring cloud costs or compromising performance.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]
The post A Code Implementation to Use Ollama through Google Colab and Building a Local RAG Pipeline on Using DeepSeek-R1 1.5B through Ollama, LangChain, FAISS, and ChromaDB for Q&A appeared first on MarkTechPost.

How iFood built a platform to run hundreds of machine learning models …

Headquartered in São Paulo, Brazil, iFood is a national private company and the leader in food-tech in Latin America, processing millions of orders monthly. iFood has stood out for its strategy of incorporating cutting-edge technology into its operations. With the support of AWS, iFood has developed a robust machine learning (ML) inference infrastructure, using services such as Amazon SageMaker to efficiently create and deploy ML models. This partnership has allowed iFood not only to optimize its internal processes, but also to offer innovative solutions to its delivery partners and restaurants.
iFood’s ML platform comprises a set of tools, processes, and workflows developed with the following objectives:

Accelerate the development and training of AI/ML models, making them more reliable and reproducible
Make sure that deploying these models to production is reliable, scalable, and traceable
Facilitate the testing, monitoring, and evaluation of models in production in a transparent, accessible, and standardized manner

To achieve these objectives, iFood uses SageMaker, which simplifies the training and deployment of models. Additionally, the integration of SageMaker features in iFood’s infrastructure automates critical processes, such as generating training datasets, training models, deploying models to production, and continuously monitoring their performance.
In this post, we show how iFood uses SageMaker to revolutionize its ML operations. By harnessing the power of SageMaker, iFood streamlines the entire ML lifecycle, from model training to deployment. This integration not only simplifies complex processes but also automates critical tasks.
AI inference at iFood iFood has harnessed the power of a robust AI/ML platform to elevate the customer experience across its diverse touchpoints. Using the cutting edge of AI/ML capabilities, the company has developed a suite of transformative solutions to address a multitude of customer use cases:

Personalized recommendations – At iFood, AI-powered recommendation models analyze a customer’s past order history, preferences, and contextual factors to suggest the most relevant restaurants and menu items. This personalized approach makes sure customers discover new cuisines and dishes tailored to their tastes, improving satisfaction and driving increased order volumes.
Intelligent order tracking – iFood’s AI systems track orders in real time, predicting delivery times with a high degree of accuracy. By understanding factors like traffic patterns, restaurant preparation times, and courier locations, the AI can proactively notify customers of their order status and expected arrival, reducing uncertainty and anxiety during the delivery process.
Automated customer Service – To handle the thousands of daily customer inquiries, iFood has developed an AI-powered chatbot that can quickly resolve common issues and questions. This intelligent virtual agent understands natural language, accesses relevant data, and provides personalized responses, delivering fast and consistent support without overburdening the human customer service team.
Grocery shopping assistance – Integrating advanced language models, iFood’s app allows customers to simply speak or type their recipe needs or grocery list, and the AI will automatically generate a detailed shopping list. This voice-enabled grocery planning feature saves customers time and effort, enhancing their overall shopping experience.

Through these diverse AI-powered initiatives, iFood is able to anticipate customer needs, streamline key processes, and deliver a consistently exceptional experience—further strengthening its position as the leading food-tech platform in Latin America.
Solution overview
The following diagram illustrates iFood’s legacy architecture, which had separate workflows for data science and engineering teams, creating challenges in efficiently deploying accurate, real-time machine learning models into production systems.

In the past, the data science and engineering teams at iFood operated independently. Data scientists would build models using notebooks, adjust weights, and publish them onto services. Engineering teams would then struggle to integrate these models into production systems. This disconnection between the two teams made it challenging to deploy accurate real-time ML models.
To overcome this challenge, iFood built an internal ML platform that helped bridge this gap. This platform has streamlined the workflow, providing a seamless experience for creating, training, and delivering models for inference. It provides a centralized integration where data scientists could build, train, and deploy models seamlessly from an integrated approach, considering the development workflow of the teams. The interaction with engineering teams could consume these models and integrate them into applications from both an online and offline perspective, enabling a more efficient and streamlined workflow.
By breaking down the barriers between data science and engineering, AWS AI platforms empowered iFood to use the full potential of their data and accelerate the development of AI applications. The automated deployment and scalable inference capabilities provided by SageMaker made sure that models were readily available to power intelligent applications and provide accurate predictions on demand. This centralization of ML services as a product has been a game changer for iFood, allowing them to focus on building high-performing models rather than the intricate details of inference.
One of the core capabilities of iFood’s ML platform is the ability to provide the infrastructure to serve predictions. Several use cases are supported by the inference made available through ML Go!, responsible for deploying SageMaker pipelines and endpoints. The former are used to schedule offline predictions jobs, and the latter are employed to create model services, to be consumed by the application services. The following diagram illustrates iFood’s updated architecture, which incorporates an internal ML platform built to streamline workflows between data science and engineering teams, enabling efficient deployment of machine learning models into production systems.

Integrating model deployment into the service development process was a key initiative to enable data scientists and ML engineers to deploy and maintain those models. The ML platform empowers the building and evolution of ML systems. Several other integrations with other important platforms, like the feature platform and data platform, were delivered to increase the experience for the users as a whole. The process of consuming ML-based decisions was streamlined—but it doesn’t end there. The iFood’s ML platform, ML Go!, is now focusing on new inference capabilities, supported by recent features in which the iFood’s team was responsible for supporting their ideation and development. The following diagram illustrates the final architecture of iFood’s ML platform, showcasing how model deployment is integrated into the service development process, the platform’s connections with feature and data platforms, and its focus on new inference capabilities.

One of the biggest changes is oriented to the creation of one abstraction for connecting with SageMaker Endpoints and Jobs, called ML Go! Gateway, and also, the separation of concerns within the Endpoints, by the use of the Inference Components feature, making the serving faster and more efficient. In this new inference structure, the Endpoints are also managed by the ML Go! CI/CD, leaving for the pipelines, to deal only with model promotions, and not the infrastructure itself. It will reduce the lead time to changes, and change failure ratio over the deployments.
Using SageMaker Inference Model Serving Containers:
One of the key features of modern machine learning platforms is the standardization of machine learning and AI services. By encapsulating models and dependencies as Docker containers, these platforms ensure consistency and portability across different environments and stages of ML. Using SageMaker, data scientists and developers can use pre-built Docker containers, making it straightforward to deploy and manage ML services. As a project progresses, they can spin up new instances and configure them according to their specific requirements. SageMaker provides Docker containers that are designed to work seamlessly with SageMaker. These containers provide a standardized and scalable environment for running ML workloads on SageMaker.
SageMaker provides a set of pre-built containers for popular ML frameworks and algorithms, such as TensorFlow, PyTorch, XGBoost, and many others. These containers are optimized for performance and include all the necessary dependencies and libraries pre-installed, making it straightforward to get started with your ML projects. In addition to the pre-built containers, it provides options to bring your own custom containers to SageMaker, which include your specific ML code, dependencies, and libraries. This can be particularly useful if you’re using a less common framework or have specific requirements that aren’t met by the pre-built containers.
iFood was highly focused on using custom containers for the training and deployment of ML workloads, providing a consistent and reproducible environment for ML experiments, and making it effortless to track and replicate results. The first step in this journey was to standardize the ML custom code, which is actually the piece of code that the data scientists should focus on. Without a notebook, and with BruceML, the way to create the code to train and serve models has changed, to be encapsulated from the start as container images. BruceML was responsible for creating the scaffolding required to seamlessly integrate with the SageMaker platform, allowing the teams to take advantage of its various features, such as hyperparameter tuning, model deployment, and monitoring. By standardizing ML services and using containerization, modern platforms democratize ML, enabling iFood to rapidly build, deploy, and scale intelligent applications.
Automating model deployment and ML system retraining
When running ML models in production, it’s critical to have a robust and automated process for deploying and recalibrating those models across different use cases. This helps make sure the models remain accurate and performant over time. The team at iFood understood this challenge well—not only the model is deployed. Instead, they rely on another concept to keep things running well: ML pipelines.
Using Amazon SageMaker Pipelines, they were able to build a CI/CD system for ML, to deliver automated retraining and model deployment. They also integrated this entire system with the company’s existing CI/CD pipeline, making it efficient and also maintaining good DevOps practices used at iFood. It starts with the ML Go! CI/CD pipeline pushing the latest code artifacts containing the model training and deployment logic. It includes the training process, which uses different containers for implementing the entire pipeline. When training is complete, the inference pipeline can be executed to begin the model deployment. It can be an entirely new model, or the promotion of a new version to increase the performance of an existing one. Every model available for deployment is also secured and registered automatically by ML Go! in Amazon SageMaker Model Registry, providing versioning and tracking capabilities.
The final step depends on the intended inference requirements. For batch prediction use cases, the pipeline creates a SageMaker batch transform job to run large-scale predictions. For real-time inference, the pipeline deploys the model to a SageMaker endpoint, carefully selecting the appropriate container variant and instance type to handle the expected production traffic and latency needs. This end-to-end automation has been a game changer for iFood, allowing them to rapidly iterate on their ML models and deploy updates and recalibrations quickly and confidently across their various use cases. SageMaker Pipelines has provided a streamlined way to orchestrate these complex workflows, making sure model operationalization is efficient and reliable.
Running inference in different SLA formats
iFood uses the inference capabilities of SageMaker to power its intelligent applications and deliver accurate predictions to its customers. By integrating the robust inference options available in SageMaker, iFood has been able to seamlessly deploy ML models and make them available for real-time and batch predictions. For iFood’s online, real-time prediction use cases, the company uses SageMaker hosted endpoints to deploy their models. These endpoints are integrated into iFood’s customer-facing applications, allowing for immediate inference on incoming data from users. SageMaker handles the scaling and management of these endpoints, making sure that iFood’s models are readily available to provide accurate predictions and enhance the user experience.
In addition to real-time predictions, iFood also uses SageMaker batch transform to perform large-scale, asynchronous inference on datasets. This is particularly useful for iFood’s data preprocessing and batch prediction requirements, such as generating recommendations or insights for their restaurant partners. SageMaker batch transform jobs enable iFood to efficiently process vast amounts of data, further enhancing their data-driven decision-making.
Building upon the success of standardization to SageMaker Inference, iFood has been instrumental in partnering with the SageMaker Inference team to build and enhance key AI inference capabilities within the SageMaker platform. Since the early days of ML, iFood has provided the SageMaker Inference team with valuable inputs and expertise, enabling the introduction of several new features and optimizations:

Cost and performance optimizations for generative AI inference – iFood helped the SageMaker Inference team develop innovative techniques to optimize the use of accelerators, enabling SageMaker Inference to reduce foundation model (FM) deployment costs by 50% on average and latency by 20% on average with inference components. This breakthrough delivers significant cost savings and performance improvements for customers running generative AI workloads on SageMaker.
Scaling improvements for AI inference – iFood’s expertise in distributed systems and auto scaling has also helped the SageMaker team develop advanced capabilities to better handle the scaling requirements of generative AI models. These improvements reduce auto scaling times by up to 40% and auto scaling detection by six times, making sure that customers can rapidly scale their inference workloads on SageMaker to meet spikes in demand without compromising performance.
Streamlined generative AI model deployment for inference – Recognizing the need for simplified model deployment, iFood collaborated with AWS to introduce the ability to deploy open source large language models (LLMs) and FMs with just a few clicks. This user-friendly functionality removes the complexity traditionally associated with deploying these advanced models, empowering more customers to harness the power of AI.
Scale-to-zero for inference endpoints – iFood played a crucial role in collaborating with SageMaker Inference to develop and launch the scale-to-zero feature for SageMaker inference endpoints. This innovative capability allows inference endpoints to automatically shut down when not in use and rapidly spin up on demand when new requests arrive. This feature is particularly beneficial for dev/test environments, low-traffic applications, and inference use cases with varying inference demands, because it eliminates idle resource costs while maintaining the ability to quickly serve requests when needed. The scale-to-zero functionality represents a major advancement in cost-efficiency for AI inference, making it more accessible and economically viable for a wider range of use cases.
Packaging AI model inference more efficiently – To further simplify the AI model lifecycle, iFood worked with AWS to enhance SageMaker’s capabilities for packaging LLMs and models for deployment. These improvements make it straightforward to prepare and deploy these AI models, accelerating their adoption and integration.
Multi-model endpoints for GPU – iFood collaborated with the SageMaker Inference team to launch multi-model endpoints for GPU-based instances. This enhancement allows you to deploy multiple AI models on a single GPU-enabled endpoint, significantly improving resource utilization and cost-efficiency. By taking advantage of iFood’s expertise in GPU optimization and model serving, SageMaker now offers a solution that can dynamically load and unload models on GPUs, reducing infrastructure costs by up to 75% for customers with multiple models and varying traffic patterns.
Asynchronous inference – Recognizing the need for handling long-running inference requests, the team at iFood worked closely with the SageMaker Inference team to develop and launch Asynchronous Inference in SageMaker. This feature enables you to process large payloads or time-consuming inference requests without the constraints of real-time API calls. iFood’s experience with large-scale distributed systems helped shape this solution, which now allows for better management of resource-intensive inference tasks, and the ability to handle inference requests that might take several minutes to complete. This capability has opened up new use cases for AI inference, particularly in industries dealing with complex data processing tasks such as genomics, video analysis, and financial modeling.

By closely partnering with the SageMaker Inference team, iFood has played a pivotal role in driving the rapid evolution of AI inference and generative AI inference capabilities in SageMaker. The features and optimizations introduced through this collaboration are empowering AWS customers to unlock the transformative potential of inference with greater ease, cost-effectiveness, and performance.

“At iFood, we were at the forefront of adopting transformative machine learning and AI technologies, and our partnership with the SageMaker Inference product team has been instrumental in shaping the future of AI applications. Together, we’ve developed strategies to efficiently manage inference workloads, allowing us to run models with speed and price-performance. The lessons we’ve learned supported us in the creation of our internal platform, which can serve as a blueprint for other organizations looking to harness the power of AI inference. We believe the features we have built in collaboration will broadly help other enterprises who run inference workloads on SageMaker, unlocking new frontiers of innovation and business transformation, by solving recurring and important problems in the universe of machine learning engineering.”
– says Daniel Vieira, ML Platform manager at iFood.

Conclusion
Using the capabilities of SageMaker, iFood transformed its approach to ML and AI, unleashing new possibilities for enhancing the customer experience. By building a robust and centralized ML platform, iFood has bridged the gap between its data science and engineering teams, streamlining the model lifecycle from development to deployment. The integration of SageMaker features has enabled iFood to deploy ML models for both real-time and batch-oriented use cases. For real-time, customer-facing applications, iFood uses SageMaker hosted endpoints to provide immediate predictions and enhance the user experience. Additionally, the company uses SageMaker batch transform to efficiently process large datasets and generate insights for its restaurant partners. This flexibility in inference options has been key to iFood’s ability to power a diverse range of intelligent applications.
The automation of deployment and retraining through ML Go!, supported by SageMaker Pipelines and SageMaker Inference, has been a game changer for iFood. This has enabled the company to rapidly iterate on its ML models, deploy updates with confidence, and maintain the ongoing performance and reliability of its intelligent applications. Moreover, iFood’s strategic partnership with the SageMaker Inference team has been instrumental in driving the evolution of AI inference capabilities within the platform. Through this collaboration, iFood has helped shape cost and performance optimizations, scale improvements, and simplify model deployment features—all of which are now benefiting a wider range of AWS customers.
By taking advantage of the capabilities SageMaker offers, iFood has been able to unlock the transformative potential of AI and ML, delivering innovative solutions that enhance the customer experience and strengthen its position as the leading food-tech platform in Latin America. This journey serves as a testament to the power of cloud-based AI infrastructure and the value of strategic partnerships in driving technology-driven business transformation.
By following iFood’s example, you can unlock the full potential of SageMaker for your business, driving innovation and staying ahead in your industry.

About the Authors
Daniel Vieira is a seasoned Machine Learning Engineering Manager at iFood, with a strong academic background in computer science, holding both a bachelor’s and a master’s degree from the Federal University of Minas Gerais (UFMG). With over a decade of experience in software engineering and platform development, Daniel leads iFood’s ML platform, building a robust, scalable ecosystem that drives impactful ML solutions across the company. In his spare time, Daniel Vieira enjoys music, philosophy, and learning about new things while drinking a good cup of coffee.
Debora Fanin serves as a Senior Customer Solutions Manager AWS for the Digital Native Business segment in Brazil. In this role, Debora manages customer transformations, creating cloud adoption strategies to support cost-effective, timely deployments. Her responsibilities include designing change management plans, guiding solution-focused decisions, and addressing potential risks to align with customer objectives. Debora’s academic path includes a Master’s degree in Administration at FEI and certifications such as Amazon Solutions Architect Associate and Agile credentials. Her professional history spans IT and project management roles across diverse sectors, where she developed expertise in cloud technologies, data science, and customer relations.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Gopi Mudiyala is a Senior Technical Account Manager at AWS. He helps customers in the financial services industry with their operations in AWS. As a machine learning enthusiast, Gopi works to help customers succeed in their ML journey. In his spare time, he likes to play badminton, spend time with family, and travel.