How to Build a Proactive Pre-Emptive Churn Prevention Agent with Intel …

In this tutorial, we build a fully functional Pre-Emptive Churn Agent that proactively identifies at-risk users and drafts personalized re-engagement emails before they cancel. Rather than waiting for churn to occur, we design an agentic loop in which we observe user inactivity, analyze behavioral patterns, strategize incentives, and generate human-ready email drafts using Gemini. We orchestrate the entire process step by step, ensuring each component, from data simulation to manager approval, works seamlessly together. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport os
import time
import json
import random
from datetime import datetime, timedelta
from typing import List, Dict, Any
import textwrap

try:
import google.generativeai as genai
except ImportError:
!pip install -q -U google-generativeai
import google.generativeai as genai

from google.colab import userdata
import getpass

We set up our environment, import all required libraries, and ensure Gemini is available for use. We keep the initialization minimal so the rest of the system loads cleanly. As we run it, we prepare the foundation for the agent-driven workflow that follows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef setup_gemini():
print(“— Security Check —“)
try:
api_key = userdata.get(‘GEMINI_API_KEY’)
except:
print(“Please enter your Google Gemini API Key:”)
api_key = getpass.getpass(“API Key: “)
if not api_key:
raise ValueError(“API Key is required to run the agent.”)
genai.configure(api_key=api_key)
return genai.GenerativeModel(‘gemini-2.5-flash’)

class MockCustomerDB:
def __init__(self):
self.today = datetime.now()
self.users = self._generate_mock_users()

def _generate_mock_users(self) -> List[Dict]:
profiles = [
{“id”: “U001”, “name”: “Sarah Connor”, “plan”: “Enterprise”,
“last_login_days_ago”: 2, “top_features”: [“Reports”, “Admin Panel”], “total_spend”: 5000},
{“id”: “U002”, “name”: “John Smith”, “plan”: “Basic”,
“last_login_days_ago”: 25, “top_features”: [“Image Editor”], “total_spend”: 50},
{“id”: “U003”, “name”: “Emily Chen”, “plan”: “Pro”,
“last_login_days_ago”: 16, “top_features”: [“API Access”, “Data Export”], “total_spend”: 1200},
{“id”: “U004”, “name”: “Marcus Aurelius”, “plan”: “Enterprise”,
“last_login_days_ago”: 45, “top_features”: [“Team Management”], “total_spend”: 8000}
]
return profiles

def fetch_at_risk_users(self, threshold_days=14) -> List[Dict]:
return [u for u in self.users if u[‘last_login_days_ago’] >= threshold_days]

We configure authentication for Gemini and construct a mock customer database that behaves like a real system. We simulate users with varying levels of inactivity to generate realistic churn scenarios. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ChurnPreventionAgent:
def __init__(self, model):
self.model = model

def analyze_and_strategize(self, user: Dict) -> Dict:
print(f” … Analyzing strategy for {user[‘name’]}…”)
prompt = f”””
You are a Customer Success AI Specialist.
Analyze this user profile and determine the best ‘Win-Back Strategy’.
USER PROFILE:
– Name: {user[‘name’]}
– Plan: {user[‘plan’]}
– Days Inactive: {user[‘last_login_days_ago’]}
– Favorite Features: {‘, ‘.join(user[‘top_features’])}
– Total Spend: ${user[‘total_spend’]}
TASK:
1. Determine the ‘Churn Probability’ (Medium/High/Critical).
2. Select a specific INCENTIVE.
3. Explain your reasoning briefly.
OUTPUT FORMAT:
{{
“risk_level”: “High”,
“incentive_type”: “Specific Incentive”,
“reasoning”: “One sentence explanation.”
}}
“””
try:
response = self.model.generate_content(prompt)
clean_json = response.text.replace(““`json”, “”).replace(““`”, “”).strip()
return json.loads(clean_json)
except Exception as e:
return {
“risk_level”: “Unknown”,
“incentive_type”: “General Check-in”,
“reasoning”: f”Analysis failed: {str(e)}”
}

We build the analytical core of our churn agent to evaluate user behavior and select win-back strategies. We let Gemini interpret signals, such as inactivity and usage patterns, to determine risk and incentives. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef draft_engagement_email(self, user: Dict, strategy: Dict) -> str:
print(f” … Drafting email for {user[‘name’]} using ‘{strategy[‘incentive_type’]}’…”)
prompt = f”””
Write a short, empathetic, professional re-engagement email.
TO: {user[‘name’]}
CONTEXT: They haven’t logged in for {user[‘last_login_days_ago’]} days.
STRATEGY: {strategy[‘incentive_type’]}
REASONING: {strategy[‘reasoning’]}
USER HISTORY: They love {‘, ‘.join(user[‘top_features’])}.
TONE: Helpful and concise.
“””
response = self.model.generate_content(prompt)
return response.text

We generate personalized re-engagement emails based on the strategy output from the previous step. We use Gemini to craft concise, empathetic messaging that aligns with each user’s history. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ManagerDashboard:
def review_draft(self, user_name, strategy, draft_text):
print(“n” + “=”*60)
print(f” REVIEW REQUIRED: Re-engagement for {user_name}”)
print(f” Strategy: {strategy[‘incentive_type’]}”)
print(f” Risk Level: {strategy[‘risk_level’]}”)
print(“-” * 60)
print(” DRAFT EMAIL:n”)
print(textwrap.indent(draft_text, ‘ ‘))
print(“-” * 60)
print(“n[Auto-Simulation] Manager reviewing…”)
time.sleep(1.5)
if strategy[‘risk_level’] == “Critical”:
print(” MANAGER DECISION: Approved (Priority Send)”)
return True
else:
print(” MANAGER DECISION: Approved”)
return True

We simulate a manager dashboard where human oversight approves or rejects the drafted email. We keep the flow simple but realistic, ensuring the agent’s actions remain aligned with human judgment. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
print(“Initializing Agentic System…”)
try:
model = setup_gemini()
db = MockCustomerDB()
agent = ChurnPreventionAgent(model)
manager = ManagerDashboard()
except Exception as e:
print(f”Setup failed: {e}”)
return

print(“n AGENT STATUS: Scanning Database for inactive users (>14 days)…”)
at_risk_users = db.fetch_at_risk_users(threshold_days=14)
print(f”Found {len(at_risk_users)} at-risk users.n”)

for user in at_risk_users:
print(f”— Processing Case: {user[‘id’]} ({user[‘name’]}) —“)
strategy = agent.analyze_and_strategize(user)
email_draft = agent.draft_engagement_email(user, strategy)
approved = manager.review_draft(user[‘name’], strategy, email_draft)
if approved:
print(f” ACTION: Email queued for sending to {user[‘name’]}.”)
else:
print(f” ACTION: Email rejected.”)
print(“n”)
time.sleep(1)

if __name__ == “__main__”:
main()

We orchestrate the full system: scanning for at-risk users, analyzing them, drafting messages, and routing everything for approval. We bring all components together into one continuous loop. 

In conclusion, we have completed a churn-prevention pipeline that observes, reasons, drafts, and involves a human reviewer before action. We watch the agent detect risk patterns, craft tailored strategies, and generate professional emails, all while maintaining human oversight for final decisions. This implementation demonstrates how agentic workflows can transform customer success operations by enabling timely, personalized, and scalable interventions. We now have a modular foundation we can expand further, connecting it to real databases, CRMs, web dashboards, or automation systems, to build a truly production-ready churn prevention engine.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Proactive Pre-Emptive Churn Prevention Agent with Intelligent Observation and Strategy Formation appeared first on MarkTechPost.

Google DeepMind Researchers Release Gemma Scope 2 as a Full Stack Inte …

Google DeepMind Researchers introduce Gemma Scope 2, an open suite of interpretability tools that exposes how Gemma 3 language models process and represent information across all layers, from 270M to 27B parameters.

Its core goal is simple, give AI safety and alignment teams a practical way to trace model behavior back to internal features instead of relying only on input output analysis. When a Gemma 3 model jailbreaks, hallucinates or shows sycophantic behavior, Gemma Scope 2 lets researchers inspect which internal features fired and how those activations flowed through the network.

What is Gemma Scope 2?

Gemma Scope 2 is a comprehensive, open suite of sparse autoencoders and related tools trained on internal activations of the Gemma 3 model family. Sparse autoencoders, SAEs, act as a microscope on the model. They decompose high dimensional activations into a sparse set of human inspectable features that correspond to concepts or behaviors.

Training Gemma Scope 2 required storing around 110 Petabytes of activation data and fitting over 1 trillion total parameters across all interpretability models.

The suite targets every Gemma 3 variant, including 270M, 1B, 4B, 12B and 27B parameter models, and covers the full depth of the network. This is important because many safety relevant behaviors only appear at larger scales.

What is new compared to the original Gemma Scope?

The first Gemma Scope release focused on Gemma 2 and already enabled research on model hallucination, identifying secrets known by a model and training safer models.

Gemma Scope 2 extends that work in four main ways:

The tools now span the entire Gemma 3 family up to 27B parameters, which is needed to study emergent behaviors observed only in larger models, such as the behavior previously analyzed in the 27B size C2S Scale model for scientific discovery tasks.

Gemma Scope 2 includes SAEs and transcoders trained on every layer of Gemma 3. Skip transcoders and cross layer transcoders help trace multi step computations that are distributed across layers.

The suite applies the Matryoshka training technique so that SAEs learn more useful and stable features and mitigate some flaws identified in the earlier Gemma Scope release.

There are dedicated interpretability tools for Gemma 3 models tuned for chat, which make it possible to analyze multi step behaviors such as jailbreaks, refusal mechanisms and chain of thought faithfulness.

Key Takeaways

Gemma Scope 2 is an open interpretability suite for all Gemma 3 models, from 270M to 27B parameters, with SAEs and transcoders on every layer of both pretrained and instruction tuned variants.

The suite uses sparse autoencoders as a microscope that decomposes internal activations into sparse, concept like features, plus transcoders that track how these features propagate across layers.

Gemma Scope 2 is explicitly positioned for AI safety work to study jailbreaks, hallucinations, sycophancy, refusal mechanisms and discrepancies between internal state and communicated reasoning in Gemma 3.

Check out the Paper, Technical details and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Researchers Release Gemma Scope 2 as a Full Stack Interpretability Suite for Gemma 3 Models appeared first on MarkTechPost.

Exploring the zero operator access design of Mantle

At Amazon, our culture, built on honest and transparent discussion of our growth opportunities, enables us to focus on investing and innovating to continually raise the standard on our ability to deliver value for our customers. Earlier this month, we had the opportunity to share an example of this process at work in Mantle, our next-generation inference engine for Amazon Bedrock. As generative AI inferencing and fine-tuning workloads continue to evolve, we need to evolve how we serve inferencing to our customers in an optimized way, which leads to the development of Mantle.
As we set out to reimagine the architecture of our next generation inferencing engine, we made raising the bar on security our top priority. AWS shares our customers’ unwavering focus on security and data privacy. This has been central to our business from the start, and it was particularly in focus from the earliest days of Amazon Bedrock. We’ve understood from the start that generative AI inference workloads present an unprecedented opportunity for customers to harness the latent value of their data, but with that opportunity comes the need to ensure the highest standards in security, privacy, and compliance as our customers build generative AI systems that process their most sensitive data and interact with their most critical systems.
As a baseline, Amazon Bedrock is designed with the same operational security standards that you see across AWS. AWS has always used a least privilege model for operations, where each AWS operator has access to only the minimum set of systems required to do their assigned task, limited to the time when that privilege is needed. Any access to systems that store or process customer data or metadata is logged, monitored for anomalies, and audited. AWS guards against any actions that would disable or bypass these controls. Additionally, on Amazon Bedrock your data is never used to train any models. Model providers have no mechanism to access customer data, because inferencing is done only within the Amazon Bedrock-owned account that model providers don’t have access to. This strong security posture has been a key enabler for our customers to unlock the potential of generative AI applications for their sensitive data.
With Mantle, we raised the bar even further. Following the approach of the AWS Nitro System, we have designed Mantle from the ground up to be zero operator access (ZOA), where we have intentionally excluded any technical means for AWS operators to access customer data. Instead, systems and services are administered using automation and secure APIs that protect customer data. With Mantle, there is no mechanism for any AWS operator to sign in to underlying compute systems or access any customer data, such as inference prompts or completions. Interactive communication tools like Secure Shell (SSH), AWS Systems Manager Session Manager, and serial consoles aren’t installed anywhere in Mantle. Additionally, all inference software updates need to be signed and verified before they can be deployed into the service, ensuring that only approved code runs on Mantle.
Mantle uses the recently released EC2 instance attestation capability to configure a hardened, constrained, and immutable compute environment for customer data processing. The services in Mantle that are responsible for handling model weights and conducting inference operations on customer prompts are further backed by the high assurance of cryptographically signed attestation measurements from the Nitro Trusted Platform Module (NitroTPM).
When a customer calls a Mantle endpoint (for example, bedrock-mantle.[regions].api.aws) such as those that serve the Responses API on Amazon Bedrock, customer data (prompts) leaves the customer’s environment through TLS, and is encrypted all the way to the Mantle service, which operates with ZOA. Throughout the entire flow and in Mantle, no operator, whether from AWS, the customer, or a model provider can access the customer data.

Looking forward
Mantle’s ZOA design exemplifies the long-term commitment of AWS to the security and privacy of our customers’ data. It’s this focus that has enabled teams across AWS to invest in further raising the bar for security. At the same time, we’ve made the foundational confidential computing capabilities that we internally use at Amazon, such as NitroTPM Attestation, available to all customers to use on Amazon Elastic Compute Cloud (Amazon EC2).
We’re not stopping here; we’re committed to continuing to invest in enhancing the security of your data and to providing you with more transparency and assurance on how we achieve this.

About the authors
Anthony Liguori is an AWS VP and Distinguished Engineer for Amazon Bedrock, and the lead engineer for Mantle.

AWS AI League: Model customization and agentic showdown

Building intelligent agents to handle complex, real-world tasks can be daunting. Additionally, rather than relying solely on large, pre-trained foundation models, organizations often need to fine-tune and customize smaller, more specialized models to outperform them for their specific use cases. The AWS AI League provides an innovative program to help enterprises overcome the challenges of building advanced AI capabilities through exciting competitions that drive innovation in agentic AI and model customization.
In 2025, the first AWS AI League competition captured the attention of developers, data scientists, and business leaders globally. They came together to solve pressing problems using the latest AI tools and techniques. The grand finale at AWS re:Invent 2025 was an exciting showcase of their ingenuity and skills. Cross-functional teams from leading organizations competed head-to-head, demonstrating their ability to craft effective prompts, fine-tune models, and build powerful AI agents.
Congratulations to our 2025 AWS AI League Champions! After intense competition among these three exceptional builders emerged victorious, sharing a $25,000 prize pool:

1st Place: Hemanth Vediyera from Cisco
2nd Place: Ross Williams from Aqfer
3rd Place: Deepesh Khanna from Capital One

Figure 1: Left to right: Ross, Hemanth, Deepesh

This post explores how the AWS AI League program can be used to host AI competitions that can help participants experience model customization and agent building concepts, apply these to tackle real-world business challenges, and showcase their innovative solutions through engaging, game-style formats. We highlight the new agentic AI and model customization challenges, where enterprises can apply to host internal tournaments using AWS credits, and developers can compete at AWS events.
To get started, visit the AWS AI League product page.
What is the AWS AI League Championship?
The AWS AI League experience begins with a hands-on, 2-hour workshop led by AWS experts, followed by self-paced experimentation. The journey culminates in a captivating, gameshow-style grand finale, where you showcase your AI creations and solutions to address pressing business challenges. The following figure shows these three steps.

Figure 2: AWS AI League Championship steps

Building on the success of the 2025 program, we are excited to announce the launch of the AWS AI League 2026 Championship. This year, the competition features two new challenges that allow participants to really put their AI skills to the test:

The agentic AI Challenge allows you to build intelligent agents using Amazon Bedrock AgentCore. Competitors craft customized agent architectures to tackle real-world business problems.
Complementing the agentic AI Challenge, the model customization Challenge now uses the latest fine-tuning recipes in SageMaker Studio. Here you customize models for specific use cases.

For the 2026 AI League championship, the prize pool doubles to $50,000, with tracks catering to developers at different skill levels – from beginners to advanced practitioners.
Build intelligent agents with the agentic AI challenge
The AWS AI League now features an exciting agentic AI challenge, where you build intelligent agents using Amazon Bedrock AgentCore to solve complex problems in a dynamic, game-style competition. In this challenge, agents navigate through a maze-like grid environment, encountering various challenges while seeking a treasure chest. These challenges map to real-world use cases, testing the agents’ ability to handle inappropriate content, execute code, use a browser, and more.
Agents have a time limit to traverse the map, collect points, and overcome the obstacles before reaching the treasure chest. The more points they earn, the higher they rank on the leaderboard. You can fully customize your agents using Amazon Bedrock AgentCore primitives, which enables you to more securely scale and manage production-grade agents. You can also select specific models for supervisor and sub-agents, as well as create custom tools such as Bedrock Guardrails, AgentCore Memory, and AWS Lambda functions to help your agents navigate the challenges. The following figure depicts the obstacles the agent must overcome while traveling to reach the treasure chest.

Figure 3: AWS AI League Agentic Challenge

AWS AI League provides a full user interface (UI) for users to build their intelligent agent solutions. You can use this no-code UI to construct multi-agent architectures and tools, integrating various components such as Amazon SageMaker Studio CodeEditor for interactive coding of custom Lambda functions and tools. This allows you to fully develop and customize your agent-based solutions within the AWS AI League website, without needing to leave the environment.
The following screenshots showcase the agent building experience all within the AWS AI League website.

Figure 4: AWS AI League agent tools

Figure 5: AWS AI League multi agent architecture

Throughout the competition, users receive real-time agent performance feedback, with a large language model (LLM) evaluator providing assessment to help with iteration. The following image showcases how the agent is evaluated during challenges.

Figure 6: AWS AI League agent challenge evaluation

At the grand finale, the top finalists take the stage to showcase their agents’ capabilities in a live, game-show format, demonstrating the power and versatility of agentic AI in solving complex, multi-step problems. The evaluation criteria include time efficiency, accuracy in solving challenges, agent planning, and token consumption efficiency. The following snapshot shows the final round of the Grand Finale at re:Invent 2025.

Figure 7: AWS AI League re:Invent 2025 Grand Finale

Customize models to outperform larger models
AWS AI League is expanding the scope of its model customization challenge, allowing you to use the latest advancements in fine-tuning techniques.
You can access the new model customization experience within Amazon SageMaker Studio, where you can use powerful new training recipes. The goal is to develop highly effective, domain-specific models that can outperform the performance of larger, reference models.
The challenge begins with you honing in on your model customization skills. Using the tools and techniques you have learned, you apply advanced fine-tuning methods to help enhance your model’s performance. After your models are customized, the true test begins. The models are submitted to a leaderboard for performance assessment against a reference model. The model earns points each time the automated judge deems your customized model’s response to be more accurate and comprehensive than the reference model’s output. You can showcase your advanced skills, rise to the top of the leaderboard, and potentially unlock new opportunities for your organizations.
During the challenge, you receive real-time feedback on your model’s performance from an automated evaluator when you submit to the leaderboard. The leaderboard evaluates submissions against a reference dataset throughout the competition, providing immediate feedback on accuracy to help you iterate and improve your solutions. The following image showcases how an AI critique is used to evaluate the customized model.

Figure 8: AWS AI League model customization evaluation

At the grand finale, the top finalists demonstrate their models’ capabilities in a live, game-show format, showcasing their prompt engineering abilities. During the gameshow, the scoring includes expert evaluation where domain experts and a live audience participate in real-time voting to determine which AI solutions best solve real business challenges. The following image showcases the participant prompt engineering view during a Grand Finale.

Figure 9: AWS AI League model customization Grand Finale participant view

Conclusion
In this post, we explored the new AWS AI League challenges and how they are transforming how organizations approach AI development. At AWS, we’ve learned that the fastest way to spark innovation is through competition. With AWS AI League, builders can now showcase their AI skills, compete and unlock innovation.
To learn more about hosting an AWS AI League within your organization visit the AWS AI League and to dive deeper into building intelligent agents and customizing AI models explore AWS AI training catalog on AWS Skill Builder.

About the authors
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Natasya K. Idries is the Product Marketing Manager for AWS AI/ML Gamified Learning Programs. She is passionate about democratizing AI/ML skills through engaging and hands-on educational initiatives that bridge the gap between advanced technology and practical business implementation. Her expertise in building learning communities and driving digital innovation continues to shape her approach to creating impactful AI education programs. Outside of work, Natasya enjoys traveling, cooking Southeast Asian cuisines and exploring nature trails.

Accelerate Enterprise AI Development using Weights & Biases and Am …

This post is co-written by Thomas Capelle and Ray Strickland from Weights & Biases (W&B).
Generative artificial intelligence (AI) adoption is accelerating across enterprises, evolving from simple foundation model interactions to sophisticated agentic workflows. As organizations transition from proof-of-concepts to production deployments, they require robust tools for development, evaluation, and monitoring of AI applications at scale.
In this post, we demonstrate how to use Foundation Models (FMs) from Amazon Bedrock and the newly launched Amazon Bedrock AgentCore alongside W&B Weave to help build, evaluate, and monitor enterprise AI solutions. We cover the complete development lifecycle from tracking individual FM calls to monitoring complex agent workflows in production.
Overview of W&B Weave
Weights & Biases (W&B) is an AI developer system that provides comprehensive tools for training models, fine-tuning, and leveraging foundation models for enterprises of all sizes across various industries.
W&B Weave offers a unified suite of developer tools to support every stage of your agentic AI workflows. It enables:

Tracing & monitoring: Track large language model (LLM) calls and application logic to debug and analyze production systems.
Systematic iteration: Refine and iterate on prompts, datasets and models.
Experimentation: Experiment with different models and prompts in the LLM Playground.
Evaluation: Use custom or pre-built scorers alongside our comparison tools to systematically assess and enhance application performance. Collect user and expert feedback for real-life testing and evaluation.
Guardrails: Help protect your application with safeguards for content moderation, prompt safety, and more. Use custom or third-party guardrails (including Amazon Bedrock Guardrails) or W&B Weave’s native guardrails.

W&B Weave can be fully managed by Weights & Biases in a multi-tenant or single-tenant environment or can be deployed in a customer’s Amazon Virtual Private Cloud (VPC) directly. In addition, W&B Weave’s integration into the W&B Development Platform provides organizations a seamlessly integrated experience between the model training/fine-tuning workflow and the agentic AI workflow.
To get started, subscribe to the Weights & Biases AI Development Platform through AWS Marketplace. Individuals and academic teams can subscribe to W&B at no additional cost.
Tracking Amazon Bedrock FMs with W&B Weave SDK
W&B Weave integrates seamlessly with Amazon Bedrock through Python and TypeScript SDKs. After installing the library and patching your Bedrock client, W&B Weave automatically tracks the LLM calls:

!pip install weave
import weave
import boto3
import json
from weave.integrations.bedrock.bedrock_sdk import patch_client

weave.init(“my_bedrock_app”)

# Create and patch the Bedrock client
client = boto3.client(“bedrock-runtime”)
patch_client(client)

# Use the client as usual
response = client.invoke_model(
modelId=”anthropic.claude-3-5-sonnet-20240620-v1:0″,
body=json.dumps({
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 100,
“messages”: [
{“role”: “user”, “content”: “What is the capital of France?”}
]
}),
contentType=’application/json’,
accept=’application/json’
)
response_dict = json.loads(response.get(‘body’).read())
print(response_dict[“content”][0][“text”])

This integration automatically versions experiments and tracks configurations, providing complete visibility into your Amazon Bedrock applications without modifying core logic.
Experimenting with Amazon Bedrock FMs in W&B Weave Playground
The W&B Weave Playground accelerates prompt engineering with an intuitive interface for testing and comparing Bedrock models. Key features include:

Direct prompt editing and message retrying
Side-by-side model comparison
Access from trace views for rapid iteration

To begin, add your AWS credentials in the Playground settings, select your preferred Amazon Bedrock FMs, and start experimenting. The interface enables rapid iteration on prompts while maintaining full traceability of experiments.

Evaluating Amazon Bedrock FMs with W&B Weave Evaluations
W&B Weave Evaluations provides dedicated tools for evaluating generative AI models effectively. By leveraging W&B Weave Evaluations alongside Amazon Bedrock, users can efficiently evaluate these models, analyze outputs, and visualize performance across key metrics. Users can use built in scorers from W&B Weave, 3rd party or custom scorers, and human/expert feedback as well. This combination allows for a deeper understanding of the tradeoffs between models, such as differences in cost, accuracy, speed, and output quality.
W&B Weave has a first-class way to track evaluations with Model & Evaluation classes. To set up an evaluation job, customers can:

Define a dataset or list of dictionaries with a collection of examples to be evaluated
Create a list of scoring functions. Each function should have a model_output and optionally, other inputs from your examples, and return a dictionary with the scores
Define an Amazon Bedrock model by using Model class
Evaluate this model by calling Evaluation

Here’s an example of setting up an evaluation job:

import weave
from weave import Evaluation
import asyncio

# Collect your examples
examples = [
{“question”: “What is the capital of France?”, “expected”: “Paris”},
{“question”: “Who wrote ‘To Kill a Mockingbird’?”, “expected”: “Harper Lee”},
{“question”: “What is the square root of 64?”, “expected”: “8”},
]

# Define any custom scoring function
@weave.op()
def match_score1(expected: str, output: dict) -> dict:
# Here is where you’d define the logic to score the model output
return {‘match’: expected == model_output[‘generated_text’]}

@weave.op()
def function_to_evaluate(question: str):
# here’s where you would add your LLM call and return the output
return {‘generated_text’: ‘Paris’}

# Score your examples using scoring functions
evaluation = Evaluation(
dataset=examples, scorers=[match_score1]
)

# Start tracking the evaluation
weave.init(‘intro-example’)
# Run the evaluation
asyncio.run(evaluation.evaluate(function_to_evaluate))

The evaluation dashboard visualizes performance metrics, enabling informed decisions about model selection and configuration. For detailed guidance, see our previous post on evaluating LLM summarization with Amazon Bedrock and Weave.
Enhancing Amazon Bedrock AgentCore Observability with W&B Weave
Amazon Bedrock AgentCore is a complete set of services for deploying and operating highly capable agents more securely at enterprise scale. It provides more secure runtime environments, workflow execution tools, and operational controls that work with popular frameworks like Strands Agents, CrewAI, LangGraph, and LlamaIndex, as well as many LLM models – whether from Amazon Bedrock or external sources.
AgentCore includes built-in observability through Amazon CloudWatch dashboards that track key metrics like token usage, latency, session duration, and error rates. It also traces workflow steps, showing which tools were invoked and how the model responded, providing essential visibility for debugging and quality assurance in production.
When working with AgentCore and W&B Weave together, teams can use AgentCore’s built-in operational monitoring and security foundations while also using W&B Weave if it aligns with their existing development workflows. Organizations already invested in the W&B environment may choose to incorporate W&B Weave’s visualization tools alongside AgentCore’s native capabilities. This approach gives teams flexibility to use the observability solution that best fits their established processes and preferences when developing complex agents that chain multiple tools and reasoning steps.

There are two main approaches to add W&B Weave observability to your AgentCore agents: using the native W&B Weave SDK or integrating through OpenTelemetry.
Native W&B Weave SDK
The simplest approach is to use W&B Weave’s @weave.op decorator to automatically track function calls. Initialize W&B Weave with your project name and wrap the functions you want to monitor:

import weave
import os

os.environ[“WANDB_API_KEY”] = “your_api_key”
weave.init(“your_project_name”)

@weave.op()
def word_count_op(text: str) -> int:
return len(text.split())

@weave.op()
def run_agent(agent: Agent, user_message: str) -> Dict[str, Any]:
result = agent(user_message)
return {“message”: result.message, “model”: agent.model.config[“model_id”]}

Since AgentCore runs as a docker container, add W&B weave to your dependencies (for example, uv add weave) to include it in your container image.
OpenTelemetry Integration
For teams already using OpenTelemetry or wanting vendor-neutral instrumentation, W&B Weave supports OTLP (OpenTelemetry Protocol) directly:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

auth_b64 = base64.b64encode(f”api:{WANDB_API_KEY}”.encode()).decode()
exporter = OTLPSpanExporter(
endpoint=”https://trace.wandb.ai/otel/v1/traces”,
headers={“Authorization”: f”Basic {auth_b64}”, “project_id”: WEAVE_PROJECT}
)

# Create spans to track execution
with tracer.start_as_current_span(“invoke_agent”) as span:
span.set_attribute(“input.value”, json.dumps({“prompt”: user_message}))
result = agent(user_message)
span.set_attribute(“output.value”, json.dumps({“message”: result.message}))

This approach maintains compatibility with AgentCore’s existing OpenTelemetry infrastructure while routing traces to W&B Weave for visualization.When using both AgentCore and W&B Weave together, teams have multiple options for observability. AgentCore’s CloudWatch integration monitors system health, resource utilization, and error rates while providing tracing for agent reasoning and tool selection. W&B Weave offers visualization capabilities that present execution data in formats familiar to teams already using the W&B environment. Both solutions provide visibility into how agents process information and make decisions, allowing organizations to choose the observability approach that best aligns with their existing workflows and preferences.This dual-layer approach means users can:

Monitor production service level agreements (SLAs) through CloudWatch alerts
Debug complex agent behaviors in W&B Weave’s trace explorer
Optimize token usage and latency with detailed execution breakdowns
Compare agent performance across different prompts and configurations

The integration requires minimal code changes, preserves your existing AgentCore deployment, and scales with your agent complexity. Whether you’re building simple tool-calling agents or orchestrating multi-step workflows, this observability stack provides the insights needed to iterate quickly and deploy confidently.
For implementation details and complete code examples, refer to our previous post.
Conclusion
In this post, we demonstrated how to build and optimize enterprise-grade agentic AI solutions by combining Amazon Bedrock’s FMs and AgentCore with W&B Weave’s comprehensive observability toolkit. We explored how W&B Weave can enhance every stage of the LLM development lifecycle—from initial experimentation in the Playground to systematic evaluation of model performance, and finally to production monitoring of complex agent workflows.
The integration between Amazon Bedrock and W&B Weave provides several key capabilities:

Automatic tracking of Amazon Bedrock FM calls with minimal code changes using the W&B Weave SDK
Rapid experimentation through the W&B Weave Playground’s intuitive interface for testing prompts and comparing models
Systematic evaluation with custom scoring functions to evaluate different Amazon Bedrock models
Comprehensive observability for AgentCore deployments, with CloudWatch metrics providing more robust operational monitoring supplemented by detailed execution traces

To get started:

Request a free trial or subscribe to Weights &Biases AI Development Platform through AWS Marketplace
Install the W&B Weave SDK and follow our code examples to begin tracking your Bedrock FM calls
Experiment with different models in the W&B Weave Playground by adding your AWS credentials and testing various Amazon Bedrock FMs
Set up evaluations using the W&B Weave Evaluation framework to systematically compare model performance for your use cases
Enhance your AgentCore agents by adding W&B Weave observability using either the native SDK or OpenTelemetry integration

Start with a simple integration to track your Amazon Bedrock calls, then progressively adopt more advanced features as your AI applications grow in complexity. The combination of Amazon Bedrock and W&B Weave’s comprehensive development tools provides the foundation needed to build, evaluate, and maintain production-ready AI solutions at scale.

About the authors
James Yi is a Senior AI/ML Partner Solutions Architect at AWS. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.
Ray Strickland is a Senior Partner Solutions Architect at AWS specializing in AI/ML, Agentic AI and Intelligent Document Processing. He enables partners to deploy scalable generative AI solutions using AWS best practices and drives innovation through strategic partner enablement programs. Ray collaborates across multiple AWS teams to accelerate AI adoption and has extensive experience in partner evaluation and enablement.
Thomas Capelle is a Machine Learning Engineer at Weights & Biases. He is responsible for keeping the www.github.com/wandb/examples repository live and up to date. He also builds content on MLOPS, applications of W&B to industries, and fun deep learning in general. Previously he was using deep learning to solve short-term forecasting for solar energy. He has a background in Urban Planning, Combinatorial Optimization, Transportation Economics, and Applied Math.
Scott Juang is the Director of Alliances at Weights & Biases. Prior to W&B, he led a number of strategic alliances at AWS and Cloudera. Scott studied Materials Engineering and has a passion for renewable energy.

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audio …

Meta researchers have introduced Perception Encoder Audiovisual, PEAV, as a new family of encoders for joint audio and video understanding. The model learns aligned audio, video, and text representations in a single embedding space using large scale contrastive training on about 100M audio video pairs with text captions.

From Perception Encoder to PEAV

Perception Encoder, PE, is the core vision stack in Meta’s Perception Models project. It is a family of encoders for images, video, and audio that reaches state of the art on many vision and audio benchmarks using a unified contrastive pretraining recipe. PE core surpasses SigLIP2 on image tasks and InternVideo2 on video tasks. PE lang powers Perception Language Model for multimodal reasoning. PE spatial is tuned for dense prediction tasks such as detection and depth estimation.

PEAV builds on this backbone and extends it to full audio video text alignment. In the Perception Models repository, PE audio visual is listed as the branch that embeds audio, video, audio video, and text into a single joint embedding space for cross modal understanding.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Architecture, Separate Towers and Fusion

The PEAV architecture is composed of a frame encoder, a video encoder, an audio encoder, an audio video fusion encoder, and a text encoder.

The video path uses the existing PE frame encoder on RGB frames, then applies a temporal video encoder on top of frame level features.

The audio path uses DAC VAE as a codec to convert raw waveforms into discrete audio tokens at fixed frame rate, about one embedding every 40 milliseconds.

These towers feed an audio video fusion encoder that learns a shared representation for both streams. The text encoder projects text queries into several specialized spaces. In practice this gives you a single backbone that can be queried in many ways. You can retrieve video from text, audio from text, audio from video, or retrieve text descriptions conditioned on any combination of modalities without retraining task specific heads.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Data Engine, Synthetic Audiovisual Captions At Scale

The research team proposed a two stage audiovisual data engine that generates high quality synthetic captions for unlabeled clips. The team describes a pipeline that first uses several weak audio caption models, their confidence scores, and separate video captioners as input to a large language model. This LLM produces three caption types per clip, one for audio content, one for visual content, and one for joint audio visual content. An initial PE AV model is trained on this synthetic supervision.

In the second stage, this initial PEAV is paired with a Perception Language Model decoder. Together they refine the captions to better exploit audiovisual correspondences. The two stage engine yields reliable captions for about 100M audio video pairs and uses about 92M unique clips for stage 1 pretraining and 32M additional unique clips for stage 2 fine tuning.

Compared to prior work that often focuses on speech or narrow sound domains, this corpus is designed to be balanced across speech, general sounds, music, and diverse video domains, which is important for general audio visual retrieval and understanding.

Contrastive Objective Across Ten Modality Pairs

PEAV uses a sigmoid based contrastive loss across audio, video, text, and fused representations. The research team explains that the model uses eight contrastive loss pairs during pretraining. These cover combinations such as audio text, video text, audio video text, and fusion related pairs. During fine tuning, two extra pairs are added, which brings the total to ten loss pairs among the different modality and caption types.

This objective is similar in form to contrastive objectives used in recent vision language encoders but generalized to audio video text tri modal training. By aligning all these views in one space, the same encoder can support classification, retrieval, and correspondence tasks with simple dot product similarities.

Performance Across Audio, Speech, Music And Video

On benchmarks, PEAV targets zero shot retrieval and classification for multiple domains. PE AV achieves state of the art performance on several audio and video benchmarks compared to recent audio text and audio video text models from works such as CLAP, Audio Flamingo, ImageBind, and LanguageBind.

Concrete gains include:

On AudioCaps, text to audio retrieval improves from 35.4 R at 1 to 45.8 R at 1.

On VGGSound, clip level classification accuracy improves from 36.0 to 47.1.

For speech retrieval on VCTK style tasks, PE AV reaches 85.6 accuracy while earlier models are near 0.

On ActivityNet, text to video retrieval improves from 60.4 R at 1 to 66.5 R at 1.

On Kinetics 400, zero shot video classification improves from 76.9 to 78.9, beating models 2 to 4 times larger.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

PEA-Frame, Frame Level Audio Text Alignment

Alongside PEAV, Meta releases Perception Encoder Audio Frame, PEA-Frame, for sound event localization. PE A Frame is an audio text embedding model that outputs one audio embedding per 40 milliseconds frame and a single text embedding per query. The model can return temporal spans that mark where in the audio each described event occurs.

PEA-Frame uses frame level contrastive learning to align audio frames with text. This enables precise localization of events such as specific speakers, instruments, or transient sounds in long audio sequences.

Role In The Perception Models And SAM Audio Ecosystem

PEAV and PEA-Frame sit inside the broader Perception Models stack, which combines PE encoders with Perception Language Model for multimodal generation and reasoning.

PEAV is also the core perception engine behind Meta’s new SAM Audio model and its Judge evaluator. SAM Audio uses PEAV embeddings to connect visual prompts and text prompts to sound sources in complex mixtures and to score the quality of separated audio tracks.

Key Takeaways

PEAV is a unified encoder for audio, video, and text, trained with contrastive learning on over 100M videos, and embeds audio, video, audio video, and text into a single joint space for cross modal retrieval and understanding.

The architecture uses separate video and audio towers, with PE based visual encoding and DAC VAE audio tokenization, followed by an audio visual fusion encoder and specialized text heads aligned to different modality pairs.

A 2 stage data engine generates synthetic audio, visual, and audio visual captions using weaker captioners plus an LLM in stage 1 and PEAV plus Perception Language Model in stage 2, enabling large scale multimodal supervision without manual labels.

PEAV establishes new state of the art on a wide range of audio and video benchmarks through a sigmoid contrastive objective over multiple modality pairs, with six public checkpoints from small 16 frame to large all frame variants, where average retrieval improves from about 45 to 51.6.

PEAV, together with the frame level PEA-Frame variant, forms the perception backbone for Meta’s SAM Audio system, providing the embeddings used for prompt based audio separation and fine grained sound event localization across speech, music, and general sounds.

Check out the Paper, Repo and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval appeared first on MarkTechPost.

How to Build a Fully Autonomous Local Fleet-Maintenance Analysis Agent …

In this tutorial, we walk through the process of creating a fully autonomous fleet-analysis agent using SmolAgents and a local Qwen model. We generate telemetry data, load it through a custom tool, and let our agent reason, analyze, and visualize maintenance risks without any external API calls. At each step of implementation, we see how the agent interprets structured logs, applies logical filters, detects anomalies, and finally produces a clear visual warning for fleet managers. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(” Installing libraries… (approx 30-60s)”)
!pip install smolagents transformers accelerate bitsandbytes ddgs matplotlib pandas -q

import os
import pandas as pd
import matplotlib.pyplot as plt
from smolagents import CodeAgent, Tool, TransformersModel

We install all required libraries and import the core modules we rely on for building our agent. We set up SmolAgents, Transformers, and basic data-handling tools to process telemetry and run the local model smoothly. At this stage, we prepare our environment and ensure everything loads correctly before moving ahead. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserfleet_data = {
“truck_id”: [“T-101”, “T-102”, “T-103”, “T-104”, “T-105”],
“driver”: [“Ali”, “Sara”, “Mike”, “Omar”, “Jen”],
“avg_speed_kmh”: [65, 70, 62, 85, 60],
“fuel_efficiency_kml”: [3.2, 3.1, 3.3, 1.8, 3.4],
“engine_temp_c”: [85, 88, 86, 105, 84],
“last_maintenance_days”: [30, 45, 120, 200, 15]
}
df = pd.DataFrame(fleet_data)
df.to_csv(“fleet_logs.csv”, index=False)
print(” ‘fleet_logs.csv’ created.”)

We generate the dummy fleet dataset that our agent will later analyze. We create a small but realistic set of telemetry fields, convert it into a DataFrame, and save it as a CSV file. Here, we establish the core data source that drives the agent’s reasoning and predictions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass FleetDataTool(Tool):
name = “load_fleet_logs”
description = “Loads vehicle telemetry logs from ‘fleet_logs.csv’. Returns the data summary.”
inputs = {}
output_type = “string”

def forward(self):
try:
df = pd.read_csv(“fleet_logs.csv”)
return f”Columns: {list(df.columns)}nData Sample:n{df.to_string()}”
except Exception as e:
return f”Error loading logs: {e}”

We define the FleetDataTool, which acts as the bridge between the agent and the underlying telemetry file. We give the agent the ability to load and inspect the CSV file to understand its structure. This tool becomes the foundation for every subsequent analysis the model performs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(” Downloading & Loading Local Model (approx 60-90s)…”)
model = TransformersModel(
model_id=”Qwen/Qwen2.5-Coder-1.5B-Instruct”,
device_map=”auto”,
max_new_tokens=2048
)
print(” Model loaded on GPU.”)

agent = CodeAgent(
tools=[FleetDataTool()],
model=model,
add_base_tools=True
)

print(“n Agent is analyzing fleet data… (Check the ‘Agent’ output below)n”)

query = “””
1. Load the fleet logs.
2. Find the truck with the worst fuel efficiency (lowest ‘fuel_efficiency_kml’).
3. For that truck, check if it is overdue for maintenance (threshold is 90 days).
4. Create a bar chart comparing the ‘fuel_efficiency_kml’ of ALL trucks.
5. Highlight the worst truck in RED and others in GRAY on the chart.
6. Save the chart as ‘maintenance_alert.png’.
“””
response = agent.run(query)

print(f”n FINAL REPORT: {response}”)

We load the Qwen2.5 local model and initialize our CodeAgent with the custom tool. We then craft a detailed query outlining the reasoning steps we want the agent to follow and execute it end-to-end. This is where we watch the agent think, analyze, compute, and even plot, fully autonomously. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif os.path.exists(“maintenance_alert.png”):
print(“n Displaying Generated Chart:”)
img = plt.imread(“maintenance_alert.png”)
plt.figure(figsize=(10, 5))
plt.imshow(img)
plt.axis(‘off’)
plt.show()
else:
print(” No chart image found. Check the agent logs above.”)

We check whether the agent successfully saved the generated maintenance chart and display it if available. We visualize the output directly in the notebook, allowing us to confirm that the agent correctly performed data analysis and plotting. This gives us a clean, interpretable result from the entire workflow.

In conclusion, we built an intelligent end-to-end pipeline that enables a local model to autonomously load data, evaluate fleet health, identify the highest-risk vehicle, and generate a diagnostic chart for actionable insights. We witness how easily we can extend this framework to real-world datasets, integrate more complex tools, or add multi-step reasoning capabilities for safety, efficiency, or predictive maintenance use cases. At last, we appreciate how SmolAgents empowers us to create practical agentic systems that execute real code, reason over real telemetry, and deliver insights immediately.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Autonomous Local Fleet-Maintenance Analysis Agent Using SmolAgents and Qwen Model appeared first on MarkTechPost.

Google Introduces A2UI (Agent-to-User Interface): An Open Sourc Protoc …

Google has open sourced A2UI, an Agent to User Interface specification and set of libraries that lets agents describe rich native interfaces in a declarative JSON format while client applications render them with their own components. The project targets a clear problem, how to let remote agents present secure, interactive interfaces across trust boundaries without sending executable code.

What is A2UI?

A2UI is an open standard and implementation that allows agents to speak UI. An agent does not output HTML or JavaScript. It outputs an A2UI response, which is a JSON payload that describes a set of components, their properties and a data model. The client application reads this description and maps each component to its own native widgets, for example Angular components, Flutter widgets, web components, React components or SwiftUI views.

The Problem, Agents Need to Speak UI

Most chat based agents respond with long text. For tasks such as restaurant booking or data entry, this produces many turns and dense answers. The A2UI launch post shows a restaurant example where a user asks for a table, then the agent asks several follow up questions in text, which is slow. A better experience is a small form with a date picker, time selector and submit button. A2UI lets the agent request that form as a structured UI description instead of narrating it in natural language.

The problem becomes harder in a multi agent mesh. In that setting, an orchestrator in one organization may delegate work to a remote A2A agent in another organization. The remote agent cannot touch the Document Object Model of the host application. It can only send messages. Historically that meant HTML or script inside an iframe. That approach is heavy, often visually inconsistent with the host and risky from a security point of view. A2UI defines a data format that is safe like data but expressive enough to describe complex layouts.

Core Design, Security and LLM Friendly Structure

A2UI focuses on security, LLM friendliness and portability.

Security first. A2UI is a declarative data format, not executable code. The client maintains a catalog of trusted components such as Card, Button or TextField. The agent can only reference types in this catalog. This reduces the risk of UI injection and avoids arbitrary script execution from model output.

LLM friendly representation. The UI is represented as a flat list of components with identifier references. This makes it easier for language models to generate or update interfaces incrementally and supports streaming updates. The agent can adjust a view as the conversation progresses without regenerating a full nested JSON tree.

Framework agnostic. A single A2UI payload can be rendered on multiple clients. The agent describes a component tree and associated data model. The client maps that structure to native widgets in frameworks such as Angular, Flutter, React or SwiftUI. This allows reuse of the same agent logic across web, mobile and desktop surfaces.

Progressive rendering. Because the format is designed for streaming, clients can show partial interfaces while the agent continues computing. Users see the interface assemble in real time rather than waiting for a complete response.

Architecture and Data Flow

A2UI is a pipeline that separates generation, transport and rendering.

A user sends a message to an agent through a chat or another surface.

The agent, often backed by Gemini or another model that can generate JSON, produces an A2UI response. This response describes components, layout and data bindings.

The A2UI messages stream to the client over a transport such as the Agent to Agent protocol or the AG UI protocol.

The client uses an A2UI renderer library. The renderer parses the payload and resolves each component type into a concrete widget in the host codebase.

User actions, for example button clicks or form submissions, are sent back as events to the agent. The agent may respond with new A2UI messages that update the existing interface.

Key Takeaways

A2UI is an open standard and library set from Google that lets agents ‘speak UI’ by sending a declarative JSON specification for interfaces, while clients render them using native components such as Angular, Flutter or Lit.

The specification focuses on security by treating UI as data, not code, so agents only reference a client controlled catalog of components, which reduces UI injection risk and avoids executing arbitrary scripts from model output.

The internal format uses an updateable, flat representation of components that is optimized for LLMs, which supports streaming and incremental updates, so agents can progressively refine the interface during a session.

A2UI is transport agnostic and is already used with the A2A protocol and AG UI, which allows orchestrator agents and remote sub agents to send UI payloads across trust boundaries while host applications keep control of branding, layout and accessibility.

The project is in early stage public preview at version v0.8, released under Apache 2.0, with reference renderers, quickstart samples and production integrations in projects such as Opal, Gemini Enterprise and Flutter GenUI, making it directly usable by engineers building agentic applications now.

Check out the Github Repo and Technical Details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Introduces A2UI (Agent-to-User Interface): An Open Sourc Protocol for Agent Driven Interfaces appeared first on MarkTechPost.

Move Beyond Chain-of-Thought with Chain-of-Draft on Amazon Bedrock

As organizations scale their generative AI implementations, the critical challenge of balancing quality, cost, and latency becomes increasingly complex. With inference costs dominating 70–90% of large language model (LLM) operational expenses, and verbose prompting strategies inflating token volume by 3–5x, organizations are actively seeking more efficient approaches to model interaction. Traditional prompting methods, while effective, often create unnecessary overhead that impacts both cost efficiency and response times.
This post explores Chain-of-Draft (CoD), an innovative prompting technique introduced in a Zoom AI Research paper Chain of Draft: Thinking Faster by Writing Less, that revolutionizes how models approach reasoning tasks. While Chain-of-Thought (CoT) prompting has been the go-to method for enhancing model reasoning, CoD offers a more efficient alternative that mirrors human problem-solving patterns—using concise, high-signal thinking steps rather than verbose explanations.
Using Amazon Bedrock and AWS Lambda, we demonstrate a practical implementation of CoD that can achieve remarkable efficiency gains: up to 75%reduction in token usage and over 78% decrease in latency, all while maintaining the accuracy levels of traditional CoT approaches. Through detailed examples, code samples, and performance metrics, we walk through deploying CoD in an AWS environment and measuring its impact on AI implementations. This approach not only optimizes costs but also enhances the overall user experience through faster response times.

Understanding Chain-of-Thought prompting
Chain-of-Thought (CoT) prompting is a technique that guides large language models to reason through problems step by step, rather than jumping directly to an answer. This method has proven particularly effective for complex tasks such as logical puzzles, mathematical problems, and common-sense reasoning scenarios. By mimicking human problem-solving patterns, CoT helps models break down complex problems into manageable steps, improving both accuracy and transparency.
Example of CoT prompting:
Question: If there are 5 apples and you eat 2 apples, how many apples remain?
CoT response: Start with 5 apples. I eat 2 apples. Subtract 2 from 5. 5 – 2 = 3 apples remaining.
However, as the example above shows, this approach comes with some drawbacks in production environments. The verbose nature of CoT responses leads to increased token usage and higher costs. The extended processing time required for generating detailed explanations results in higher latency, making it in some cases less suitable for real-time applications. Additionally, the detailed outputs can complicate downstream processing and integration with other systems.
Introducing Chain-of-Draft prompting
Chain-of-Draft (CoD) is a novel prompting technique that aims to reduce verbosity by limiting the number of words used in each reasoning step, focusing only on the essential calculations or transformations needed to progress, while significantly reducing token usage and inference latency. CoD draws inspiration from how humans solve problems with brief mental notes rather than verbose explanations—encouraging LLMs to generate compact, high-signal reasoning steps.
The key innovation of CoD lies in its constraint: each reasoning step is limited to five words or less. This limitation forces the model to focus on essential logical components while minimizing unnecessary verbosity. For instance, when solving a mathematical word problem, instead of generating full sentences explaining each step, CoD produces concise numerical operations and key logical markers.
Consider this example:
Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
A CoT response might include several sentences explaining the reasoning process like, “Jason had 20 lollipops. He gave some to Denny and now has 12 left. So he gave away 8.”
In contrast, a CoD response would simply state “Start: 20, End: 12, 20 – 12 = 8.”
This minimalist approach achieves the same logical reasoning while using significantly fewer tokens.
Why CoD works
The key idea behind CoD is that most reasoning chains contain high redundancy. By distilling steps to their semantic core, CoD helps the model focus on the logical structure of the task rather than language fluency. This results in lower inference latency due to shorter outputs, reduced token cost from minimized generation and cleaner output for downstream parsing or automation.
This minimalism is achieved without sacrificing accuracy. In fact, according to the original Zoom AI paper, CoD “achieved 91.4% accuracy on GSM8K (vs. 95.3% for CoT), while reducing output tokens by up to 92.1%, and cutting latency nearly in half in several models tested.”
Under the hood, the CoD technique uses natural language prompts that instruct the model to “think step by step” while explicitly limiting the length of each reasoning step: “Only keep a minimum draft for each thinking step, with 5 words at most.”
The researchers found that models like GPT-4, Claude, and Cohere Command R+ performed especially well under these constraints, particularly when using few-shot examples to demonstrate the concise reasoning pattern.

Beyond arithmetic tasks, CoD has demonstrated strong performance in commonsense reasoning tasks. In the original Zoom AI paper, the authors evaluated CoD using big-bench benchmarks, specifically focused on date understanding and sports understanding tasks. The same system prompts were used as in arithmetic evaluations, maintaining consistency across experiments. The results revealed that CoD not only significantly reduces token generation and latency, but in several cases, outperforms CoT in accuracy—especially when verbose output isn’t necessary.
One notable finding was with a large language model on the sports understanding task: CoT produced long, verbose responses with an average of 172.5 output tokens, while CoD reduced this to 31.3 tokens, achieving an ~82% reduction. Interestingly, accuracy improved slightly, demonstrating that CoD can be more effective with fewer words.
Here’s a snapshot from the original paper showing the evaluation across two LLMs:

Model
Prompt
Accuracy
Token
Latency

LLM-1
Standard
72.60%
5.2
0.6s

Chain-of-Thought
90.20%
75.7
1.7s

Chain-of-Draft
88.10%
30.2
1.3s

LLM-2
Standard
84.30%
5.2
1.0s

Chain-of-Thought
87%
172.5
3.2s

Chain-of-Draft
89.70%
31.3
1.4s

Table 1. Date understanding evaluation results. (Chain of Draft: Thinking Faster by Writing Less)
These results further validate CoD’s value in real-world reasoning scenarios, showing that models can reason effectively with fewer, smarter tokens. The implication for production use is clear: faster responses and lower cost, without a trade-off in quality.

In the next section, we show how we implemented this prompting strategy using Amazon Bedrock and AWS Lambda, and how CoD compares to CoT across foundation models in real-world conditions.
Implementation and evaluation on AWS
To evaluate the efficiency of CoD prompting techniques, we run a test in Amazon Bedrock and solve the “Red, Blue, and Green Balls” puzzle using an LLM.
The Puzzle: You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. Box 1 is labelled “Red Balls Only.” Box 2 is labelled “Blue Balls Only.” Box 3 is labelled “Red and Blue Balls Only.” The labels on the boxes are all incorrect. The task is that you must determine the contents of each box, knowing that all labels are incorrect. You can only take a single ball from one box and observe its color. Then you must deduce the contents of all three boxes.
We chose this puzzle because solving it requires a measurable number of tokens, as the problem needs to be broken down into several logical steps, each requiring the LLM to process and retain information. The LLM needs to handle “if-then” statements and consider different possibilities leading to logical reasoning. The LLM also needs to maintain the context of the puzzle throughout the reasoning process, and lastly, the LLM needs to understand the symbols and relationships between the colors, labels, and balls.
Prerequisites
To test and compare the prompting techniques in Amazon Bedrock, verify you have the following prerequisites:

AWS account with permission to create and execute Lambda functions
Amazon Bedrock access enabled in your AWS Region (for example, us-east-1) along with Model Access for example, Model-1 and Model-2; select any model of your choice
AWS IAM role for the Lambda function execution
Permissions to invoke Amazon Bedrock models (bedrock:Converse)
Permissions to put custom metrics in Amazon CloudWatch (cloudwatch:PutMetricData)
(Optional) CloudWatch Logs permissions for logging
Necessary Python libraries (boto3), included in the AWS Lambda runtime environment for Python 3.9 or later

Evaluation with Amazon Bedrock Converse API
We start by creating a Python Lambda function designed to interact with models using Amazon Bedrock to solve the puzzle. This AWS Lambda function uses the Amazon Bedrock Converse API, which provides a unified, consistent interface to interact with various foundation models. The Converse API simplifies sending conversational messages to models and receiving their replies, supporting multi-turn dialogue and advanced features while managing AWS authentication and infrastructure. The Lambda function initializes clients for Amazon Bedrock Runtime and CloudWatch and send a static puzzle prompt as a user message to the Converse API, retrieve the response text, and calculate latency and token usage for both input and output. These metrics are published to CloudWatch, and relevant logs are recorded. Finally, the function returns the model’s answer along with input/output token counts. Errors are logged and returned with proper HTTP error code.
The Lambda function
import json
import boto3
import time
import logging
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

bedrock = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)
cloudwatch = boto3.client(‘cloudwatch’)
MODEL_ID = “model1-id” # Replace with actual Model 1 ID
PROMPT = (
“You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. ”
“Box 1 is labeled as ‘Red Balls Only’. Box 2 is labeled ‘Blue Balls Only’. ”
“Box 3 is labeled ‘Red and Blue Balls Only’. The labels on the boxes are all incorrect. ”
“The Task: You must determine the contents of each box, knowing that all labels are incorrect. ”
“You can only take a single ball from one box and observe its color. ”
“Then you must deduce the contents of all three boxes. ”
“Think step by step to answer the question, but only keep a minimum draft for each thinking step, with 5 words at most. ”
“Return the answer at the end of the response after separator ###.”
)

def lambda_handler(event, context):
conversation = [{“role”: “user”, “content”: [{“text”: PROMPT}]}]
start_time = time.time()
try:
response = bedrock.converse(
modelId=MODEL_ID,
messages=conversation,
inferenceConfig={“maxTokens”: 2000, “temperature”: 0.7}
)
response_text = response[“output”][“message”][“content”][0][“text”]
latency = time.time() – start_time
input_tokens = len(PROMPT.split())
output_tokens = len(response_text.split())

cloudwatch.put_metric_data(
Namespace=’ChainOfDraft’,
MetricData=[
{“MetricName”: “Latency”, “Value”: latency, “Unit”: “Seconds”},
{“MetricName”: “TokensUsed”, “Value”: input_tokens + output_tokens, “Unit”: “Count”},
]
)

logger.info({
“request_id”: context.aws_request_id,
“latency_seconds”: round(latency, 2),
“total_tokens”: input_tokens + output_tokens
})

return {
“statuscode”: 200,
“body”: json.dumps({
“response”: response_text,
“input_tokens”: input_tokens,
“output_tokens”: output_tokens,
“metrics”: {
“latency_seconds”: round(latency, 2),
“total_tokens”: input_tokens + output_tokens,
},
}),
}

except ClientError as e:
logger.error(f”AWS service error: {e}”)
return {“statuscode”: 500, “body”: json.dumps(“Service error occurred”)}

except Exception as e:
logger.error(f”Unexpected error: {e}”)
return {“statusCode”: 500, “body”: json.dumps(f”Internal error occurred: {e}”)}
If you’re using Model 2, change the MODEL_ID in the above code to Model 2 id.  The rest of the code remains the same.
Testing
Here are the three prompts used with the models to test the Lambda function. Change the PROMPT in the Lambda function to test out the prompting techniques.
Standard prompt:
“You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. Box 1 is labelled as ‘Red Balls Only’. Box 2 is labelled ‘Blue Balls Only’. Box 3 is labelled ‘Red and Blue Balls Only’. The labels on the boxes are all incorrect. The Task: You must determine the contents of each box, knowing that all labels are incorrect. You can only take a single ball from one box and observe its color. Then you must deduce the contents of all three boxes. Answer the question directly. Do not return any preamble explanation or reasoning.”
Chain-of-Thought prompt:
“You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. Box 1 is labelled as ‘Red Balls Only’. Box 2 is labelled ‘Blue Balls Only’. Box 3 is labelled ‘Red and Blue Balls Only’. The labels on the boxes are all incorrect. The Task: You must determine the contents of each box, knowing that all labels are incorrect. You can only take a single ball from one box and observe its color. Then you must deduce the contents of all three boxes. Think step by step to answer the question. Return the answer at the end of the response after separator.”
Chain-of-Draft prompt:
“You have three boxes. Each box contains three balls, but the balls can be red, blue, or green. Box 1 is labelled as ‘Red Balls Only’. Box 2 is labelled ‘Blue Balls Only’. Box 3 is labelled ‘Red and Blue Balls Only’. The labels on the boxes are all incorrect. The Task: You must determine the contents of each box, knowing that all labels are incorrect. You can only take a single ball from one box and observe its color. Then you must deduce the contents of all three boxes. Think step by step to answer the question but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after separator.”
Results
On testing the lambda function with the above prompts with the two models, the results are as follows:

Model
Prompt technique
Input tokens
Output tokens
Total Tokens
Tokens Reduction COD vs. COT
Latency in seconds
Latency Reduction COD vs. COT

Model-1
Standard Prompt
102
23
125

0.8

Chain of Thought
109
241
350

3.28

Chain of Draft
123
93
216
((350-216)/350) × 100 = 39% reduction
1.58
((3.28-1.58)/3.28) × 100 = 52% reduction

Model-2
Standard Prompt
102
17
119

0.6

Chain of Thought
109
492
601

3.81

Chain of Draft
123
19
142
((601-142)/601) × 100 = 76% reduction
0.79
((3.81-0.79)/3.81) × 100 = 79% reduction

Table 2: Results of Testing with Standard prompt, CoD prompt and CoT prompt across the models
The comparison shows that Chain of Draft (CoD) is far more efficient than Chain of Thought (CoT) across both models. For Model-1, CoD reduces total token usage from 350 to 216 (a 39% reduction) and cuts latency from 3.28 to 1.58 seconds (a 52% reduction). The gains are even greater for Model-2 where COD lowers tokens from 601 to 142 (a 76% reduction) and latency from 3.81 to 0.79 seconds (a 79% reduction). Overall, COD delivers significant improvements in speed and token efficiency compared to COT, with especially strong results on Model-2.
When to avoid using CoD
While CoD prompting offers compelling benefits in terms of efficiency and performance, it’s not universally applicable. There are scenarios where traditional CoT or even more verbose reasoning may be more effective or appropriate. Based on our experimentation and findings from the original research, here are some key considerations:

Zero-shot or prompt-only use cases: CoD performs best when paired with strong few-shot examples. In zero-shot scenarios—where no reasoning patterns are provided—models often struggle to adopt the minimalist drafting style on their own. This can lead to lower accuracy or incomplete reasoning steps.
Tasks requiring high interpretability: For use cases like legal or medical document review, audit trails, or regulated environments, verbose reasoning may be essential. In such cases, CoT’s more transparent, step-by-step explanations provide better traceability and trust.
Small language models: CoD underperformed on models with fewer than 3 billion parameters. These models lack the instruction-following fidelity and reasoning power needed to execute CoD-style prompts effectively. CoT may yield better results in these cases.
Creative or open-ended tasks: Tasks that benefit from elaboration—like writing, ideation, or user-facing conversations—may lose value if too condensed. CoD is best suited for structured reasoning, logic, and deterministic tasks where brevity improves performance.

In short, CoD shines when the goal is efficient reasoning with minimal overhead—but careful prompt design, model selection, and task fit are key to success.
Conclusion and key takeaways
CoD prompting emerges as an efficient technique for organizations seeking to optimize their generative AI implementations. By encouraging language models to reason in concise, focused steps, CoD achieves remarkable improvements in both performance and resource utilization. Our implementation using Amazon Bedrock and AWS Lambda demonstrated significant benefits in token usage and improvement in latency compared to traditional CoT prompting, while maintaining comparable accuracy across various foundation models and complex reasoning tasks. As AI continues to evolve, CoD represents a significant step towards more efficient and performant language models. It’s particularly valuable for structured reasoning tasks where speed and token efficiency are critical, though it’s not a one-size-fits-all solution. We encourage practitioners to explore CoD in their own AI workflows, leveraging its potential to reduce costs, improve response times, and enhance scalability. The future of AI lies in smarter, more efficient reasoning approaches, and CoD prompting is at the forefront of this transformation.
To learn more about prompt engineering and CoD technique, refer to the following resources:

What is prompt engineering?
Chain of Draft: Thinking Faster by Writing Less
Prompt Engineering concepts

About the authors
Ahmed Raafat is a Senior Manager at AWS leading the AI/ML Specialist team in the UK & Ireland, with over 20 years of technology experience helping major companies transform through AI and cloud technologies. As a trusted C-suite advisor and thought leader, he guides organizations in AI strategy and adoption, helping them use emerging technologies for innovation and growth.
Kiranpreet Chawla is a Solutions Architect at Amazon Web Services, leveraging over 15 years of diverse technology experience to drive cloud and AI transformations. Kiranpreet’s expertise spans from cloud modernization to AI/ML implementations, enabling her to provide comprehensive guidance to customers across various industries.

Deploy Mistral AI’s Voxtral on Amazon SageMaker AI

Mistral AI’s Voxtral models combine text and audio processing capabilities in a single framework. The Voxtral family includes two distinct variants designed for different use cases and resource requirements. The Voxtral-Mini-3B-2507 is a compact 3-billion-parameter model optimized for efficient audio transcription and basic multimodal understanding, making it ideal for applications where speed and resource efficiency are priorities. The Voxtral-Small-24B-2507 is 24-billion-parameter model built on the Mistral Small 3 backbone that supports advanced chat capabilities, function calling directly from voice input, and complex audio-text intelligence, perfect for enterprise applications requiring nuanced understanding and multilingual audio processing. Both models support long-form audio context of up to 30–40 minutes, feature automatic language detection, and maintain a 32,000-token context length. They are released under the Apache 2.0 license, making them readily available for both commercial and research applications.
Voxtral models feature multimodal intelligence that processes spoken and written communication within a unified pipeline, alleviating the need for separate transcription and processing stages. The models demonstrate advanced audio understanding by extracting context and sentiment directly from audio inputs and can handle multiple audio files within single conversation threads. Voxtral Small includes function calling capabilities that convert audio inputs into executable tool calls. These capabilities enable applications such as contextual voice assistants, automated meeting transcription with insight extraction, intelligent call processing for customer service, accessibility tools, and multilingual communication systems for global organizations.
In this post, we demonstrate hosting Voxtral models on Amazon SageMaker AI endpoints using vLLM and the Bring Your Own Container (BYOC) approach. vLLM is a high-performance library for serving large language models (LLMs) that features paged attention for improved memory management and tensor parallelism for distributing models across multiple GPUs. The BYOC capability of SageMaker supports deployment with custom container images, providing precise version control for vLLM 0.10.0+ compatibility, optimization flexibility for Voxtral’s multimodal processing requirements (including specialized audio libraries and custom memory management), and support for both Voxtral-Mini and Voxtral-Small models through simple configuration updates.
Solution overview
In this solution, the SageMaker notebook environment serves as the central orchestration point for the entire deployment process. It manages the building and pushing of custom Docker images to Amazon Elastic Container Registry (Amazon ECR), handles model configuration and deployment workflows, and provides testing and validation capabilities to facilitate successful model deployment.
A key part of this solution is a custom Docker container that builds on the official vLLM server by adding specialized audio processing libraries (librosa, soundfile, pydub) and mistral_common for Voxtral tokenization, with everything set up to work seamlessly with the SageMaker BYOC approach. Amazon ECR provides secure storage and scalable distribution of this container image, integrating seamlessly with the SageMaker deployment mechanisms. The SageMaker inference endpoint serves as the production runtime where the Voxtral model is hosted, offering automatic scaling and load balancing with recommended instance types of ml.g6.4xlarge for Voxtral-Mini and ml.g6.12xlarge for Voxtral-Small deployments. Amazon Simple Storage Service (Amazon S3) completes the architecture by storing three critical files from our vLLM-BYOC implementation: the custom inference handler (model.py), model configuration (serving.properties), and dependencies (requirements.txt), creating a modular approach that separates configuration from container images to enable flexible model updates and configuration changes without container rebuilds, so teams can seamlessly switch between Voxtral-Mini and Voxtral-Small deployments by simply updating the serving.properties file.
The following diagram illustrates the solution architecture.

A three-step workflow diagram showing how to deploy Voxtral models on Amazon SageMaker using custom Docker containers, S3 storage, and multi-GPU endpoints.

The solution supports multiple use case patterns for different organizational needs. Text-only processing uses the standard chat completion API for traditional conversational AI where audio processing isn’t required. Transcription-only mode provides accurate audio file transcription, ideal for meeting notes or searchable audio archives. More sophisticated applications combine audio and text intelligence, where audio provides context while text delivers specific instructions, enabling voice-controlled applications with written clarifications. The advanced pattern involves function calling from audio inputs, where spoken commands directly trigger automated actions. For example, saying “Calculate the square root of 144” automatically executes the calculator tool and returns results, creating hands-free workflows.
This post also demonstrates integrating the Voxtral model deployed on SageMaker with Strands Agents to build agentic applications with minimal code.
The following sections provide a complete implementation guide to get your Voxtral model running on SageMaker endpoints.
Prerequisites
To get started, you must have the following prerequisites:

The following software requirements:

vLLM >= 0.10.0
mistral_common >= 1.8.1

AWS account setup, including:

A SageMaker notebook using ml.m5.4xlarge with 100 GB storage.
AWS Identity and Access Management (IAM) permissions. Add the EC2InstanceProfileForImageBuilderECRContainerBuilds policy to the SageMaker execution role.
Service quotas: ml.g6.4xlarge (Voxtral-Mini) and ml.g6.12xlarge (Voxtral-Small) instances available. Refer to Requesting a quota increase to request a service quota increase in your account.

Deploy Voxtral models
Complete the following steps to quickly deploy and test Voxtral models:

Download the code from the GitHub repo:

git clone https://github.com/aws-samples/mistral-on-aws.git
cd mistral-on-aws/Mistral Voxtral/Voxtral-vllm-byoc

Build your container:

chmod +x build_and_push.sh
./build_and_push.sh

Configure your model in code/serving.properties:

To deploy Voxtral-Mini, use the following code:

option.model_id=mistralai/Voxtral-Mini-3B-2507
option.tensor_parallel_degree=1

To deploy Voxtral-Small, use the following code:

option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4

Open and run Voxtral-vLLM-BYOC-SageMaker.ipynb to deploy your endpoint and test with text, audio, and function calling capabilities.
Docker container configuration The GitHub repo contains the full Dockerfile. The following code snippet highlights the key parts:

# Custom vLLM Container for Voxtral Model Deployment on SageMaker
FROM –platform=linux/amd64 vllm/vllm-openai:latest
# Set environment variables for SageMaker
ENV MODEL_CACHE_DIR=/opt/ml/model
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
ENV HF_HOME=/tmp/hf_home
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Install audio processing dependencies
RUN pip install –no-cache-dir
“mistral_common>=1.8.1”
librosa>=0.10.2
soundfile>=0.12.1
pydub>=0.25.1
This Dockerfile creates a specialized container that extends the official vLLM server with Voxtral-specific capabilities by adding essential audio processing libraries (mistral_common for tokenization, librosa/soundfile/pydub for audio handling) while configuring the proper SageMaker environment variables for model loading and caching. The approach separates infrastructure from business logic by keeping the container generic and allowing SageMaker to dynamically inject model-specific code (model.py and serving.properties) from Amazon S3 at runtime, enabling flexible deployment of different Voxtral variants without requiring container rebuilds. Model configurations The full model configurations are in the serving.properties file located in the code folder. The following code snippet highlights the key configurations:

# Model configuration
option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4
option.dtype=bfloat16
# Voxtral-specific settings (as per official documentation)
option.tokenizer_mode=mistral
option.config_format=mistral
option.load_format=mistral
option.trust_remote_code=true
# Audio processing (Voxtral specifications)
option.limit_mm_per_prompt=audio:8
option.mm_processor_kwargs={“audio_sampling_rate”: 16000, “audio_max_length”: 1800.0}
# Performance optimizations (vLLM v0.10.0+ features)
option.enable_chunked_prefill=true
option.enable_prefix_caching=true
option.use_v2_block_manager=true
This configuration file provides Voxtral-specific optimizations that follow Mistral’s official recommendations for vLLM server deployment, setting up proper tokenization modes, audio processing parameters (supporting up to eight audio files per prompt with 30-minute transcription capability), and using the latest vLLM v0.10.0+ performance features like chunked prefill and prefix caching. The modular design supports seamless switching between Voxtral-Mini and Voxtral-Small by simply changing the model_id and tensor_parallel_degree parameters, while maintaining optimal memory utilization and enabling advanced caching mechanisms for improved inference performance. Custom inference handler The full custom inference code is in the model.py file located in the code folder. The following code snippet highlights the key functions:

# FastAPI app for SageMaker compatibility
app = FastAPI(title=”Voxtral vLLM Inference Server”, version=”1.1.0″)
model_engine = None
# vLLM Server Initialization for Voxtral
def start_vllm_server():
“””Start vLLM server with Voxtral-specific configuration”””
config = load_serving_properties()

cmd = [
“vllm”, “serve”, config.get(“option.model_id”),
“–tokenizer-mode”, “mistral”,
“–config-format”, “mistral”,
“–tensor-parallel-size”, config.get(“option.tensor_parallel_degree”),
“–host”, “127.0.0.1”,
“–port”, “8000”
]

vllm_server_process = subprocess.Popen(cmd, env=vllm_env)
server_ready = wait_for_server()
return server_ready
@app.post(“/invocations”)
async def invoke_model(request: Request):
“””Handle chat, transcription, and function calling”””
# Transcription requests
if “transcription” in request_data:
audio_source = request_data[“transcription”][“audio”]
return transcribe_audio(audio_source)

# Chat requests with multimodal support
messages = format_messages_for_openai(request_data[“messages”])
tools = request_data.get(“tools”)

# Generate via vLLM OpenAI client
response = openai_client.chat.completions.create(
model=model_config[“model_id”],
messages=messages,
tools=tools if supports_function_calling() else None
)
return response
This custom inference handler creates a FastAPI-based server that directly integrates with the vLLM server for optimal Voxtral performance. The handler processes multimodal content including base64-encoded audio and audio URLs, dynamically loads model configurations from the serving.properties file, and supports advanced features like function calling for Voxtral-Small deployments. SageMaker deployment code The Voxtral-vLLM-BYOC-SageMaker.ipynb notebook included in the Voxtral-vllm-byoc folder orchestrates the entire deployment process for both Voxtral models:

import boto3
import sagemaker
from sagemaker.model import Model
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = “your-s3-bucket”
# Upload model artifacts to S3
byoc_config_uri = sagemaker_session.upload_data(
path=”./code”,
bucket=bucket,
key_prefix=”voxtral-vllm-byoc/code”
)
# Configure custom container image
account_id = boto3.client(‘sts’).get_caller_identity()[‘Account’]
region = boto3.Session().region_name
image_uri = f”{account_id}.dkr.ecr.{region}.amazonaws.com/voxtral-vllm-byoc:latest”
# Create SageMaker model
voxtral_model = Model(
image_uri=image_uri,
model_data={
“S3DataSource”: {
“S3Uri”: f”{byoc_config_uri}/”,
“S3DataType”: “S3Prefix”,
“CompressionType”: “None”
}
},
role=role,
env={
‘MODEL_CACHE_DIR’: ‘/opt/ml/model’,
‘TRANSFORMERS_CACHE’: ‘/tmp/transformers_cache’,
‘SAGEMAKER_BIND_TO_PORT’: ‘8080’
}
)
# Deploy to endpoint
predictor = voxtral_model.deploy(
initial_instance_count=1,
instance_type=”ml.g6.12xlarge”, # For Voxtral-Small
container_startup_health_check_timeout=1200,
wait=True
)
Model use cases The Voxtral models support various text and speech-to-text use cases, and the Voxtral-Small model supports tool use with voice input. Refer to the GitHub repository for the complete code. In this section, we provide code snippets for different use cases that the model supports. Text-only The following code shows a basic text-based conversation with the model. The user sends a text query and receives a structured response:

payload = {
“messages”: [
{
“role”: “user”,
“content”: “Hello! Can you tell me about the advantages of using vLLM for model inference?”
}
],
“max_tokens”: 200,
“temperature”: 0.2,
“top_p”: 0.95
}
response = predictor.predict(payload)
Transcription-only The following example focuses on speech-to-text transcription by setting temperature to 0 for deterministic output. The model processes an audio file URL or audio file converted to base64 code, then returns the transcribed text without additional interpretation:

payload = {
“transcription”: {
“audio”: “https://audiocdn.frenchtoday.com/file/ft-public-files/audiobook-samples/AMPFE/AMP%20FE%20Ch%2002%20Story%20Slower.mp3”,
“language”: “fr”,
“temperature”: 0.0
}
}
response = predictor.predict(payload)
Text and audio understanding The following code combines both text instructions and audio input for multimodal processing. The model can follow specific text commands while analyzing the provided audio file in one inference pass, enabling more complex interactions like guided transcription or audio analysis tasks:

payload = {
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “Can you summarise this audio file”
},
{
“type”: “audio”,
“path”: “https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3”
}
]
}
],
“max_tokens”: 300,
“temperature”: 0.2,
“top_p”: 0.95
}
response = predictor.predict(payload)
Tool use The following code showcases function calling capabilities, where the model can interpret voice commands and execute predefined tools. The example demonstrates weather queries through voice input, with the model automatically calling the appropriate function and returning structured results:

# Define weather tool configuration
WEATHER_TOOL = {
“type”: “function”,
“function”: {
“name”: “get_current_weather”,
“description”: “Get the current weather for a specific location”,
“parameters”: {
“type”: “object”,
“properties”: {
“location”: {
“type”: “string”,
“description”: “The city and state, e.g. San Francisco, CA”
},
“format”: {
“type”: “string”,
“enum”: [“celsius”, “fahrenheit”],
“description”: “The temperature unit to use.”
}
},
“required”: [“location”, “format”]
}
}
}
# Mock weather function
def mock_weather(location, format=”celsius”):
“””Always returns sunny weather at 25°C/77°F”””
temp = 77 if format.lower() == “fahrenheit” else 25
unit = “°F” if format.lower() == “fahrenheit” else “°C”
return f”It’s sunny in {location} with {temp}{unit}”
# Test payload with audio
payload = {
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “audio”,
“path”: “https://huggingface.co/datasets/patrickvonplaten/audio_samples/resolve/main/fn_calling.wav”
}
]
}
],
“temperature”: 0.2,
“top_p”: 0.95,
“tools”: [WEATHER_TOOL]
}
response = predictor.predict(payload)
Strands Agents integration The following example shows how to integrate Voxtral with the Strands framework to create intelligent agents capable of using multiple tools. The agent can automatically select and execute appropriate tools (such as calculator, file operations, or shell commands from Strands prebuilt tools) based on user queries, enabling complex multi-step workflows through natural language interaction:

# SageMaker integration with Strands agents
# from strands import Agent
from strands import Agent
from strands.models.sagemaker import SageMakerAIModel
from strands_tools import calculator, current_time, file_read, shell
model = SageMakerAIModel(
endpoint_config={
“endpoint_name”: endpoint_name,
“region_name”: “us-west-2”,
},
payload_config={
“max_tokens”: 1000,
“temperature”: 0.7,
“stream”: False,
}
)
agent = Agent(model=model, tools=[calculator, current_time, file_read, shell])
response = agent(“What is the square root of 12?”)
Clean up When you finish experimenting with this example, delete the SageMaker endpoints that you created in the notebook to avoid unnecessary costs:

# Delete SageMaker endpoint
print(f” Deleting endpoint: {endpoint_name}”)
predictor.delete_endpoint(delete_endpoint_config=True)
print(” Endpoint deleted successfully”)
Conclusion In this post, we demonstrated how to successfully self-host Mistral’s open source Voxtral models on SageMaker using the BYOC approach. We’ve created a production-ready system that uses the latest vLLM framework and official Voxtral optimizations for both Mini and Small model variants. The solution supports the full spectrum of Voxtral capabilities, including text-only conversations, audio transcription, sophisticated multimodal understanding, and function calling directly from voice input. With this flexible architecture, you can switch between Voxtral-Mini and Voxtral-Small models through simple configuration updates without requiring container rebuilds. Take your multimodal AI applications to the next level by trying out the complete code from the GitHub repository to host the Voxtral model on SageMaker and start building your own voice-enabled applications. Explore Voxtral’s full potential by visiting Mistral’s official website to discover detailed capabilities, performance benchmarks, and technical specifications. Finally, explore the Strands Agents framework to seamlessly create agentic applications that can execute complex workflows.
About the authors Ying Hou, PhD, is a Sr. Specialist Solution Architect for GenAI at AWS, where she collaborates with model providers to onboard the latest and most intelligent AI models onto AWS platforms. With deep expertise in Gen AI, ASR, computer vision, NLP, and time-series forecasting models, she works closely with customers to design and build cutting-edge ML and GenAI applications.

Enhance document analytics with Strands AI Agents for the GenAI IDP Ac …

Extracting structured information from unstructured data is a critical first step to unlocking business value. Our Generative AI Intelligent Document Processing (GenAI IDP) Accelerator has been at the forefront of this transformation, already having processed tens of millions of documents for hundreds of customers.
Although organizations can use intelligent document processing (IDP) solutions to digitize their documents by extracting structured data, the methods to efficiently analyze this processed data remains elusive. After documents are processed and structured, a new challenge emerges: how can businesses quickly analyze this wealth of information and unlock actionable insights?
To address this need, we are announcing Analytics Agent, a new feature that is seamlessly integrated into the GenAI IDP Accelerator. With this feature, users can perform advanced searches and complex analyses using natural language queries without SQL or data analysis expertise.
In this post, we discuss how non-technical users can use this tool to analyze and understand the documents they have processed at scale with natural language.
GenAI IDP Accelerator
The GenAI IDP Accelerator, an open source solution, helps organizations use generative AI to automatically extract information from various document types. The accelerator combines Amazon Bedrock and other AWS services, including AWS Lambda, AWS Step Functions, Amazon Simple Queue Service (Amazon SQS), and Amazon DynamoDB, to create a serverless system. The GenAI IDP Accelerator is designed to work at scale and can handle thousands of documents daily. It offers three processing patterns for users to build custom solutions for complex document processing workflows. The accelerator can be deployed using AWS CloudFormation templates, and users can start processing documents immediately through either the web interface or by uploading files directly to Amazon Simple Storage Service (Amazon S3). The accelerator consists of multiple modules like document classification, data extraction, assessment, summarization, and evaluation. To learn more about the GenAI IDP Accelerator, see Accelerate intelligent document processing with generative AI on AWS.
Now, using natural language queries through the Analytics Agent feature, you can extract valuable information to understand the performance of the solution. To access this feature, simply deploy the latest version of the GenAI IDP Accelerator and choose Agent Companion Chat in the navigation pane, as shown in the following screenshot (from accelerator version 0.4.7). Queries related to analytics automatically get routed to the Analytics Agent.

The Analytics Agent acts as an intelligent interface between business users and their processed document data. It can handle intricate queries that would typically require a skilled data scientist, making advanced analytics accessible to the average business user. For example, a healthcare provider could ask, “What percentage of insurance claims were denied last month? Of those, how many were due to incomplete documentation? Show me a trend of denial reasons over the past six months.” Or a tax accounting firm could ask, “Which of my clients are paying state tax in more than one state on their W2 forms?”
The following screenshot is an example of an analysis using the Analytics Agent feature through the Agent Companion Chat interface. A user in the accounting vertical queried “Make a histogram of gross earnings from all uploaded W2s in the last 180 days with 25 bins between $0 and $300,000,” and the agent analyzed data extracted from over 1,000 W2 forms in under a minute.

Analytics Agent
The Analytics Agent is built using Strands Agents, an open source SDK with a model-driven approach for building AI agents. The agent, using several tools, is designed to make working with enterprise data more intuitive by providing natural language to data and visualization conversion. The Analytics Agent workflow consists of the following steps:

The agent uses a database exploration tool if needed to understand data structures stored in Amazon Athena tables within the IDP solution. This is required because the tables within the IDP solution can have different schemas based on how users have configured the processing pipeline.
The agent converts natural language queries into optimized SQL queries compatible with the available databases and tables. These queries can scale to tables of arbitrary size.
The agent runs SQL against Athena and stores query results in Amazon S3. These results can be thousands of rows long. It automatically fixes and reruns potential failed queries based on the error message generated by Athena.
The agent securely transfers query results from Amazon S3 into an AWS Bedrock AgentCore Code Interpreter sandbox.
The agent writes Python code designed to analyze the query results and generate charts or tables in a structured output compatible with the UI. The code is copied into the sandbox and is executed securely there.
Lastly, final visualizations are presented in the web interface for straightforward interpretation.

The following diagram illustrates the workflow of the Analytics Agent.

Solution overview
The following architecture diagram illustrates the serverless Analytics Agent deployment and its integration with the existing IDP solution through the AWS AppSync API.

The Analytics Agent is deployed primarily within Lambda functions. When a user query is provided to the AppSync API from the IDP frontend, an ephemeral request handler Lambda function creates and stores a unique job ID in DynamoDB to track the asynchronous processing flow, and launches a long-running agent request processor Lambda function that instantiates a Strands agent and launches it. The frontend polls the job status and retrieves final results (including from prior jobs) from DynamoDB. The agent request processor Lambda function has AWS Identity and Access Management (IAM) permissions to access the IDP tables in Athena as well as to launch and execute an AgentCore Code Interpreter sandbox for more secure Python code execution.
The architecture follows a security-first design:

Sandboxed execution – The Python code runs in AgentCore Code Interpreter, completely isolated from the rest of the AWS environment and the internet
Secure data transfer – Query results are transferred through Amazon S3 and AgentCore APIs, not through the context window of an LLM
Session management – AgentCore Code Interpreter sessions are properly managed and cleaned up after use
Minimal permissions – Each component requests only the necessary AWS permissions
Audit trail – The solution offers comprehensive logging and monitoring for security reviews

Intelligent document insights with the Analytics Agent
To demonstrate the capabilities of the Analytics Agent, we processed 10,000 documents from the RVL-CDIP dataset using the GenAI IDP Accelerator. The dataset, containing diverse document types including memos, letters, forms, and reports, was processed using Pattern 2 configuration to extract structured information including document type, sender, recipient, and department details. In the following sections, we walk through the details of a single sample user query.
Real-world query: Departmental memo analysis
A business user posed a straightforward question in natural language: “Which departments generate the most memos?” This seemingly simple query would traditionally require a data analyst to complete the following steps:

Obtain credentials and connect to an internal database
Understand the database schema by executing exploratory queries or reading internal documentation
Write complex SQL with proper Athena syntax
Execute and validate the query
Process results and create visualizations
Format findings for presentation

The Analytics Agent handled this entire workflow autonomously in under 60 seconds.
Generated visualization using the Analytics Agent
The following figure shows the visualization the agent generated based on a single natural language query.

The analysis revealed that Lorillard generated the most memos (11 documents), followed by INBIFO, Corporate Affairs, and Philip Morris departments (10 documents each). The visualization showed the distribution across major organizational units, with tobacco research and corporate departments dominating memo generation. If the user wants a different visualization style, they can quickly toggle through various options like pie charts, line charts, and bar charts. They can also display the results as a table. We toggled the original bar chart it created to a doughnut chart for aesthetic purposes in this blog post.
Agent thought process
The agent’s transparent reasoning process reveals the comprehensive orchestration happening behind the scenes.

The agent first explored the database structure, identifying the document_sections_memo table and discovering the inference_result.department column containing the needed information.
The agent crafted an optimized Athena query with proper column quoting and null handling, which can be displayed by clicking “View Details” in the chat window:

After retrieving unique departments from the query results, the agent automatically performed the following actions:

Generated Python code to analyze and visualize the data
Copied the Python code and SQL query results into a secure AgentCore Code Interpreter sandbox
Executed the Python code within the sandbox, returning a JSON dictionary with chart data
Identified and fixed an issue with a NaN value in the data
Created a horizontal bar chart highlighting the top 15 departments
Formatted the output for seamless web display

The python code it wrote to load the query results into sandbox memory and generate a plot to display in the frontend can be displayed by clicking “View Details” in the chat window (screenshot cropped for brevity):

Agent capabilities
This example showcases three transformative capabilities:

Autonomous problem-solving – The agent independently discovered the database schema, identified the correct table and columns, and handled data quality issues (null values) without human intervention. This means that the agent can work on different documents analyzed by the IDP solution, regardless of document type or IDP processing configurations.
Adaptive reasoning – When the agent detected null values in the initial visualization, it automatically corrected the issue by filtering the data and regenerating the chart, demonstrating self-correction capabilities.
End-to-end interpretability – The entire workflow, from natural language query to polished visualization, executed in 90 seconds with complete transparency. Users can review each decision the agent made through the detailed thought process log.

The Analytics Agent transforms processed document data into actionable intelligence, helping business users explore their document corpus with the same ease as asking a colleague a question. This democratization of data analysis makes sure valuable insights aren’t locked away behind technical barriers, and are immediately accessible to decision-makers across the organization.
How customers can use this feature
The power of this feature lies in its ability to democratize data analysis, turning business users into data analysts through the simple power of conversation. Customers can use this feature in the following use cases:

Instant business insights:

Ask complex questions in plain English, like “What percentage of invoices exceeded $50,000 last quarter?”
Get immediate visualizations of trends and patterns with queries like “How has the average value of invoices trended over the past 12 months?”
Make data-driven decisions without waiting for IT or data science teams with queries like “Show me which employees based out of the Seattle office submitted the most invoices.”

Risk and compliance monitoring:

Detect anomalies in real time with queries like “Show me all contracts missing mandatory clauses.”
Track compliance rates across document types.
Identify high-risk documents requiring immediate attention.

Operational excellence:

Monitor processing bottlenecks with queries like “Which document types have the longest processing times?”
Track accuracy rates across different document categories.
Optimize resource allocation based on volume patterns.

Customer experience enhancement:

Analyze customer-specific processing metrics with queries like “How close are we to using up our monthly processing allocation budget of $100 this month?”
Identify opportunities for process automation.
Track SLA compliance in real time with queries like “Which processed invoices don’t have an associated processed pay slip associated with them yet?”

Strategic planning:

Forecast processing volumes based on historical patterns with queries like “We are expecting our number of uploaded documents to increase 20% year over year. How many documents will we expect to process in the next five years?”
Identify seasonal trends and plan accordingly.
Track ROI metrics for document processing investments.
Make data-backed decisions for system scaling.

Best practices
Consider the following best practices when using the Analytics Agent:

Start broad – Begin with general questions before diving into specifics.
Be specific – Clearly state what information you’re looking for. Don’t be afraid to provide an entire paragraph describing what you need if necessary.
Use follow-up queries – Build on what you learned in previous questions to explore topics in depth. Chat messages sent in the Agent Companion Chat are stateful, enabling you to ask followup questions.
Check results – Verify visualizations make sense for your data, and read through the displayed agent thought process to validate the decisions it made.

Integration with external agentic AI systems
The Analytics Agent can be easily integrated into other agentic AI systems, such as Amazon Quick Suite, through the IDP Accelerator’s new Model Context Protocol (MCP) Server. Organizations can incorporate document analytics capabilities into their broader AI workflows and automation platforms using this integration. For implementation guidance and technical details, see the MCP integration documentation.
Clean up
When you’re finished experimenting with the Agent Analysis feature, you have two cleanup options depending on your needs:

Remove individual analytics queries – Navigate to the Agent Analysis section in the web UI and use the “load previous chat” pane to delete specific queries. Alternatively, you can remove query entries directly from the DynamoDB analytics jobs table associated with your stack.
Delete the entire IDP deployment – Use the CloudFormation console to delete the IDP stack. For automated cleanup with S3 bucket emptying, you can use the IDP CLI:

idp-cli delete –stack-name my-idp-stack –empty-buckets –force
For more detailed cleanup procedures and options, see the IDP CLI documentation.
Conclusion
In this post, we discussed the new Analytics Agent feature for the GenAI IDP Accelerator, an autonomous agent built on Strands that helps non-technical users analyze and understand the documents they have processed at scale with natural language. With this agent, users no longer need SQL expertise or knowledge of underlying database structures to retrieve data or generate visualizations.
Visit the GenAI IDP Accelerator GitHub repository for detailed guides and examples and choose Watch to stay informed on new releases and features. AWS Professional Services and AWS Partners are available to help with implementation. You can also join the GitHub community to contribute improvements and share your experiences.

About the authors
David Kaleko is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he leads applied research efforts into cutting-edge generative AI implementation strategies for AWS customers. He holds a PhD in particle physics from Columbia University.
Tryambak Gangopadhyay is a Senior Applied Scientist at the AWS Generative AI Innovation Center, where he collaborates with organizations across a diverse spectrum of industries. His role involves researching and developing generative AI solutions to address crucial business challenges and accelerate AI adoption. Prior to joining AWS, Tryambak completed his PhD at Iowa State University.
Mofijul Islam is an Applied Scientist II and Tech Lead at the AWS Generative AI Innovation Center, where he helps customers tackle customer-centric research and business challenges using generative AI, large language models, multi-agent learning, code generation, and multimodal learning. He holds a PhD in machine learning from the University of Virginia, where his work focused on multimodal machine learning, multilingual NLP, and multitask learning. His research has been published in top-tier conferences like NeurIPS, ICLR, EMNLP, AISTATS, and AAAI, as well as IEEE and ACM Transactions.
Jordan Ratner is a Senior Generative AI Strategist at Amazon Web Services, where he helps companies of different sizes design, deploy, and scale AI solutions. He previously co-founded Deloitte’s global AI practice and led OneReach.ai as Managing Partner, scaling conversational and generative AI deployments worldwide. Jordan now focuses on turning fast-moving AI trends into reusable products and frameworks, driving real adoption across industries.
Bob Strahan is a Principal Solutions Architect in the AWS Generative AI Innovation Center.

Anthropic AI Releases Bloom: An Open-Source Agentic Framework for Auto …

Anthropic has released Bloom, an open source agentic framework that automates behavioral evaluations for frontier AI models. The system takes a researcher specified behavior and builds targeted evaluations that measure how often and how strongly that behavior appears in realistic scenarios.

Why Bloom?

Behavioral evaluations for safety and alignment are expensive to design and maintain. Teams must hand creative scenarios, run many interactions, read long transcripts and aggregate scores. As models evolve, old benchmarks can become obsolete or leak into training data. Anthropic’s research team frames this as a scalability problem, they need a way to generate fresh evaluations for misaligned behaviors faster while keeping metrics meaningful.

Bloom targets this gap. Instead of a fixed benchmark with a small set of prompts, Bloom grows an evaluation suite from a seed configuration. The seed anchors what behavior to study, how many scenarios to generate and what interaction style to use. The framework then produces new but behavior consistent scenarios on each run, while still allowing reproducibility through the recorded seed.

https://www.anthropic.com/research/bloom

Seed configuration and system design

Bloom is implemented as a Python pipeline and is released under the MIT license on GitHub. The core input is the evaluation “seed”, defined in seed.yaml. This file references a behavior key in behaviors/behaviors.json, optional example transcripts and global parameters that shape the whole run.

Key configuration elements include:

behavior, a unique identifier defined in behaviors.json for the target behavior, for example sycophancy or self preservation

examples, zero or more few shot transcripts stored under behaviors/examples/

total_evals, the number of rollouts to generate in the suite

rollout.target, the model under evaluation such as claude-sonnet-4

controls such as diversity, max_turns, modality, reasoning effort and additional judgment qualities

Bloom uses LiteLLM as a backend for model API calls and can talk to Anthropic and OpenAI models through a single interface. It integrates with Weights and Biases for large sweeps and exports Inspect compatible transcripts.

Four stage agentic pipeline

Bloom’s evaluation process is organized into four agent stages that run in sequence:

Understanding agent: This agent reads the behavior description and example conversations. It builds a structured summary of what counts as a positive instance of the behavior and why this behavior matters. It attributes specific spans in the examples to successful behavior demonstrations so that later stages know what to look for.

Ideation agent: The ideation stage generates candidate evaluation scenarios. Each scenario describes a situation, the user persona, the tools that the target model can access and what a successful rollout looks like. Bloom batches scenario generation to use token budgets efficiently and uses the diversity parameter to trade off between more distinct scenarios and more variations per scenario.

Rollout agent: The rollout agent instantiates these scenarios with the target model. It can run multi turn conversations or simulated environments, and it records all messages and tool calls. Configuration parameters such as max_turns, modality and no_user_mode control how autonomous the target model is during this phase.

Judgment and meta judgment agents: A judge model scores each transcript for behavior presence on a numerical scale and can also rate additional qualities like realism or evaluator forcefulness. A meta judge then reads summaries of all rollouts and produces a suite level report that highlights the most important cases and patterns. The main metric is an elicitation rate, the share of rollouts that score at least 7 out of 10 for behavior presence.

Validation on frontier models

Anthropic used Bloom to build four alignment relevant evaluation suites, for delusional sycophancy, instructed long horizon sabotage, self preservation and self preferential bias. Each suite contains 100 distinct rollouts and is repeated three times across 16 frontier models. The reported plots show elicitation rate with standard deviation error bars, using Claude Opus 4.1 as the evaluator across all stages.

Bloom is also tested on intentionally misaligned ‘model organisms’ from earlier alignment work. Across 10 quirky behaviors, Bloom separates the organism from the baseline production model in 9 cases. In the remaining self promotion quirk, manual inspection shows that the baseline model exhibits similar behavior frequency, which explains the overlap in scores. A separate validation exercise compares human labels on 40 transcripts against 11 candidate judge models. Claude Opus 4.1 reaches a Spearman correlation of 0.86 with human scores, and Claude Sonnet 4.5 reaches 0.75, with especially strong agreement at high and low scores where thresholds matter.

https://alignment.anthropic.com/2025/bloom-auto-evals/

Relationship to Petri and Positioning

Anthropic positions Bloom as complementary to Petri. Petri is a broad coverage auditing tool that takes seed instructions describing many scenarios and behaviors, then uses automated agents to probe models through multi turn interactions and summarize diverse safety relevant dimensions. Bloom instead starts from one behavior definition and automates the engineering needed to turn that into a large, targeted evaluation suite with quantitative metrics like elicitation rate.

Key Takeaways

Bloom is an open source agentic framework that turns a single behavior specification into a complete behavioral evaluation suite for large models, using a four stage pipeline of understanding, ideation, rollout and judgment.

The system is driven by a seed configuration in seed.yaml and behaviors/behaviors.json, where researchers specify the target behavior, example transcripts, total evaluations, rollout model and controls such as diversity, max turns and modality.

Bloom relies on LiteLLM for unified access to Anthropic and OpenAI models, integrates with Weights and Biases for experiment tracking and exports Inspect compatible JSON plus an interactive viewer for inspecting transcripts and scores.

Anthropic validates Bloom on 4 alignment focused behaviors across 16 frontier models with 100 rollouts repeated 3 times, and on 10 model organism quirks, where Bloom separates intentionally misaligned organisms from baseline models in 9 cases and judge models match human labels with Spearman correlation up to 0.86.

Check out the Github Repo, Technical report and Blog. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic AI Releases Bloom: An Open-Source Agentic Framework for Automated Behavioral Evaluations of Frontier AI Models appeared first on MarkTechPost.

AI Interview Series #4: Explain KV Caching

Question:

You’re deploying an LLM in production. Generating the first few tokens is fast, but as the sequence grows, each additional token takes progressively longer to generate—even though the model architecture and hardware remain the same.

If compute isn’t the primary bottleneck, what inefficiency is causing this slowdown, and how would you redesign the inference process to make token generation significantly faster?

What is KV Caching and how does it make token generation faster?

KV caching is an optimization technique used during text generation in large language models to avoid redundant computation. In autoregressive generation, the model produces text one token at a time, and at each step it normally recomputes attention over all previous tokens. However, the keys (K) and values (V) computed for earlier tokens never change.

With KV caching, the model stores these keys and values the first time they are computed. When generating the next token, it reuses the cached K and V instead of recomputing them from scratch, and only computes the query (Q), key, and value for the new token. Attention is then calculated using the cached information plus the new token.

This reuse of past computations significantly reduces redundant work, making inference faster and more efficient—especially for long sequences—at the cost of additional memory to store the cache. Check out the Practice Notebook here

Evaluating the Impact of KV Caching on Inference Speed

In this code, we benchmark the impact of KV caching during autoregressive text generation. We run the same prompt through the model multiple times, once with KV caching enabled and once without it, and measure the average generation time. By keeping the model, prompt, and generation length constant, this experiment isolates how reusing cached keys and values significantly reduces redundant attention computation and speeds up inference. Check out the Practice Notebook here

Copy CodeCopiedUse a different Browserimport numpy as np
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = “cuda” if torch.cuda.is_available() else “cpu”

model_name = “gpt2-medium”
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

prompt = “Explain KV caching in transformers.”

inputs = tokenizer(prompt, return_tensors=”pt”).to(device)

for use_cache in (True, False):
times = []
for _ in range(5):
start = time.time()
model.generate(
**inputs,
use_cache=use_cache,
max_new_tokens=1000
)
times.append(time.time() – start)

print(
f”{‘with’ if use_cache else ‘without’} KV caching: ”
f”{round(np.mean(times), 3)} ± {round(np.std(times), 3)} seconds”
)

The results clearly demonstrate the impact of KV caching on inference speed. With KV caching enabled, generating 1000 tokens takes around 21.7 seconds, whereas disabling KV caching increases the generation time to over 107 seconds—nearly a 5× slowdown. This sharp difference occurs because, without KV caching, the model recomputes attention over all previously generated tokens at every step, leading to quadratic growth in computation. Check out the Practice Notebook here

With KV caching, past keys and values are reused, eliminating redundant work and keeping generation time nearly linear as the sequence grows. This experiment highlights why KV caching is essential for efficient, real-world deployment of autoregressive language models.

Check out the Practice Notebook here

AI Interview Series #3: Explain Federated Learning

The post AI Interview Series #4: Explain KV Caching appeared first on MarkTechPost.

NVIDIA AI Releases Nemotron 3: A Hybrid Mamba Transformer MoE Stack fo …

NVIDIA has released the Nemotron 3 family of open models as part of a full stack for agentic AI, including model weights, datasets and reinforcement learning tools. The family has three sizes, Nano, Super and Ultra, and targets multi agent systems that need long context reasoning with tight control over inference cost. Nano has about 30 billion parameters with about 3 billion active per token, Super has about 100 billion parameters with up to 10 billion active per token, and Ultra has about 500 billion parameters with up to 50 billion active per token.

https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

Model family and target workloads

Nemotron 3 is presented as an efficient open model family for agentic applications. The line consists of Nano, Super and Ultra models, each tuned for different workload profiles.

Nemotron 3 Nano is a Mixture of Experts hybrid Mamba Transformer language model with about 31.6 billion parameters. Only about 3.2 billion parameters are active per forward pass, or 3.6 billion including embeddings. This sparse activation allows the model to keep high representational capacity while keeping compute low.

Nemotron 3 Super has about 100 billion parameters with up to 10 billion active per token. Nemotron 3 Ultra scales this design to about 500 billion parameters with up to 50 billion active per token. Super targets high accuracy reasoning for large multi agent applications, while Ultra is intended for complex research and planning workflows.

Nemotron 3 Nano is available now with open weights and recipes, on Hugging Face and as an NVIDIA NIM microservice. Super and Ultra are scheduled for the first half of 2026.

NVIDIA Nemotron 3 Nano delivers about 4 times higher token throughput than Nemotron 2 Nano and reduces reasoning token usage significantly, while supporting a native context length of up to 1 million tokens. This combination is intended for multi agent systems that operate on large workspaces such as long documents and large code bases.

https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

Hybrid Mamba Transformer MoE architecture

The core design of Nemotron 3 is a Mixture of Experts hybrid Mamba Transformer architecture. The models mix Mamba sequence blocks, attention blocks and sparse expert blocks inside a single stack.

For Nemotron 3 Nano, the research team describes a pattern that interleaves Mamba 2 blocks, attention blocks and MoE blocks. Standard feedforward layers from earlier Nemotron generations are replaced by MoE layers. A learned router selects a small subset of experts per token, for example 6 out of 128 routable experts for Nano, which keeps the active parameter count close to 3.2 billion while the full model holds 31.6 billion parameters.

https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

Mamba 2 handles long range sequence modeling with state space style updates, attention layers provide direct token to token interactions for structure sensitive tasks, and MoE provides parameter scaling without proportional compute scaling. The important point is that most layers are either fast sequence or sparse expert computations, and full attention is used only where it matters most for reasoning.

For Nemotron 3 Super and Ultra, NVIDIA adds LatentMoE. Tokens are projected into a lower dimensional latent space, experts operate in that latent space, then outputs are projected back. This design allows several times more experts at similar communication and compute cost, which supports more specialization across tasks and languages.

Super and Ultra also include multi token prediction. Multiple output heads share a common trunk and predict several future tokens in a single pass. During training this improves optimization, and at inference it enables speculative decoding like execution with fewer full forward passes.

Training data, precision format and context window

Nemotron 3 is trained on large scale text and code data. The research team reports pretraining on about 25 trillion tokens, with more than 3 trillion new unique tokens over the Nemotron 2 generation. Nemotron 3 Nano uses Nemotron Common Crawl v2 point 1, Nemotron CC Code and Nemotron Pretraining Code v2, plus specialized datasets for scientific and reasoning content.

Super and Ultra are trained mostly in NVFP4, a 4 bit floating point format optimized for NVIDIA accelerators. Matrix multiply operations run in NVFP4 while accumulations use higher precision. This reduces memory pressure and improves throughput while keeping accuracy close to standard formats.

All Nemotron 3 models support context windows up to 1 million tokens. The architecture and training pipeline are tuned for long horizon reasoning across this length, which is essential for multi agent environments that pass large traces and shared working memory between agents.

Key Takeaways

Nemotron 3 is a three tier open model family for agentic AI: Nemotron 3 comes in Nano, Super and Ultra variants. Nano has about 30 billion parameters with about 3 billion active per token, Super has about 100 billion parameters with up to 10 billion active per token, and Ultra has about 500 billion parameters with up to 50 billion active per token. The family targets multi agent applications that need efficient long context reasoning.

Hybrid Mamba Transformer MoE with 1 million token context: Nemotron 3 models use a hybrid Mamba 2 plus Transformer architecture with sparse Mixture of Experts and support a 1 million token context window. This design gives long context handling with high throughput, where only a small subset of experts is active per token and attention is used where it is most useful for reasoning.

Latent MoE and multi token prediction in Super and Ultra: The Super and Ultra variants add latent MoE where expert computation happens in a reduced latent space, which lowers communication cost and allows more experts, and multi token prediction heads that generate several future tokens per forward pass. These changes improve quality and enable speculative style speedups for long text and chain of thought workloads.

Large scale training data and NVFP4 precision for efficiency: Nemotron 3 is pretrained on about 25 trillion tokens, with more than 3 trillion new tokens over the previous generation, and Super and Ultra are trained mainly in NVFP4, a 4 bit floating point format for NVIDIA GPUs. This combination improves throughput and reduces memory use while keeping accuracy close to standard precision.

Check out the Paper, Technical blog and Model Weights on HF. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post NVIDIA AI Releases Nemotron 3: A Hybrid Mamba Transformer MoE Stack for Long Context Agentic AI appeared first on MarkTechPost.

A Coding Guide to Design a Complete Agentic Workflow in Gemini for Aut …

In this tutorial, we devise how to orchestrate a fully functional, tool-using medical prior-authorization agent powered by Gemini. We walk through each component step by step, from securely configuring the model to building realistic external tools and finally constructing an intelligent agent loop that reasons, acts, and responds entirely through structured JSON. As we progress, we see how the system thinks, retrieves evidence, and interacts with simulated medical systems to complete a complex workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q -U google-generative-ai

import google.generativeai as genai
from google.colab import userdata
import os
import getpass
import json
import time

try:
GOOGLE_API_KEY = userdata.get(‘GOOGLE_API_KEY’)
except:
print(“Please enter your Google API Key:”)
GOOGLE_API_KEY = getpass.getpass(“API Key: “)

genai.configure(api_key=GOOGLE_API_KEY)

print(“n Scanning for available models…”)
available_models = [m.name for m in genai.list_models()]
target_model = “”

if ‘models/gemini-1.5-flash’ in available_models:
target_model = ‘gemini-1.5-flash’
elif ‘models/gemini-1.5-flash-001’ in available_models:
target_model = ‘gemini-1.5-flash-001’
elif ‘models/gemini-pro’ in available_models:
target_model = ‘gemini-pro’
else:
for m in available_models:
if ‘generateContent’ in genai.get_model(m).supported_generation_methods:
target_model = m
break

if not target_model:
raise ValueError(” No text generation models found for this API key.”)

print(f” Selected Model: {target_model}”)
model = genai.GenerativeModel(target_model)

We set up our environment and automatically detect the best available Gemini model. We configure the API key securely and let the system choose the most capable model without hardcoding anything. This ensures that we start the tutorial with a clean, flexible, and reliable foundation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass MedicalTools:
def __init__(self):
self.ehr_docs = [
“Patient: John Doe | DOB: 1980-05-12”,
“Visit 2023-01-10: Diagnosed with Type 2 Diabetes. Prescribed Metformin.”,
“Visit 2023-04-15: Patient reports severe GI distress with Metformin. Discontinued.”,
“Visit 2023-04-20: BMI recorded at 32.5. A1C is 8.4%.”,
“Visit 2023-05-01: Doctor recommends starting Ozempic (Semaglutide).”
]

def search_ehr(self, query):
print(f” [Tool] Searching EHR for: ‘{query}’…”)
results = [doc for doc in self.ehr_docs if any(q.lower() in doc.lower() for q in query.split())]
if not results:
return “No records found.”
return “n”.join(results)

def submit_prior_auth(self, drug_name, justification):
print(f” [Tool] Submitting claim for {drug_name}…”)
justification_lower = justification.lower()
if “metformin” in justification_lower and (“discontinued” in justification_lower or “intolerance” in justification_lower):
if “bmi” in justification_lower and “32” in justification_lower:
return “SUCCESS: Authorization Approved. Auth ID: #998877”
return “DENIED: Policy requires proof of (1) Metformin failure and (2) BMI > 30.”

We define the medical tools that our agent can use during the workflow. We simulate an EHR search and a prior-authorization submission system so the agent has real actions to perform. By doing this, we ground the agent’s reasoning in tool-enabled interactions rather than plain text generation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AgenticSystem:
def __init__(self, model, tools):
self.model = model
self.tools = tools
self.history = []
self.max_steps = 6

self.system_prompt = “””
You are an expert Medical Prior Authorization Agent.
Your goal is to get approval for a medical procedure/drug.

You have access to these tools:
1. search_ehr(query)
2. submit_prior_auth(drug_name, justification)

RULES:
1. ALWAYS think before you act.
2. You MUST output your response in STRICT JSON format:
{
“thought”: “Your reasoning here”,
“action”: “tool_name_or_finish”,
“action_input”: “argument_string_or_dict”
}
3. Do not guess patient data. Use ‘search_ehr’.
4. If you have the evidence, use ‘submit_prior_auth’.
5. If the task is done, use action “finish”.
“””

We initialize the agent and provide its full system prompt. We define the rules, the JSON response format, and the expectation that the agent must think before acting. This gives us a controlled, deterministic structure for building a safe and traceable agent loop. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def execute_tool(self, action_name, action_input):
if action_name == “search_ehr”:
return self.tools.search_ehr(action_input)
elif action_name == “submit_prior_auth”:
if isinstance(action_input, str):
return “Error: submit_prior_auth requires a dictionary.”
return self.tools.submit_prior_auth(**action_input)
else:
return “Error: Unknown tool.”

def run(self, objective):
print(f” AGENT STARTING. Objective: {objective}n” + “-“*50)
self.history.append(f”User: {objective}”)

for i in range(self.max_steps):
print(f”n STEP {i+1}”)
prompt = self.system_prompt + “nnHistory:n” + “n”.join(self.history) + “nnNext JSON:”

try:
response = self.model.generate_content(prompt)
text_response = response.text.strip().replace(““`json”, “”).replace(““`”, “”)
agent_decision = json.loads(text_response)
except Exception as e:
print(f” Error parsing AI response. Retrying… ({e})”)
continue

print(f” THOUGHT: {agent_decision[‘thought’]}”)
print(f” ACTION: {agent_decision[‘action’]}”)

if agent_decision[‘action’] == “finish”:
print(f”n TASK COMPLETED: {agent_decision[‘action_input’]}”)
break

tool_result = self.execute_tool(agent_decision[‘action’], agent_decision[‘action_input’])
print(f” OBSERVATION: {tool_result}”)

self.history.append(f”Assistant: {text_response}”)
self.history.append(f”System: {tool_result}”)

if “SUCCESS” in str(tool_result):
print(“n SUCCESS! The Agent successfully navigated the insurance portal.”)
break

We implement the core agent loop where reasoning, tool execution, and observations happen step by step. We watch the agent decide its next action, execute tools, update history, and evaluate success conditions. This is where the agent truly comes alive and performs iterative reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertools_instance = MedicalTools()
agent = AgenticSystem(model, tools_instance)
agent.run(“Please get prior authorization for Ozempic for patient John Doe.”)

We instantiate the tools and agent, then run the entire system end-to-end with a real objective. We see the full workflow unfold as the agent navigates through medical history, validates evidence, and attempts prior authorization. This final snippet demonstrates the complete pipeline working seamlessly.

In conclusion, we reflect on how this compact yet powerful framework enables us to design real-world agentic behaviors that go beyond simple text responses. We watch our agent plan, consult tools, gather evidence, and ultimately complete a structured insurance authorization task, entirely through autonomous reasoning. It provides confidence that we can now expand the system with additional tools, stronger policies, domain-specific logic, or even multi-agent collaboration.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide to Design a Complete Agentic Workflow in Gemini for Automated Medical Evidence Gathering and Prior Authorization Submission appeared first on MarkTechPost.