MiniMax Releases M2.1: An Enhanced M2 Version with Features like Multi …

Just months after releasing M2—a fast, low-cost model designed for agents and code—MiniMax has introduced an enhanced version: MiniMax M2.1.

M2 already stood out for its efficiency, running at roughly 8% of the cost of Claude Sonnet while delivering significantly higher speed. More importantly, it introduced a different computational and reasoning pattern, particularly in how the model structures and executes its thinking during complex code and tool-driven workflows.

M2.1 builds on this foundation, bringing tangible improvements across key areas: better code quality, smarter instruction following, cleaner reasoning, and stronger performance across multiple programming languages. These upgrades extend the original strengths of M2 while staying true to MiniMax’s vision of “Intelligence with Everyone.”

Strengthening the core capabilities of M2, M2.1 is no longer just about better coding—it also produces clearer, more structured outputs across conversations, documentation, and writing.

Core Capabilities and Benchmark Results

Built for real-world coding and AI-native teams: Designed to support everything from rapid “vibe builds” to complex, production-grade workflows.

Goes beyond coding: Produces clearer, more structured, and higher-quality outputs across everyday conversations, technical documentation, and writing tasks.

State-of-the-art multilingual coding performance: Achieves 72.5% on SWE-Multilingual, outperforming Claude Sonnet 4.5 and Gemini 3 Pro across multiple programming languages.

Strong AppDev & WebDev capabilities: Scores 88.6% on VIBE-Bench, exceeding Claude Sonnet 4.5 and Gemini 3 Pro, with major improvements in native Android, iOS, and modern web development.

Excellent agent and tool compatibility: Delivers consistent and stable performance across leading coding tools and agent frameworks, including Claude Code, Droid (Factory AI), Cline, Kilo Code, Roo Code, BlackBox, and more.

Robust context management support: Works reliably with advanced context mechanisms such as Skill.md, Claude.md / agent.md / cursorrule, and Slash Commands, enabling scalable agent workflows.

Automatic caching, zero configuration: Built-in caching works out of the box to reduce latency, lower costs, and deliver a smoother overall experience.

Getting Started with MiniMax M2.1

To get started with MiniMax M2.1, you’ll need an API key from the MiniMax platform. You can generate one from the MiniMax user console.

Once issued, store the API key securely and avoid exposing it in code repositories or public environments.

Installing & Setting up the dependencies

MiniMax supports both the Anthropic and OpenAI API formats, making it easy to integrate MiniMax models into existing workflows with minimal configuration changes—whether you’re using Anthropic-style message APIs or OpenAI-compatible setups.

Copy CodeCopiedUse a different Browserpip install anthropic

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘ANTHROPIC_BASE_URL’] = ‘https://api.minimax.io/anthropic’
os.environ[‘ANTHROPIC_API_KEY’] = getpass(‘Enter MiniMax API Key: ‘)

With just this minimal setup, you’re ready to start using the model.

Sending Requests to the Model

MiniMax M2.1 returns structured outputs that separate internal reasoning (thinking) from the final response (text). This allows you to observe how the model interprets intent and plans its answer before producing the user-facing output.

Copy CodeCopiedUse a different Browserimport anthropic

client = anthropic.Anthropic()

message = client.messages.create(
model=”MiniMax-M2.1″,
max_tokens=1000,
system=”You are a helpful assistant.”,
messages=[
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “Hi, how are you?”
}
]
}
]
)

for block in message.content:
if block.type == “thinking”:
print(f”Thinking:n{block.thinking}n”)
elif block.type == “text”:
print(f”Text:n{block.text}n”)

Copy CodeCopiedUse a different BrowserThinking:
The user is just asking how I am doing. This is a friendly greeting, so I should respond in a warm, conversational way. I’ll keep it simple and friendly.

Text:
Hi! I’m doing well, thanks for asking!

I’m ready to help you with whatever you need today. Whether it’s coding, answering questions, brainstorming ideas, or just chatting, I’m here for you.

What can I help you with?

What makes MiniMax stand out is the visibility into its reasoning process. Before producing the final response, the model explicitly reasons about the user’s intent, tone, and expected style—ensuring the answer is appropriate and context-aware. 

By cleanly separating reasoning from responses, the model becomes easier to interpret, debug, and trust, especially in complex agent-based or multi-step workflows, and with M2.1 this clarity is paired with faster responses, more concise reasoning, and substantially reduced token consumption compared to M2.

Testing the Model’s Coding Capabilities

MiniMax M2 stands out for its native mastery of Interleaved Thinking, allowing it to dynamically plan and adapt within complex coding and tool-based workflows, and M2.1 extends this capability with improved code quality, more precise instruction following, clearer reasoning, and stronger performance across programming languages—particularly in handling composite instruction constraints as seen in OctoCodingBench—making it ready for office automation.

To evaluate these capabilities in practice, let’s test the model using a structured coding prompt that includes multiple constraints and real-world engineering requirements.

Copy CodeCopiedUse a different Browserimport anthropic

client = anthropic.Anthropic()

def run_test(prompt: str, title: str):
print(f”n{‘=’*80}”)
print(f”TEST: {title}”)
print(f”{‘=’*80}n”)

message = client.messages.create(
model=”MiniMax-M2.1″,
max_tokens=10000,
system=(
“You are a senior software engineer. ”
“Write production-quality code with clear structure, ”
“explicit assumptions, and minimal but sufficient reasoning. ”
“Avoid unnecessary verbosity.”
),
messages=[
{
“role”: “user”,
“content”: [{“type”: “text”, “text”: prompt}]
}
]
)

for block in message.content:
if block.type == “thinking”:
print(” Thinking:n”, block.thinking, “n”)
elif block.type == “text”:
print(” Output:n”, block.text, “n”)

PROMPT= “””
Design a small Python service that processes user events.

Requirements:
1. Events arrive as dictionaries with keys: user_id, event_type, timestamp.
2. Validate input strictly (types + required keys).
3. Aggregate events per user in memory.
4. Expose two functions:
– ingest_event(event: dict) -> None
– get_user_summary(user_id: str) -> dict
5. Code must be:
– Testable
– Thread-safe
– Easily extensible for new event types
6. Do NOT use external libraries.

Provide:
– Code only
– Brief inline comments where needed
“””

run_test(prompt=PROMPT, title=”Instruction Following + Architecture”)

This test uses a deliberately structured and constraint-heavy prompt designed to evaluate more than just code generation. The prompt requires strict input validation, in-memory state management, thread safety, testability, and extensibility—all without relying on external libraries.

By combining architectural decisions with multiple non-trivial constraints, the prompt operates at a medium-to-high complexity level, making it well-suited for assessing how effectively MiniMax M2.1 follows instructions, reasons through design trade-offs, and produces production-quality code rather than isolated snippets.

Model Reasoning & Output

The model reasons through key architectural trade-offs before coding, carefully balancing flexibility, memory usage, and extensibility. It evaluates multiple approaches to event aggregation and deliberately chooses raw event storage to enable future extensions without modifying ingestion logic. 

Thread safety is explicitly handled through locking, and strict input validation is planned to ensure data correctness, reflecting a real-world, engineering-first mindset rather than jumping straight to implementation.

This thoughtful reasoning is reflected in the final output, which demonstrates strong code quality through clear structure, meaningful naming, type hints, and a thread-safe design aligned with production standards. The solution follows all prompt constraints accurately—covering validation, in-memory aggregation, extensibility, and the absence of external dependencies—while maintaining clean, focused logic that avoids unnecessary complexity and remains easy to maintain.

Copy CodeCopiedUse a different Browserimport threading
from typing import Dict, List, Any

class EventProcessor:
“””
Thread-safe event processor that aggregates user events in memory.
Validates input strictly and supports easy extension for new event types.
“””

def __init__(self) -> None:
# Stores events per user: user_id -> list of event dictionaries
self._user_events: Dict[str, List[Dict[str, Any]]] = {}
# Lock for thread-safe access
self._lock = threading.Lock()

def ingest_event(self, event: dict) -> None:
“””
Validate and ingest a single event.
Strictly validates types and presence of required keys.
“””
# Validate event is a dictionary
if not isinstance(event, dict):
raise ValueError(“Event must be a dictionary”)

# Validate required keys and their types
required_keys = {
‘user_id’: str,
‘event_type’: str,
‘timestamp’: (str, int, float) # Accept string or numeric timestamp
}

for key, expected_type in required_keys.items():
if key not in event:
raise ValueError(f”Missing required key: ‘{key}'”)
if not isinstance(event[key], expected_type):
raise ValueError(f”Key ‘{key}’ must be of type {expected_type.__name__}”)

# Thread-safe event storage
with self._lock:
user_id = event[‘user_id’]
if user_id not in self._user_events:
self._user_events[user_id] = []
self._user_events[user_id].append(event)

def get_user_summary(self, user_id: str) -> dict:
“””
Generate summary for a specific user.
Returns aggregated event counts and timestamps.
“””
if not isinstance(user_id, str):
raise ValueError(“user_id must be a string”)

with self._lock:
user_events = self._user_events.get(user_id, [])

# Aggregate event counts
event_counts = {}
timestamps = []

for event in user_events:
event_type = event[‘event_type’]
event_counts[event_type] = event_counts.get(event_type, 0) + 1
timestamps.append(event[‘timestamp’])

return {
‘user_id’: user_id,
‘total_events’: len(user_events),
‘event_counts’: event_counts,
‘timestamps’: sorted(timestamps) if timestamps else []
}
“`

**Key Design Choices:**

1. **Thread Safety**: Uses `threading.Lock` to protect shared state during concurrent access
2. **Input Validation**: Strict type checking for required keys with clear error messages
3. **Extensibility**:
– New event types automatically handled by dynamic counting
– Easy to add new aggregations in `get_user_summary`
4. **Testability**: Clear separation of concerns, easy to mock for unit tests
5. **Memory Efficiency**: Stores only essential data (event dictionaries)

**Usage Example:**
“`python
processor = EventProcessor()

# Ingest events
processor.ingest_event({
‘user_id’: ‘user123’,
‘event_type’: ‘login’,
‘timestamp’: ‘2023-01-01T10:00:00Z’
})

# Get user summary
summary = processor.get_user_summary(‘user123’)
print(summary)

Model’s Interleaved Thinking in Action

Let’s now see MiniMax M2.1’s interleaved thinking in action. We ask the model to compare two organizations based on P/E ratio and sentiment, using two dummy tools to clearly observe how the workflow operates. 

This example demonstrates how M2.1 interacts with external tools in a controlled, agent-style setup. One tool simulates fetching stock metrics, while the other provides sentiment analysis, with both returning locally generated responses. As the model receives these tool outputs, it incorporates them into its reasoning and adjusts its final comparison accordingly.

Defining the tools

Copy CodeCopiedUse a different Browserimport anthropic
import json

client = anthropic.Anthropic()

def get_stock_metrics(ticker):
data = {
“NVDA”: {“price”: 130, “pe”: 75.2},
“AMD”: {“price”: 150, “pe”: 40.5}
}
return json.dumps(data.get(ticker, “Ticker not found”))

def get_sentiment_analysis(company_name):
sentiments = {“NVIDIA”: 0.85, “AMD”: 0.42}
return f”Sentiment score for {company_name}: {sentiments.get(company_name, 0.0)}”

tools = [
{
“name”: “get_stock_metrics”,
“description”: “Get price and P/E ratio.”,
“input_schema”: {
“type”: “object”,
“properties”: {“ticker”: {“type”: “string”}},
“required”: [“ticker”]
}
},
{
“name”: “get_sentiment_analysis”,
“description”: “Get news sentiment score.”,
“input_schema”: {
“type”: “object”,
“properties”: {“company_name”: {“type”: “string”}},
“required”: [“company_name”]
}
}
]

Model Execution with Tool Interaction

Copy CodeCopiedUse a different Browsermessages = [{“role”: “user”, “content”: “Compare NVDA and AMD value based on P/E and sentiment.”}]
running = True

print(f” [USER]: {messages[0][‘content’]}”)

while running:
# Get model response
response = client.messages.create(
model=”MiniMax-M2.1″,
max_tokens=4096,
messages=messages,
tools=tools,
)

messages.append({“role”: “assistant”, “content”: response.content})

tool_results = []
has_tool_use = False

for block in response.content:
if block.type == “thinking”:
print(f”n [THINKING]:n{block.thinking}”)

elif block.type == “text”:
print(f”n [MODEL]: {block.text}”)
if not any(b.type == “tool_use” for b in response.content):
running = False

elif block.type == “tool_use”:
has_tool_use = True
print(f” [TOOL CALL]: {block.name}({block.input})”)

# Execute the correct mock function
if block.name == “get_stock_metrics”:
result = get_stock_metrics(block.input[‘ticker’])
elif block.name == “get_sentiment_analysis”:
result = get_sentiment_analysis(block.input[‘company_name’])

# Add to the results list for this turn
tool_results.append({
“type”: “tool_result”,
“tool_use_id”: block.id,
“content”: result
})

if has_tool_use:
messages.append({“role”: “user”, “content”: tool_results})
else:
running = False

print(“n Conversation Complete.”)

During execution, the model decides when and which tool to call, receives the corresponding tool results, and then updates its reasoning and final response based on that data. This showcases M2.1’s ability to interleave reasoning, tool usage, and response generation—adapting its output dynamically as new information becomes available.

Comparison with OpenAI’s GPT-5.2

Finally, we compare MiniMax M2.1 with GPT-5.2 using a compact multilingual instruction-following prompt. The task requires the model to identify coffee-related terms from a Spanish passage, translate only those terms into English, remove duplicates, and return the result in a strictly formatted numbered list.

To run this code block, you’ll need an OpenAI API key, which can be generated from the OpenAI developer dashboard.

Copy CodeCopiedUse a different Browserimport os
from getpass import getpass
os.environ[‘OPENAI_API_KEY’] = getpass (‘Enter OpenAI API Key: ‘)

Copy CodeCopiedUse a different Browserinput_text = “””
¡Preparar café Cold Brew es un proceso sencillo y refrescante!
Todo lo que necesitas son granos de café molido grueso y agua fría.
Comienza añadiendo el café molido a un recipiente o jarra grande.
Luego, vierte agua fría, asegurándote de que todos los granos de café
estén completamente sumergidos.
Remueve la mezcla suavemente para garantizar una saturación uniforme.
Cubre el recipiente y déjalo en remojo en el refrigerador durante al
menos 12 a 24 horas, dependiendo de la fuerza deseada.
“””

prompt = f”””
The following text is written in Spanish.

Task:
1. Identify all words in the text that are related to coffee or coffee preparation.
2. Translate ONLY those words into English.
3. Remove duplicates (each word should appear only once).
4. Present the result as a numbered list.

Rules:
– Do NOT include explanations.
– Do NOT include non-coffee-related words.
– Do NOT include Spanish words in the final output.

Text:
<{input_text}>
“””

from openai import OpenAI
client = OpenAI()

response = client.responses.create(
model=”gpt-5.2″,
input=prompt
)

print(response.output_text)

Copy CodeCopiedUse a different Browserimport anthropic

client = anthropic.Anthropic()

message = client.messages.create(
model=”MiniMax-M2.1″,
max_tokens=10000,
system=”You are a helpful assistant.”,
messages=[
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: prompt
}
]
}
]
)

for block in message.content:
if block.type == “thinking”:
print(f”Thinking:n{block.thinking}n”)
elif block.type == “text”:
print(f”Text:n{block.text}n”)

When comparing the outputs, MiniMax M2.1 produces a noticeably broader and more granular set of coffee-related terms than GPT-5.2. M2.1 identifies not only core nouns like coffee, beans, and water, but also preparation actions (pour, stir, cover), process-related states (submerged, soak), and contextual attributes (cold, coarse, strength, hours). 

This indicates a deeper semantic pass over the text, where the model reasons through the entire preparation workflow rather than extracting only the most obvious keywords.

This difference is also reflected in the reasoning process. M2.1 explicitly analyzes context, resolves edge cases (such as borrowed English terms like Cold Brew), considers duplicates, and deliberates on whether certain adjectives or verbs qualify as coffee-related before finalizing the list. GPT-5.2, by contrast, delivers a shorter and more conservative output focused on high-confidence terms, with less visible reasoning depth. 

Together, this highlights M2.1’s stronger instruction adherence and semantic coverage, especially for tasks that require careful filtering, translation, and strict output control.

The post MiniMax Releases M2.1: An Enhanced M2 Version with Features like Multi-Coding Language Support, API Integration, and Improved Tools for Structured Coding appeared first on MarkTechPost.

This AI Paper from Stanford and Harvard Explains Why Most ‘Agentic A …

Agentic AI systems sit on top of large language models and connect to tools, memory, and external environments. They already support scientific discovery, software development, and clinical research, yet they still struggle with unreliable tool use, weak long horizon planning, and poor generalization. The latest research paper ‘Adaptation of Agentic AI‘ from Stanford, Harvard, UC Berkeley, Caltech proposes a unified view of how these systems should adapt and maps existing methods into a compact, mathematically defined framework.

How this research paper models an agentic AI system?

The research survey models an agentic AI system as a foundation model agent along with 3 key components. A planning module decomposes goals into sequences of actions, using static procedures such as Chain-of-Thought and Tree-of-Thought, or dynamic procedures such as ReAct and Reflexion that react to feedback. A tool use module connects the agent to web search engines, APIs, code execution environments, Model Context Protocols, and browser automation. A memory module stores short term context and long term knowledge, accessed through retrieval augmented generation. Adaptation changes prompts or parameters for these components using supervised fine tuning, preference based methods such as Direct Preference Optimization, reinforcement learning methods such as Proximal Policy Optimization and Group Relative Policy Optimization, and parameter efficient techniques such as low rank adaptation.

https://arxiv.org/pdf/2512.16301

Four adaptation paradigms

The framework defines 4 adaptation paradigms by combining 2 binary choices. The first dimension is the target, agent adaptation versus tool adaptation. The second dimension is the supervision signal, tool execution versus agent output. This yields A1 and A2 for adapting the agent, and T1 and T2 for adapting tools.

A1, Tool Execution Signaled Agent Adaptation, optimizes the agent using feedback derived from tool execution. A2, Agent Output Signaled Agent Adaptation, optimizes the agent using a signal defined only on its final outputs. T1, Agent-Agnostic Tool Adaptation, optimizes tools without referring to a particular agent. T2, Agent-Supervised Tool Adaptation, optimizes tools under supervision from a fixed agent.

https://arxiv.org/pdf/2512.16301

A1, learning from verifiable tool feedback

In A1, the agent receives an input x, produces a structured tool call a, the tools return a result y, and the learning objective O_tool measures tool success, for example execution correctness or retrieval quality. The paper covers both supervised imitation of successful tool trajectories and reinforcement learning that uses verifiable tool outcomes as reward.

Toolformer, ToolAlpaca, and Gorilla illustrate supervised A1 methods, since each uses execution results of real tools to construct or filter training traces before imitation. All of them keep the supervision signal defined at the tool behavior level, not at the final answer level.

DeepRetrieval is a central A1 reinforcement learning example. It frames query reformulation as a Markov decision process where the state is the user query, the action is a rewritten query, and the reward combines retrieval metrics such as Recall and nDCG, a format term, and, for text to SQL, SQL execution accuracy. The policy is trained with KL regularized Proximal Policy Optimization and the same objective covers literature search, corpus question answering, and text to SQL.

A2, learning from final agent outputs

A2 covers cases where the optimization objective O_agent depends only on the final output o produced by the agent, even when the agent uses tools internally. The survey shows that supervising only o is not enough to teach tools, because the agent can ignore tools and still improve likelihood. Effective A2 systems therefore combine supervision on tool calls with supervision on final answers, or assign sparse rewards such as exact match accuracy to o and propagate them back through the full trajectory.

T1, agent agnostic tool training

T1 freezes the main agent and optimizes tools so that they are broadly reusable. The objective O_tool depends only on tool outputs and is measured by metrics such as retrieval accuracy, ranking quality, simulation fidelity, or downstream task success. A1 trained search policies, such as DeepRetrieval, can later be reused as T1 tools inside new agentic systems without modifying the main agent.

T2, tools optimized under a frozen agent

T2 assumes a powerful but fixed agent A, which is common when the agent is a closed source foundation model. The tool executes calls and returns results that the agent then uses to produce o. The optimization objective again lives on O_agent, but the trainable parameters belong to the tool. The paper describes quality weighted training, target based training, and reinforcement learning variants that all derive learning signals for the tool from the final agent outputs.

The survey treats long term memory as a special case of T2. Memory is an external store written and read through learned functions, and the agent remains frozen. Recent T2 systems include s3, which trains a 7 billion parameter searcher that maximizes a Gain Beyond RAG reward defined by a frozen generator, and AgentFlow, which trains a planner to orchestrate mostly frozen Qwen2.5 based modules using Flow GRPO.

https://arxiv.org/pdf/2512.16301

Key Takeaways

The research defines a precise 4 paradigm framework for adapting agentic AI by crossing 2 dimensions, whether adaptation targets the agent or tools, and whether the supervision signal comes from tool execution or from final agent outputs.

A1 methods such as Toolformer, ToolAlpaca, Gorilla, and DeepRetrieval adapt the agent directly from verifiable tool feedback, including retrieval metrics, SQL execution accuracy, and code execution results, often optimized with KL regularized Proximal Policy Optimization.

A2 methods optimize the agent from signals on final outputs, for example answer accuracy, and the paper shows that systems must still supervise tool calls or propagate sparse rewards through full trajectories, otherwise the agent can ignore tools while still improving likelihood.

T1 and T2 shift learning to tools and memory, T1 trains generally useful retrievers, searchers, and simulators without a specific agent in mind, while T2 adapts tools under a frozen agent, as in s3 and AgentFlow where a fixed generator supervises a learned searcher and planner.

The research team introduce an adaptation landscape that relates monolithic versus modular and local versus systemic control, and they argue that practical systems will combine rare A1 or A2 updates on a strong base model with frequent T1 and T2 adaptation of retrievers, search policies, simulators, and memory for robustness and scalability.

Check out the Paper and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post This AI Paper from Stanford and Harvard Explains Why Most ‘Agentic AI’ Systems Feel Impressive in Demos and then Completely Fall Apart in Real Use appeared first on MarkTechPost.

Google Health AI Releases MedASR: a Conformer Based Medical Speech to …

Google Health AI team has released MedASR, an open weights medical speech to text model that targets clinical dictation and physician patient conversations and is designed to plug directly into modern AI workflows.

What MedASR is and where it fits?

MedASR is a speech to text model based on the Conformer architecture and is pre trained for medical dictation and transcription. It is positioned as a starting point for developers who want to build healthcare based voice applications such as radiology dictation tools or visit note capture systems.

The model has 105 million parameters and accepts mono channel audio at 16000 hertz with 16 bit integer waveforms. It produces text only output, so it drops directly into downstream natural language processing or generative models such as MedGemma.

MedASR sits inside the Health AI Developer Foundations portfolio, alongside MedGemma, MedSigLIP and other domain specific medical models that share common terms of use and a consistent governance story.

Training data and domain specialization

MedASR is trained on a diverse corpus of de identified medical speech. The dataset includes about 5000 hours of physician dictations and clinical conversations across radiology, internal medicine and family medicine.

The training pairs audio segments with transcripts and metadata. Subsets of the conversational data are annotated with medical named entities including symptoms, medications and conditions. This gives the model strong coverage of clinical vocabulary and phrasing patterns that appear in routine documentation.

The model is English only, and most training audio comes from speakers for whom English is a first language and who were raised in the United States. The documentation notes that performance may be lower for other speaker profiles or noisy microphones and recommends fine tuning for such settings.

Architecture and decoding

MedASR follows the Conformer encoder design. Conformer combines convolution blocks with self attention layers so it can capture local acoustic patterns and longer range temporal dependencies in the same stack.

The model is exposed as an automated speech detector with a CTC style interface. In the reference implementation, developers use AutoProcessor to create input features from waveform audio and AutoModelForCTC to produce token sequences. Decoding uses greedy decoding by default. The model can also be paired with an external six gram language model with beam search of size 8 to improve word error rate.

MedASR training uses JAX and ML Pathways on TPUv4p, TPUv5p and TPUv5e hardware. These systems provide the scale needed for large speech models and align with Google’s broader foundation model training stack.

Performance on medical speech tasks

Key results, with greedy decoding and with a six gram language model, are:

RAD DICT, radiologist dictation: MedASR greedy 6.6 percent, MedASR plus language model 4.6 percent, Gemini 2.5 Pro 10.0 percent, Gemini 2.5 Flash 24.4 percent, Whisper v3 Large 25.3 percent.

GENERAL DICT, general and internal medicine: MedASR greedy 9.3 percent, MedASR plus language model 6.9 percent, Gemini 2.5 Pro 16.4 percent, Gemini 2.5 Flash 27.1 percent, Whisper v3 Large 33.1 percent.

FM DICT, family medicine: MedASR greedy 8.1 percent, MedASR plus language model 5.8 percent, Gemini 2.5 Pro 14.6 percent, Gemini 2.5 Flash 19.9 percent, Whisper v3 Large 32.5 percent.

Eye Gaze, dictation on 998 MIMIC chest X ray cases: MedASR greedy 6.6 percent, MedASR plus language model 5.2 percent, Gemini 2.5 Pro 5.9 percent, Gemini 2.5 Flash 9.3 percent, Whisper v3 Large 12.5 percent.

Developer workflow and deployment options

A minimal pipeline example is:

Copy CodeCopiedUse a different Browserfrom transformers import pipeline
import huggingface_hub

audio = huggingface_hub.hf_hub_download(“google/medasr”, “test_audio.wav”)
pipe = pipeline(“automatic-speech-recognition”, model=”google/medasr”)
result = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(result)

For more control, developers load AutoProcessor and AutoModelForCTC, resample audio to 16000 hertz with librosa, move tensors to CUDA if available and call model.generate followed by processor.batch_decode.

Key Takeaways

MedASR is a lightweight, open weights Conformer based medical ASR model: It has 105M parameters, is trained specifically for medical dictation and transcription, and is released under the Health AI Developer Foundations program as an English only model for healthcare developers.

Domain specific training on about 5000 hours of de identified medical audio: MedASR is pre trained on physician dictations and clinical conversations across specialties like radiology, internal medicine and family medicine, which gives it strong coverage of clinical terminology compared to general purpose ASR systems.

Competitive or better word error rates on medical dictation benchmarks: On internal radiology, general medicine, family medicine and Eye Gaze datasets, MedASR with greedy or language model decoding matches or outperforms large general models such as Gemini 2.5 Pro, Gemini 2.5 Flash and Whisper v3 Large on word error rate for English medical speech.

Check out the Repo, Model on HF and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Health AI Releases MedASR: a Conformer Based Medical Speech to Text Model for Clinical Dictation appeared first on MarkTechPost.

InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Multi-Spe …

Genomic prediction and design now require models that connect local motifs with megabase scale regulatory context and that operate across many organisms. Nucleotide Transformer v3, or NTv3, is InstaDeep’s new multi species genomics foundation model for this setting. It unifies representation learning, functional track and genome annotation prediction, and controllable sequence generation in a single backbone that runs on 1 Mb contexts at single nucleotide resolution.

Earlier Nucleotide Transformer models already showed that self supervised pretraining on thousands of genomes yields strong features for molecular phenotype prediction. The original series included models from 50M to 2.5B parameters trained on 3,200 human genomes and 850 additional genomes from diverse species. NTv3 keeps this sequence only pretraining idea but extends it to longer contexts and adds explicit functional supervision and a generative mode.

https://huggingface.co/spaces/InstaDeepAI/ntv3

Architecture for 1 Mb genomic windows

NTv3 uses a U-Net style architecture that targets very long genomic windows. A convolutional downsampling tower compresses the input sequence, a transformer stack models long range dependencies in that compressed space, and a deconvolution tower restores base level resolution for prediction and generation. Inputs are tokenized at the character level over A, T, C, G, N with special tokens such as <unk>, <pad>, <mask>, <cls>, <eos>, and <bos>. Sequence length must be a multiple of 128 tokens, and the reference implementation uses padding to enforce this constraint. All public checkpoints use single base tokenization with a vocabulary size of 11 tokens.

The smallest public model, NTv3 8M pre, has about 7.69M parameters with hidden dimension 256, FFN dimension 1,024, 2 transformer layers, 8 attention heads, and 7 downsample stages. At the high end, NTv3 650M uses hidden dimension 1,536, FFN dimension 6,144, 12 transformer layers, 24 attention heads, and 7 downsample stages, and adds conditioning layers for species specific prediction heads.

Training data

The NTv3 model is pretrained on 9 trillion base pairs from the OpenGenome2 resource using base resolution masked language modeling. After this stage, the model is post trained with a joint objective that integrates continued self supervision with supervised learning on approximately 16,000 functional tracks and annotation labels from 24 animal and plant species.

Performance and Ntv3 Benchmark

After post training NTv3 achieves state of the art accuracy for functional track prediction and genome annotation across species. It outperforms strong sequence to function models and previous genomic foundation models on existing public benchmarks and on the new Ntv3 Benchmark, which is defined as a controlled downstream fine tuning suite with standardized 32 kb input windows and base resolution outputs.

The Ntv3 Benchmark currently consists of 106 long range, single nucleotide, cross assay, cross species tasks. Because NTv3 sees thousands of tracks across 24 species during post training, the model learns a shared regulatory grammar that transfers between organisms and assays and supports coherent long range genome to function inference.

From prediction to controllable sequence generation

Beyond prediction, NTv3 can be fine tuned into a controllable generative model via masked diffusion language modeling. In this mode the model receives conditioning signals that encode desired enhancer activity levels and promoter selectivity, and it fills masked spans in the DNA sequence in a way that is consistent with those conditions.

In experiments described in the launch materials, the team designs 1,000 enhancer sequences with specified activity and promoter specificity and validates them in vitro using STARR seq assays in collaboration with the Stark Lab. The results show that these generated enhancers recover the intended ordering of activity levels and reach more than 2 times improved promoter specificity compared with baselines.

Comparison Table

DimensionNTv3 (Nucleotide Transformer v3)GENA-LMPrimary goalUnified multi species genomics foundation model for representation learning, sequence to function prediction and controllable sequence generationFamily of DNA language models for long sequences focused on transfer learning for many supervised genomic prediction tasksArchitectureU-Net style convolutional tower, transformer stack, deconvolutional tower, single base resolution language model, post trained versions add multi species conditioning and task specific heads BERT based encoder models with 12 or 24 layers and BigBird variants with sparse attention, extended further with recurrent memory transformer for long contexts Parameter scaleFamily spans 8M, 100M and 650M parametersBase models have 110M parameters and large models have 336M parameters, including BigBird variants at 110M Native context lengthUp to 1 Mb input at single nucleotide resolution for both pre trained and post trained modelsUp to about 4500 bp with 512 BPE tokens for BERT models and up to 36000 bp with 4096 tokens for BigBird models Extended context mechanismUses U-Net style convolutional tower to aggregate long range context before transformer layers while keeping single base resolution; context length is fixed at 1 Mb in the released checkpoints Uses sparse attention in BigBird variants plus recurrent memory transformer to extend effective context to hundreds of thousands of base pairs TokenizationCharacter level tokenizer over A, T, C, G, N and special tokens; each nucleotide is a tokenBPE tokenizer on DNA that maps to about 4500 bp for 512 tokens; two tokenizers are used, one on T2T only and one on T2T plus 1000G SNPs plus multispecies data Pretraining corpus sizeFirst stage pre training on OpenGenome2 with about 9 trillion base pairs from more than 128000 speciesHuman only models trained on pre processed human T2T v2 plus 1000 Genomes SNPs, about 480 × 10^9 base pairs, multispecies models trained on combined human and multispecies data, about 1072 × 10^9 base pairsSpecies coverageMore than 128000 species in OpenGenome2 pretraining and post training supervision from 24 animal and plant speciesHuman focused models plus taxon specific models for yeast, Arabidopsis and Drosophila and multispecies models from ENSEMBL genomes Supervised post training signalsAbout 16000 functional tracks across about 10 assay types and about 2700 tissues in 24 species, used to condition the backbone with discrete labels and to train functional heads Fine tuned on multiple supervised tasks, including promoters, splice sites, Drosophila enhancers, chromatin profiles and polyadenylation sites, with task specific heads on top of the LMGenerative capabilitiesCan be fine tuned into a controllable generative model using masked diffusion language modeling, used to design 1000 promoter specific enhancers that achieved more than 2× increased specificity in STARR seq assaysPrimarily used as a masked language model and feature extractor, supports sequence completion through MLM but the main publication focuses on predictive tasks rather than explicit controllable sequence design

Key Takeaways

NTv3 is a long range, multi species genomics foundation model: It unifies representation learning, functional track prediction, genome annotation, and controllable sequence generation in a single U Net style architecture that supports 1 Mb nucleotide resolution context across 24 animal and plant species.

The model is trained on 9 trillion base pairs with joint self supervised and supervised objectives: NTv3 is pretrained on 9 trillion base pairs from OpenGenome2 with base resolution masked language modeling, then post trained on more than 16,000 functional tracks and annotation labels from 24 species using a joint objective that mixes continued self supervision with supervised learning.

NTv3 achieves state of the art performance on the Ntv3 Benchmark: After post training, NTv3 reaches state of the art accuracy for functional track prediction and genome annotation across species and outperforms previous sequence to function models and genomics foundation models on public benchmarks and on the Ntv3 Benchmark, which contains 106 standardized long range downstream tasks with 32 kb input and base resolution outputs.

The same backbone supports controllable enhancer design validated with STARR seq: NTv3 can be fine tuned as a controllable generative model using masked diffusion language modeling to design enhancer sequences with specified activity levels and promoter selectivity, and these designs are validated experimentally with STARR seq assays that confirm the intended activity ordering and improved promoter specificity.

Check out the Repo, Model on HF and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Foundation Model, Designed for 1 Mb Context Lengths at Single-Nucleotide Resolution appeared first on MarkTechPost.

Programmatically creating an IDP solution with Amazon Bedrock Data Aut …

Intelligent Document Processing (IDP) transforms how organizations handle unstructured document data, enabling automatic extraction of valuable information from invoices, contracts, and reports. Today, we explore how to programmatically create an IDP solution that uses Strands SDK, Amazon Bedrock AgentCore, Amazon Bedrock Knowledge Base, and Bedrock Data Automation (BDA). This solution is provided through a Jupyter notebook that enables users to upload multi-modal business documents and extract insights using BDA as a parser to retrieve relevant chunks and augment a prompt to a foundational model (FM). In this use case, our solution performs retrieval of relevant context for public school districts from a Nation’s Report Card from the U.S Department of Education.
Amazon Bedrock Data Automation can be used as a standalone feature or as a parser when setting up a knowledge base for Retrieval-Augmented Generation (RAG) workflows. BDA can be used to generate valuable insights from unstructured, multi-modal content such as documents, images, video, and audio. With BDA, you can build automated IDP and RAG workflows, quickly and cost-effectively. In building your RAG workflow, you can use Amazon OpenSearch Service to store the vector embeddings of necessary documents. In this post, Bedrock AgentCore utilizes BDA via tools to perform multi-modal RAG for the IDP solution.
Amazon Bedrock AgentCore is a fully managed service that allows you to build and configure autonomous agents. Developers can build and deploy agents using popular frameworks and a suite of models including those from Amazon Bedrock, Anthropic, Google, and OpenAI all without managing the underlying infrastructure or writing custom code.
Strands Agents SDK is a sophisticated open-source toolkit that revolutionizes artificial intelligence (AI) agent development through a model-driven approach. Developers can create a Strands Agent with a prompt (defining agent behavior) and a list of tools. A large language model (LLM) performs the reasoning, autonomously deciding the optimal actions and when to use tools based on the context and task. This workflow supports complex systems, minimizing the code typically needed to orchestrate multi-agent collaboration. Strands SDK is used for creating the agent and defining the tools needed to perform intelligent document processing.
Follow the following prerequisites and step-by-step implementations to deploy the solution in your own AWS environment.
Prerequisites
To follow along with the example use cases, set up the following prerequisites:

AWS credentials with appropriate permissions
Access to Github
Git installed locally; for instructions, see Getting Started – Installing Git

Architecture
The solution uses the following AWS services:

Amazon S3 for document storage and upload capabilities
Bedrock Knowledge Bases to convert objects stored in S3 into a RAG-ready workflow
Amazon OpenSearch for vector embeddings
Amazon Bedrock AgentCore for the IDP workflow
Strands Agent SDK for the open source framework of defining tools to perform IDP
Bedrock Data Automation (BDA) to extract structured insights from your documents

Follow these steps to get started:

Upload relevant documents to Amazon S3
Create Amazon Bedrock Knowledge Base and parse S3 data source using Amazon Bedrock Data Automation.
Document chunks stored as vector embeddings in Amazon OpenSearch
Strands Agent deployed on Amazon Bedrock AgentCore Runtime performs RAG to answer user questions.
End user receives response

Configure the AWS CLI
Use the following command to configure the AWS Command Line Interface (AWS CLI) with the AWS credentials for your Amazon account and AWS Region. Before you begin, check AWS Bedrock Data Automation for region availability and pricing:

aws configure

Clone and build the GitHub repository locally

git clone https://github.com/aws-samples/sample-for-amazon-bda-agents
cd sample-for-amazon-bda-agents

Open Jupyter notebook called:

bedrock-data-automation-with-agents.ipynb

Bedrock Data Automation with AgentCore Notebook instructions:
This notebook demonstrates how to create an IDP solution using BDA with Amazon Bedrock AgentCore Runtime. Instead of traditional Bedrock Agents, we’ll deploy a Strands Agent through AgentCore, providing enterprise-grade capabilities with framework flexibility. More specific instructions are included in the Jupyter notebook. Here’s an overview of how you can setup Bedrock Knowledge Bases with data automation as a parser with Bedrock AgentCore.
Steps:

Import libraries and setup AgentCore capabilities
Create the Knowledge Base for Amazon Bedrock with BDA
Upload the academic reports dataset to Amazon S3
Deploy the Strands Agent using AgentCore Runtime
Test the AgentCore-hosted agent
Clean-up all resources

Security considerations
The implementation uses several security guardrails like:

Secure file upload handling
Identity and Access Management (IAM) role-based access control
Input validation and error handling

Note: This implementation is for demonstration purposes. Additional security controls, testing, and architectural reviews are required before deploying in a production environment.
Benefits and use cases
This solution is particularly valuable for:

Automated document processing workflows
Intelligent document analysis on large-scale datasets
Question-answering systems based on document content
Multi-modal content processing

Conclusion
This solution demonstrates how to use Amazon Bedrock AgentCore’s capabilities to build intelligent document processing applications. By building Strands Agents to support Amazon Bedrock Data Automation, we can create powerful applications that understand and interact with multi-modal document content using tools. With Amazon Bedrock Data Automation, we can enhance the RAG experience for more complex data formats including visual rich documents, images, audios, and video.
Additional resources
For more information, visit Amazon Bedrock.
Service User Guides:

Amazon Bedrock Knowledge Bases User Guide
Amazon Bedrock AgentCore User Guide
Strands Agents: Open Source AI Agents SDK
Amazon Bedrock Data Automation User Guide

Relevant Samples:

Amazon Bedrock Data Automation AWS Samples
Amazon Bedrock AgentCore AWS Samples
Strands Agents Samples

About the authors
Raian Osman is a Technical Account Manager at AWS and works closely with Education technology customers based out of North America. He has been with AWS for over 3 years and began his journey working as a Solutions Architect. Raian works closely with organizations to optimize and secure workloads on AWS, while exploring innovative use cases for generative AI.
Andy Orlosky is a Strategic Pursuit Solutions Architect at Amazon Web Services (AWS) based out of Austin, Texas. He has been with AWS for about 2 years but has worked closely with Education customers across public sector. As a leader in the AI/ML Technical Field Community, Andy continues to dive deep with his customers to design and scale generative AI solutions. He holds 7 AWS certifications and enjoys spending time with his family, playing sports with friends, and cheering for his favorite sports teams in his free time.
Spencer Harrison is a partner solutions architect at Amazon Web Services (AWS), where he helps public sector organizations use cloud technology to focus on business outcomes. He is passionate about using technology to improve processes and workflows. Spencer’s interests outside of work include reading, pickleball, and personal finance.

AI agent-driven browser automation for enterprise workflow management

Enterprise organizations increasingly rely on web-based applications for critical business processes, yet many workflows remain manually intensive, creating operational inefficiencies and compliance risks. Despite significant technology investments, knowledge workers routinely navigate between eight to twelve different web applications during standard workflows, constantly switching contexts and manually transferring information between systems. Data entry and validation tasks consume approximately 25-30% of worker time, while manual processes create compliance bottlenecks and cross-system data consistency challenges that require continuous human verification. Traditional automation approaches have significant limitations. While robotic process automation (RPA) works for structured, rule-based processes, it becomes brittle when applications update and requires ongoing maintenance. API-based integration remains optimal, but many legacy systems lack modern capabilities. Business process management platforms provide orchestration but struggle with complex decision points and direct web interaction. As a result, most enterprises operate with mixed approaches where only 30% of workflow tasks are fully automated, 50% require human oversight, and 20% remain entirely manual.
These challenges manifest across common enterprise workflows. For example, purchase order validation requires intelligent navigation through multiple systems to perform three-way matching between purchase orders (POs), receipts, and invoices while maintaining audit trails. Employee on-boarding demands coordinated access provisioning across identity management, customer relationship management (CRM), enterprise resource planning (ERP), and collaboration platforms with role-based decision-making. Finally, e-commerce order processing must intelligently process orders across multiple retailer websites lacking native API access. Artificial intelligence (AI) agents represent a significant advancement beyond these traditional solutions, offering capabilities that can intelligently navigate complexity, adapt to dynamic environments, and dramatically reduce manual intervention across enterprise workflows.
In this post, we demonstrate how an e-commerce order management platform can automate order processing workflows across multiple retail websites via AI agents like Amazon Nova Act and Strands agent using Amazon Bedrock AgentCore Browser at scale.
E-commerce order automation workflow
This workflow demonstrates how AI agents can intelligently automate complex, multi-step order processing across diverse retailer websites that lack native API integration, combining adaptive browser navigation with human oversight for exception handling.

The following components work together to enable scalable, AI-powered order processing:

ECS Fargate tasks run containerized Python FastAPI backend with React frontend, providing WebSocket connections for real-time order automation. Tasks automatically scale based on demand.
Application integrates with Amazon Bedrock and Amazon Nova Act for AI-powered order automation. AgentCore Browser Tool provides secure, isolated browser environment for web automation. Main Agent orchestrates Nova Act Agent and Strands + Playwright Agent for intelligent browser control.

The e-commerce order automation workflow represents a common enterprise challenge where businesses need to process orders across multiple retailer websites without native API access. This workflow demonstrates the full capabilities of AI-powered browser automation, from initial navigation through complex decision-making to human-in-the-loop intervention. We have a sample agentic e-commerce automation built out which we have open sourced on aws-samples repository on GitHub.
Workflow process
Users of the e-commerce order management system submit customer orders through a web interface or batch CSV upload, including product details (URL, size, color), customer information, and shipping address. The system assigns priority levels and queues orders for processing. When an order starts, Amazon Bedrock AgentCore Browser creates an isolated browser session with Chrome DevTools Protocol (CDP) connectivity. Amazon Bedrock AgentCore Browser provides a secure, cloud-based browser that enables the AI agent (Amazon Nova Act and Strands agent in this case) to interact with websites. It includes security features such as session isolation, built-in observability through live viewing, AWS CloudTrail logging, and session replay capabilities. The system retrieves retailer credentials from AWS Secrets Manager and generates a live view URL using Amazon DCV streaming for real-time monitoring. The following diagram illustrates the order entire workflow process.

Browser automation with form-filling and order submission
Form-filling represents a critical capability where the agent intelligently detects and populates various field types across different retailer checkout layouts. The AI agent visits the product page, handles authentication if needed, and analyzes the page to identify size selectors, color options, and cart buttons. It selects specified options, adds items to cart, and proceeds to checkout, filling shipping information with intelligent field detection across different retailer layouts. If products are out of stock or unavailable, the agent escalates to human review with context about alternatives.

The sample application employs two distinct approaches depending on the automation method. Amazon Nova Act uses visual understanding and DOM structure of the webpage, allowing the Nova Act agent to receive natural language instructions like “fill shipping address” and automatically identify form fields from the screenshot, adapting to different layouts without predefined selectors. In contrast, the Strands + Playwright Model Context Protocol (MCP) combination uses Bedrock models to analyze the page’s Document Object Model (DOM) structure, determine appropriate form field selectors, and then Playwright MCP executes the low-level browser interactions to populate the fields with customer data. Both approaches automatically adapt to diverse retailer checkout interfaces, eliminating the brittleness of traditional selector-based automation.
Human-in-the-loop
When encountering CAPTCHAs or complex challenges, the agent pauses automation and notifies operators via WebSocket. Operators access the live view to see the exact browser state, resolve the issue manually, and trigger resumption. AgentCore Browser allows for human browser takeover and passing control back to the agent. The agent continues from the current state without restarting the entire process.
Observability and scale
Throughout execution, the system captures session recordings stored in S3, screenshots at critical steps, and detailed execution logs with timestamps. Operators monitor progress through a real-time dashboard showing order status, current step, and progress percentage. For high-volume scenarios, batch processing supports parallel execution of multiple orders with configurable workers (1-10), priority-based queuing, and automatic retry logic for transient failures.
Conclusion
AI agent-driven browser automation represents a fundamental shift in how enterprises approach workflow management. By combining intelligent decision-making, adaptive navigation, and human-in-the-loop capabilities, organizations can move beyond the 30-50-20 split of traditional automation toward significantly higher automation rates across complex, multi-system workflows. The e-commerce order automation example demonstrates that AI agents don’t replace traditional RPA—they enable automation of workflows previously considered too dynamic or complex for automation, handling diverse user interfaces, making contextual decisions, and maintaining full compliance and auditability.
As enterprises face mounting pressure to improve operational efficiency while managing legacy systems and complex integrations, AI agents offer a practical path forward. Rather than investing in expensive system overhauls or accepting the inefficiencies of manual processes, organizations can deploy intelligent browser automation that adapts to their existing technology landscape. The result is reduced operational costs, faster processing times, improved compliance, and most importantly, liberation of knowledge workers from repetitive data entry and system navigation tasks—allowing them to focus on higher-value activities that drive business impact.

About the authors
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.
Veda Raman is a Sr Solutions Architect for Generative AI for Amazon Nova and Agentic AI at AWS. She helps customers design and build Agentic AI solutions using Amazon Nova models and Bedrock AgentCore. She previously worked with customers building ML solutions using Amazon SageMaker and also as a serverless solutions architect at AWS.
Sanghwa Na is a Generative AI Specialist Solutions Architect at Amazon Web Services. Based in San Francisco, he works with customers to design and build generative AI solutions using large language models and foundation models on AWS. He focuses on helping organizations adopt AI technologies that drive real business value.

Agentic QA automation using Amazon Bedrock AgentCore Browser and Amazo …

Quality assurance (QA) testing has long been the backbone of software development, but traditional QA approaches haven’t kept pace with modern development cycles and complex UIs. Most organizations still rely on a hybrid approach combining manual testing with script-based automation frameworks like Selenium, Cypress, and Playwright—yet teams spend significant amount of their time maintaining existing test automation rather than creating new tests. The problem is that traditional automation is brittle. Test scripts break with UI changes, require specialized programming knowledge, and often provide incomplete coverage across browsers and devices. With many organizations actively exploring AI-driven testing workflows, current approaches are insufficient.
In this post, we explore how agentic QA automation addresses these challenges and walk through a practical example using Amazon Bedrock AgentCore Browser and Amazon Nova Act to automate testing for a sample retail application.
Benefits of agentic QA testing
Agentic AI shifts QA testing from rule-based automation to intelligent, autonomous testing systems. Unlike conventional automation that follows preprogrammed scripts, agentic AI can observe, learn, adapt, and make decisions in real time. The key advantages include autonomous test generation through UI observation and dynamic adaptation as UI elements change—minimizing the maintenance overhead that consumes QA teams’ time. These systems mimic human interaction patterns, making sure testing occurs from a genuine user perspective rather than through rigid, scripted pathways.
AgentCore Browser for large-scale agentic QA testing
To realize the potential of agentic AI testing at enterprise scale, organizations need robust infrastructure that can support intelligent, autonomous testing agents. AgentCore Browser, a built-in tool of Amazon Bedrock AgentCore, addresses this need by providing a secure, cloud-based browser environment specifically designed for AI agents to interact with websites and applications.
AgentCore Browser includes essential enterprise security features such as session isolation, built-in observability through live viewing, AWS CloudTrail logging, and session replay capabilities. Operating within a containerized ephemeral environment, each browser instance can be shut down after use, providing clean testing states and optimal resource management. For large-scale QA operations, AgentCore Browser can run multiple browser sessions concurrently, so organizations can parallelize testing across different scenarios, environments, and user journeys simultaneously.
Agentic QA with the Amazon Nova Act SDK
The infrastructure capabilities of AgentCore Browser become truly powerful when combined with an agentic SDK like Amazon Nova Act. Amazon Nova Act is an AWS service that helps developers build, deploy, and manage fleets of reliable AI agents for automating production UI workflows. With this SDK, developers can break down complex testing workflows into smaller, reliable commands while maintaining the ability to call APIs and perform direct browser manipulation as needed. This approach offers seamless integration of Python code throughout the testing process. Developers can interleave tests, breakpoints, and assertions directly within the agentic workflow, providing unprecedented control and debugging capabilities. This combination of the AgentCore Browser cloud infrastructure with the Amazon Nova Act agentic SDK creates a comprehensive testing ecosystem that transforms how organizations approach quality assurance.
Practical implementation: Retail application testing
To illustrate this transformation in practice, let’s consider developing a new application for a retail company. We’ve created a mock retail web application to demonstrate the agentic QA process, assuming the application is hosted on AWS infrastructure within a private enterprise network during development and testing phases.
To streamline the test creation process, we use Kiro, an AI-powered coding assistant to automatically generate UI test cases by analyzing our application code base. Kiro examines the application structure, reviews existing test patterns, and creates comprehensive test cases following the JSON schema format required by Amazon Nova Act. By understanding the application’s features—including navigation, search, filtering, and form submissions—Kiro generates detailed test steps with actions and expected results that are immediately executable through AgentCore Browser. This AI-assisted approach dramatically accelerates test creation while providing comprehensive coverage. The following demonstration shows Kiro generating 15 ready-to-use test cases for our QA testing demo application.

After the test cases are generated, they are placed in the test data directory where pytest automatically discovers and executes them. Each JSON test file becomes an independent test that pytest can run in parallel. The framework uses pytest-xdist to distribute tests across multiple worker processes, automatically utilizing available system resources for optimal performance.
During execution, each test gets its own isolated AgentCore Browser session through the Amazon Nova Act SDK. The Amazon Nova Act agent reads the test steps from the JSON file and executes them—performing actions like clicking buttons or filling forms, then validating that expected results occur. This data-driven approach means teams can create comprehensive test suites by simply writing JSON files, without needing to write Python code for each test scenario. The parallel execution architecture significantly reduces testing time. Tests that would normally run sequentially can now execute simultaneously across multiple browser sessions, with pytest managing the distribution and aggregation of results. An HTML report is automatically generated using pytest-html and the pytest-html-nova-act plugin, providing test outcomes, screenshots, and execution logs for complete visibility into the testing process.

One of the most powerful capabilities of AgentCore Browser is its ability to run multiple browser sessions concurrently, enabling true parallel test execution at scale. When pytest distributes tests across worker processes, each test spawns its own isolated browser session in the cloud. This means your entire test suite can execute simultaneously rather than waiting for each test to complete sequentially.
The AWS Management Console provides complete visibility into these parallel sessions. As demonstrated in the following video, you can view the active browser sessions running concurrently, monitor their status, and track resource utilization in real time. This observability is critical for understanding test execution patterns and optimizing your testing infrastructure.

Beyond just monitoring session status, AgentCore Browser offers live view and session replay features to watch exactly what Amazon Nova Act is doing during and after test execution. For an active browser session, you can open the live view and observe the agent interacting with your application in real time—clicking buttons, filling forms, navigating pages, and validating results. When you enable session replay, you can view the recorded events by replaying the recorded session. This allows you to validate test results even after the test execution completes. These capabilities are invaluable for debugging test failures, understanding agent behavior, and gaining confidence in your automated testing process.
For complete deployment instructions and access to the sample retail application code, AWS CloudFormation templates, and pytest testing framework, refer to the accompanying GitHub repository. The repository includes the necessary components to deploy and test the application in your own AWS environment.
Conclusion
In this post, we walked through how AgentCore Browser can help parallelize agentic QA testing for web applications. An agent like Amazon Nova Act can perform automated agentic QA testing with high reliability.

About the authors
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.
Veda Raman is a Sr Solutions Architect for Generative AI for Amazon Nova and Agentic AI at AWS. She helps customers design and build Agentic AI solutions using Amazon Nova models and Bedrock AgentCore. She previously worked with customers building ML solutions using Amazon SageMaker and also as a serverless solutions architect at AWS.
Omkar Nyalpelly is a Cloud Infrastructure Architect at AWS Professional Services with deep expertise in AWS Landing Zones and DevOps methodologies. His current focus centers on the intersection of cloud infrastructure and AI technologies—specifically leveraging Generative AI and agentic AI systems to build autonomous, self-managing cloud environments. Through his work with enterprise customers, Omkar explores innovative approaches to reduce operational overhead while enhancing system reliability. Outside of his technical pursuits, he enjoys playing cricket, baseball, and exploring creative photography. He holds an MS in Networking and Telecommunications from Southern Methodist University.
Ryan Canty is a Solutions Architect at Amazon AGI Labs with over 10 years of software engineering experience, specializing in designing and scaling enterprise software systems across multiple technology stacks. He works with customers to leverage Amazon Nova Act, an AWS service for building and deploying highly reliable AI agents that automate UI-based workflows at scale, bridging the gap between cutting-edge AI capabilities and practical business applications.

How to Build a Proactive Pre-Emptive Churn Prevention Agent with Intel …

In this tutorial, we build a fully functional Pre-Emptive Churn Agent that proactively identifies at-risk users and drafts personalized re-engagement emails before they cancel. Rather than waiting for churn to occur, we design an agentic loop in which we observe user inactivity, analyze behavioral patterns, strategize incentives, and generate human-ready email drafts using Gemini. We orchestrate the entire process step by step, ensuring each component, from data simulation to manager approval, works seamlessly together. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport os
import time
import json
import random
from datetime import datetime, timedelta
from typing import List, Dict, Any
import textwrap

try:
import google.generativeai as genai
except ImportError:
!pip install -q -U google-generativeai
import google.generativeai as genai

from google.colab import userdata
import getpass

We set up our environment, import all required libraries, and ensure Gemini is available for use. We keep the initialization minimal so the rest of the system loads cleanly. As we run it, we prepare the foundation for the agent-driven workflow that follows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef setup_gemini():
print(“— Security Check —“)
try:
api_key = userdata.get(‘GEMINI_API_KEY’)
except:
print(“Please enter your Google Gemini API Key:”)
api_key = getpass.getpass(“API Key: “)
if not api_key:
raise ValueError(“API Key is required to run the agent.”)
genai.configure(api_key=api_key)
return genai.GenerativeModel(‘gemini-2.5-flash’)

class MockCustomerDB:
def __init__(self):
self.today = datetime.now()
self.users = self._generate_mock_users()

def _generate_mock_users(self) -> List[Dict]:
profiles = [
{“id”: “U001”, “name”: “Sarah Connor”, “plan”: “Enterprise”,
“last_login_days_ago”: 2, “top_features”: [“Reports”, “Admin Panel”], “total_spend”: 5000},
{“id”: “U002”, “name”: “John Smith”, “plan”: “Basic”,
“last_login_days_ago”: 25, “top_features”: [“Image Editor”], “total_spend”: 50},
{“id”: “U003”, “name”: “Emily Chen”, “plan”: “Pro”,
“last_login_days_ago”: 16, “top_features”: [“API Access”, “Data Export”], “total_spend”: 1200},
{“id”: “U004”, “name”: “Marcus Aurelius”, “plan”: “Enterprise”,
“last_login_days_ago”: 45, “top_features”: [“Team Management”], “total_spend”: 8000}
]
return profiles

def fetch_at_risk_users(self, threshold_days=14) -> List[Dict]:
return [u for u in self.users if u[‘last_login_days_ago’] >= threshold_days]

We configure authentication for Gemini and construct a mock customer database that behaves like a real system. We simulate users with varying levels of inactivity to generate realistic churn scenarios. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ChurnPreventionAgent:
def __init__(self, model):
self.model = model

def analyze_and_strategize(self, user: Dict) -> Dict:
print(f” … Analyzing strategy for {user[‘name’]}…”)
prompt = f”””
You are a Customer Success AI Specialist.
Analyze this user profile and determine the best ‘Win-Back Strategy’.
USER PROFILE:
– Name: {user[‘name’]}
– Plan: {user[‘plan’]}
– Days Inactive: {user[‘last_login_days_ago’]}
– Favorite Features: {‘, ‘.join(user[‘top_features’])}
– Total Spend: ${user[‘total_spend’]}
TASK:
1. Determine the ‘Churn Probability’ (Medium/High/Critical).
2. Select a specific INCENTIVE.
3. Explain your reasoning briefly.
OUTPUT FORMAT:
{{
“risk_level”: “High”,
“incentive_type”: “Specific Incentive”,
“reasoning”: “One sentence explanation.”
}}
“””
try:
response = self.model.generate_content(prompt)
clean_json = response.text.replace(““`json”, “”).replace(““`”, “”).strip()
return json.loads(clean_json)
except Exception as e:
return {
“risk_level”: “Unknown”,
“incentive_type”: “General Check-in”,
“reasoning”: f”Analysis failed: {str(e)}”
}

We build the analytical core of our churn agent to evaluate user behavior and select win-back strategies. We let Gemini interpret signals, such as inactivity and usage patterns, to determine risk and incentives. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef draft_engagement_email(self, user: Dict, strategy: Dict) -> str:
print(f” … Drafting email for {user[‘name’]} using ‘{strategy[‘incentive_type’]}’…”)
prompt = f”””
Write a short, empathetic, professional re-engagement email.
TO: {user[‘name’]}
CONTEXT: They haven’t logged in for {user[‘last_login_days_ago’]} days.
STRATEGY: {strategy[‘incentive_type’]}
REASONING: {strategy[‘reasoning’]}
USER HISTORY: They love {‘, ‘.join(user[‘top_features’])}.
TONE: Helpful and concise.
“””
response = self.model.generate_content(prompt)
return response.text

We generate personalized re-engagement emails based on the strategy output from the previous step. We use Gemini to craft concise, empathetic messaging that aligns with each user’s history. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ManagerDashboard:
def review_draft(self, user_name, strategy, draft_text):
print(“n” + “=”*60)
print(f” REVIEW REQUIRED: Re-engagement for {user_name}”)
print(f” Strategy: {strategy[‘incentive_type’]}”)
print(f” Risk Level: {strategy[‘risk_level’]}”)
print(“-” * 60)
print(” DRAFT EMAIL:n”)
print(textwrap.indent(draft_text, ‘ ‘))
print(“-” * 60)
print(“n[Auto-Simulation] Manager reviewing…”)
time.sleep(1.5)
if strategy[‘risk_level’] == “Critical”:
print(” MANAGER DECISION: Approved (Priority Send)”)
return True
else:
print(” MANAGER DECISION: Approved”)
return True

We simulate a manager dashboard where human oversight approves or rejects the drafted email. We keep the flow simple but realistic, ensuring the agent’s actions remain aligned with human judgment. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
print(“Initializing Agentic System…”)
try:
model = setup_gemini()
db = MockCustomerDB()
agent = ChurnPreventionAgent(model)
manager = ManagerDashboard()
except Exception as e:
print(f”Setup failed: {e}”)
return

print(“n AGENT STATUS: Scanning Database for inactive users (>14 days)…”)
at_risk_users = db.fetch_at_risk_users(threshold_days=14)
print(f”Found {len(at_risk_users)} at-risk users.n”)

for user in at_risk_users:
print(f”— Processing Case: {user[‘id’]} ({user[‘name’]}) —“)
strategy = agent.analyze_and_strategize(user)
email_draft = agent.draft_engagement_email(user, strategy)
approved = manager.review_draft(user[‘name’], strategy, email_draft)
if approved:
print(f” ACTION: Email queued for sending to {user[‘name’]}.”)
else:
print(f” ACTION: Email rejected.”)
print(“n”)
time.sleep(1)

if __name__ == “__main__”:
main()

We orchestrate the full system: scanning for at-risk users, analyzing them, drafting messages, and routing everything for approval. We bring all components together into one continuous loop. 

In conclusion, we have completed a churn-prevention pipeline that observes, reasons, drafts, and involves a human reviewer before action. We watch the agent detect risk patterns, craft tailored strategies, and generate professional emails, all while maintaining human oversight for final decisions. This implementation demonstrates how agentic workflows can transform customer success operations by enabling timely, personalized, and scalable interventions. We now have a modular foundation we can expand further, connecting it to real databases, CRMs, web dashboards, or automation systems, to build a truly production-ready churn prevention engine.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Proactive Pre-Emptive Churn Prevention Agent with Intelligent Observation and Strategy Formation appeared first on MarkTechPost.

Google DeepMind Researchers Release Gemma Scope 2 as a Full Stack Inte …

Google DeepMind Researchers introduce Gemma Scope 2, an open suite of interpretability tools that exposes how Gemma 3 language models process and represent information across all layers, from 270M to 27B parameters.

Its core goal is simple, give AI safety and alignment teams a practical way to trace model behavior back to internal features instead of relying only on input output analysis. When a Gemma 3 model jailbreaks, hallucinates or shows sycophantic behavior, Gemma Scope 2 lets researchers inspect which internal features fired and how those activations flowed through the network.

What is Gemma Scope 2?

Gemma Scope 2 is a comprehensive, open suite of sparse autoencoders and related tools trained on internal activations of the Gemma 3 model family. Sparse autoencoders, SAEs, act as a microscope on the model. They decompose high dimensional activations into a sparse set of human inspectable features that correspond to concepts or behaviors.

Training Gemma Scope 2 required storing around 110 Petabytes of activation data and fitting over 1 trillion total parameters across all interpretability models.

The suite targets every Gemma 3 variant, including 270M, 1B, 4B, 12B and 27B parameter models, and covers the full depth of the network. This is important because many safety relevant behaviors only appear at larger scales.

What is new compared to the original Gemma Scope?

The first Gemma Scope release focused on Gemma 2 and already enabled research on model hallucination, identifying secrets known by a model and training safer models.

Gemma Scope 2 extends that work in four main ways:

The tools now span the entire Gemma 3 family up to 27B parameters, which is needed to study emergent behaviors observed only in larger models, such as the behavior previously analyzed in the 27B size C2S Scale model for scientific discovery tasks.

Gemma Scope 2 includes SAEs and transcoders trained on every layer of Gemma 3. Skip transcoders and cross layer transcoders help trace multi step computations that are distributed across layers.

The suite applies the Matryoshka training technique so that SAEs learn more useful and stable features and mitigate some flaws identified in the earlier Gemma Scope release.

There are dedicated interpretability tools for Gemma 3 models tuned for chat, which make it possible to analyze multi step behaviors such as jailbreaks, refusal mechanisms and chain of thought faithfulness.

Key Takeaways

Gemma Scope 2 is an open interpretability suite for all Gemma 3 models, from 270M to 27B parameters, with SAEs and transcoders on every layer of both pretrained and instruction tuned variants.

The suite uses sparse autoencoders as a microscope that decomposes internal activations into sparse, concept like features, plus transcoders that track how these features propagate across layers.

Gemma Scope 2 is explicitly positioned for AI safety work to study jailbreaks, hallucinations, sycophancy, refusal mechanisms and discrepancies between internal state and communicated reasoning in Gemma 3.

Check out the Paper, Technical details and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Researchers Release Gemma Scope 2 as a Full Stack Interpretability Suite for Gemma 3 Models appeared first on MarkTechPost.

Exploring the zero operator access design of Mantle

At Amazon, our culture, built on honest and transparent discussion of our growth opportunities, enables us to focus on investing and innovating to continually raise the standard on our ability to deliver value for our customers. Earlier this month, we had the opportunity to share an example of this process at work in Mantle, our next-generation inference engine for Amazon Bedrock. As generative AI inferencing and fine-tuning workloads continue to evolve, we need to evolve how we serve inferencing to our customers in an optimized way, which leads to the development of Mantle.
As we set out to reimagine the architecture of our next generation inferencing engine, we made raising the bar on security our top priority. AWS shares our customers’ unwavering focus on security and data privacy. This has been central to our business from the start, and it was particularly in focus from the earliest days of Amazon Bedrock. We’ve understood from the start that generative AI inference workloads present an unprecedented opportunity for customers to harness the latent value of their data, but with that opportunity comes the need to ensure the highest standards in security, privacy, and compliance as our customers build generative AI systems that process their most sensitive data and interact with their most critical systems.
As a baseline, Amazon Bedrock is designed with the same operational security standards that you see across AWS. AWS has always used a least privilege model for operations, where each AWS operator has access to only the minimum set of systems required to do their assigned task, limited to the time when that privilege is needed. Any access to systems that store or process customer data or metadata is logged, monitored for anomalies, and audited. AWS guards against any actions that would disable or bypass these controls. Additionally, on Amazon Bedrock your data is never used to train any models. Model providers have no mechanism to access customer data, because inferencing is done only within the Amazon Bedrock-owned account that model providers don’t have access to. This strong security posture has been a key enabler for our customers to unlock the potential of generative AI applications for their sensitive data.
With Mantle, we raised the bar even further. Following the approach of the AWS Nitro System, we have designed Mantle from the ground up to be zero operator access (ZOA), where we have intentionally excluded any technical means for AWS operators to access customer data. Instead, systems and services are administered using automation and secure APIs that protect customer data. With Mantle, there is no mechanism for any AWS operator to sign in to underlying compute systems or access any customer data, such as inference prompts or completions. Interactive communication tools like Secure Shell (SSH), AWS Systems Manager Session Manager, and serial consoles aren’t installed anywhere in Mantle. Additionally, all inference software updates need to be signed and verified before they can be deployed into the service, ensuring that only approved code runs on Mantle.
Mantle uses the recently released EC2 instance attestation capability to configure a hardened, constrained, and immutable compute environment for customer data processing. The services in Mantle that are responsible for handling model weights and conducting inference operations on customer prompts are further backed by the high assurance of cryptographically signed attestation measurements from the Nitro Trusted Platform Module (NitroTPM).
When a customer calls a Mantle endpoint (for example, bedrock-mantle.[regions].api.aws) such as those that serve the Responses API on Amazon Bedrock, customer data (prompts) leaves the customer’s environment through TLS, and is encrypted all the way to the Mantle service, which operates with ZOA. Throughout the entire flow and in Mantle, no operator, whether from AWS, the customer, or a model provider can access the customer data.

Looking forward
Mantle’s ZOA design exemplifies the long-term commitment of AWS to the security and privacy of our customers’ data. It’s this focus that has enabled teams across AWS to invest in further raising the bar for security. At the same time, we’ve made the foundational confidential computing capabilities that we internally use at Amazon, such as NitroTPM Attestation, available to all customers to use on Amazon Elastic Compute Cloud (Amazon EC2).
We’re not stopping here; we’re committed to continuing to invest in enhancing the security of your data and to providing you with more transparency and assurance on how we achieve this.

About the authors
Anthony Liguori is an AWS VP and Distinguished Engineer for Amazon Bedrock, and the lead engineer for Mantle.

AWS AI League: Model customization and agentic showdown

Building intelligent agents to handle complex, real-world tasks can be daunting. Additionally, rather than relying solely on large, pre-trained foundation models, organizations often need to fine-tune and customize smaller, more specialized models to outperform them for their specific use cases. The AWS AI League provides an innovative program to help enterprises overcome the challenges of building advanced AI capabilities through exciting competitions that drive innovation in agentic AI and model customization.
In 2025, the first AWS AI League competition captured the attention of developers, data scientists, and business leaders globally. They came together to solve pressing problems using the latest AI tools and techniques. The grand finale at AWS re:Invent 2025 was an exciting showcase of their ingenuity and skills. Cross-functional teams from leading organizations competed head-to-head, demonstrating their ability to craft effective prompts, fine-tune models, and build powerful AI agents.
Congratulations to our 2025 AWS AI League Champions! After intense competition among these three exceptional builders emerged victorious, sharing a $25,000 prize pool:

1st Place: Hemanth Vediyera from Cisco
2nd Place: Ross Williams from Aqfer
3rd Place: Deepesh Khanna from Capital One

Figure 1: Left to right: Ross, Hemanth, Deepesh

This post explores how the AWS AI League program can be used to host AI competitions that can help participants experience model customization and agent building concepts, apply these to tackle real-world business challenges, and showcase their innovative solutions through engaging, game-style formats. We highlight the new agentic AI and model customization challenges, where enterprises can apply to host internal tournaments using AWS credits, and developers can compete at AWS events.
To get started, visit the AWS AI League product page.
What is the AWS AI League Championship?
The AWS AI League experience begins with a hands-on, 2-hour workshop led by AWS experts, followed by self-paced experimentation. The journey culminates in a captivating, gameshow-style grand finale, where you showcase your AI creations and solutions to address pressing business challenges. The following figure shows these three steps.

Figure 2: AWS AI League Championship steps

Building on the success of the 2025 program, we are excited to announce the launch of the AWS AI League 2026 Championship. This year, the competition features two new challenges that allow participants to really put their AI skills to the test:

The agentic AI Challenge allows you to build intelligent agents using Amazon Bedrock AgentCore. Competitors craft customized agent architectures to tackle real-world business problems.
Complementing the agentic AI Challenge, the model customization Challenge now uses the latest fine-tuning recipes in SageMaker Studio. Here you customize models for specific use cases.

For the 2026 AI League championship, the prize pool doubles to $50,000, with tracks catering to developers at different skill levels – from beginners to advanced practitioners.
Build intelligent agents with the agentic AI challenge
The AWS AI League now features an exciting agentic AI challenge, where you build intelligent agents using Amazon Bedrock AgentCore to solve complex problems in a dynamic, game-style competition. In this challenge, agents navigate through a maze-like grid environment, encountering various challenges while seeking a treasure chest. These challenges map to real-world use cases, testing the agents’ ability to handle inappropriate content, execute code, use a browser, and more.
Agents have a time limit to traverse the map, collect points, and overcome the obstacles before reaching the treasure chest. The more points they earn, the higher they rank on the leaderboard. You can fully customize your agents using Amazon Bedrock AgentCore primitives, which enables you to more securely scale and manage production-grade agents. You can also select specific models for supervisor and sub-agents, as well as create custom tools such as Bedrock Guardrails, AgentCore Memory, and AWS Lambda functions to help your agents navigate the challenges. The following figure depicts the obstacles the agent must overcome while traveling to reach the treasure chest.

Figure 3: AWS AI League Agentic Challenge

AWS AI League provides a full user interface (UI) for users to build their intelligent agent solutions. You can use this no-code UI to construct multi-agent architectures and tools, integrating various components such as Amazon SageMaker Studio CodeEditor for interactive coding of custom Lambda functions and tools. This allows you to fully develop and customize your agent-based solutions within the AWS AI League website, without needing to leave the environment.
The following screenshots showcase the agent building experience all within the AWS AI League website.

Figure 4: AWS AI League agent tools

Figure 5: AWS AI League multi agent architecture

Throughout the competition, users receive real-time agent performance feedback, with a large language model (LLM) evaluator providing assessment to help with iteration. The following image showcases how the agent is evaluated during challenges.

Figure 6: AWS AI League agent challenge evaluation

At the grand finale, the top finalists take the stage to showcase their agents’ capabilities in a live, game-show format, demonstrating the power and versatility of agentic AI in solving complex, multi-step problems. The evaluation criteria include time efficiency, accuracy in solving challenges, agent planning, and token consumption efficiency. The following snapshot shows the final round of the Grand Finale at re:Invent 2025.

Figure 7: AWS AI League re:Invent 2025 Grand Finale

Customize models to outperform larger models
AWS AI League is expanding the scope of its model customization challenge, allowing you to use the latest advancements in fine-tuning techniques.
You can access the new model customization experience within Amazon SageMaker Studio, where you can use powerful new training recipes. The goal is to develop highly effective, domain-specific models that can outperform the performance of larger, reference models.
The challenge begins with you honing in on your model customization skills. Using the tools and techniques you have learned, you apply advanced fine-tuning methods to help enhance your model’s performance. After your models are customized, the true test begins. The models are submitted to a leaderboard for performance assessment against a reference model. The model earns points each time the automated judge deems your customized model’s response to be more accurate and comprehensive than the reference model’s output. You can showcase your advanced skills, rise to the top of the leaderboard, and potentially unlock new opportunities for your organizations.
During the challenge, you receive real-time feedback on your model’s performance from an automated evaluator when you submit to the leaderboard. The leaderboard evaluates submissions against a reference dataset throughout the competition, providing immediate feedback on accuracy to help you iterate and improve your solutions. The following image showcases how an AI critique is used to evaluate the customized model.

Figure 8: AWS AI League model customization evaluation

At the grand finale, the top finalists demonstrate their models’ capabilities in a live, game-show format, showcasing their prompt engineering abilities. During the gameshow, the scoring includes expert evaluation where domain experts and a live audience participate in real-time voting to determine which AI solutions best solve real business challenges. The following image showcases the participant prompt engineering view during a Grand Finale.

Figure 9: AWS AI League model customization Grand Finale participant view

Conclusion
In this post, we explored the new AWS AI League challenges and how they are transforming how organizations approach AI development. At AWS, we’ve learned that the fastest way to spark innovation is through competition. With AWS AI League, builders can now showcase their AI skills, compete and unlock innovation.
To learn more about hosting an AWS AI League within your organization visit the AWS AI League and to dive deeper into building intelligent agents and customizing AI models explore AWS AI training catalog on AWS Skill Builder.

About the authors
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Natasya K. Idries is the Product Marketing Manager for AWS AI/ML Gamified Learning Programs. She is passionate about democratizing AI/ML skills through engaging and hands-on educational initiatives that bridge the gap between advanced technology and practical business implementation. Her expertise in building learning communities and driving digital innovation continues to shape her approach to creating impactful AI education programs. Outside of work, Natasya enjoys traveling, cooking Southeast Asian cuisines and exploring nature trails.

Accelerate Enterprise AI Development using Weights & Biases and Am …

This post is co-written by Thomas Capelle and Ray Strickland from Weights & Biases (W&B).
Generative artificial intelligence (AI) adoption is accelerating across enterprises, evolving from simple foundation model interactions to sophisticated agentic workflows. As organizations transition from proof-of-concepts to production deployments, they require robust tools for development, evaluation, and monitoring of AI applications at scale.
In this post, we demonstrate how to use Foundation Models (FMs) from Amazon Bedrock and the newly launched Amazon Bedrock AgentCore alongside W&B Weave to help build, evaluate, and monitor enterprise AI solutions. We cover the complete development lifecycle from tracking individual FM calls to monitoring complex agent workflows in production.
Overview of W&B Weave
Weights & Biases (W&B) is an AI developer system that provides comprehensive tools for training models, fine-tuning, and leveraging foundation models for enterprises of all sizes across various industries.
W&B Weave offers a unified suite of developer tools to support every stage of your agentic AI workflows. It enables:

Tracing & monitoring: Track large language model (LLM) calls and application logic to debug and analyze production systems.
Systematic iteration: Refine and iterate on prompts, datasets and models.
Experimentation: Experiment with different models and prompts in the LLM Playground.
Evaluation: Use custom or pre-built scorers alongside our comparison tools to systematically assess and enhance application performance. Collect user and expert feedback for real-life testing and evaluation.
Guardrails: Help protect your application with safeguards for content moderation, prompt safety, and more. Use custom or third-party guardrails (including Amazon Bedrock Guardrails) or W&B Weave’s native guardrails.

W&B Weave can be fully managed by Weights & Biases in a multi-tenant or single-tenant environment or can be deployed in a customer’s Amazon Virtual Private Cloud (VPC) directly. In addition, W&B Weave’s integration into the W&B Development Platform provides organizations a seamlessly integrated experience between the model training/fine-tuning workflow and the agentic AI workflow.
To get started, subscribe to the Weights & Biases AI Development Platform through AWS Marketplace. Individuals and academic teams can subscribe to W&B at no additional cost.
Tracking Amazon Bedrock FMs with W&B Weave SDK
W&B Weave integrates seamlessly with Amazon Bedrock through Python and TypeScript SDKs. After installing the library and patching your Bedrock client, W&B Weave automatically tracks the LLM calls:

!pip install weave
import weave
import boto3
import json
from weave.integrations.bedrock.bedrock_sdk import patch_client

weave.init(“my_bedrock_app”)

# Create and patch the Bedrock client
client = boto3.client(“bedrock-runtime”)
patch_client(client)

# Use the client as usual
response = client.invoke_model(
modelId=”anthropic.claude-3-5-sonnet-20240620-v1:0″,
body=json.dumps({
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 100,
“messages”: [
{“role”: “user”, “content”: “What is the capital of France?”}
]
}),
contentType=’application/json’,
accept=’application/json’
)
response_dict = json.loads(response.get(‘body’).read())
print(response_dict[“content”][0][“text”])

This integration automatically versions experiments and tracks configurations, providing complete visibility into your Amazon Bedrock applications without modifying core logic.
Experimenting with Amazon Bedrock FMs in W&B Weave Playground
The W&B Weave Playground accelerates prompt engineering with an intuitive interface for testing and comparing Bedrock models. Key features include:

Direct prompt editing and message retrying
Side-by-side model comparison
Access from trace views for rapid iteration

To begin, add your AWS credentials in the Playground settings, select your preferred Amazon Bedrock FMs, and start experimenting. The interface enables rapid iteration on prompts while maintaining full traceability of experiments.

Evaluating Amazon Bedrock FMs with W&B Weave Evaluations
W&B Weave Evaluations provides dedicated tools for evaluating generative AI models effectively. By leveraging W&B Weave Evaluations alongside Amazon Bedrock, users can efficiently evaluate these models, analyze outputs, and visualize performance across key metrics. Users can use built in scorers from W&B Weave, 3rd party or custom scorers, and human/expert feedback as well. This combination allows for a deeper understanding of the tradeoffs between models, such as differences in cost, accuracy, speed, and output quality.
W&B Weave has a first-class way to track evaluations with Model & Evaluation classes. To set up an evaluation job, customers can:

Define a dataset or list of dictionaries with a collection of examples to be evaluated
Create a list of scoring functions. Each function should have a model_output and optionally, other inputs from your examples, and return a dictionary with the scores
Define an Amazon Bedrock model by using Model class
Evaluate this model by calling Evaluation

Here’s an example of setting up an evaluation job:

import weave
from weave import Evaluation
import asyncio

# Collect your examples
examples = [
{“question”: “What is the capital of France?”, “expected”: “Paris”},
{“question”: “Who wrote ‘To Kill a Mockingbird’?”, “expected”: “Harper Lee”},
{“question”: “What is the square root of 64?”, “expected”: “8”},
]

# Define any custom scoring function
@weave.op()
def match_score1(expected: str, output: dict) -> dict:
# Here is where you’d define the logic to score the model output
return {‘match’: expected == model_output[‘generated_text’]}

@weave.op()
def function_to_evaluate(question: str):
# here’s where you would add your LLM call and return the output
return {‘generated_text’: ‘Paris’}

# Score your examples using scoring functions
evaluation = Evaluation(
dataset=examples, scorers=[match_score1]
)

# Start tracking the evaluation
weave.init(‘intro-example’)
# Run the evaluation
asyncio.run(evaluation.evaluate(function_to_evaluate))

The evaluation dashboard visualizes performance metrics, enabling informed decisions about model selection and configuration. For detailed guidance, see our previous post on evaluating LLM summarization with Amazon Bedrock and Weave.
Enhancing Amazon Bedrock AgentCore Observability with W&B Weave
Amazon Bedrock AgentCore is a complete set of services for deploying and operating highly capable agents more securely at enterprise scale. It provides more secure runtime environments, workflow execution tools, and operational controls that work with popular frameworks like Strands Agents, CrewAI, LangGraph, and LlamaIndex, as well as many LLM models – whether from Amazon Bedrock or external sources.
AgentCore includes built-in observability through Amazon CloudWatch dashboards that track key metrics like token usage, latency, session duration, and error rates. It also traces workflow steps, showing which tools were invoked and how the model responded, providing essential visibility for debugging and quality assurance in production.
When working with AgentCore and W&B Weave together, teams can use AgentCore’s built-in operational monitoring and security foundations while also using W&B Weave if it aligns with their existing development workflows. Organizations already invested in the W&B environment may choose to incorporate W&B Weave’s visualization tools alongside AgentCore’s native capabilities. This approach gives teams flexibility to use the observability solution that best fits their established processes and preferences when developing complex agents that chain multiple tools and reasoning steps.

There are two main approaches to add W&B Weave observability to your AgentCore agents: using the native W&B Weave SDK or integrating through OpenTelemetry.
Native W&B Weave SDK
The simplest approach is to use W&B Weave’s @weave.op decorator to automatically track function calls. Initialize W&B Weave with your project name and wrap the functions you want to monitor:

import weave
import os

os.environ[“WANDB_API_KEY”] = “your_api_key”
weave.init(“your_project_name”)

@weave.op()
def word_count_op(text: str) -> int:
return len(text.split())

@weave.op()
def run_agent(agent: Agent, user_message: str) -> Dict[str, Any]:
result = agent(user_message)
return {“message”: result.message, “model”: agent.model.config[“model_id”]}

Since AgentCore runs as a docker container, add W&B weave to your dependencies (for example, uv add weave) to include it in your container image.
OpenTelemetry Integration
For teams already using OpenTelemetry or wanting vendor-neutral instrumentation, W&B Weave supports OTLP (OpenTelemetry Protocol) directly:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

auth_b64 = base64.b64encode(f”api:{WANDB_API_KEY}”.encode()).decode()
exporter = OTLPSpanExporter(
endpoint=”https://trace.wandb.ai/otel/v1/traces”,
headers={“Authorization”: f”Basic {auth_b64}”, “project_id”: WEAVE_PROJECT}
)

# Create spans to track execution
with tracer.start_as_current_span(“invoke_agent”) as span:
span.set_attribute(“input.value”, json.dumps({“prompt”: user_message}))
result = agent(user_message)
span.set_attribute(“output.value”, json.dumps({“message”: result.message}))

This approach maintains compatibility with AgentCore’s existing OpenTelemetry infrastructure while routing traces to W&B Weave for visualization.When using both AgentCore and W&B Weave together, teams have multiple options for observability. AgentCore’s CloudWatch integration monitors system health, resource utilization, and error rates while providing tracing for agent reasoning and tool selection. W&B Weave offers visualization capabilities that present execution data in formats familiar to teams already using the W&B environment. Both solutions provide visibility into how agents process information and make decisions, allowing organizations to choose the observability approach that best aligns with their existing workflows and preferences.This dual-layer approach means users can:

Monitor production service level agreements (SLAs) through CloudWatch alerts
Debug complex agent behaviors in W&B Weave’s trace explorer
Optimize token usage and latency with detailed execution breakdowns
Compare agent performance across different prompts and configurations

The integration requires minimal code changes, preserves your existing AgentCore deployment, and scales with your agent complexity. Whether you’re building simple tool-calling agents or orchestrating multi-step workflows, this observability stack provides the insights needed to iterate quickly and deploy confidently.
For implementation details and complete code examples, refer to our previous post.
Conclusion
In this post, we demonstrated how to build and optimize enterprise-grade agentic AI solutions by combining Amazon Bedrock’s FMs and AgentCore with W&B Weave’s comprehensive observability toolkit. We explored how W&B Weave can enhance every stage of the LLM development lifecycle—from initial experimentation in the Playground to systematic evaluation of model performance, and finally to production monitoring of complex agent workflows.
The integration between Amazon Bedrock and W&B Weave provides several key capabilities:

Automatic tracking of Amazon Bedrock FM calls with minimal code changes using the W&B Weave SDK
Rapid experimentation through the W&B Weave Playground’s intuitive interface for testing prompts and comparing models
Systematic evaluation with custom scoring functions to evaluate different Amazon Bedrock models
Comprehensive observability for AgentCore deployments, with CloudWatch metrics providing more robust operational monitoring supplemented by detailed execution traces

To get started:

Request a free trial or subscribe to Weights &Biases AI Development Platform through AWS Marketplace
Install the W&B Weave SDK and follow our code examples to begin tracking your Bedrock FM calls
Experiment with different models in the W&B Weave Playground by adding your AWS credentials and testing various Amazon Bedrock FMs
Set up evaluations using the W&B Weave Evaluation framework to systematically compare model performance for your use cases
Enhance your AgentCore agents by adding W&B Weave observability using either the native SDK or OpenTelemetry integration

Start with a simple integration to track your Amazon Bedrock calls, then progressively adopt more advanced features as your AI applications grow in complexity. The combination of Amazon Bedrock and W&B Weave’s comprehensive development tools provides the foundation needed to build, evaluate, and maintain production-ready AI solutions at scale.

About the authors
James Yi is a Senior AI/ML Partner Solutions Architect at AWS. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.
Ray Strickland is a Senior Partner Solutions Architect at AWS specializing in AI/ML, Agentic AI and Intelligent Document Processing. He enables partners to deploy scalable generative AI solutions using AWS best practices and drives innovation through strategic partner enablement programs. Ray collaborates across multiple AWS teams to accelerate AI adoption and has extensive experience in partner evaluation and enablement.
Thomas Capelle is a Machine Learning Engineer at Weights & Biases. He is responsible for keeping the www.github.com/wandb/examples repository live and up to date. He also builds content on MLOPS, applications of W&B to industries, and fun deep learning in general. Previously he was using deep learning to solve short-term forecasting for solar energy. He has a background in Urban Planning, Combinatorial Optimization, Transportation Economics, and Applied Math.
Scott Juang is the Director of Alliances at Weights & Biases. Prior to W&B, he led a number of strategic alliances at AWS and Cloudera. Scott studied Materials Engineering and has a passion for renewable energy.

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audio …

Meta researchers have introduced Perception Encoder Audiovisual, PEAV, as a new family of encoders for joint audio and video understanding. The model learns aligned audio, video, and text representations in a single embedding space using large scale contrastive training on about 100M audio video pairs with text captions.

From Perception Encoder to PEAV

Perception Encoder, PE, is the core vision stack in Meta’s Perception Models project. It is a family of encoders for images, video, and audio that reaches state of the art on many vision and audio benchmarks using a unified contrastive pretraining recipe. PE core surpasses SigLIP2 on image tasks and InternVideo2 on video tasks. PE lang powers Perception Language Model for multimodal reasoning. PE spatial is tuned for dense prediction tasks such as detection and depth estimation.

PEAV builds on this backbone and extends it to full audio video text alignment. In the Perception Models repository, PE audio visual is listed as the branch that embeds audio, video, audio video, and text into a single joint embedding space for cross modal understanding.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Architecture, Separate Towers and Fusion

The PEAV architecture is composed of a frame encoder, a video encoder, an audio encoder, an audio video fusion encoder, and a text encoder.

The video path uses the existing PE frame encoder on RGB frames, then applies a temporal video encoder on top of frame level features.

The audio path uses DAC VAE as a codec to convert raw waveforms into discrete audio tokens at fixed frame rate, about one embedding every 40 milliseconds.

These towers feed an audio video fusion encoder that learns a shared representation for both streams. The text encoder projects text queries into several specialized spaces. In practice this gives you a single backbone that can be queried in many ways. You can retrieve video from text, audio from text, audio from video, or retrieve text descriptions conditioned on any combination of modalities without retraining task specific heads.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Data Engine, Synthetic Audiovisual Captions At Scale

The research team proposed a two stage audiovisual data engine that generates high quality synthetic captions for unlabeled clips. The team describes a pipeline that first uses several weak audio caption models, their confidence scores, and separate video captioners as input to a large language model. This LLM produces three caption types per clip, one for audio content, one for visual content, and one for joint audio visual content. An initial PE AV model is trained on this synthetic supervision.

In the second stage, this initial PEAV is paired with a Perception Language Model decoder. Together they refine the captions to better exploit audiovisual correspondences. The two stage engine yields reliable captions for about 100M audio video pairs and uses about 92M unique clips for stage 1 pretraining and 32M additional unique clips for stage 2 fine tuning.

Compared to prior work that often focuses on speech or narrow sound domains, this corpus is designed to be balanced across speech, general sounds, music, and diverse video domains, which is important for general audio visual retrieval and understanding.

Contrastive Objective Across Ten Modality Pairs

PEAV uses a sigmoid based contrastive loss across audio, video, text, and fused representations. The research team explains that the model uses eight contrastive loss pairs during pretraining. These cover combinations such as audio text, video text, audio video text, and fusion related pairs. During fine tuning, two extra pairs are added, which brings the total to ten loss pairs among the different modality and caption types.

This objective is similar in form to contrastive objectives used in recent vision language encoders but generalized to audio video text tri modal training. By aligning all these views in one space, the same encoder can support classification, retrieval, and correspondence tasks with simple dot product similarities.

Performance Across Audio, Speech, Music And Video

On benchmarks, PEAV targets zero shot retrieval and classification for multiple domains. PE AV achieves state of the art performance on several audio and video benchmarks compared to recent audio text and audio video text models from works such as CLAP, Audio Flamingo, ImageBind, and LanguageBind.

Concrete gains include:

On AudioCaps, text to audio retrieval improves from 35.4 R at 1 to 45.8 R at 1.

On VGGSound, clip level classification accuracy improves from 36.0 to 47.1.

For speech retrieval on VCTK style tasks, PE AV reaches 85.6 accuracy while earlier models are near 0.

On ActivityNet, text to video retrieval improves from 60.4 R at 1 to 66.5 R at 1.

On Kinetics 400, zero shot video classification improves from 76.9 to 78.9, beating models 2 to 4 times larger.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

PEA-Frame, Frame Level Audio Text Alignment

Alongside PEAV, Meta releases Perception Encoder Audio Frame, PEA-Frame, for sound event localization. PE A Frame is an audio text embedding model that outputs one audio embedding per 40 milliseconds frame and a single text embedding per query. The model can return temporal spans that mark where in the audio each described event occurs.

PEA-Frame uses frame level contrastive learning to align audio frames with text. This enables precise localization of events such as specific speakers, instruments, or transient sounds in long audio sequences.

Role In The Perception Models And SAM Audio Ecosystem

PEAV and PEA-Frame sit inside the broader Perception Models stack, which combines PE encoders with Perception Language Model for multimodal generation and reasoning.

PEAV is also the core perception engine behind Meta’s new SAM Audio model and its Judge evaluator. SAM Audio uses PEAV embeddings to connect visual prompts and text prompts to sound sources in complex mixtures and to score the quality of separated audio tracks.

Key Takeaways

PEAV is a unified encoder for audio, video, and text, trained with contrastive learning on over 100M videos, and embeds audio, video, audio video, and text into a single joint space for cross modal retrieval and understanding.

The architecture uses separate video and audio towers, with PE based visual encoding and DAC VAE audio tokenization, followed by an audio visual fusion encoder and specialized text heads aligned to different modality pairs.

A 2 stage data engine generates synthetic audio, visual, and audio visual captions using weaker captioners plus an LLM in stage 1 and PEAV plus Perception Language Model in stage 2, enabling large scale multimodal supervision without manual labels.

PEAV establishes new state of the art on a wide range of audio and video benchmarks through a sigmoid contrastive objective over multiple modality pairs, with six public checkpoints from small 16 frame to large all frame variants, where average retrieval improves from about 45 to 51.6.

PEAV, together with the frame level PEA-Frame variant, forms the perception backbone for Meta’s SAM Audio system, providing the embeddings used for prompt based audio separation and fine grained sound event localization across speech, music, and general sounds.

Check out the Paper, Repo and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval appeared first on MarkTechPost.

Google Introduces A2UI (Agent-to-User Interface): An Open Sourc Protoc …

Google has open sourced A2UI, an Agent to User Interface specification and set of libraries that lets agents describe rich native interfaces in a declarative JSON format while client applications render them with their own components. The project targets a clear problem, how to let remote agents present secure, interactive interfaces across trust boundaries without sending executable code.

What is A2UI?

A2UI is an open standard and implementation that allows agents to speak UI. An agent does not output HTML or JavaScript. It outputs an A2UI response, which is a JSON payload that describes a set of components, their properties and a data model. The client application reads this description and maps each component to its own native widgets, for example Angular components, Flutter widgets, web components, React components or SwiftUI views.

The Problem, Agents Need to Speak UI

Most chat based agents respond with long text. For tasks such as restaurant booking or data entry, this produces many turns and dense answers. The A2UI launch post shows a restaurant example where a user asks for a table, then the agent asks several follow up questions in text, which is slow. A better experience is a small form with a date picker, time selector and submit button. A2UI lets the agent request that form as a structured UI description instead of narrating it in natural language.

The problem becomes harder in a multi agent mesh. In that setting, an orchestrator in one organization may delegate work to a remote A2A agent in another organization. The remote agent cannot touch the Document Object Model of the host application. It can only send messages. Historically that meant HTML or script inside an iframe. That approach is heavy, often visually inconsistent with the host and risky from a security point of view. A2UI defines a data format that is safe like data but expressive enough to describe complex layouts.

Core Design, Security and LLM Friendly Structure

A2UI focuses on security, LLM friendliness and portability.

Security first. A2UI is a declarative data format, not executable code. The client maintains a catalog of trusted components such as Card, Button or TextField. The agent can only reference types in this catalog. This reduces the risk of UI injection and avoids arbitrary script execution from model output.

LLM friendly representation. The UI is represented as a flat list of components with identifier references. This makes it easier for language models to generate or update interfaces incrementally and supports streaming updates. The agent can adjust a view as the conversation progresses without regenerating a full nested JSON tree.

Framework agnostic. A single A2UI payload can be rendered on multiple clients. The agent describes a component tree and associated data model. The client maps that structure to native widgets in frameworks such as Angular, Flutter, React or SwiftUI. This allows reuse of the same agent logic across web, mobile and desktop surfaces.

Progressive rendering. Because the format is designed for streaming, clients can show partial interfaces while the agent continues computing. Users see the interface assemble in real time rather than waiting for a complete response.

Architecture and Data Flow

A2UI is a pipeline that separates generation, transport and rendering.

A user sends a message to an agent through a chat or another surface.

The agent, often backed by Gemini or another model that can generate JSON, produces an A2UI response. This response describes components, layout and data bindings.

The A2UI messages stream to the client over a transport such as the Agent to Agent protocol or the AG UI protocol.

The client uses an A2UI renderer library. The renderer parses the payload and resolves each component type into a concrete widget in the host codebase.

User actions, for example button clicks or form submissions, are sent back as events to the agent. The agent may respond with new A2UI messages that update the existing interface.

Key Takeaways

A2UI is an open standard and library set from Google that lets agents ‘speak UI’ by sending a declarative JSON specification for interfaces, while clients render them using native components such as Angular, Flutter or Lit.

The specification focuses on security by treating UI as data, not code, so agents only reference a client controlled catalog of components, which reduces UI injection risk and avoids executing arbitrary scripts from model output.

The internal format uses an updateable, flat representation of components that is optimized for LLMs, which supports streaming and incremental updates, so agents can progressively refine the interface during a session.

A2UI is transport agnostic and is already used with the A2A protocol and AG UI, which allows orchestrator agents and remote sub agents to send UI payloads across trust boundaries while host applications keep control of branding, layout and accessibility.

The project is in early stage public preview at version v0.8, released under Apache 2.0, with reference renderers, quickstart samples and production integrations in projects such as Opal, Gemini Enterprise and Flutter GenUI, making it directly usable by engineers building agentic applications now.

Check out the Github Repo and Technical Details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Introduces A2UI (Agent-to-User Interface): An Open Sourc Protocol for Agent Driven Interfaces appeared first on MarkTechPost.

How to Build a Fully Autonomous Local Fleet-Maintenance Analysis Agent …

In this tutorial, we walk through the process of creating a fully autonomous fleet-analysis agent using SmolAgents and a local Qwen model. We generate telemetry data, load it through a custom tool, and let our agent reason, analyze, and visualize maintenance risks without any external API calls. At each step of implementation, we see how the agent interprets structured logs, applies logical filters, detects anomalies, and finally produces a clear visual warning for fleet managers. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(” Installing libraries… (approx 30-60s)”)
!pip install smolagents transformers accelerate bitsandbytes ddgs matplotlib pandas -q

import os
import pandas as pd
import matplotlib.pyplot as plt
from smolagents import CodeAgent, Tool, TransformersModel

We install all required libraries and import the core modules we rely on for building our agent. We set up SmolAgents, Transformers, and basic data-handling tools to process telemetry and run the local model smoothly. At this stage, we prepare our environment and ensure everything loads correctly before moving ahead. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserfleet_data = {
“truck_id”: [“T-101”, “T-102”, “T-103”, “T-104”, “T-105”],
“driver”: [“Ali”, “Sara”, “Mike”, “Omar”, “Jen”],
“avg_speed_kmh”: [65, 70, 62, 85, 60],
“fuel_efficiency_kml”: [3.2, 3.1, 3.3, 1.8, 3.4],
“engine_temp_c”: [85, 88, 86, 105, 84],
“last_maintenance_days”: [30, 45, 120, 200, 15]
}
df = pd.DataFrame(fleet_data)
df.to_csv(“fleet_logs.csv”, index=False)
print(” ‘fleet_logs.csv’ created.”)

We generate the dummy fleet dataset that our agent will later analyze. We create a small but realistic set of telemetry fields, convert it into a DataFrame, and save it as a CSV file. Here, we establish the core data source that drives the agent’s reasoning and predictions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass FleetDataTool(Tool):
name = “load_fleet_logs”
description = “Loads vehicle telemetry logs from ‘fleet_logs.csv’. Returns the data summary.”
inputs = {}
output_type = “string”

def forward(self):
try:
df = pd.read_csv(“fleet_logs.csv”)
return f”Columns: {list(df.columns)}nData Sample:n{df.to_string()}”
except Exception as e:
return f”Error loading logs: {e}”

We define the FleetDataTool, which acts as the bridge between the agent and the underlying telemetry file. We give the agent the ability to load and inspect the CSV file to understand its structure. This tool becomes the foundation for every subsequent analysis the model performs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(” Downloading & Loading Local Model (approx 60-90s)…”)
model = TransformersModel(
model_id=”Qwen/Qwen2.5-Coder-1.5B-Instruct”,
device_map=”auto”,
max_new_tokens=2048
)
print(” Model loaded on GPU.”)

agent = CodeAgent(
tools=[FleetDataTool()],
model=model,
add_base_tools=True
)

print(“n Agent is analyzing fleet data… (Check the ‘Agent’ output below)n”)

query = “””
1. Load the fleet logs.
2. Find the truck with the worst fuel efficiency (lowest ‘fuel_efficiency_kml’).
3. For that truck, check if it is overdue for maintenance (threshold is 90 days).
4. Create a bar chart comparing the ‘fuel_efficiency_kml’ of ALL trucks.
5. Highlight the worst truck in RED and others in GRAY on the chart.
6. Save the chart as ‘maintenance_alert.png’.
“””
response = agent.run(query)

print(f”n FINAL REPORT: {response}”)

We load the Qwen2.5 local model and initialize our CodeAgent with the custom tool. We then craft a detailed query outlining the reasoning steps we want the agent to follow and execute it end-to-end. This is where we watch the agent think, analyze, compute, and even plot, fully autonomously. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif os.path.exists(“maintenance_alert.png”):
print(“n Displaying Generated Chart:”)
img = plt.imread(“maintenance_alert.png”)
plt.figure(figsize=(10, 5))
plt.imshow(img)
plt.axis(‘off’)
plt.show()
else:
print(” No chart image found. Check the agent logs above.”)

We check whether the agent successfully saved the generated maintenance chart and display it if available. We visualize the output directly in the notebook, allowing us to confirm that the agent correctly performed data analysis and plotting. This gives us a clean, interpretable result from the entire workflow.

In conclusion, we built an intelligent end-to-end pipeline that enables a local model to autonomously load data, evaluate fleet health, identify the highest-risk vehicle, and generate a diagnostic chart for actionable insights. We witness how easily we can extend this framework to real-world datasets, integrate more complex tools, or add multi-step reasoning capabilities for safety, efficiency, or predictive maintenance use cases. At last, we appreciate how SmolAgents empowers us to create practical agentic systems that execute real code, reason over real telemetry, and deliver insights immediately.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Autonomous Local Fleet-Maintenance Analysis Agent Using SmolAgents and Qwen Model appeared first on MarkTechPost.