Uncategorized Archives - Page 35 of 269

Build multi-agent site reliability engineering assistants with Amazon …

Posted on September 27, 2025 by i-genie

Site reliability engineers (SREs) face an increasingly complex challenge in modern distributed systems. During production incidents, they must rapidly correlate data from multiple sources—logs, metrics, Kubernetes events, and operational runbooks—to identify root causes and implement solutions. Traditional monitoring tools provide raw data but lack the intelligence to synthesize information across these diverse systems, often leaving SREs to manually piece together the story behind system failures.
With a generative AI solution, SREs can ask their infrastructure questions in natural language. For example, they can ask “Why are the payment-service pods crash looping?” or “What’s causing the API latency spike?” and receive comprehensive, actionable insights that combine infrastructure status, log analysis, performance metrics, and step-by-step remediation procedures. This capability transforms incident response from a manual, time-intensive process into a time-efficient, collaborative investigation.
In this post, we demonstrate how to build a multi-agent SRE assistant using Amazon Bedrock AgentCore, LangGraph, and the Model Context Protocol (MCP). This system deploys specialized AI agents that collaborate to provide the deep, contextual intelligence that modern SRE teams need for effective incident response and infrastructure management. We walk you through the complete implementation, from setting up the demo environment to deploying on Amazon Bedrock AgentCore Runtime for production use.
Solution overview
This solution uses a comprehensive multi-agent architecture that addresses the challenges of modern SRE operations through intelligent automation. The solution consists of four specialized AI agents working together under a supervisor agent to provide comprehensive infrastructure analysis and incident response assistance.
The examples in this post use synthetically generated data from our demo environment. The backend servers simulate realistic Kubernetes clusters, application logs, performance metrics, and operational runbooks. In production deployments, these stub servers would be replaced with connections to your actual infrastructure systems, monitoring services, and documentation repositories.
The architecture demonstrates several key capabilities:

Natural language infrastructure queries – You can ask complex questions about your infrastructure in plain English and receive detailed analysis combining data from multiple sources
Multi-agent collaboration – Specialized agents for Kubernetes, logs, metrics, and operational procedures work together to provide comprehensive insights
Real-time data synthesis – Agents access live infrastructure data through standardized APIs and present correlated findings
Automated runbook execution – Agents retrieve and display step-by-step operational procedures for common incident scenarios
Source attribution – Every finding includes explicit source attribution for verification and audit purposes

The following diagram illustrates the solution architecture.

The architecture demonstrates how the SRE support agent integrates seamlessly with Amazon Bedrock AgentCore components:

Customer interface – Receives alerts about degraded API response times and returns comprehensive agent responses
Amazon Bedrock AgentCore Runtime – Manages the execution environment for the multi-agent SRE solution
SRE support agent – Multi-agent collaboration system that processes incidents and orchestrates responses
Amazon Bedrock AgentCore Gateway – Routes requests to specialized tools through OpenAPI interfaces:

Kubernetes API for getting cluster events
Logs API for analyzing log patterns
Metrics API for analyzing performance trends
Runbooks API for searching operational procedures

Amazon Bedrock AgentCore Memory – Stores and retrieves session context and previous interactions for continuity
Amazon Bedrock AgentCore Identity – Handles authentication for tool access using Amazon Cognito integration
Amazon Bedrock AgentCore Observability – Collects and visualizes agent traces for monitoring and debugging
Amazon Bedrock LLMs – Powers the agent intelligence through Anthropic’s Claude large language models (LLMs)

The multi-agent solution uses a supervisor-agent pattern where a central orchestrator coordinates five specialized agents:

Supervisor agent – Analyzes incoming queries and creates investigation plans, routing work to appropriate specialists and aggregating results into comprehensive reports
Kubernetes infrastructure agent – Handles container orchestration and cluster operations, investigating pod failures, deployment issues, resource constraints, and cluster events
Application logs agent – Processes log data to find relevant information, identifies patterns and anomalies, and correlates events across multiple services
Performance metrics agent – Monitors system metrics and identifies performance issues, providing real-time analysis and historical trending
Operational runbooks agent – Provides access to documented procedures, troubleshooting guides, and escalation procedures based on the current situation

Using Amazon Bedrock AgentCore primitives
The solution showcases the power of Amazon Bedrock AgentCore by using multiple core primitives. The solution supports two providers for Anthropic’s LLMs. Amazon Bedrock supports Anthropic’s Claude 3.7 Sonnet for AWS integrated deployments, and Anthropic API supports Anthropic’s Claude 4 Sonnet for direct API access.
The Amazon Bedrock AgentCore Gateway component converts the SRE agent’s backend APIs (Kubernetes, application logs, performance metrics, and operational runbooks) into Model Context Protocol (MCP) tools. This enables agents built with an open-source framework supporting MCP (such as LangGraph in this post) to seamlessly access infrastructure APIs.
Security for the entire solution is provided by Amazon Bedrock AgentCore Identity. It supports ingress authentication for secure access control for agents connecting to the gateway, and egress authentication to manage authentication with backend servers, providing secure API access without hardcoding credentials.
The serverless execution environment for deploying the SRE agent in production is provided by Amazon Bedrock AgentCore Runtime. It automatically scales from zero to handle concurrent incident investigations while maintaining complete session isolation. Amazon Bedrock AgentCore Runtime supports both OAuth and AWS Identity and Access Management (IAM) for agent authentication. Applications that invoke agents must have appropriate IAM permissions and trust policies. For more information, see Identity and access management for Amazon Bedrock AgentCore.
Amazon Bedrock AgentCore Memory transforms the SRE agent from a stateless system into an intelligent learning assistant that personalizes investigations based on user preferences and historical context. The memory component provides three distinct strategies:

User preferences strategy (/sre/users/{user_id}/preferences) – Stores individual user preferences for investigation style, communication channels, escalation procedures, and report formatting. For example, Alice (a technical SRE) receives detailed systematic analysis with troubleshooting steps, whereas Carol (an executive) receives business-focused summaries with impact analysis.
Infrastructure knowledge strategy (/sre/infrastructure/{user_id}/{session_id}) – Accumulates domain expertise across investigations, enabling agents to learn from past discoveries. When the Kubernetes agent identifies a memory leak pattern, this knowledge becomes available for future investigations, enabling faster root cause identification.
Investigation memory strategy (/sre/investigations/{user_id}/{session_id}) – Maintains historical context of past incidents and their resolutions. This enables the solution to suggest proven remediation approaches and avoid anti-patterns that previously failed.

The memory component demonstrates its value through personalized investigations. When both Alice and Carol investigate “API response times have degraded 3x in the last hour,” they receive identical technical findings but completely different presentations.
Alice receives a technical analysis:

memory_client.retrieve_user_preferences(user_id=”Alice”)
# Returns: {“investigation_style”: “detailed_systematic_analysis”, “reports”: “technical_exposition_with_troubleshooting_steps”}

Carol receives an executive summary:

memory_client.retrieve_user_preferences(user_id=”Carol”)
# Returns: {“investigation_style”: “business_impact_focused”,”reports”: “executive_summary_without_technical_details”}

Adding observability to the SRE agent
Adding observability to an SRE agent deployed on Amazon Bedrock AgentCore Runtime is straightforward using the Amazon Bedrock AgentCore Observability primitive. This enables comprehensive monitoring through Amazon CloudWatch with metrics, traces, and logs. Setting up observability requires three steps:

Add the OpenTelemetry packages to your pyproject.toml:

dependencies = [
# … other dependencies …
“opentelemetry-instrumentation-langchain”,
“aws-opentelemetry-distro~=0.10.1”,
]

Configure observability for your agents to enable metrics in CloudWatch.
Start your container using the opentelemetry-instrument utility to automatically instrument your application.

The following command is added to the Dockerfile for the SRE agent:

# Run application with OpenTelemetry instrumentation
CMD [“uv”, “run”, “opentelemetry-instrument”, “uvicorn”, “sre_agent.agent_runtime:app”, “–host”, “0.0.0.0”, “–port”, “8080”]

As shown in the following screenshot, with observability enabled, you gain visibility into the following:

LLM invocation metrics – Token usage, latency, and model performance across agents
Tool execution traces – Duration and success rates for each MCP tool call
Memory operations – Retrieval patterns and storage efficiency
End-to-end request tracing – Complete request flow from user query to final response

The observability primitive automatically captures these metrics without additional code changes, providing production-grade monitoring capabilities out of the box.
Development to production flow
The SRE agent follows a four-step structured deployment process from local development to production, with detailed procedures documented in Development to Production Flow in the accompanying GitHub repo:

The deployment process maintains consistency across environments: the core agent code (sre_agent/) remains unchanged, and the deployment/ folder contains deployment-specific utilities. The same agent works locally and in production through environment configuration, with Amazon Bedrock AgentCore Gateway providing MCP tools access across different stages of development and deployment.
Implementation walkthrough
In the following section, we focus on how Amazon Bedrock AgentCore Gateway, Memory, and Runtime work together to build this multi-agent collaboration solution and deploy it end-to-end with MCP support and persistent intelligence.
We start by setting up the repository and establishing the local runtime environment with API keys, LLM providers, and demo infrastructure. We then bring core AgentCore components online by creating the gateway for standardized API access, configuring authentication, and establishing tool connectivity. We add intelligence through AgentCore Memory, creating strategies for user preferences and investigation history while loading personas for personalized incident response. Finally, we configure individual agents with specialized tools, integrate memory capabilities, orchestrate collaborative workflows, and deploy to AgentCore Runtime with full observability.
Detailed instructions for each step are provided in the repository:

Use Case Setup Guide – Backend deployment and development setup
Deployment Guide – Production containerization and Amazon Bedrock AgentCore Runtime deployment

Prerequisites
You can find the port forwarding requirements and other setup instructions in the README file’s Prerequisites section.
Convert APIs to MCP tools with Amazon Bedrock AgentCore Gateway
Amazon Bedrock AgentCore Gateway demonstrates the power of protocol standardization by converting existing backend APIs into MCP tools that agent frameworks can consume. This transformation happens seamlessly, requiring only OpenAPI specifications.
Upload OpenAPI specifications
The gateway process begins by uploading your existing API specifications to Amazon Simple Storage Service (Amazon S3). The create_gateway.sh script automatically handles uploading the four API specifications (Kubernetes, Logs, Metrics, and Runbooks) to your configured S3 bucket with proper metadata and content types. These specifications will be used to create API endpoint targets in the gateway.
Create an identity provider and gateway
Authentication is handled seamlessly through Amazon Bedrock AgentCore Identity. The main.py script creates both the credential provider and gateway:

# Create AgentCore Gateway with JWT authorization
def create_gateway(
client: Any,
gateway_name: str,
role_arn: str,
discovery_url: str,
allowed_clients: list = None,
description: str = “AgentCore Gateway created via SDK”,
search_type: str = “SEMANTIC”,
protocol_version: str = “2025-03-26”,
) -> Dict[str, Any]:

# Build auth config for Cognito
auth_config = {“customJWTAuthorizer”: {“discoveryUrl”: discovery_url}}
if allowed_clients:
auth_config[“customJWTAuthorizer”][“allowedClients”] = allowed_clients

protocol_configuration = {
“mcp”: {“searchType”: search_type, “supportedVersions”: [protocol_version]}
}

response = client.create_gateway(
name=gateway_name,
roleArn=role_arn,
protocolType=”MCP”,
authorizerType=”CUSTOM_JWT”,
authorizerConfiguration=auth_config,
protocolConfiguration=protocol_configuration,
description=description,
exceptionLevel=’DEBUG’
)
return response

Deploy API endpoint targets with credential providers
Each API becomes an MCP target through the gateway. The solution automatically handles credential management:

def create_api_endpoint_target(
client: Any,
gateway_id: str,
s3_uri: str,
provider_arn: str,
target_name_prefix: str = “open”,
description: str = “API Endpoint Target for OpenAPI schema”,
) -> Dict[str, Any]:

api_target_config = {“mcp”: {“openApiSchema”: {“s3”: {“uri”: s3_uri}}}}

# API key credential provider configuration
credential_config = {
“credentialProviderType”: “API_KEY”,
“credentialProvider”: {
“apiKeyCredentialProvider”: {
“providerArn”: provider_arn,
“credentialLocation”: “HEADER”,
“credentialParameterName”: “X-API-KEY”,
}
},
}

response = client.create_gateway_target(
gatewayIdentifier=gateway_id,
name=target_name_prefix,
description=description,
targetConfiguration=api_target_config,
credentialProviderConfigurations=[credential_config],
)
return response

Validate MCP tools are ready for agent framework
Post-deployment, Amazon Bedrock AgentCore Gateway provides a standardized /mcp endpoint secured with JWT tokens. Testing the deployment with mcp_cmds.sh reveals the power of this transformation:

Tool summary:
================
Total tools found: 21

Tool names:
• x_amz_bedrock_agentcore_search
• k8s-api___get_cluster_events
• k8s-api___get_deployment_status
• k8s-api___get_node_status
• k8s-api___get_pod_status
• k8s-api___get_resource_usage
• logs-api___analyze_log_patterns
• logs-api___count_log_events
• logs-api___get_error_logs
• logs-api___get_recent_logs
• logs-api___search_logs
• metrics-api___analyze_trends
• metrics-api___get_availability_metrics
• metrics-api___get_error_rates
• metrics-api___get_performance_metrics
• metrics-api___get_resource_metrics
• runbooks-api___get_common_resolutions
• runbooks-api___get_escalation_procedures
• runbooks-api___get_incident_playbook
• runbooks-api___get_troubleshooting_guide
• runbooks-api___search_runbooks

Universal agent framework compatibility
This MCP-standardized gateway can now be configured as a Streamable-HTTP server for MCP clients, including AWS Strands, Amazon’s agent development framework, LangGraph, the framework used in our SRE agent implementation, and CrewAI, a multi-agent collaboration framework.
The advantage of this approach is that existing APIs require no modification—only OpenAPI specifications. Amazon Bedrock AgentCore Gateway handles the following:

Protocol translation – Between REST APIs to MCP
Authentication – JWT token validation and credential injection
Security – TLS termination and access control
Standardization – Consistent tool naming and parameter handling

This means you can take existing infrastructure APIs (Kubernetes, monitoring, logging, documentation) and instantly make them available to AI agent frameworks that support MCP—through a single, secure, standardized interface.
Implement persistent intelligence with Amazon Bedrock AgentCore Memory
Whereas Amazon Bedrock AgentCore Gateway provides seamless API access, Amazon Bedrock AgentCore Memory transforms the SRE agent from a stateless system into an intelligent, learning assistant. The memory implementation demonstrates how a few lines of code can enable sophisticated personalization and cross-session knowledge retention.
Initialize memory strategies
The SRE agent memory component is built on Amazon Bedrock AgentCore Memory’s event-based model with automatic namespace routing. During initialization, the solution creates three memory strategies with specific namespace patterns:

from sre_agent.memory.client import SREMemoryClient
from sre_agent.memory.strategies import create_memory_strategies

# Initialize memory client
memory_client = SREMemoryClient(
memory_name=”sre_agent_memory”,
region=”us-east-1″
)

# Create three specialized memory strategies
strategies = create_memory_strategies()
for strategy in strategies:
memory_client.create_strategy(strategy)

The three strategies each serve distinct purposes:

User preferences (/sre/users/{user_id}/preferences) – Individual investigation styles and communication preferences
Infrastructure Knowledge: /sre/infrastructure/{user_id}/{session_id} – Domain expertise accumulated across investigations
Investigation Summaries: /sre/investigations/{user_id}/{session_id} – Historical incident patterns and resolutions

Load user personas and preferences
The solution comes preconfigured with user personas that demonstrate personalized investigations. The manage_memories.py script loads these personas:

# Load Alice – Technical SRE Engineer
alice_preferences = {
“investigation_style”: “detailed_systematic_analysis”,
“communication”: [“#alice-alerts”, “#sre-team”],
“escalation”: {“contact”: “alice.manager@company.com”, “threshold”: “15min”},
“reports”: “technical_exposition_with_troubleshooting_steps”,
“timezone”: “UTC”
}

# Load Carol – Executive/Director
carol_preferences = {
“investigation_style”: “business_impact_focused”,
“communication”: [“#carol-executive”, “#strategic-alerts”],
“escalation”: {“contact”: “carol.director@company.com”, “threshold”: “5min”},
“reports”: “executive_summary_without_technical_details”,
“timezone”: “EST”
}

# Store preferences using memory client
memory_client.store_user_preference(“Alice”, alice_preferences)
memory_client.store_user_preference(“Carol”, carol_preferences)

Automatic namespace routing in action
The power of Amazon Bedrock AgentCore Memory lies in its automatic namespace routing. When the SRE agent creates events, it only needs to provide the actor_id—Amazon Bedrock AgentCore Memory automatically determines which namespaces the event belongs to:

# During investigation, the supervisor agent stores context
memory_client.create_event(
memory_id=”sre_agent_memory-abc123″,
actor_id=”Alice”, # AgentCore Memory routes this automatically
session_id=”investigation_2025_01_15″,
messages=[(“investigation_started”, “USER”)]
)

# Memory system automatically:
# 1. Checks strategy namespaces <!– “all” is necessary here for technical accuracy –>
# 2. Matches actor_id “Alice” to /sre/users/Alice/preferences
# 3. Stores event in User Preferences Strategy
# 4. Makes event available for future retrievals

Validate the personalized investigation experience
The memory component’s impact becomes clear when both Alice and Carol investigate the same issue. Using identical technical findings, the solution produces completely different presentations of the same underlying content.
Alice’s technical report contains detailed systematic analysis for technical teams:

Technical Investigation Summary

Root Cause: Payment processor memory leak causing OOM kills

Analysis:
– Pod restart frequency increased 300% at 14:23 UTC
– Memory utilization peaked at 8.2GB (80% of container limit)
– JVM garbage collection latency spiked to 2.3s

Next Step:
1. Implement heap dump analysis (`kubectl exec payment-pod — jmap`)
2. Review recent code deployments for memory management changes
3. Consider increasing memory limits and implementing graceful shutdown

Carol’s executive summary contains business impact focused for executive stakeholders:

Business Impact Assessment
Status: CRITICAL – Customer payment processing degraded
Impact: 23% transaction failure rate, $47K revenue at risk
Timeline: Issue detected 14:23 UTC, resolution ETA 45 minutes
Business Actions: – Customer communication initiated via status page – Finance team alerted for revenue impact tracking – Escalating to VP Engineering if not resolved by 15:15 UTC

The memory component enables this personalization while continuously learning from each investigation, building organizational knowledge that improves incident response over time.
Deploy to production with Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore makes it straightforward to deploy existing agents to production. The process involves three key steps: containerizing your agent, deploying to Amazon Bedrock AgentCore Runtime, and invoking the deployed agent.
Containerize your agent
Amazon Bedrock AgentCore Runtime requires ARM64 containers. The following code shows the complete Dockerfile:

# Use uv’s ARM64 Python base image
FROM –platform=linux/arm64 ghcr.io/astral-sh/uv:python3.12-bookworm-slim

WORKDIR /app

# Copy uv files
COPY pyproject.toml uv.lock ./

# Install dependencies
RUN uv sync –frozen –no-dev

# Copy SRE agent module
COPY sre_agent/ ./sre_agent/

# Set environment variables
# Note: Set DEBUG=true to enable debug logging and traces
ENV PYTHONPATH=”/app”
PYTHONDONTWRITEBYTECODE=1
PYTHONUNBUFFERED=1

# Expose port
EXPOSE 8080

Existing agents just need a FastAPI wrapper (agent_runtime:app) to become compatible with Amazon Bedrock AgentCore, and we add opentelemetry-instrument to enable observability through Amazon Bedrock AgentCore.
Deploy to Amazon Bedrock AgentCore Runtime
Deploying to Amazon Bedrock AgentCore Runtime is straightforward with the deploy_agent_runtime.py script:

import boto3

# Create AgentCore client
client = boto3.client(‘bedrock-agentcore’, region_name=region)

# Environment variables for your agent
env_vars = {
‘GATEWAY_ACCESS_TOKEN’: gateway_access_token,
‘LLM_PROVIDER’: llm_provider,
‘ANTHROPIC_API_KEY’: anthropic_api_key # if using Anthropic
}

# Deploy container to AgentCore Runtime
response = client.create_agent_runtime(
agentRuntimeName=runtime_name,
agentRuntimeArtifact={
‘containerConfiguration’: {
‘containerUri’: container_uri # Your ECR container URI
}
},
networkConfiguration={“networkMode”: “PUBLIC”},
roleArn=role_arn,
environmentVariables=env_vars
)

print(f”Agent Runtime ARN: {response[‘agentRuntimeArn’]}”)

Amazon Bedrock AgentCore handles the infrastructure, scaling, and session management automatically.
Invoke your deployed agent
Calling your deployed agent is just as simple with invoke_agent_runtime.py:

# Prepare your query with user_id and session_id for memory personalization
payload = json.dumps({
“input”: {
“prompt”: “API response times have degraded 3x in the last hour”,
“user_id”: “Alice”, # User for personalized investigation
“session_id”: “investigation-20250127-123456″ # Session for context
}
})

# Invoke the deployed agent
response = agent_core_client.invoke_agent_runtime(
agentRuntimeArn=runtime_arn,
runtimeSessionId=session_id,
payload=payload,
qualifier=”DEFAULT”
)

# Get the response
response_data = json.loads(response[‘response’].read())
print(response_data) # Full response includes output with agent’s investigation

Key benefits of Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore Runtime offers the following key benefits:

Zero infrastructure management – No servers, load balancers, or scaling to configure
Built-in session isolation – Each conversation is completely isolated
AWS IAM integration – Secure access control without custom authentication
Automatic scaling – Scales from zero to thousands of concurrent sessions

The complete deployment process, including building containers and handling AWS permissions, is documented in the Deployment Guide.
Real-world use cases
Let’s explore how the SRE agent handles common incident response scenarios with a real investigation.
When facing a production issue, you can query the system in natural language. The solution uses Amazon Bedrock AgentCore Memory to personalize the investigation based on your role and preferences:

export USER_ID=Alice
sre-agent –prompt “API response times have degraded 3x in the last hour”

The supervisor retrieves Alice’s preferences from memory (detailed systematic analysis style) and creates an investigation plan tailored to her role as a Technical SRE:

Investigation Plan
1. Use metrics_agent to analyze API performance metrics including response times, error rates, and resource utilization to identify the extent and pattern of slowdown
2. Use logs_agent to examine application logs for errors related to slow API responses, focusing on database connection issues and memory errors
3. Use kubernetes_agent to check pod status and resource constraints, especially for web-service and database pods, looking for CrashLoopBackOff states and missing ConfigMaps
Complexity: Simple
Auto-execute: Yes
Agents involved: Metrics Agent, Logs Agent, Kubernetes Agent

The agents investigate sequentially according to the plan, each contributing their specialized analysis. The solution then aggregates these findings into a comprehensive executive summary:

Executive Summary
Key Insights
– Root Cause: Database service failure due to missing ConfigMap ‘database-config’ in production namespace, causing database pod to crash repeatedly
– Impact: Severe performance degradation with API response times increased from 150ms to 5000ms (33x slower)
– Severity: High – Database unavailability, memory exhaustion (100%), and CPU saturation (95%) causing 75% error rate
Next Steps
1. Immediate (< 1 hour): Create/update ConfigMap ‘database-config’ in production namespace and restart database pod
2. Short-term (< 24 hours):
– Fix permissions on ‘/var/lib/postgresql/data’ directory
– Increase Java heap space for web-service to address OutOfMemoryErrors
– Optimize UserService.loadAllUsers method causing memory issues
3. Long-term (< 1 week):
– Implement resource monitoring with alerts for CPU (>80%), memory (>90%)
– Optimize slow database queries, particularly “SELECT * FROM users WHERE status=’active'”
– Scale up resources or implement autoscaling for web-service
Critical Alerts
– Database pod (database-pod-7b9c4d8f2a-x5m1q) in CrashLoopBackOff state
– Web-service experiencing OutOfMemoryErrors in UserService.loadAllUsers(UserService.java:45)
– Node-3 experiencing memory pressure (>85% usage)
– Web-app-deployment showing readiness probe failures with 503 errors
Troubleshooting Steps
1. Verify ConfigMap status: `kubectl get configmap database-config -n production`
2. Check database pod logs: `kubectl logs database-pod-7b9c4d8f2a-x5m1q -n production`
3. Create/update ConfigMap: `kubectl create configmap database-config –from-file=database.conf -n production`
4. Fix data directory permissions: `kubectl exec database-pod-7b9c4d8f2a-x5m1q -n production — chmod -R 700 /var/lib/postgresql/data`
5. Restart database pod: `kubectl delete pod database-pod-7b9c4d8f2a-x5m1q -n production`

This investigation demonstrates how Amazon Bedrock AgentCore primitives work together:

Amazon Bedrock AgentCore Gateway – Provides secure access to infrastructure APIs through MCP tools
Amazon Bedrock AgentCore Identity – Handles ingress and egress authentication
Amazon Bedrock AgentCore Runtime – Hosts the multi-agent solution with automatic scaling
Amazon Bedrock AgentCore Memory – Personalizes Alice’s experience and stores investigation knowledge for future incidents
Amazon Bedrock AgentCore Observability – Captures detailed metrics and traces in CloudWatch for monitoring and debugging

The SRE agent demonstrates intelligent agent orchestration, with the supervisor routing work to specialists based on the investigation plan. The solution’s memory capabilities make sure each investigation builds organizational knowledge and provides personalized experiences based on user roles and preferences.
This investigation showcases several key capabilities:

Multi-source correlation – It connects database configuration issues to API performance degradation
Sequential investigation – Agents work systematically through the investigation plan while providing live updates
Source attribution – Findings include the specific tool and data source
Actionable insights – It provides a clear timeline of events and prioritized recovery steps
Cascading failure detection – It can help show how one failure propagates through the system

Business impact
Organizations implementing AI-powered SRE assistance report significant improvements in key operational metrics. Initial investigations that previously took 30–45 minutes can now be completed in 5–10 minutes, providing SREs with comprehensive context before diving into detailed analysis. This dramatic reduction in investigation time translates directly to faster incident resolution and reduced downtime.The solution improves how SREs interact with their infrastructure. Instead of navigating multiple dashboards and tools, engineers can ask questions in natural language and receive aggregated insights from relevant data sources. This reduction in context switching enables teams to maintain focus during critical incidents and reduces cognitive load during investigations.Perhaps most importantly, the solution democratizes knowledge across the team. All team members can access the same comprehensive investigation techniques, reducing dependency on tribal knowledge and on-call burden. The consistent methodology provided by the solution makes sure investigation approaches remain uniform across team members and incident types, improving overall reliability and reducing the chance of missed evidence.
The automatically generated investigation reports provide valuable documentation for post-incident reviews and help teams learn from each incident, building organizational knowledge over time. Furthermore, the solution extends existing AWS infrastructure investments, working alongside services like Amazon CloudWatch, AWS Systems Manager, and other AWS operational tools to provide a unified operational intelligence system.
Extending the solution
The modular architecture makes it straightforward to extend the solution for your specific needs.
For example, you can add specialized agents for your domain:

Security agent – For compliance checks and security incident response
Database agent – For database-specific troubleshooting and optimization
Network agent – For connectivity and infrastructure debugging

You can also replace the demo APIs with connections to your actual systems:

Kubernetes integration – Connect to your cluster APIs for pod status, deployments, and events
Log aggregation – Integrate with your log management service (Elasticsearch, Splunk, CloudWatch Logs)
Metrics platform – Connect to your monitoring service (Prometheus, Datadog, CloudWatch Metrics)
Runbook repository – Link to your operational documentation and playbooks stored in wikis, Git repositories, or knowledge bases

Clean up
To avoid incurring future charges, use the cleanup script to remove the billable AWS resources created during the demo:

# Complete cleanup – deletes AWS resources and local files
./scripts/cleanup.sh

This script automatically performs the following actions:

Stop backend servers
Delete the gateway and its targets
Delete Amazon Bedrock AgentCore Memory resources
Delete the Amazon Bedrock AgentCore Runtime
Remove generated files (gateway URIs, tokens, agent ARNs, memory IDs)

For detailed cleanup instructions, refer to Cleanup Instructions.
Conclusion
The SRE agent demonstrates how multi-agent systems can transform incident response from a manual, time-intensive process into a time-efficient, collaborative investigation that provides SREs with the insights they need to resolve issues quickly and confidently.
By combining the enterprise-grade infrastructure of Amazon Bedrock AgentCore with standardized tool access in MCP, we’ve created a foundation that can adapt as your infrastructure evolves and new capabilities emerge.
The complete implementation is available in our GitHub repository, including demo environments, configuration guides, and extension examples. We encourage you to explore the solution, customize it for your infrastructure, and share your experiences with the community.
To get started building your own SRE assistant, refer to the following resources:

Automate tasks in your application using AI agents
Amazon Bedrock AgentCore Samples GitHub repository
Model Context Protocol documentation
LangGraph documentation

About the authors
Amit Arora is an AI and ML Specialist Architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington, D.C.
Dheeraj Oruganty is a Delivery Consultant at Amazon Web Services. He is passionate about building innovative Generative AI and Machine Learning solutions that drive real business impact. His expertise spans Agentic AI Evaluations, Benchmarking and Agent Orchestration, where he actively contributes to research advancing the field. He holds a master’s degree in Data Science from Georgetown University. Outside of work, he enjoys geeking out on cars, motorcycles, and exploring nature.

OpenAI Releases ChatGPT ‘Pulse’: Proactive, Personalized Daily Bri …

Posted on September 26, 2025 by i-genie

OpenAI introduced ChatGPT Pulse, a proactive experience that compiles personalized, research-backed updates each morning. In preview on mobile and limited to $200/month Pro subscribers, Pulse surfaces topical cards built from a user’s chats, explicit feedback, and opt-in connected apps (e.g., calendar/email), shifting ChatGPT from a request-driven tool to a context-aware assistant.

What Pulse Actually Does Under the Hood

Each day, Pulse performs background research anchored to user signals: recent conversations, long-term interests, thumbs-up/down feedback, and data from connected apps where enabled. The output appears as scannable visual cards (briefs and deep links) rather than an infinite feed, designed for quick triage and drill-down. Early examples include targeted news roundups and context-conditioned suggestions (e.g., travel planning aligned with calendar events).

Data Sources and Controls

Integrations are off by default and can be toggled. When granted, Pulse may use Gmail/Google Calendar context to tailor cards (e.g., meeting prep, itinerary nudges). OpenAI positions this as a user-level personalization layer; reporting notes emphasize optionality and in-app settings for managing connected accounts and memory.

Availability and Rollout Plan

Pulse is rolling out now to Pro on the ChatGPT mobile app as a dedicated tab. OpenAI says it wants broader availability “soon,” with Plus access targeted after product and efficiency improvements. The company reiterated the Pro-first gating due to compute costs.

Product Positioning: Toward Agentic, Goal-Oriented Workflows

OpenAI frames Pulse as the first step toward agent-like behavior where the model tracks goals and initiates updates without prompts. External coverage highlights the shift from chat to assistant workflows that reason over user state and schedule. This aligns with OpenAI’s recent emphasis on agents and proactive help, not passive Q&A.

The Signal from Leadership

Sam Altman summarized the intent succinctly: Pulse is his “favorite feature” to date, starting with Pro. His post also underscores the model’s use of interests and recent chats, hinting at broader personalization as users share preferences over time. OpenAI’s official announcement on X mirrors the blog language around daily, proactive updates.

Today we are launching my favorite feature of ChatGPT so far, called Pulse. It is initially available to Pro subscribers.Pulse works for you overnight, and keeps thinking about your interests, your connected data, your recent chats, and more. Every morning, you get a…— Sam Altman (@sama) September 25, 2025

Competitive Context

Pulse lands in a crowded “morning brief” space but differs by tying briefs to your live context and chats rather than generic headlines. It also inches ChatGPT toward hands-on assistant territory seen in agent platforms that watch calendars, draft emails, and pre-stage tasks—yet packaged for consumers inside the ChatGPT app rather than a separate agent runner.

Summary

Pulse formalizes ChatGPT as a proactive system: it reads your signals, checks your day, and delivers a compact, personalized brief—first for Pro on mobile, with Plus on the roadmap once the system is optimized. The implementation details (APIs, enterprise knobs, retention policies) will determine how far it goes beyond morning cards into full agent workflows.

The post OpenAI Releases ChatGPT ‘Pulse’: Proactive, Personalized Daily Briefings for Pro Users appeared first on MarkTechPost.

OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on R …

Posted on September 26, 2025 by i-genie

OpenAI introduced GDPval, a new evaluation suite designed to measure how AI models perform on real-world, economically valuable tasks across 44 occupations in nine GDP-dominant U.S. sectors. Unlike academic benchmarks, GDPval centers on authentic deliverables—presentations, spreadsheets, briefs, CAD artifacts, audio/video—graded by occupational experts through blinded pairwise comparisons. OpenAI also released a 220-task “gold” subset and an experimental automated grader hosted at evals.openai.com.

From Benchmarks to Billables: How GDPval Builds Tasks

GDPval aggregates 1,320 tasks sourced from industry professionals averaging 14 years of experience. Tasks map to O*NET work activities and include multi-modal file handling (docs, slides, images, audio, video, spreadsheets, CAD), with up to dozens of reference files per task. The gold subset provides public prompts and references; primary scoring still relies on expert pairwise judgments due to subjectivity and format requirements.

https://openai.com/index/gdpval/

What the Data Says: Model vs. Expert

On the gold subset, frontier models approach expert quality on a substantial fraction of tasks under blind expert review, with model progress trending roughly linearly across releases. Reported model-vs-human win/tie rates near parity for top models, error profiles cluster around instruction-following, formatting, data usage, and hallucinations. Increased reasoning effort and stronger scaffolding (e.g., format checks, artifact rendering for self-inspection) yield predictable gains.

Time–Cost Math: Where AI Pays Off

GDPval runs scenario analyses comparing human-only to model-assisted workflows with expert review. It quantifies (i) human completion time and wage-based cost, (ii) reviewer time/cost, (iii) model latency and API cost, and (iv) empirically observed win rates. Results indicate potential time/cost reductions for many task classes once review overhead is included.

Automated Judging: Useful Proxy, Not Oracle

For the gold subset, an automated pairwise grader shows ~66% agreement with human experts, within ~5 percentage points of human–human agreement (~71%). It’s positioned as an accessibility proxy for rapid iteration, not a replacement for expert review.

https://openai.com/index/gdpval/

Why This Isn’t Yet Another Benchmark

Occupational breadth: Spans top GDP sectors and a wide slice of O*NET work activities, not just narrow domains.

Deliverable realism: Multi-file, multi-modal inputs/outputs stress structure, formatting, and data handling.

Moving ceiling: Uses human preference win rate against expert deliverables, enabling re-baselining as models improve.

Boundary Conditions: Where GDPval Doesn’t Reach

GDPval-v0 targets computer-mediated knowledge work. Physical labor, long-horizon interactivity, and organization-specific tooling are out of scope. Tasks are one-shot and precisely specified; ablations show performance drops with reduced context. Construction and grading are resource-intensive, motivating the automated grader—whose limits are documented—and future expansion.

Fit in the Stack: How GDPval Complements Other Evals

GDPval augments existing OpenAI evals with occupational, multi-modal, file-centric tasks and reports human preference outcomes, time/cost analyses, and ablations on reasoning effort and agent scaffolding. v0 is versioned and expected to broaden coverage and realism over time.

Summary

GDPval formalizes evaluation for economically relevant knowledge work by pairing expert-built tasks with blinded human preference judgments and an accessible automated grader. The framework quantifies model quality and practical time/cost trade-offs while exposing failure modes and the effects of scaffolding and reasoning effort. Scope remains v0—computer-mediated, one-shot tasks with expert review—yet it establishes a reproducible baseline for tracking real-world capability gains across occupations.

Check out the Paper, Technical details, and Dataset on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks appeared first on MarkTechPost.

Meta FAIR Released Code World Model (CWM): A 32-Billion-Parameter Open …

Posted on September 26, 2025 by i-genie

Meta FAIR released Code World Model (CWM), a 32-billion-parameter dense decoder-only LLM that injects world modeling into code generation by training on execution traces and long-horizon agent–environment interactions—not just static source text.

What’s new: learning code by predicting execution?

CWM mid-trains on two large families of observation–action trajectories: (1) Python interpreter traces that record local variable states after each executed line, and (2) agentic interactions inside Dockerized repositories that capture edits, shell commands, and test feedback. This grounding is intended to teach semantics (how state evolves) rather than only syntax.

To scale collection, the research team built executable repository images from thousands of GitHub projects and foraged multi-step trajectories via a software-engineering agent (“ForagerAgent”). The release reports ~3M trajectories across ~10k images and 3.15k repos, with mutate-fix and issue-fix variants.

https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/

Model and context window

CWM is a dense, decoder-only Transformer (no MoE) with 64 layers, GQA (48Q/8KV), SwiGLU, RMSNorm, and Scaled RoPE. Attention alternates local 8k and global 131k sliding-window blocks, enabling 131k tokens effective context; training uses document-causal masking.

Training recipe (pre → mid → post)

General pretraining: 8T tokens (code-heavy) at 8k context.

Mid-training: +5T tokens, long-context (131k) with Python execution traces, ForagerAgent data, PR-derived diffs, IR/compilers, Triton kernels, and Lean math.

Post-training: 100B-token SFT for instruction + reasoning, then multi-task RL (~172B-token) across verifiable coding, math, and multi-turn SWE environments using a GRPO-style algorithm and a minimal toolset (bash/edit/create/submit).

Quantized inference fits on a single 80 GB H100.

Benchmarks

The research team cites the following pass@1 / scores (test-time scaling noted where applicable):

SWE-bench Verified: 65.8% (with test-time scaling).

LiveCodeBench-v5: 68.6%; LCB-v6: 63.5%.

Math-500: 96.6%; AIME-24: 76.0%; AIME-25: 68.2%.

CruxEval-Output: 94.3%.

The research team position CWM as competitive with similarly sized open-weights baselines and even with larger or closed models on SWE-bench Verified.

For context on SWE-bench Verified’s task design and metrics, see the official benchmark resources.

https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/

Why world modeling matters for code?

The release emphasizes two operational capabilities:

Execution-trace prediction: given a function and a trace start, CWM predicts stack frames (locals) and the executed line at each step via a structured format—usable as a “neural debugger” for grounded reasoning without live execution.

Agentic coding: multi-turn reasoning with tool use against real repos, verified by hidden tests and patch similarity rewards; the setup trains the model to localize faults and generate end-to-end patches (git diff) rather than snippets.

Some details worth noting

Tokenizer: Llama-3 family with reserved control tokens; reserved IDs are used to demarcate trace and reasoning segments during SFT.

Attention layout: the 3:1 local:global interleave is repeated across the depth; long-context training occurs at large token batch sizes to stabilize gradients.

Compute scaling: learning-rate/batch size schedules are derived from internal scaling-law sweeps tailored for long-context overheads.

Summary

CWM is a pragmatic step toward grounded code generation: Meta ties a 32B dense transformer to execution-trace learning and agentic, test-verified patching, releases intermediate/post-trained checkpoints, and gates usage under the FAIR Non-Commercial Research License—making it a useful platform for reproducible ablations on long-context, execution-aware coding without conflating research with production deployment.

Check out the Paper, GitHub Page, and Model on Hugging Face. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta FAIR Released Code World Model (CWM): A 32-Billion-Parameter Open-Weights LLM, to Advance Research on Code Generation with World Models appeared first on MarkTechPost.

DoWhile loops now supported in Amazon Bedrock Flows

Posted on September 26, 2025 by i-genie

Today, we are excited to announce support for DoWhile loops in Amazon Bedrock Flows. With this powerful new capability, you can create iterative, condition-based workflows directly within your Amazon Bedrock flows, using Prompt nodes, AWS Lambda functions, Amazon Bedrock Agents, Amazon Bedrock Flows inline code, Amazon Bedrock Knowledge Bases, Amazon Simple Storage Service (Amazon S3), and other Amazon Bedrock nodes within the loop structure. This feature avoids the need for complex workarounds, enabling sophisticated iteration patterns that use the full range of Amazon Bedrock Flows components. Tasks like content refinement, recursive analysis, and multi-step processing can now seamlessly integrate AI model calls, custom code execution, and knowledge retrieval in repeated cycles. By providing loop support with diverse node types, this feature simplifies generative AI application development and accelerates enterprise adoption of complex, adaptive AI solutions.
Organizations using Amazon Bedrock Flows can now use DoWhile loops to design and deploy workflows for building more scalable and efficient generative AI applications fully within the Amazon Bedrock environment while achieving the following:

Iterative processing – Execute repeated operations until specific conditions are met, enabling dynamic content refinement and recursive improvements
Conditional logic – Implement sophisticated decision-making within flows based on AI outputs and business rules
Complex use cases – Manage multi-step generative AI workflows that require repeated execution and refinement
Builder-friendly – Create and manage loops through both the Amazon Bedrock API and AWS Management Console in the traces
Observability – Employ seamless tracking of loop iterations, conditions, and execution paths

In this post, we discuss the benefits of this new feature, and show how to use DoWhile loops in Amazon Bedrock Flows.
Benefits of DoWhile loops in Amazon Bedrock Flows
Using DoWhile loops in Amazon Bedrock Flows offers the following benefits:

Simplified flow control – Create sophisticated iterative workflows without complex orchestration or external services
Flexible processing – Enable dynamic, condition-based execution paths that can adapt based on AI outputs and business rules
Enhanced development experience – Help users build complex iterative workflows through an intuitive interface, without requiring external workflow management

Solution overview
In the following sections, we show how to create a simple Amazon Bedrock flow using Do-while loops with Lambda functions. Our example showcases a practical application where we construct a flow that generates a blog post on a given topic in an iterative manner until certain acceptance criteria are fulfilled. The flow demonstrates the power of combining different types of Amazon Bedrock Flows nodes within a loop structure, where Prompt nodes generate and fine-tune the blog post, Inline Code nodes allow writing custom Python code to analyze the outputs, and S3 Storage nodes enable storing each version of the blog post during the process for reference. The DoWhile loop continues to execute until the quality of the blog post meets the condition set in the loop controller. This example illustrates how different flow nodes can work together within a loop to progressively transform data until desired conditions are met, providing a foundation for understanding more complex iterative workflows with various node combinations.
Prerequisites
Before implementing the new capabilities, make sure you have the following:

An AWS account
Other Amazon Bedrock services in place:

Create and test your base prompts for customer service interactions in Amazon Bedrock Prompt Management
Create guardrails with relevant rules using Amazon Bedrock Guardrails

Resources in auxiliary AWS services needed for your workflow, such as Lambda, Amazon DynamoDB, and Amazon S3
Required AWS Identity and Access Management (IAM) permissions:

Access to Amazon Bedrock Flows
Appropriate access to large language models (LLMs) in Amazon Bedrock

After these components are in place, you can proceed with using Amazon Bedrock Flows with DoWhile loop capabilities in your generative AI use case.
Create your flow using DoWhile Loop nodes
Complete the following steps to create your flow:

On the Amazon Bedrock console, choose Flows under Builder tools in the navigation pane.
Create a new flow, for example, dowhile-loop-demo. For detailed instructions on creating a flow, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Add a DoWhile loop node.
Add additional nodes according to the solution workflow (discussed in the next section).

Amazon Bedrock provides different node types to build your prompt flow. For this example, we use a DoWhile Loop node for calling different types of nodes for a generative AI-powered application, which creates a blog post on a given topic and checks the quality in every loop. There is one DoWhile Loop node in the flow. This new node type is on the Nodes tab in the left pane, as shown in the following screenshot.

DoWhile loop workflow
A DoWhile loop consists of two parts: the loop and the loop controller. The loop controller validates the logic for the loop and decides whether to continue or exit the loop. In this example, it is executing Prompt, Inline Code, S3 Storage nodes each time the loop is executed.
Let’s go through this flow step-by-step, as illustrated in the preceding screenshot:

A user asks to write a blog post on a specific topic (for example, using the following prompt: {“topic”: “AWS Lambda”, “Audience”: “Chief Technology Officer”, “word_count”:”500}). This prompt is sent to the Prompt node (Content_Generator).
The Prompt node (Content_Generator) writes a blog post based on the prompt using one of the Amazon Bedrock provided LLMs (such as Amazon Nova or Anthropic’s Claude) and is sent to the Loop Input node. This is the entry point to the DoWhile Loop node.
Three steps happen in tandem:

The Loop Input node forwards the blog post content to another Prompt node (Blog_Analysis_Rating) for rating the post based on criteria mentioned as part of the prompt. The output of this Prompt node is JSON code like the following example. The output of a Prompt node is always of type String. You can modify the prompt to get different types of output according to your needs. However, you can also ask the LLM to output a single rating number.

{
“overall_rating”: 8.5,
“category_ratings”: {
“clarity_and_readability”: 9,
“value_to_target_audience”: 8,
“engagement_level”: 8,
“technical_accuracy”: 9
}

The blog post is sent to the flow output during every iteration. This is the final version whenever the loop condition is not met (exiting the loop) or the end of maximum loop iterations.
At the same time, the output of the previous Prompt node (Content_Generator) is forwarded to another Prompt node (Blog_Refinement) by the Loop Input node. This node recreates or modifies the blog post based on the feedback from the analysis.

The output of the Prompt node (Blog_Analysis_Rating) is fed into the Inline Code node to extract the necessary rating and return that as a number or other information required for checking the condition inside the loop controller as input variables (for example, a rating).

def __func(variable):
return float(variable[“overall_rating”])
__func(variable)

Python code inside the Inline Code must be treated as untrusted, and appropriate parsing, validation, and data handling should be implemented.

The output of the Inline Code node is fed into the loop condition inside the loop controller to validate against the condition we set up inside the continue loop. In this example, we are checking for a rating less than or equal to 9 for the generated blog post. You can check up to five conditions. Additionally, a maximum loop iterations parameter makes sure that loop doesn’t continue infinitely.
The step consists of two parts:

A Prompt node (Blog_Refinement) forwards the newly generated blog post to loopinput inside the loop controller.
The loop controller stores the version of the post in Amazon S3 for future reference and comparing the different versions generated.

This path will execute if one of the conditions is met inside the continue loop and maximum loop iterations. If this continues, then the new modified blog post from earlier is forwarded to the input field in the Loop Input node as LoopInput and the loop continues.
The final output is produced after the DoWhile loop condition is met or maximum number of iterations are completed. The output will be final version of the blog post.

You can see the output as shown in the following screenshot. The system also provides access to node execution traces, offering detailed insights into each processing step, real-time performance metrics, and highlighting issues that may have occurred during the flow’s execution. Traces can be enabled using an API and sent to an Amazon CloudWatch log. In the API, set the enableTrace field to true in an InvokeFlow request. Each flowOutputEvent in the response is returned alongside a flowTraceEvent.

You have now successfully created and executed an Amazon Bedrock flow using DoWhile Loop nodes. You can also use Amazon Bedrock APIs to programmatically execute this flow. For additional details on how to configure flows, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Considerations
When working with DoWhile Loop nodes in Amazon Bedrock Flows, the following are the important things to note:

DoWhile Loop nodes don’t support nested loops (loops within loops)
Each loop controller can evaluate up to five input conditions for its exit criteria
A maximum iteration limit must be specified to help prevent infinite loops and enable controlled execution

Conclusion
The integration of DoWhile loops in Amazon Bedrock Flows marks a significant advancement in iterative workflow capabilities, enabling sophisticated loop-based processing that can incorporate Prompt nodes, Inline Code nodes, S3 Storage nodes, Lambda functions, agents, DoWhile Loop nodes, and Knowledge Base nodes. This enhancement responds directly to enterprise customers’ needs for handling complex, repetitive tasks within their AI workflows, helping developers create adaptive, condition-based solutions without requiring external orchestration tools. By providing support for iterative processing patterns, DoWhile loops help organizations build more sophisticated AI applications that can refine outputs, perform recursive operations, and implement complex business logic directly within the Amazon Bedrock environment. This powerful addition to Amazon Bedrock Flows democratizes the development of advanced AI workflows, making iterative AI processing more accessible and manageable across organizations.
DoWhile loops in Amazon Bedrock Flows are now available in all the AWS Regions where Amazon Bedrock Flows is supported, except for the AWS Gov Cloud (US) Region. To get started, open the Amazon Bedrock console or Amazon Bedrock APIs to begin building flows with Amazon Bedrock Flows. To learn more, refer to Create your first flow in Amazon Bedrock and Track each step in your flow by viewing its trace in Amazon Bedrock.
We’re excited to see the innovative applications you will build with these new capabilities. As always, we welcome your feedback through AWS re:Post for Amazon Bedrock or your usual AWS contacts. Join the generative AI builder community at community.aws to share your experiences and learn from others.

About the authors
Shubhankar Sumar is a Senior Solutions Architect at AWS, where he specializes in architecting generative AI-powered solutions for enterprise software and SaaS companies across the UK. With a strong background in software engineering, Shubhankar excels at designing secure, scalable, and cost-effective multi-tenant systems on the cloud. His expertise lies in seamlessly integrating cutting-edge generative AI capabilities into existing SaaS applications, helping customers stay at the forefront of technological innovation.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.
Eric Li is a Software Development Engineer II at AWS, where he builds core capabilities for Amazon Bedrock and SageMaker to support generative AI applications at scale. His work focuses on designing secure, observable, and cost-efficient systems that help developers and enterprises adopt generative AI with confidence. He is passionate about advancing developer experiences for building with large language models, making it easier to integrate AI into production-ready cloud applications.

How PropHero built an intelligent property investment advisor with con …

Posted on September 26, 2025 by i-genie

This post was written with Lucas Dahan, Dil Dolkun, and Mathew Ng from PropHero.
PropHero is a leading property wealth management service that democratizes access to intelligent property investment advice through big data, AI, and machine learning (ML). For the Spanish and Australian consumer base, PropHero needed an AI-powered advisory system that could engage customers in accurate property investment discussions. The goal was to provide personalized investment insights and to guide and assist users at every stage of their investment journey: from understanding the process, gaining visibility into timelines, securely uploading documents, to tracking progress in real time.
PropHero collaborated with the AWS Generative AI Innovation Center to implement an intelligent property investment advisor using AWS generative AI services with continuous evaluation. The solution helps users engage in natural language conversations about property investment strategies and receive personalized recommendations based on PropHero’s comprehensive market knowledge.
In this post, we explore how we built a multi-agent conversational AI system using Amazon Bedrock that delivers knowledge-grounded property investment advice. We explore the agent architecture, model selection strategy, and comprehensive continuous evaluation system that facilitates quality conversations while facilitating rapid iteration and improvement.
The challenge: Making property investment knowledge more accessible
The area of property investment presents numerous challenges for both novice and experienced investors. Information asymmetry creates barriers where comprehensive market data remains expensive or inaccessible. Traditional investment processes are manual, time-consuming, and require extensive market knowledge to navigate effectively. For the Spanish and Australian consumers specifically, we needed to build a solution that could provide accurate, contextually relevant property investment advice in Spanish while handling complex, multi-turn conversations about investment strategies. The system needed to maintain high accuracy while delivering responses at scale, continuously learning and improving from customer interactions. Most importantly, it needed to assist users across every phase of their journey, from initial onboarding through to final settlement, ensuring comprehensive support throughout the entire investment process.
Solution overview
We built a complete end-to-end solution using AWS generative AI services, architected around a multi-agent AI advisor with integrated continuous evaluation. The system provides seamless data flow from ingestion through intelligent advisory conversations with real-time quality monitoring. The following diagram illustrates this architecture.

The solution architecture consists of four virtual layers, each serving specific functions in the overall system design.
Data foundation layer
The data foundation provides the storage and retrieval infrastructure for system components:

Amazon DynamoDB – Fast storage for conversation history, evaluation metrics, and user interaction data
Amazon Relational Database (Amazon RDS) for PostgreSQL – A PostgreSQL database storing LangFuse observability data, including large language model (LLM) traces and latency metrics
Amazon Simple Storage Service (Amazon S3) – A central data lake storing Spanish FAQ documents, property investment guides, and conversation datasets

Multi-agent AI layer
The AI processing layer encompasses the core intelligence components that power the conversational experience:

Amazon Bedrock – Foundation models (FMs) such as LLMs and rerankers powering specialized agents
Amazon Bedrock Knowledge Bases – Semantic search engine with semantic chunking for FAQ-style content
LangGraph – Orchestration of multi-agent workflows and conversation state management
AWS Lambda – Serverless functions executing multi-agent logic and retrival of user information for richer context

Continuous evaluation layer
The evaluation infrastructure facilitates continuous quality monitoring and improvement through these components:

Amazon CloudWatch – Real-time monitoring of quality metrics with automated alerting and threshold management
Amazon EventBridge – Real-time event triggers for conversation completion and quality assessment
AWS Lambda – Automated evaluation functions measuring context relevance, response groundedness, and goal accuracy
Amazon QuickSight – Interactive dashboards and analytics for monitoring the respective metrics

Application and integration layer
The integration layer provides secure interfaces for external communication:

Amazon API Gateway – Secure API endpoints for conversational interface and evaluation webhooks

Multi-agent AI advisor architecture
The intelligent advisor uses a multi-agent system orchestrated through LangGraph, which sits in a single Lambda function, where each agent is optimized for specific tasks. The following diagram shows the communication flow among the various agents within the Lambda function.

Agent composition and model selection
Our model selection strategy involved extensive testing to match each component’s computational requirements with the most cost-effective Amazon Bedrock model. We evaluated factors including response quality, latency requirements, and cost per token to determine optimal model assignments for each agent type.Each component in the system uses the most appropriate model for its designated function, as outlined in the following table.

Component
Amazon Bedrock Model
Purpose

Router Agent
Anthropic Claude 3.5 Haiku
Query classification and routing

General Agent
Amazon Nova Lite
Common questions and conversation management

Advisor Agent
Amazon Nova Pro
Specialized property investment advice

Settlement agent
Anthropic Claude 3.5 Haiku
Customer support specialising on pre-settlement phase of investment

Response Agent
Amazon Nova Lite
Final response generation and formatting

Embedding
Cohere Embed Multilingual v3
Context retrieval

Retriever
Cohere Rerank 3.5
Context retrieval and ranking

Evaluator
Anthropic Claude 3.5 Haiku
Quality assessment and scoring

End-to-end conversation flow
The conversation processing follows a structured workflow that facilitates accurate responses while maintaining quality standards:

User queries enter through API Gateway and are routed to the router agent.
The router agent determines the appropriate specialized agent based on query analysis.
User information is retrieved at the start for richer context and knowledge-intensive queries trigger the retriever to access the Amazon Bedrock knowledge base.
Specialized agents process queries with retrieved user information and relevant context from the knowledge base.
The response agent formats and generates the final user-facing response with the appropriate tone.
Parallel evaluation processes assess context relevance, response groundedness, and goal accuracy.
Conversation data is stored in DynamoDB for analysis and improvement.

Amazon Bedrock Knowledge Bases configuration
The knowledge base uses several AWS services for optimal performance:

Data source – S3 bucket
Chunking strategy – Semantic chunking optimized for FAQ-style content
Embedding model – Cohere Embed Multilingual v3 for Spanish understanding
Reranker – Cohere Rerank 3.5 for reranking retrieved Spanish content
Vector database – Amazon OpenSearch Serverless for scalable semantic search

Sample conversation experience
The system handles natural Spanish conversations seamlessly. For example:
Usuario: “Hola, ¿qué es PropHero exactamente?”
Asistente: “¡Hola! PropHero es una plataforma que te permite crear, optimizar y crecer tu patrimonio inmobiliario delegando todo el proceso en nuestro equipo de profesionales…”
Usuario: “¿Y cómo funciona en la práctica?”
Asistente: “Es muy sencillo. Comenzamos con una sesión con tu Property Coach para definir tu estrategia de inversión personalizada…”
Integrated continuous evaluation system
The continuous evaluation system is integrated into the architecture, operating as a core component. This approach supports quality monitoring in real-time alongside conversation processing.
Evaluation metrics
We used the following metrics from the Ragas library for evaluation:

Context Relevance (0–1) – Measures the relevance of retrieved context to user queries, evaluating RAG system effectiveness
Response Groundedness (0–1) – Makes sure responses are factually accurate and derived from PropHero’s official information
Agent Goal Accuracy (0–1) – Binary measure of whether responses successfully address user investment goals

Real-time evaluation workflow
The evaluation system operates seamlessly within the conversation architecture:

Amazon DynamoDB Streams triggers – Conversation data written to DynamoDB automatically triggers a Lambda function for evaluation through Amazon DynamoDB Streams
Parallel processing – Lambda functions execute evaluation logic in parallel with response delivery
Multi-dimensional assessment – Each conversation is evaluated across three key dimensions simultaneously
Intelligent scoring with LLM-as-a-judge – Anthropic’s Claude 3.5 Haiku provides consistent evaluation as an LLM judge, offering standardized assessment criteria across conversations.
Monitoring and analytics – CloudWatch captures metrics from the evaluation process, and QuickSight provides dashboards for trend analysis

The following diagram provides an overview of the Lambda function responsible for continuous evaluation.

Implementation insights and best practices
Our development journey involved a 6-week iterative process with PropHero’s technical team. We conducted testing across different model combinations and evaluated chunking strategies using real customer FAQ data. This journey revealed several architectural optimizations that enhanced system performance, achieved significant cost reductions, and improved user experience.
Model selection strategy
Our approach to model selection demonstrates the importance of matching model capabilities to specific tasks. By using Amazon Nova Lite for simpler tasks and Amazon Nova Pro for complex reasoning, the solution achieves optimal cost-performance balance while maintaining high accuracy standards.
Chunking and retrieval optimization
Semantic chunking proved superior to hierarchical and fixed chunking approaches for FAQ-style content. The Cohere Rerank 3.5 model enabled the system to use fewer chunks (10 vs. 20) while maintaining accuracy, reducing latency and cost.
Multilingual capabilities
The system effectively handles Spanish and English queries by using FMs that support Spanish language on Amazon Bedrock.
Business impact
The PropHero AI advisor delivered measurable business value:

Enhanced customer engagement – A 90% goal accuracy rate makes sure customers receive relevant, actionable property investment advice. Over 50% of our users (and over 70% of paid users) are actively using the AI advisor.
Operational efficiency – Automated responses to common questions reduced customer service workload by 30%, freeing staff to focus on complex customer needs.
Scalable growth – The serverless architecture automatically scales to handle increasing customer demand without manual intervention.
Cost optimization – Strategic model selection achieved high performance while reducing AI costs by 60% compared to using premium models throughout.
Consumer base expansion – Successful Spanish language support enabled PropHero’s expansion into the Spanish consumer base with localized expertise.

Conclusion
The PropHero AI advisor demonstrates how AWS generative AI services can be used to create intelligent, context-aware conversational agents that deliver real business value. By combining a modular agent architecture with a robust evaluation system, PropHero has created a solution that enhances customer engagement while providing accurate and relevant responses.The comprehensive evaluation pipeline has been particularly valuable, providing clear metrics for measuring conversation quality and guiding ongoing improvements. This approach makes sure the AI advisor will continue to evolve and improve over time.For more information about building multi-agent AI advisors with continuous evaluation, refer to the following resources:

Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases – With Amazon Bedrock Knowledge Bases, you can implement semantic search with chunking strategies
LangGraph – LangGraph can help you build multi-agent workflows
Ragas – Ragas offers comprehensive LLM evaluation metrics, including context relevance, groundedness, and goal accuracy used in this implementation

To learn more about the Generative AI Innovation Center, get in touch with your account team.

About the authors
Adithya Suresh is a Deep Learning Architect at the AWS Generative AI Innovation Center based in Sydney, where he collaborates directly with enterprise customers to design and scale transformational generative AI solutions for complex business challenges. He uses AWS generative AI services to build bespoke AI systems that drive measurable business value across diverse industries.
Lucas Dahan was the Head of Data & AI at PropHero at the time of writing. He leads the technology team that is transforming property investment through innovative digital solutions.
Dil Dolkun is the Data & AI Engineer at PropHero’s tech team, and has been instrumental in designing data architectures and multi-agent workflows for PropHero’s generative AI property investment Advisor system.
Mathew Ng is a Technical Lead at PropHero, who architected and scaled PropHero’s cloud-native, high-performance software solution from early stage start up to successful Series A funding.
Aaron Su is a Solutions Architect at AWS, with a focus across AI and SaaS startups. He helps early-stage companies architect scalable, secure, and cost-effective cloud solutions.

Accelerate benefits claims processing with Amazon Bedrock Data Automat …

Posted on September 26, 2025 by i-genie

In the benefits administration industry, claims processing is a vital operational pillar that makes sure employees and beneficiaries receive timely benefits, such as health, dental, or disability payments, while controlling costs and adhering to regulations like HIPAA and ERISA. Businesses aim to optimize the workflow—covering claim submission, validation, adjudication, payment, and appeals—to enhance employee satisfaction, strengthen provider relationships, and mitigate financial risks. The process includes specific steps like claim submission (through portals or paper), data validation (verifying eligibility and accuracy), adjudication (assessing coverage against plan rules), payment or denial (including check processing for reimbursements), and appeal handling. Efficient claims processing supports competitive benefits offerings, which is crucial for talent retention and employer branding, but requires balancing speed, accuracy, and cost in a highly regulated environment.
Despite its importance, claims processing faces significant challenges in many organizations. Most notably, the reliance on legacy systems and manual processes results in frustratingly slow resolution times, high error rates, and increased administrative costs. Incomplete or inaccurate claim submissions—such as those with missing diagnosis codes or eligibility mismatches—frequently lead to denials and rework, creating frustration for both employees and healthcare providers. Additionally, fraud, waste, and abuse continue to inflate costs, yet detecting these issues without delaying legitimate claims remains challenging. Complex regulatory requirements demand constant system updates, and poor integration between systems—such as Human Resource Information Systems (HRIS) and other downstream systems—severely limits scalability. These issues drive up operational expenses, erode trust in benefits programs, and overburden customer service teams, particularly during appeals processes or peak claims periods.
Generative AI can help address these challenges. With Amazon Bedrock Data Automation, you can automate generation of useful insights from unstructured multimodal content such as documents, images, audio, and video. Amazon Bedrock Data Automation can be used in benefits claims process to automate document processing by extracting and classifying documents from claims packets, policy applications, and supporting documents with industry-leading accuracy, reducing manual errors and accelerating resolution times. Amazon Bedrock Data Automation natural language processing capabilities interpret unstructured data, such as provider notes, supporting compliance with plan rules and regulations. By automating repetitive tasks and providing insights, Amazon Bedrock Data Automation helps reduce administrative burdens, enhance experiences for both employees and providers, and support compliance in a cost-effective manner. Furthermore, its scalable architecture enables seamless integration with existing systems, improving data flow across HRIS, claims systems, and provider networks, and advanced analytics help detect fraud patterns to optimize cost control.
In this post, we examine the typical benefit claims processing workflow and identify where generative AI-powered automation can deliver the greatest impact.
Benefit claims processing
When an employee or beneficiary pays out of pocket for an expense covered under their health benefits, they submit a claim for reimbursement. This process requires several supporting documents, including doctor’s prescriptions and proof of payment, which might include check images, receipts, or electronic payment confirmations.
The claims processing workflow involves several critical steps:

Document intake and processing – The system receives and categorizes submitted documentation, including:

Medical records and prescriptions
Proof of payment documentation
Supporting forms and eligibility verification

Payment verification processing – For check-based reimbursements, the system must complete the following steps:

Extract information from check images, including the account number and routing number contained in the MICR line
Verify payee and payer names against the information provided during the claim submission process
Confirm payment amounts match the claimed expenses
Flag discrepancies for human review

Adjudication and reimbursement – When verification is complete, the system performs several actions:

Determine eligibility based on plan rules and coverage limits
Calculate appropriate reimbursement amounts
Initiate payment processing through direct deposit or check issuance
Provide notification to the claimant regarding the status of their reimbursement

In this post, we walk through a real-world scenario to make the complexity of this multi-step process clearer. The following example demonstrates how Amazon Bedrock Data Automation can streamline the claims processing workflow, from initial submission to final reimbursement.
Solution overview
Let’s consider a scenario where a benefit plan participant seeks treatment and pays out of pocket for the doctor’s fee using a check. They then buy the medications prescribed by the doctor at the pharmacy store. Later, they log in to their benefit provider’s portal and submit a claim along with the image of the check and payment receipt for the medications.
This solution uses Amazon Bedrock Data Automation to automate the two most critical and time-consuming aspects of this workflow: document intake and payment verification processing. The following diagram illustrates the benefits claims processing architecture.

The end-to-end process works through four integrated stages: ingestion, extraction, validation, and integration.
Ingestion
When a beneficiary uploads supporting documents (check image and pharmacy receipt) through the company’s benefit claims portal, these documents are securely saved in an Amazon Simple Storage Service (Amazon S3) bucket, triggering the automated claims processing pipeline.
Extraction
After documents are ingested, the system immediately begins with intelligent data extraction:

The S3 object upload triggers an AWS Lambda function, which invokes the Amazon Bedrock Data Automation project.
Amazon Bedrock Data Automation uses blueprints for file processing and extraction. Blueprints are artifacts used to configure file processing business logic by specifying a list of field names for data extraction, along with their desired data formats (string, number, or Boolean) and natural language context for data normalization and validation rules. Amazon Bedrock Data Automation provides a catalog of sample blueprints out of the box. You can create a custom blueprint for your unique document types that aren’t predefined in the catalog. This solution uses two blueprints designed for different document types, as shown in the following screenshot:

The catalog blueprint US-Bank-Check for check processing.
The custom blueprint benefit-claims-pharmacy-receipt-blueprint for pharmacy-specific receipts.

US-Bank-Check is a catalog blueprint provided out of the box by Amazon Bedrock Data Automation. The custom blueprint benefit-claims-pharmacy-receipt-blueprint is created using an AWS CloudFormation template to handle pharmacy receipt processing, addressing a specific document type that wasn’t available in the standard blueprint catalog. The benefit administrator wants to look for vendor-specific information such as name, address, and phone details for benefits claims processing. The custom blueprint schema contains natural language explanation of those fields, such as VendorName, VendorAddress, VendorPhone, and additional fields, explaining what the field represents, expected data types, and inference type for each extracted field (explained in Creating Blueprints for Extraction), as shown in the following screenshot.

3. The two blueprints are added to the Amazon Bedrock Data Automation project. An Amazon Bedrock Data Automation project is a grouping of both standard and custom blueprints that you can use to process different types of files (like documents, audio, and images) using specific configuration settings, where you can control what kind of information you want to extract from each file type. When the project is invoked asynchronously, it automatically applies the appropriate blueprint, extracts information such as confidence scores and bounding box details for each field, and saves results in a separate S3 bucket. This intelligent classification alleviates the need for you to write complex document classification logic.
The following screenshot illustrates the document classification by the standard catalog blueprint US-Bank-Check.

The following screenshot shows the document classification by the custom blueprint benefit-claims-pharmacy-receipt-blueprint.

Validation
With the data extracted, the system moves to the validation and decision-making process using the business rules specific to each document type.
The business rules are documented in standard operating procedure documents (AnyCompany Benefit Checks Standard Operating procedure.docx and AnyCompany Benefit Claims Standard Operating procedure.docx) and uploaded to an S3 bucket. Then the system creates a knowledge base for Amazon Bedrock with the S3 bucket as the source, as shown in the following screenshot.

When the extracted Amazon Bedrock Data Automation results are saved to the configured S3 bucket, a Lambda function is triggered automatically. Based on the business rules retrieved from the knowledge base for the specific document type and the extracted Amazon Bedrock Data Automation output, an Amazon Nova Lite large langue model (LLM) makes the automated approve/deny decision for claims.
The following screenshot shows the benefit claim adjudication automated decision for US-Bank-Check.

The following screenshot shows the benefit claim adjudication automated decision for benefit-claims-pharmacy-receipt-blueprint.

Integration
The system seamlessly integrates with existing business processes.
When validation is complete, an event is pushed to Amazon EventBridge, which triggers a Lambda function for downstream integration. In this implementation, we use an Amazon DynamoDB table and Amazon Simple Notification Service (Amazon SNS) email for downstream integration. A DynamoDB table is created as part of the deployment stack, which is used to populate details including document classification, extracted data, and automated decision. An email notification is sent for both check and receipts after the final decision is made by the system. The following screenshot shows an example email for pharmacy receipt approval.

This flexible architecture helps you integrate with your existing applications through internal APIs or events to update claim status or trigger additional workflows when validation fails.
Reducing manual effort through intelligent business rules management
Beyond automating document processing, this solution addresses a common operational challenge: Traditionally, customers must write and maintain code for handling business rules around claims adjudication and processing. Every business rule change requires development effort and code updates, slowing time-to-market and increasing maintenance overhead.
Our approach converts business rules and standard operating procedures (SOPs) into knowledge bases using Amazon Bedrock Knowledge Bases, which you can use for automated decision-making. This approach can dramatically reduce time-to-market when business rules change, because updates can be made through knowledge management rather than code deployment.
In the following sections, we walk you through the steps to deploy the solution to your own AWS account.
Prerequisites
To implement the solution provided in this post, you must have the following:

An AWS account
Access to Amazon Titan Text Embeddings V2 and Amazon Nova Lite foundation models (FMs) enabled in Amazon Bedrock

This solution uses Python 3.13 with Boto3 1.38. or later version, and the AWS Serverless Application Model Command Line Interface (AWS SAM CLI) version 1.138.0. We assume that you have installed these in your local machine already. If not, refer to the following instructions:

Python 3.13 installation
Install the AWS SAM CLI

Set up code in your local machine
To set up the code, clone the GitHub repository. After you have cloned the repository to your local machine, the project folder structure will look like the following code, as mentioned in the README file:

Deploy the solution in your account
The sample code comes with a CloudFormation template that creates necessary resources. To deploy the solution in your account, follow the deployment instructions in the README file.
Clean up
Deploying this solution in your account will incur costs. Follow the cleanup instructions in the README file to avoid charges when you are done.
Conclusion
Benefits administration companies can significantly enhance their operations by automating claims processing using the solution outlined in this post. This strategic approach directly addresses the industry’s core challenges and can deliver several key advantages:

Enhanced processing efficiency through accelerated claims resolution times, reduced manual error rates, and higher straight-through processing rates that minimize the frustrating delays and manual rework plaguing legacy systems
Streamlined document integration and fraud detection capabilities, where adding new supporting documents becomes seamless through new Amazon Bedrock Data Automation blueprints, while AI-powered analytics identify suspicious patterns without delaying legitimate claims, avoiding traditional months-long development cycles and reducing costly fraud, waste, and abuse
Agile business rule management that enables rapid adaptation to changing HIPAA and ERISA requirements and modification of business rules, significantly reducing administrative costs and time-to-market while improving scalability and integration with existing HRIS and claims, ultimately enhancing employee satisfaction, strengthening provider relationships, and supporting competitive benefits offerings that are crucial for talent retention and employer branding

To get started with this solution, refer to the GitHub repo. For more information about Amazon Bedrock Data Automation, refer to Transform unstructured data into meaningful insights using Amazon Bedrock Data Automation and try the Document Processing Using Amazon Bedrock Data Automation workshop.

About the authors
Saurabh Kumar is a Senior Solutions Architect at AWS based out of Raleigh, NC, with expertise in Resilience Engineering, Chaos Engineering, and Generative AI solutions. He advises customers on fault-tolerance strategies and generative AI-driven modernization approaches, helping organizations build robust architectures while leveraging generative AI technologies to drive innovation.
Kiran Lakkireddy is a Principal Solutions Architect at AWS with expertise in Financial Services, Benefits Management and HR Services industries. Kiran provides technology and architecture guidance to customers in their business transformation, with a specialized focus on GenAI security, compliance, and governance. He regularly speaks to customer security leadership on GenAI security, compliance, and governance topics, helping organizations navigate the complex landscape of AI implementation while maintaining robust security standards.
Tamilmanam Sambasivam is a Solutions Architect and AI/ML Specialist at AWS. She helps enterprise customers to solve their business problems by recommending the right AWS solutions. Her strong back ground in Information Technology (24+ years of experience) helps customers to strategize, develop and modernize their business problems in AWS cloud. In the spare time, Tamil like to travel and gardening.

How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and M …

Posted on September 25, 2025 by i-genie

In this tutorial, we explore advanced computer vision techniques using TorchVision’s v2 transforms, modern augmentation strategies, and powerful training enhancements. We walk through the process of building an augmentation pipeline, applying MixUp and CutMix, designing a modern CNN with attention, and implementing a robust training loop. By running everything seamlessly in Google Colab, we position ourselves to understand and apply state-of-the-art practices in deep learning with clarity and efficiency. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install torch torchvision torchaudio –quiet
!pip install matplotlib pillow numpy –quiet

import torch
import torchvision
from torchvision import transforms as T
from torchvision.transforms import v2
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import requests
from io import BytesIO

print(f”PyTorch version: {torch.__version__}”)
print(f”TorchVision version: {torchvision.__version__}”)

We begin by installing the libraries and importing all the essential modules for our workflow. We set up PyTorch, TorchVision v2 transforms, and supporting tools like NumPy, PIL, and Matplotlib, so we are ready to build and test advanced computer vision pipelines. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedAugmentationPipeline:
def __init__(self, image_size=224, training=True):
self.image_size = image_size
self.training = training
base_transforms = [
v2.ToImage(),
v2.ToDtype(torch.uint8, scale=True),
]
if training:
self.transform = v2.Compose([
*base_transforms,
v2.Resize((image_size + 32, image_size + 32)),
v2.RandomResizedCrop(image_size, scale=(0.8, 1.0), ratio=(0.9, 1.1)),
v2.RandomHorizontalFlip(p=0.5),
v2.RandomRotation(degrees=15),
v2.ColorJitter(brights=0.4, contst=0.4, sation=0.4, hue=0.1),
v2.RandomGrayscale(p=0.1),
v2.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0)),
v2.RandomPerspective(distortion_scale=0.1, p=0.3),
v2.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
else:
self.transform = v2.Compose([
*base_transforms,
v2.Resize((image_size, image_size)),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
def __call__(self, image):
return self.transform(image)

We define an advanced augmentation pipeline that adapts to both training and validation modes. We apply powerful TorchVision v2 transforms, such as cropping, flipping, color jittering, blurring, perspective, and affine transformations, during training, while keeping validation preprocessing simple with resizing and normalization. This way, we ensure that we enrich the training data for better generalization while maintaining consistent and stable evaluation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedMixupCutmix:
def __init__(self, mixup_alpha=1.0, cutmix_alpha=1.0, prob=0.5):
self.mixup_alpha = mixup_alpha
self.cutmix_alpha = cutmix_alpha
self.prob = prob
def mixup(self, x, y):
batch_size = x.size(0)
lam = np.random.beta(self.mixup_alpha, self.mixup_alpha) if self.mixup_alpha > 0 else 1
index = torch.randperm(batch_size)
mixed_x = lam * x + (1 – lam) * x[index, :]
y_a, y_b = y, y[index]
return mixed_x, y_a, y_b, lam
def cutmix(self, x, y):
batch_size = x.size(0)
lam = np.random.beta(self.cutmix_alpha, self.cutmix_alpha) if self.cutmix_alpha > 0 else 1
index = torch.randperm(batch_size)
y_a, y_b = y, y[index]
bbx1, bby1, bbx2, bby2 = self._rand_bbox(x.size(), lam)
x[:, :, bbx1:bbx2, bby1:bby2] = x[index, :, bbx1:bbx2, bby1:bby2]
lam = 1 – ((bbx2 – bbx1) * (bby2 – bby1) / (x.size()[-1] * x.size()[-2]))
return x, y_a, y_b, lam
def _rand_bbox(self, size, lam):
W = size[2]
H = size[3]
cut_rat = np.sqrt(1. – lam)
cut_w = int(W * cut_rat)
cut_h = int(H * cut_rat)
cx = np.random.randint(W)
cy = np.random.randint(H)
bbx1 = np.clip(cx – cut_w // 2, 0, W)
bby1 = np.clip(cy – cut_h // 2, 0, H)
bbx2 = np.clip(cx + cut_w // 2, 0, W)
bby2 = np.clip(cy + cut_h // 2, 0, H)
return bbx1, bby1, bbx2, bby2
def __call__(self, x, y):
if np.random.random() > self.prob:
return x, y, y, 1.0
if np.random.random() < 0.5:
return self.mixup(x, y)
else:
return self.cutmix(x, y)

class ModernCNN(nn.Module):
def __init__(self, num_classes=10, dropout=0.3):
super(ModernCNN, self).__init__()
self.conv1 = self._conv_block(3, 64)
self.conv2 = self._conv_block(64, 128, downsample=True)
self.conv3 = self._conv_block(128, 256, downsample=True)
self.conv4 = self._conv_block(256, 512, downsample=True)
self.gap = nn.AdaptiveAvgPool2d(1)
self.attention = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 512),
nn.Sigmoid()
)
self.classifier = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(dropout/2),
nn.Linear(256, num_classes)
)
def _conv_block(self, in_channels, out_channels, downsample=False):
stride = 2 if downsample else 1
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, 3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.gap(x)
x = torch.flatten(x, 1)
attention_weights = self.attention(x)
x = x * attention_weights
return self.classifier(x)

We strengthen our training with a unified MixUp/CutMix module, where we stochastically blend images or patch-swap regions and compute label interpolation with the exact pixel ratio. We pair this with a modern CNN that stacks progressive conv blocks, applies global average pooling, and uses a learned attention gate before a dropout-regularized classifier, so we improve generalization while keeping inference straightforward. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedTrainer:
def __init__(self, model, device=’cuda’ if torch.cuda.is_available() else ‘cpu’):
self.model = model.to(device)
self.device = device
self.mixup_cutmix = AdvancedMixupCutmix()
self.optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
self.scheduler = optim.lr_scheduler.OneCycleLR(
self.optimizer, max_lr=1e-2, epochs=10, steps_per_epoch=100
)
self.criterion = nn.CrossEntropyLoss()
def mixup_criterion(self, pred, y_a, y_b, lam):
return lam * self.criterion(pred, y_a) + (1 – lam) * self.criterion(pred, y_b)
def train_epoch(self, dataloader):
self.model.train()
total_loss = 0
correct = 0
total = 0
for batch_idx, (data, target) in enumerate(dataloader):
data, target = data.to(self.device), target.to(self.device)
data, target_a, target_b, lam = self.mixup_cutmix(data, target)
self.optimizer.zero_grad()
output = self.model(data)
if lam != 1.0:
loss = self.mixup_criterion(output, target_a, target_b, lam)
else:
loss = self.criterion(output, target)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
self.scheduler.step()
total_loss += loss.item()
_, predicted = output.max(1)
total += target.size(0)
if lam != 1.0:
correct += (lam * predicted.eq(target_a).sum().item() +
(1 – lam) * predicted.eq(target_b).sum().item())
else:
correct += predicted.eq(target).sum().item()
return total_loss / len(dataloader), 100. * correct / total

We orchestrate training with AdamW, OneCycleLR, and dynamic MixUp/CutMix so we stabilize optimization and boost generalization. We compute an interpolated loss when mixing, clip gradients for safety, and step the scheduler each batch, so we track loss/accuracy per epoch in a single tight loop. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_advanced_techniques():
batch_size = 16
num_classes = 10
sample_data = torch.randn(batch_size, 3, 224, 224)
sample_labels = torch.randint(0, num_classes, (batch_size,))
transform_pipeline = AdvancedAugmentationPipeline(training=True)
model = ModernCNN(num_classes=num_classes)
trainer = AdvancedTrainer(model)
print(” Advanced Deep Learning Tutorial Demo”)
print(“=” * 50)
print(“n1. Advanced Augmentation Pipeline:”)
augmented = transform_pipeline(Image.fromarray((sample_data[0].permute(1,2,0).numpy() * 255).astype(np.uint8)))
print(f” Original shape: {sample_data[0].shape}”)
print(f” Augmented shape: {augmented.shape}”)
print(f” Applied transforms: Resize, Crop, Flip, ColorJitter, Blur, Perspective, etc.”)
print(“n2. MixUp/CutMix Augmentation:”)
mixup_cutmix = AdvancedMixupCutmix()
mixed_data, target_a, target_b, lam = mixup_cutmix(sample_data, sample_labels)
print(f” Mixed batch shape: {mixed_data.shape}”)
print(f” Lambda value: {lam:.3f}”)
print(f” Technique: {‘MixUp’ if lam > 0.7 else ‘CutMix’}”)
print(“n3. Modern CNN Architecture:”)
model.eval()
with torch.no_grad():
output = model(sample_data)
print(f” Input shape: {sample_data.shape}”)
print(f” Output shape: {output.shape}”)
print(f” Features: Residual blocks, Attention, Global Average Pooling”)
print(f” Parameters: {sum(p.numel() for p in model.parameters()):,}”)
print(“n4. Advanced Training Simulation:”)
dummy_loader = [(sample_data, sample_labels)]
loss, acc = trainer.train_epoch(dummy_loader)
print(f” Training loss: {loss:.4f}”)
print(f” Training accuracy: {acc:.2f}%”)
print(f” Learning rate: {trainer.scheduler.get_last_lr()[0]:.6f}”)
print(“n Tutorial completed successfully!”)
print(“This code demonstrates state-of-the-art techniques in deep learning:”)
print(“• Advanced data augmentation with TorchVision v2”)
print(“• MixUp and CutMix for better generalization”)
print(“• Modern CNN architecture with attention”)
print(“• Advanced training loop with OneCycleLR”)
print(“• Gradient clipping and weight decay”)

if __name__ == “__main__”:
demo_advanced_techniques()

We run a compact end-to-end demo where we visualize our augmentation pipeline, apply MixUp/CutMix, and double-check the ModernCNN with a forward pass. We then simulate one training epoch on dummy data to verify loss, accuracy, and learning-rate scheduling, so we confirm the full stack works before scaling to a real dataset.

In conclusion, we have successfully developed and tested a comprehensive workflow that integrates advanced augmentations, innovative CNN design, and modern training strategies. By experimenting with TorchVision v2, MixUp, CutMix, attention mechanisms, and OneCycleLR, we not only strengthen model performance but also deepen our understanding of cutting-edge techniques.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and Modern CNN Training for State-of-the-Art Computer Vision? appeared first on MarkTechPost.

Vision-RAG vs Text-RAG: A Technical Comparison for Enterprise Search

Posted on September 25, 2025 by i-genie

Most RAG failures originate at retrieval, not generation. Text-first pipelines lose layout semantics, table structure, and figure grounding during PDF→text conversion, degrading recall and precision before an LLM ever runs. Vision-RAG—retrieving rendered pages with vision-language embeddings—directly targets this bottleneck and shows material end-to-end gains on visually rich corpora.

Pipelines (and where they fail)

Text-RAG. PDF → (parser/OCR) → text chunks → text embeddings → ANN index → retrieve → LLM. Typical failure modes: OCR noise, multi-column flow breakage, table cell structure loss, and missing figure/chart semantics—documented by table- and doc-VQA benchmarks created to measure exactly these gaps.

Vision-RAG. PDF → page raster(s) → VLM embeddings (often multi-vector with late-interaction scoring) → ANN index → retrieve → VLM/LLM consumes high-fidelity crops or full pages. This preserves layout and figure-text grounding; recent systems (ColPali, VisRAG, VDocRAG) validate the approach.

What current evidence supports

Document-image retrieval works and is simpler. ColPali embeds page images and uses late-interaction matching; on the ViDoRe benchmark it outperforms modern text pipelines while remaining end-to-end trainable.

End-to-end lift is measurable. VisRAG reports 25–39% end-to-end improvement over text-RAG on multimodal documents when both retrieval and generation use a VLM.

Unified image format for real-world docs. VDocRAG shows that keeping documents in a unified image format (tables, charts, PPT/PDF) avoids parser loss and improves generalization; it also introduces OpenDocVQA for evaluation.

Resolution drives reasoning quality. High-resolution support in VLMs (e.g., Qwen2-VL/Qwen2.5-VL) is explicitly tied to SoTA results on DocVQA/MathVista/MTVQA; fidelity matters for ticks, superscripts, stamps, and small fonts.

Costs: vision context is (often) order-of-magnitude heavier—because of tokens

Vision inputs inflate token counts via tiling, not necessarily per-token price. For GPT-4o-class models, total tokens ≈ base + (tile_tokens × tiles), so 1–2 MP pages can be ~10× cost of a small text chunk. Anthropic recommends ~1.15 MP caps (~1.6k tokens) for responsiveness. By contrast, Google Gemini 2.5 Flash-Lite prices text/image/video at the same per-token rate, but large images still consume many more tokens. Engineering implication: adopt selective fidelity (crop > downsample > full page).

Design rules for production Vision-RAG

Align modalities across embeddings. Use encoders trained for textimage alignment (CLIP-family or VLM retrievers) and, in practice, dual-index: cheap text recall for coverage + vision rerank for precision. ColPali’s late-interaction (MaxSim-style) is a strong default for page images.

Feed high-fidelity inputs selectively. Coarse-to-fine: run BM25/DPR, take top-k pages to a vision reranker, then send only ROI crops (tables, charts, stamps) to the generator. This preserves crucial pixels without exploding tokens under tile-based accounting.

Engineer for real documents.• Tables: if you must parse, use table-structure models (e.g., PubTables-1M/TATR); otherwise prefer image-native retrieval.• Charts/diagrams: expect tick- and legend-level cues; resolution must retain these. Evaluate on chart-focused VQA sets.• Whiteboards/rotations/multilingual: page rendering avoids many OCR failure modes; multilingual scripts and rotated scans survive the pipeline.• Provenance: store page hashes and crop coordinates alongside embeddings to reproduce exact visual evidence used in answers.

StandardText-RAGVision-RAGIngest pipelinePDF → parser/OCR → text chunks → text embeddings → ANNPDF → page render(s) → VLM page/crop embeddings (often multi-vector, late interaction) → ANN. ColPali is a canonical implementation.Primary failure modesParser drift, OCR noise, multi-column flow breakage, table structure loss, missing figure/chart semantics. Benchmarks exist because these errors are common.Preserves layout/figures; failures shift to resolution/tiling choices and cross-modal alignment. VDocRAG formalizes “unified image” processing to avoid parsing loss.Retriever representationSingle-vector text embeddings; rerank via lexical or cross-encodersPage-image embeddings with late interaction (MaxSim-style) capture local regions; improves page-level retrieval on ViDoRe.End-to-end gains (vs Text-RAG)Baseline+25–39% E2E on multimodal docs when both retrieval and generation are VLM-based (VisRAG).Where it excelsClean, text-dominant corpora; low latency/costVisually rich/structured docs: tables, charts, stamps, rotated scans, multilingual typography; unified page context helps QA. Resolution sensitivityNot applicable beyond OCR settingsReasoning quality tracks input fidelity (ticks, small fonts). High-res document VLMs (e.g., Qwen2-VL family) emphasize this.Cost model (inputs)Tokens ≈ characters; cheap retrieval contextsImage tokens grow with tiling: e.g., OpenAI base+tiles formula; Anthropic guidance ~1.15 MP ≈ ~1.6k tokens. Even when per-token price is equal (Gemini 2.5 Flash-Lite), high-res pages consume far more tokens. Cross-modal alignment needNot requiredCritical: textimage encoders must share geometry for mixed queries; ColPali/ViDoRe demonstrate effective page-image retrieval aligned to language tasks.Benchmarks to trackDocVQA (doc QA), PubTables-1M (table structure) for parsing-loss diagnostics. ViDoRe (page retrieval), VisRAG (pipeline), VDocRAG (unified-image RAG).Evaluation approachIR metrics plus text QA; may miss figure-text grounding issuesJoint retrieval+gen on visually rich suites (e.g., OpenDocVQA under VDocRAG) to capture crop relevance and layout grounding.Operational patternOne-stage retrieval; cheap to scaleCoarse-to-fine: text recall → vision rerank → ROI crops to generator; keeps token costs bounded while preserving fidelity. (Tiling math/pricing inform budgets.) When to preferContracts/templates, code/wikis, normalized tabular data (CSV/Parquet)Real-world enterprise docs with heavy layout/graphics; compliance workflows needing pixel-exact provenance (page hash + crop coords).Representative systemsDPR/BM25 + cross-encoder rerankColPali (ICLR’25) vision retriever; VisRAG pipeline; VDocRAG unified image framework.

When Text-RAG is still the right default?

Clean, text-dominant corpora (contracts with fixed templates, wikis, code)

Strict latency/cost constraints for short answers

Data already normalized (CSV/Parquet)—skip pixels and query the table store

Evaluation: measure retrieval + generation jointly

Add multimodal RAG benchmarks to your harness—e.g., M²RAG (multi-modal QA, captioning, fact-verification, reranking), REAL-MM-RAG (real-world multi-modal retrieval), and RAG-Check (relevance + correctness metrics for multi-modal context). These catch failure cases (irrelevant crops, figure-text mismatch) that text-only metrics miss.

Summary

Text-RAG remains efficient for clean, text-only data. Vision-RAG is the practical default for enterprise documents with layout, tables, charts, stamps, scans, and multilingual typography. Teams that (1) align modalities, (2) deliver selective high-fidelity visual evidence, and (3) evaluate with multimodal benchmarks consistently get higher retrieval precision and better downstream answers—now backed by ColPali (ICLR 2025), VisRAG’s 25–39% E2E lift, and VDocRAG’s unified image-format results.

References:

https://arxiv.org/abs/2407.01449

https://github.com/illuin-tech/vidore-benchmark

https://huggingface.co/vidore

https://arxiv.org/abs/2410.10594

https://github.com/OpenBMB/VisRAG

https://huggingface.co/openbmb/VisRAG-Ret

https://arxiv.org/abs/2504.09795

https://openaccess.thecvf.com/content/CVPR2025/papers/Tanaka_VDocRAG_Retrieval-Augmented_Generation_over_Visually-Rich_Documents_CVPR_2025_paper.pdf

https://cvpr.thecvf.com/virtual/2025/poster/34926

https://vdocrag.github.io/

https://arxiv.org/abs/2110.00061

https://openaccess.thecvf.com/content/CVPR2022/papers/Smock_PubTables-1M_Towards_Comprehensive_Table_Extraction_From_Unstructured_Documents_CVPR_2022_paper.pdf (CVF Open Access)

https://huggingface.co/datasets/bsmock/pubtables-1m (Hugging Face)

https://arxiv.org/abs/2007.00398

https://www.docvqa.org/datasets

https://qwenlm.github.io/blog/qwen2-vl/

https://arxiv.org/html/2409.12191v1

https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

https://arxiv.org/abs/2203.10244

https://arxiv.org/abs/2504.05506

https://aclanthology.org/2025.findings-acl.978.pdf

https://arxiv.org/pdf/2504.05506

https://openai.com/api/pricing/

https://docs.claude.com/en/docs/build-with-claude/vision

https://docs.claude.com/en/docs/build-with-claude/token-counting

https://ai.google.dev/gemini-api/docs/pricing

https://arxiv.org/abs/2502.17297

https://openreview.net/forum?id=1oCZoWvb8i

https://github.com/NEUIR/M2RAG

https://arxiv.org/abs/2502.12342

https://aclanthology.org/2025.acl-long.1528/

https://aclanthology.org/2025.acl-long.1528.pdf

https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34

https://research.ibm.com/publications/real-mm-rag-a-real-world-multi-modal-retrieval-benchmark

https://arxiv.org/abs/2501.03995

https://platform.openai.com/docs/guides/images-vision

The post Vision-RAG vs Text-RAG: A Technical Comparison for Enterprise Search appeared first on MarkTechPost.

Alibaba’s Qwen3-Max: Production-Ready Thinking Mode, 1T+ Parameters, …

Posted on September 25, 2025 by i-genie

Alibaba has released Qwen3-Max, a trillion-parameter Mixture-of-Experts (MoE) model positioned as its most capable foundation model to date, with an immediate public on-ramp via Qwen Chat and Alibaba Cloud’s Model Studio API. The launch moves Qwen’s 2025 cadence from preview to production and centers on two variants: Qwen3-Max-Instruct for standard reasoning/coding tasks and Qwen3-Max-Thinking for tool-augmented “agentic” workflows.

What’s new at the model level?

Scale & architecture: Qwen3-Max crosses the 1-trillion-parameter mark with an MoE design (sparse activation per token). Alibaba positions the model as its largest and most capable to date; public briefings and coverage consistently describe it as a 1T-parameter class system rather than another mid-scale refresh.

Training/runtime posture: Qwen3-Max uses a sparse Mixture-of-Experts design and was pretrained on ~36T tokens (~2× Qwen2.5). The corpus skews toward multilingual, coding, and STEM/reasoning data. Post-training follows Qwen3’s four-stage recipe: long CoT cold-start → reasoning-focused RL → thinking/non-thinking fusion → general-domain RL. Alibaba confirms >1T parameters for Max; treat token counts/routing as team-reported until a formal Max tech report is published.

Access: Qwen Chat showcases the general-purpose UX, while Model Studio exposes inference and “thinking mode” toggles (notably, incremental_output=true is required for Qwen3 thinking models). Model listings and pricing sit under Model Studio with regioned availability.

Benchmarks: coding, agentic control, math

Coding (SWE-Bench Verified). Qwen3-Max-Instruct is reported at 69.6 on SWE-Bench Verified. That places it above some non-thinking baselines (e.g., DeepSeek V3.1 non-thinking) and slightly below Claude Opus 4 non-thinking in at least one roundup. Treat these as point-in-time numbers; SWE-Bench evaluations move quickly with harness updates.

Agentic tool use (Tau2-Bench). Qwen3-Max posts 74.8 on Tau2-Bench—an agent/tool-calling evaluation—beating named peers in the same report. Tau2 is designed to test decision-making and tool routing, not just text accuracy, so gains here are meaningful for workflow automation.

Math & advanced reasoning (AIME25, etc.). The Qwen3-Max-Thinking track (with tool use and a “heavy” runtime configuration) is described as near-perfect on key math benchmarks (e.g., AIME25) in multiple secondary sources and earlier preview coverage. Until an official technical report drops, treat “100%” claims as vendor-reported or community-replicated, not peer-reviewed.

https://qwen.ai/

Why two tracks—Instruct vs. Thinking?

Instruct targets conventional chat/coding/reasoning with tight latency, while Thinking enables longer deliberation traces and explicit tool calls (retrieval, code execution, browsing, evaluators), aimed at higher-reliability “agent” use cases. Critically, Alibaba’s API docs formalize the runtime switch: Qwen3 thinking models only operate with streaming incremental output enabled; commercial defaults are false, so callers must explicitly set it. This is a small but consequential contract detail if you’re instrumenting tools or chain-of-thought-like rollouts.

How to reason about the gains (signal vs. noise)?

Coding: A 60–70 SWE-Bench Verified score range typically reflects non-trivial repository-level reasoning and patch synthesis under evaluation harness constraints (e.g., environment setup, flaky tests). If your workloads hinge on repo-scale code changes, these deltas matter more than single-file coding toys.

Agentic: Tau2-Bench emphasizes multi-tool planning and action selection. Improvements here usually translate into fewer brittle hand-crafted policies in production agents, provided your tool APIs and execution sandboxes are robust.

Math/verification: “Near-perfect” math numbers from heavy/thinky modes underscore the value of extended deliberation plus tools (calculators, validators). Portability of those gains to open-ended tasks depends on your evaluator design and guardrails.

Summary

Qwen3-Max is not a teaser—it’s a deployable 1T-parameter MoE with documented thinking-mode semantics and reproducible access paths (Qwen Chat, Model Studio). Treat day-one benchmark wins as directionally strong but continue local evals; the hard, verifiable facts are scale (≈36T tokens, >1T params) and the API contract for tool-augmented runs (incremental_output=true). For teams building coding and agentic systems, this is ready for hands-on trials and internal gating against SWE-/Tau2-style suites.

Check out the Technical details, API and Qwen Chat. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Alibaba’s Qwen3-Max: Production-Ready Thinking Mode, 1T+ Parameters, and Day-One Coding/Agentic Bench Signals appeared first on MarkTechPost.

Coding Implementation to End-to-End Transformer Model Optimization wit …

Posted on September 24, 2025 by i-genie

In this tutorial, we walk through how we use Hugging Face Optimum to optimize Transformer models and make them faster while maintaining accuracy. We begin by setting up DistilBERT on the SST-2 dataset, and then we compare different execution engines, including plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step by step, we get hands-on experience with model export, optimization, quantization, and benchmarking, all inside a Google Colab environment. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “transformers>=4.49” “optimum[onnxruntime]>=1.20.0” “datasets>=2.20” “evaluate>=0.4” accelerate

from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig

os.environ.setdefault(“OMP_NUM_THREADS”, “1”)
os.environ.setdefault(“MKL_NUM_THREADS”, “1”)

MODEL_ID = “distilbert-base-uncased-finetuned-sst-2-english”
ORT_DIR = Path(“onnx-distilbert”)
Q_DIR = Path(“onnx-distilbert-quant”)
DEVICE = “cuda” if torch.cuda.is_available() else “cpu”
BATCH = 16
MAXLEN = 128
N_WARM = 3
N_ITERS = 8

print(f”Device: {DEVICE} | torch={torch.__version__}”)

We begin by installing the required libraries and setting up our environment for Hugging Face Optimum with ONNX Runtime. We configure paths, batch size, and iteration settings, and we confirm whether we run on CPU or GPU. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserds = load_dataset(“glue”, “sst2″, split=”validation[:20%]”)
texts, labels = ds[“sentence”], ds[“label”]
metric = evaluate.load(“accuracy”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

def make_batches(texts, max_len=MAXLEN, batch=BATCH):
for i in range(0, len(texts), batch):
yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
max_length=max_len, return_tensors=”pt”)

def run_eval(predict_fn, texts, labels):
preds = []
for toks in make_batches(texts):
preds.extend(predict_fn(toks))
return metric.compute(predictions=preds, references=labels)[“accuracy”]

def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
for _ in range(n_warm):
for toks in make_batches(texts[:BATCH*2]):
predict_fn(toks)
times = []
for _ in range(n_iters):
t0 = time.time()
for toks in make_batches(texts):
predict_fn(toks)
times.append((time.time() – t0) * 1000)
return float(np.mean(times)), float(np.std(times))

We load an SST-2 validation slice and prepare tokenization, an accuracy metric, and batching. We define run_eval to compute accuracy from any predictor and bench to warm up and time end-to-end inference. With these helpers, we fairly compare different engines using identical data and batching. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertorch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()

@torch.no_grad()
def pt_predict(toks):
toks = {k: v.to(DEVICE) for k, v in toks.items()}
logits = torch_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()

pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f”[PyTorch eager] {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}”)

compiled_model = torch_model
compile_ok = False
try:
compiled_model = torch.compile(torch_model, mode=”reduce-overhead”, fullgraph=False)
compile_ok = True
except Exception as e:
print(“torch.compile unavailable or failed -> skipping:”, repr(e))

@torch.no_grad()
def ptc_predict(toks):
toks = {k: v.to(DEVICE) for k, v in toks.items()}
logits = compiled_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()

if compile_ok:
ptc_ms, ptc_sd = bench(ptc_predict, texts)
ptc_acc = run_eval(ptc_predict, texts, labels)
print(f”[torch.compile] {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}”)

We load the baseline PyTorch classifier, define a pt_predict helper, and benchmark/score it on SST-2. We then attempt torch.compile for just-in-time graph optimizations and, if successful, run the same benchmarks to compare speed and accuracy under an identical setup. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprovider = “CUDAExecutionProvider” if DEVICE == “cuda” else “CPUExecutionProvider”
ort_model = ORTModelForSequenceClassification.from_pretrained(
MODEL_ID, export=True, provider=provider, cache_dir=ORT_DIR
)

@torch.no_grad()
def ort_predict(toks):
logits = ort_model(**{k: v.cpu() for k, v in toks.items()}).logits
return logits.argmax(-1).cpu().tolist()

ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f”[ONNX Runtime] {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}”)

Q_DIR.mkdir(parents=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(approach=”dynamic”, per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)

ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, provider=provider)

@torch.no_grad()
def ortq_predict(toks):
logits = ort_quant(**{k: v.cpu() for k, v in toks.items()}).logits
return logits.argmax(-1).cpu().tolist()

oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f”[ORT Quantized] {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}”)

We export the model to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark both to see how latency improves while accuracy stays comparable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserpt_pipe = pipeline(“sentiment-analysis”, model=torch_model, tokenizer=tokenizer,
device=0 if DEVICE==”cuda” else -1)
ort_pipe = pipeline(“sentiment-analysis”, model=ort_model, tokenizer=tokenizer, device=-1)
samples = [
“What a fantastic movie—performed brilliantly!”,
“This was a complete waste of time.”,
“I’m not sure how I feel about this one.”
]
print(“nSample predictions (PT | ORT):”)
for s in samples:
a = pt_pipe(s)[0][“label”]
b = ort_pipe(s)[0][“label”]
print(f”- {s}n PT={a} | ORT={b}”)

import pandas as pd
rows = [[“PyTorch eager”, pt_ms, pt_sd, pt_acc],
[“ONNX Runtime”, ort_ms, ort_sd, ort_acc],
[“ORT Quantized”, oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, [“torch.compile”, ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=[“Engine”, “Mean ms (↓)”, “Std ms”, “Accuracy”])
display(df)

print(“””
Notes:
– BetterTransformer is deprecated on transformers>=4.49, hence omitted.
– For larger gains on GPU, also try FlashAttention2 models or FP8 with TensorRT-LLM.
– For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; try NUMA pinning.
– For static (calibrated) quantization, use QuantizationConfig(approach=’static’) with a calibration set.
“””)

We sanity-check predictions with quick sentiment pipelines and print PyTorch vs ONNX labels side by side. We then assemble a summary table to compare latency and accuracy across engines, inserting torch.compile results when available. We conclude with practical notes, allowing us to extend the workflow to other backends and quantization modes.

In conclusion, we can clearly see how Optimum helps us bridge the gap between standard PyTorch models and production-ready, optimized deployments. We achieve speedups with ONNX Runtime and quantization while retaining accuracy, and we also explore how torch.compile provides gains directly within PyTorch. This workflow demonstrates a practical approach to balancing performance and efficiency for Transformer models, providing a foundation that can be further extended with advanced backends, such as OpenVINO or TensorRT.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization appeared first on MarkTechPost.

Google AI Introduces the Public Preview of Chrome DevTools MCP: Making …

Posted on September 24, 2025 by i-genie

Google has released a public preview of “Chrome DevTools MCP,” a Model Context Protocol (MCP) server that lets AI coding agents control and inspect a real Chrome instance—recording performance traces, inspecting the DOM and CSS, executing JavaScript, reading console output, and automating user flows. The launch directly targets a well-known limitation in code-generating agents: they usually cannot observe the runtime behavior of the pages they create or modify. By wiring agents into Chrome’s DevTools via MCP, Google is turning static suggestion engines into loop-closed debuggers that run measurements in the browser before proposing fixes.

What exactly is Chrome DevTools MCP?

MCP is an open protocol for connecting LLMs to tools and data. Google’s DevTools MCP acts as a specialized server that exposes Chrome’s debugging surface to MCP-compatible clients. Google’s developer blog positions this as “bringing the power of Chrome DevTools to AI coding assistants,” with concrete workflows like initiating a performance trace (e.g., performance_start_trace) against a target URL, then having the agent analyze the resulting trace to suggest optimizations (for example, diagnosing high Largest Contentful Paint).

Capabilities and tool surface

The official GitHub repository documents a broad tool set. Beyond performance tracing (performance_start_trace, performance_stop_trace, performance_analyze_insight), agents can run navigation primitives (navigate_page, new_page, wait_for), simulate user input (click, fill, drag, hover), and interrogate runtime state (list_console_messages, evaluate_script, list_network_requests, get_network_request). Screenshot and snapshot utilities provide visual and DOM-state capture to support diffs and regressions. The server uses Puppeteer under the hood for reliable automation and waiting semantics, and it speaks to Chrome via the Chrome DevTools Protocol (CDP).

Installation

Setup is intentionally minimal for MCP clients. Google recommends adding a single config stanza that shells out to npx, always tracking the latest server build:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“chrome-devtools”: {
“command”: “npx”,
“args”: [“chrome-devtools-mcp@latest”]
}
}
}

This server integrates with multiple agent front ends: Gemini CLI, Claude Code, Cursor, and GitHub Copilot’s MCP support. For VS Code/Copilot, the repo documents a code –add-mcp one-liner; for Claude Code, a claude mcp add command mirrors the same npx target. The package targets Node.js ≥22 and current Chrome.

Example agent workflows

Google’s announcement highlights pragmatic prompts that demonstrate end-to-end loops: verify a proposed fix in a live browser; analyze network failures (e.g., CORS or blocked image requests); simulate user behaviors like form submission to reproduce bugs; inspect layout issues by reading DOM/CSS in context; and run automated performance audits to reduce LCP and other Core Web Vitals. These are all operations agents can now validate with actual measurements rather than heuristics.

https://developer.chrome.com/blog/chrome-devtools-mcp?hl=en

Summary

Chrome DevTools MCP’s public preview is a practical inflection point for agentic frontend tooling: it grounds AI assistants in real browser telemetry—performance traces, DOM/CSS state, network and console data—so recommendations are driven by measurements rather than guesswork. The first-party server, shipped by the Chrome DevTools team, is installable via npx and targets MCP-capable clients, with Chrome/CDP under the hood. Expect shorter diagnose-fix loops for regressions and flaky UI flows, plus tighter validation of performance work.

Check out the Technical details and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Google AI Introduces the Public Preview of Chrome DevTools MCP: Making Your Coding Agent Control and Inspect a Live Chrome Browser appeared first on MarkTechPost.

Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Rea …

Posted on September 24, 2025 by i-genie

Real-time agents, live dubbing, and simultaneous translation die by a thousand milliseconds. Most “streaming” TTS (Text to Speech) stacks still wait for a chunk of text before they emit sound, so the human hears a beat of silence before the voice starts. VoXtream—released by KTH’s Speech, Music and Hearing group—attacks this head-on: it begins speaking after the first word, outputs audio in 80 ms frames, and reports 102 ms first-packet latency (FPL) on a modern GPU (with PyTorch compile).

What exactly is “full-stream” TTS and how is it different from “output streaming”?

Output-streaming systems decode speech in chunks but still require the entire input text upfront; the clock starts late. Full-stream systems consume text as it arrives (word-by-word from an LLM) and emit audio in lockstep. VoXtream implements the latter: it ingests a word stream and generates audio frames continuously, eliminating input-side buffering while maintaining low per-frame compute. The architecture explicitly targets first-word onset rather than only steady-state throughput.

https://arxiv.org/pdf/2509.15969

How does VoXtream start speaking without waiting for future words?

The core trick is a dynamic phoneme look-ahead inside an incremental Phoneme Transformer (PT). PT may peek up to 10 phonemes to stabilize prosody, but it does not wait for that context; generation can start immediately after the first word enters the buffer. This avoids fixed look-ahead windows that add onset delay.

What’s the model stack under the hood?

VoXtream is a single, fully-autoregressive (AR) pipeline with three transformers:

Phoneme Transformer (PT): decoder-only, incremental; dynamic look-ahead ≤ 10 phonemes; phonemization via g2pE at the word level.

Temporal Transformer (TT): AR predictor over Mimi codec semantic tokens plus a duration token that encodes a monotonic phoneme-to-audio alignment (“stay/go” and {1, 2} phonemes per frame). Mimi runs at 12.5 Hz (→ 80 ms frames).

Depth Transformer (DT): AR generator for the remaining Mimi acoustic codebooks, conditioned on TT outputs and a ReDimNet speaker embedding for zero-shot voice prompting. The Mimi decoder reconstructs the waveform frame-by-frame, enabling continuous emission.

Mimi’s streaming codec design and dual-stream tokenization are well documented; VoXtream uses its first codebook as “semantic” context and the rest for high-fidelity reconstruction.

Is it actually fast in practice—or just “fast on paper”?

The repository includes a benchmark script that measures both FPL and real-time factor (RTF). On A100, the research team report 171 ms / 1.00 RTF without compile and 102 ms / 0.17 RTF with compile; on RTX 3090, 205 ms / 1.19 RTF uncompiled and 123 ms / 0.19 RTF compiled.

How does it compare to today’s popular streaming baselines?

The research team evaluates short-form output streaming and full-stream scenarios. On LibriSpeech-long full-stream (where text arrives word-by-word), VoXtream shows lower WER (3.24 %) than CosyVoice2 (6.11 %) and a significant naturalness preference for VoXtream in listener studies (p ≤ 5e-10), while CosyVoice2 scores higher on speaker-similarity—consistent with its flow-matching decoder. In runtime, VoXtream has the lowest FPL among the compared public streaming systems, and with compile it operates >5× faster than real time (RTF ≈ 0.17).

https://arxiv.org/pdf/2509.15969

Why does this AR design beat diffusion/flow stacks on onset?

Diffusion/flow vocoders typically generate audio in chunks, so even if the text-audio interleaving is clever, the vocoder imposes a floor on first-packet latency. VoXtream keeps every stage AR and frame-synchronous—PT→TT→DT→Mimi decoder—so the first 80 ms packet emerges after one pass through the stack rather than a multi-step sampler. The introduction surveys prior interleaved and chunked approaches and explains how NAR flow-matching decoders used in IST-LM and CosyVoice2 impede low FPL despite strong offline quality.

Did they get here with huge data—or something smaller and cleaner?

VoXtream trains on a ~9k-hour mid-scale corpus: roughly 4.5k h Emilia and 4.5k h HiFiTTS-2 (22 kHz subset). The team diarized to remove multi-speaker clips, filtered transcripts using ASR, and applied NISQA to drop low-quality audio. Everything is resampled to 24 kHz, and the dataset card spells out the preprocessing pipeline and alignment artifacts (Mimi tokens, MFA alignments, duration labels, and speaker templates).

Are the headline quality metrics holding up outside cherry-picked clips?

Table 1 (zero-shot TTS) shows VoXtream is competitive on WER, UTMOS (MOS predictor), and speaker similarity across SEED-TTS test-en and LibriSpeech test-clean; the research team also runs an ablation: adding the CSM Depth Transformer and speaker encoder notably improves similarity without a significant WER penalty relative to a stripped baseline. The subjective study uses a MUSHRA-like protocol and a second-stage preference test tailored to full-stream generation.

source: marktechpost.com

Where does this land in the TTS landscape?

As per the research paper, it positions VoXtream among recent interleaved AR + NAR vocoder approaches and LM-codec stacks. The core contribution isn’t a new codec or a giant model—it’s a latency-focused AR arrangement plus a duration-token alignment that preserves input-side streaming. If you build live agents, the important trade-off is explicit: a small drop in speaker similarity vs. order-of-magnitude lower FPL than chunked NAR vocoders in full-stream conditions.

Check out the PAPER, Model on Hugging, GitHub Page and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Real-Time Use that Begins Speaking from the First Word appeared first on MarkTechPost.

Running deep research AI agents on Amazon Bedrock AgentCore

Posted on September 24, 2025 by i-genie

AI agents are evolving beyond basic single-task helpers into more powerful systems that can plan, critique, and collaborate with other agents to solve complex problems. Deep Agents—a recently introduced framework built on LangGraph—bring these capabilities to life, enabling multi-agent workflows that mirror real-world team dynamics. The challenge, however, is not just building such agents but also running them reliably and securely in production. This is where Amazon Bedrock AgentCore Runtime comes in. By providing a secure, serverless environment purpose-built for AI agents and tools, Runtime makes it possible to deploy Deep Agents at enterprise scale without the heavy lifting of managing infrastructure.
In this post, we demonstrate how to deploy Deep Agents on AgentCore Runtime. As shown in the following figure, AgentCore Runtime scales any agent and provides session isolation by allocating a new microVM for each new session.

What is Amazon Bedrock AgentCore?
Amazon Bedrock AgentCore is both framework-agnostic and model-agnostic, giving you the flexibility to deploy and operate advanced AI agents securely and at scale. Whether you’re building with Strands Agents, CrewAI, LangGraph, LlamaIndex, or another framework—and running them on a large language model (LLM)—AgentCore provides the infrastructure to support them. Its modular services are purpose-built for dynamic agent workloads, with tools to extend agent capabilities and controls required for production use. By alleviating the undifferentiated heavy lifting of building and managing specialized agent infrastructure, AgentCore lets you bring your preferred framework and model and deploy without rewriting code.
Amazon Bedrock AgentCore offers a comprehensive suite of capabilities designed to transform local agent prototypes into production-ready systems. These include persistent memory for maintaining context in and across conversations, access to existing APIs using Model Context Protocol (MCP), seamless integration with corporate authentication systems, specialized tools for web browsing and code execution, and deep observability into agent reasoning processes. In this post, we focus specifically on the AgentCore Runtime component.
Core capabilities of AgentCore Runtime
AgentCore Runtime provides a serverless, secure hosting environment specifically designed for agentic workloads. It packages code into a lightweight container with a simple, consistent interface, making it equally well-suited for running agents, tools, MCP servers, or other workloads that benefit from seamless scaling and integrated identity management.AgentCore Runtime offers extended execution times up to 8 hours for complex reasoning tasks, handles large payloads for multimodal content, and implements consumption-based pricing that charges only during active processing—not while waiting for LLM or tool responses. Each user session runs in complete isolation within dedicated micro virtual machines (microVMs), maintaining security and helping to prevent cross-session contamination between agent interactions. The runtime works with many frameworks (for example: LangGraph, CrewAI, Strands, and so on) and many foundation model providers, while providing built-in corporate authentication, specialized agent observability, and unified access to the broader AgentCore environment through a single SDK.
Real-world example: Deep Agents integration
In this post we’re going to deploy the recently released Deep Agents implementation example on AgentCore Runtime—showing just how little effort it takes to get the latest agent innovations up and running.

The sample implementation in the preceding diagram includes:

A research agent that conducts deep internet searches using the Tavily API
A critique agent that reviews and provides feedback on generated reports
A main orchestrator that manages the workflow and handles file operations

Deep Agents uses LangGraph’s state management to create a multi-agent system with:

Built-in task planning through a write_todos tool that helps agents break down complex requests
Virtual file system where agents can read/write files to maintain context across interactions
Sub-agent architecture allowing specialized agents to be invoked for specific tasks while maintaining context isolation
Recursive reasoning with high recursion limits (more than 1,000) to handle complex, multi-step workflows

This architecture enables Deep Agents to handle research tasks that require multiple rounds of information gathering, synthesis, and refinement.The key integration points in our code showcase how agents work with AgentCore. The beauty is in its simplicity—we only need to add a couple of lines of code to make an agent AgentCore-compatible:

# 1. Import the AgentCore runtime
from bedrock_agentcore.runtime import BedrockAgentCoreApp
app = BedrockAgentCoreApp()

# 2. Decorate your agent function with @app.entrypoint
@app.entrypoint
async def langgraph_bedrock(payload):
# Your existing agent logic remains unchanged
user_input = payload.get(“prompt”)

# Call your agent as before
stream = agent.astream(
{“messages”: [HumanMessage(content=user_input)]},
stream_mode=”values”
)

# Stream responses back
async for chunk in stream:
yield(chunk)

# 3. Add the runtime starter at the bottom
if __name__ == “__main__”:
app.run()

That’s it! The rest of the code—model initialization, API integrations, and agent logic—remains exactly as it was. AgentCore handles the infrastructure while your agent handles the intelligence. This integration pattern works for most Python agent frameworks, making AgentCore truly framework-agnostic.
Deploying to AgentCore Runtime: Step-by-step
Let’s walk through the actual deployment process using the AgentCore Starter ToolKit, which dramatically simplifies the deployment workflow.
Prerequisites
Before you begin, make sure you have:

Python 3.10 or higher
AWS credentials configured
Amazon Bedrock AgentCore SDK installed

Step 1: IAM permissions
There are two different AWS Identity and Access Management (IAM) permissions you need to consider when deploying an agent in an AgentCore Runtime—the role you, as a developer use to create AgentCore resources and the execution role that an agent needs to run in an AgentCore Runtime. While the latter role can now be auto-created by the AgentCore Starter Toolkit (auto_create_execution_role=True), the former must be defined as described in IAM Permissions for AgentCore Runtime.
Step 2: Add a wrapper to your agent
As shown in the preceding Deep Agents example, add the AgentCore imports and decorator to your existing agent code.
Step 3: Deploy using the AgentCore starter toolkit
The starter toolkit provides a three-step deployment process:

from bedrock_agentcore_starter_toolkit import Runtime

# Step 1: Configure
agentcore_runtime = Runtime()
config_response = agentcore_runtime.configure(
entrypoint=”hello.py”, # contains the code we showed earlier in the post
execution_role=role_arn, # or auto-create
auto_create_ecr=True,
requirements_file=”requirements.txt”,
region=”us-west-2″,
agent_name=”deepagents-research”
)

# Step 2: Launch
launch_result = agentcore_runtime.launch()
print(f”Agent deployed! ARN: {launch_result[‘agent_arn’]}”)

# Step 3: Invoke
response = agentcore_runtime.invoke({
“prompt”: “Research the latest developments in quantum computing”
})

Step 4: What happens behind the scenes
When you run the deployment, the starter kit automatically:

Generates an optimized Docker file with Python 3.13-slim base image and OpenTelemetry instrumentation
Builds your container with the dependencies from requirements.txt
Creates an Amazon Elastic Container Registry (Amazon ECR) repository (if auto_create_ecr=True) and pushes your image
Deploys to AgentCore Runtime and monitors the deployment status
Configures networking and observability with Amazon CloudWatch and AWS X-Ray integration

The entire process typically takes 2–3 minutes, after which your agent is ready to handle requests at scale. Each new session is launched in its own fresh AgentCore Runtime microVM, maintaining complete environment isolation.
The starter kit generates a configuration file (.bedrock_agentcore.yaml) that captures your deployment settings, making it straightforward to redeploy or update your agent later.
Invoking your deployed agent
After deployment, you have two options for invoking your agent:
Option 1: Using the start kit (shown in Step 3)

response = agentcore_runtime.invoke({
“prompt”: “Research the latest developments in quantum computing”
})

Option 2: Using boto3 SDK directly

import boto3
import json

agentcore_client = boto3.client(‘bedrock-agentcore’, region_name=’us-west-2′)
response = agentcore_client.invoke_agent_runtime(
agentRuntimeArn=agent_arn,
qualifier=”DEFAULT”,
payload=json.dumps({
“prompt”: “Analyze the impact of AI on healthcare in 2024”
})
)

# Handle streaming response
for event in response[‘completion’]:
if ‘chunk’ in event:
print(event[‘chunk’][‘bytes’].decode(‘utf-8’))

Deep Agents in action
As the code executes in Bedrock AgentCore Runtime, the primary agent orchestrates specialized sub-agents—each with its own purpose, prompt, and tool access—to solve complex tasks more effectively. In this case, the orchestrator prompt (research_instructions) sets the plan:

Write the question to question.txt
Fan out to one or more research-agent calls (each on a single sub-topic) using the internet_search tool
Synthesize findings into final_report.md
Call critique-agent to evaluate gaps and structure
Optionally loop back to more research/edits until quality is met

Here it is in action:

Clean up
When finished, don’t forget to de-allocate provisioned AgentCore Runtime in addition to the container repository that was created during the process:

agentcore_control_client = boto3.client(
‘bedrock-agentcore-control’, region_name=region )
ecr_client = boto3.client(‘ecr’,region_name=region )
runtime_delete_response = agentcore_control_client.delete_agent_runtime( agentRuntimeId=launch_result.agent_id,)
response = ecr_client.delete_repository(
repositoryName=launch_result.ecr_uri.split(‘/’)[1],force=True)

Conclusion
Amazon Bedrock AgentCore represents a paradigm shift in how we deploy AI agents. By abstracting away infrastructure complexity while maintaining framework and model flexibility, AgentCore enables developers to focus on building sophisticated agent logic rather than managing deployment pipelines. Our Deep Agents deployment demonstrates that even complex, multi-agent systems with external API integrations can be deployed with minimal code changes. The combination of enterprise-grade security, built-in observability, and serverless scaling makes AgentCore the best choice for production AI agent deployments. Specifically for deep research agents, AgentCore offers the following unique capabilities that you can explore:

AgentCore Runtime can handle asynchronous processing and long running (up to 8 hours) agents. Asynchronous tasks allow your agent to continue processing after responding to the client and handle long-running operations without blocking responses. Your background research sub-agent could be asynchronously researching for hours.
AgentCore Runtime works with AgentCore Memory, enabling capabilities such as building upon previous findings, remembering research preferences, and maintaining complex investigation context without losing progress between sessions.
You can use AgentCore Gateway to extend your deep research to include proprietary insights from enterprise services and data sources. By exposing these differentiated resources as MCP tools, your agents can quickly take advantage and combine that with publicly available knowledge.

Ready to deploy your agents to production? Here’s how to get started:

Install the AgentCore starter kit: pip install bedrock-agentcore-starter-toolkit
Experiment: Deploy your code by following this step by step guide.

The era of production-ready AI agents is here. With AgentCore, the journey from prototype to production has never been shorter.

About the authors
Vadim Omeltchenko is a Sr. AI/ML Solutions Architect who is passionate about helping AWS customers innovate in the cloud. His prior IT experience was predominantly on the ground.
Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Shreyas Subramanian is a Principal data scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS Certifications, including the ML Specialty Certification.

Integrate tokenization with Amazon Bedrock Guardrails for secure data …

Posted on September 24, 2025 by i-genie

This post is co-written by Mark Warner, Principal Solutions Architect for Thales, Cyber Security Products.
As generative AI applications make their way into production environments, they integrate with a wider range of business systems that process sensitive customer data. This integration introduces new challenges around protecting personally identifiable information (PII) while maintaining the ability to recover original data when legitimately needed by downstream applications. Consider a financial services company implementing generative AI across different departments. The customer service team needs an AI assistant that can access customer profiles and provide personalized responses that include contact information, for example: “We’ll send your new card to your address at 123 Main Street.” Meanwhile, the fraud analysis team requires the same customer data but must analyze patterns without exposing actual PII, working only with protected representations of sensitive information.
Amazon Bedrock Guardrails helps detect sensitive information, such as PII, in standard format in input prompts or model responses. Sensitive information filters give organizations control over how sensitive data is handled, with options to block requests containing PII or mask the sensitive information with generic placeholders like {NAME} or {EMAIL}. This capability helps organizations comply with data protection regulations while still using the power of large language models (LLMs).
Although masking effectively protects sensitive information, it creates a new challenge: the loss of data reversibility. When guardrails replace sensitive data with generic masks, the original information becomes inaccessible to downstream applications that might need it for legitimate business processes. This limitation can impact workflows where both security and functional data are required.
Tokenization offers a complementary approach to this challenge. Unlike masking, tokenization replaces sensitive data with format-preserving tokens that are mathematically unrelated to the original information but maintain its structure and usability. These tokens can be securely reversed back to their original values when needed by authorized systems, creating a path for secure data flows throughout an organization’s environment.
In this post, we show you how to integrate Amazon Bedrock Guardrails with third-party tokenization services to protect sensitive data while maintaining data reversibility. By combining these technologies, organizations can implement stronger privacy controls while preserving the functionality of their generative AI applications and related systems. The solution described in this post demonstrates how to combine Amazon Bedrock Guardrails with tokenization services from Thales CipherTrust Data Security Platform to create an architecture that protects sensitive data without sacrificing the ability to process that data securely when needed. This approach is particularly valuable for organizations in highly regulated industries that need to balance innovation with compliance requirements.
Amazon Bedrock Guardrails APIs
This section describes the key components and workflow for the integration between Amazon Bedrock Guardrails and a third-party tokenization service.
Amazon Bedrock Guardrails provides two distinct approaches for implementing content safety controls:

Direct integration with model invocation through APIs like InvokeModel and Converse, where guardrails automatically evaluate inputs and outputs as part of the model inference process.
Standalone evaluation through the ApplyGuardrail API, which decouples guardrails assessment from model invocation, allowing evaluation of text against defined policies.

This post uses the ApplyGuardrail API for tokenization integration because it separates content assessment from model invocation, allowing for the insertion of tokenization processing between these steps. This separation creates the necessary space in the workflow to replace guardrail masks with format-preserving tokens before model invocation, or after the model response is handed over to the target application downstream in the process.
The solution extends the typical ApplyGuardrail API implementation by inserting tokenization processing between guardrail evaluation and model invocation, as follows:

The application calls the ApplyGuardrail API to assess the user input for sensitive information.
If no sensitive information is detected (action = “NONE”), the application proceeds to model invocation via the InvokeModel API.
If sensitive information is detected (action = “ANONYMIZED”):

The application captures the detected PII and its positions.
It calls a tokenization service to convert these entities into format-preserving tokens.
It replaces the generic guardrail masks with these tokens.
The application then invokes the foundation model with the tokenized content.

For model responses:

The application applies guardrails to check the output from the model for sensitive information.
It tokenizes detected PII before passing the response to downstream systems.

Solution overview
To illustrate how this workflow delivers value in practice, consider a financial advisory application that helps customers understand their spending patterns and receive personalized financial recommendations. In this example, three distinct application components work together to provide secure, AI-powered financial insights:

Customer gateway service – This trusted frontend orchestrator receives customer queries that often contain sensitive information. For example, a customer might ask: “Hi, this is j.smith@example.com. Based on my last five transactions on acme.com, and my current balance of $2,342.18, should I consider their new credit card offer?”
Financial analysis engine – This AI-powered component analyzes financial patterns and generates recommendations but doesn’t need access to actual customer PII. It works with anonymized or tokenized information.
Response processing service – This trusted service handles the final customer communication, including detokenizing sensitive information before presenting results to the customer.

The following diagram illustrates the workflow for integrating Amazon Bedrock Guardrails with tokenization services in this financial advisory application. AWS Step Functions orchestrates the sequential process of PII detection, tokenization, AI model invocation, and detokenization across the three key components (customer gateway service, financial analysis engine, and response processing service) using AWS Lambda functions.

The workflow operates as follows:

The customer gateway service (for this example, through Amazon API Gateway) receives the user input containing sensitive information.
It calls the ApplyGuardrail API to identify PII or other sensitive information that should be anonymized or blocked.
For detected sensitive elements (such as user names or merchant names), it calls the tokenization service to generate format-preserving tokens.
The input with tokenized values is passed to the financial analysis engine for processing. (For example, “Hi, this is [[TOKEN_123]]. Based on my last five transactions on [[TOKEN_456]] and my current balance of $2,342.18, should I consider their new credit card offer?”)
The financial analysis engine invokes an LLM on Amazon Bedrock to generate financial advice using the tokenized data.
The model response, potentially containing tokenized values, is sent to the response processing service.
This service calls the tokenization service to detokenize the tokens, restoring the original sensitive values.
The final, detokenized response is delivered to the customer.

This architecture maintains data confidentiality throughout the processing flow while preserving the information’s utility. The financial analysis engine works with structurally valid but cryptographically protected data, allowing it to generate meaningful recommendations without exposing sensitive customer information. Meanwhile, the trusted components at the entry and exit points of the workflow can access the actual data when necessary, creating a secure end-to-end solution.
In the following sections, we provide a detailed walkthrough of implementing the integration between Amazon Bedrock Guardrails and tokenization services.
Prerequisites
To implement the solution described in this post, you must have the following components configured in your environment:

An AWS account with Amazon Bedrock enabled in your target AWS Region.
Appropriate AWS Identity and Access Management (IAM) permissions configured following least privilege principles with specific actions enabled: bedrock:CreateGuardrail, bedrock:ApplyGuardrail, and bedrock-runtime:InvokeModel.
For AWS Organizations, verify Amazon Bedrock access is permitted by service control policies.
A Python 3.7+ environment with the boto3 library installed. For information about installing the boto3 library, refer to AWS SDK for Python (Boto3).
AWS credentials configured for programmatic access using the AWS Command Line Interface (AWS CLI). For more details, refer to Configuring settings for the AWS CLI.
This implementation requires a deployed tokenization service accessible through REST API endpoints. Although this walkthrough demonstrates integration with Thales CipherTrust, the pattern adapts to tokenization providers offering protect and unprotect API operations. Make sure network connectivity exists between your application environment and both AWS APIs and your tokenization service endpoints, along with valid authentication credentials for accessing your chosen tokenization service. For information about setting up Thales CipherTrust specifically, refer to How Thales Enables PCI DSS Compliance with a Tokenization Solution on AWS.

Configure Amazon Bedrock Guardrails
Configure Amazon Bedrock Guardrails for PII detection and masking through the Amazon Bedrock console or programmatically using the AWS SDK. Sensitive information filter policies can anonymize or redact information from model requests or responses:

import boto3
def create_bedrock_guardrail():
“””
Create a guardrail in Amazon Bedrock for financial applications with PII protection.
“””
bedrock = boto3.client(‘bedrock’)

response = bedrock.create_guardrail(
name=”FinancialServiceGuardrail”,
description=”Guardrail for financial applications with PII protection”,
sensitiveInformationPolicyConfig={
‘piiEntitiesConfig’: [
{
‘type’: ‘URL’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
},
{
‘type’: ‘EMAIL’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
},
{
‘type’: ‘NAME’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
}
]
},
blockedInputMessaging=”I can’t provide information with PII data.”,
blockedOutputsMessaging=”I can’t generate content with PII data.”
)

return response

Integrate the tokenization workflow
This section implements the tokenization workflow by first detecting PII entities with the ApplyGuardrail API, then replacing the generic masks with format-preserving tokens from your tokenization service.
Apply guardrails to detect PII entities
Use the ApplyGuardrail API to validate input text from the user and detect PII entities:

import boto3
from botocore.exceptions import ClientError
def invoke_guardrail(user_query):
“””
Apply Amazon Bedrock Guardrails to validate input text and detect PII entities.

Args:
user_query (str): The user’s input text to be checked.

Returns:
dict: The response from the ApplyGuardrail API.

Raises:
ClientError: If there’s an error applying the guardrail.
“””
try:
bedrock_runtime = boto3.client(‘bedrock-runtime’)

response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=’your-guardrail-id’, # Replace with your actual guardrail ID
guardrailVersion=’your-guardrail-version’, # Replace with your actual version
source=”INPUT”,
content=[{“text”: {“text”: user_query}}]
)

return response
except ClientError as e:
print(f”Error applying guardrail: {e}”)
raise

Invoke tokenization service
The response from the ApplyGuadrail API includes the list of PII entities matching the sensitive information policy. Parse those entities and invoke the tokenization service to generate the tokens.
The following example code uses the Thales CipherTrust tokenization service:

import json
import requests
from botocore.exceptions import ClientError
def thales_ciphertrust_tokenizer(guardrail_response):
“””
Process PII entities detected by the guardrail and tokenize them using Thales CipherTrust

Args:
guardrail_response (dict): The response from the ApplyGuardrail API

Returns:
list: List of dictionaries containing original values, types, and tokenized responses

Raises:
ClientError: If there’s an error invoking Thales CipherTrust.
“””
try:
protected_results = []

for assessment in guardrail_response.get(“assessments”, []):
pii_entities = assessment.get(“sensitiveInformationPolicy”, {}).get(“piiEntities”, [])

for entity in pii_entities:
sensitive_value = entity.get(“match”)
entity_type = entity.get(“type”)

if sensitive_value:
# Prepare payload for Thales CipherTrust tokenization service
crdp_payload = {
“protection_policy_name”: “plain-alpha-internal”,
“DATA_KEY”: sensitive_value,
}

url_str = “http://your-ciphertrust-cname:8090/v1/protect” # Replace with your actual CipherTrust URL
headers = {“Content-Type”: “application/json”}

# Invoke the Thales CipherTrust tokenization service
response = requests.post(url_str, headers=headers, data=json.dumps(crdp_payload))
response.raise_for_status()
response_json = response.json()

protected_results.append({
“original_value”: sensitive_value,
“type”: entity_type,
“protection_response”: response_json
})

return protected_results
except requests.RequestException as e:
print(f”Error invoking Thales CipherTrust: {e}”)
raise ClientError(f”Error invoking Thales CipherTrust: {e}”, “TokenizationError”)

Replace guardrail masks with tokens
Next, substitute the generic guardrail masks with the tokens generated by the Thales CipherTrust tokenization service. This enables downstream applications to work with structurally valid data while maintaining security and reversibility.

def process_guardrail_output(protected_results, guardrail_response):
“””
Process guardrail output by replacing placeholders with protected values.

Args:
protected_results (list): List of protected data tokenized by Thales CipherTrust.
guardrail_response (dict): Guardrail response dictionary.

Returns:
list: List of modified output items with placeholders replaced by tokens.

Raises:
ValueError: If input parameters are invalid.
Exception: For any unexpected errors during processing.
“””
try:
# Validate input types
if not isinstance(protected_results, list) or not isinstance(guardrail_response, dict):
raise ValueError(“Invalid input parameters”)

# Extract protection map
protection_map = {res[‘type’].upper(): res[‘protection_response’][‘protected_data’]
for res in protected_results}
# Process outputs
modified_outputs = []
for output_item in guardrail_response.get(‘outputs’, []):
if ‘text’ in output_item:
modified_text = output_item[‘text’]

# Replace all placeholders in one pass
for pii_type, protected_value in protection_map.items():
modified_text = modified_text.replace(f”{{{pii_type}}}”, protected_value)

modified_outputs.append({“text”: modified_text})
return modified_outputs
except (ValueError, KeyError) as e:
print(f”Error processing guardrail output: {e}”)
raise
except Exception as e:
print(f”Unexpected error while processing guardrail output: {e}”)
raise

The result of this process transforms user inputs containing information that match the sensitive information policy applied using Amazon Bedrock Guardrails into unique and reversible tokenized versions.
The following example input contains PII elements:

“Hi, this is john.smith@example.com. Based on my last five transactions on acme.com, and my current balance of $2,342.18, should I consider their new credit card offer?”

The following is an example of the sanitized user input:

“Hi, this is 1001000GC5gDh1.D8eK71@EjaWV.lhC. Based on my last five transactions on 1001000WcFzawG.Jc9Tfc, and my current balance of $2,342.18, should I consider their new credit card offer?”

Downstream application processing
The sanitized input is ready to be used by generative AI applications, including model invocations on Amazon Bedrock. In response to the tokenized input, an LLM invoked by the financial analysis engine would produce a relevant analysis that maintains the secure token format:

“Based on your recent transactions at 1001000WcFzawG.Jc9Tfc and your current account status, I can confirm that the new credit card offer would provide approximately $33 in monthly rewards based on your spending patterns. With annual benefits of around $394 against the $55 annual fee, this card would be beneficial for your profile, 1001000GC5gDh1.D8eK71@EjaWV.lhC.”

When authorized systems need to recover original values, tokens are detokenized. With Thales CipherTrust, this is accomplished using the Detokenize API, which requires the same parameters as in the previous tokenize action. This completes the secure data flow while preserving the ability to recover original information when needed.
Clean up
As you follow the approach described in this post, you will create new AWS resources in your account. To avoid incurring additional charges, delete these resources when you no longer need them.
To clean up your resources, complete the following steps:

Delete the guardrails you created. For instructions, refer to Delete your guardrail.
If you implemented the tokenization workflow using Lambda, API Gateway, or Step Functions as described in this post, remove the resources you created.
This post assumes a tokenization solution is already available in your account. If you deployed a third-party tokenization solution (such as Thales CipherTrust) to test this implementation, refer to that solution’s documentation for instructions to properly decommission these resources and stop incurring charges.

Conclusion
This post demonstrated how to combine Amazon Bedrock Guardrails with tokenization to enhance handling of sensitive information in generative AI workflows. By integrating these technologies, organizations can protect PII during processing while maintaining data utility and reversibility for authorized downstream applications.
The implementation illustrated uses Thales CipherTrust Data Security Platform for tokenization, but the architecture supports many tokenization solutions. To learn more about a serverless approach to building custom tokenization capabilities, refer to Building a serverless tokenization solution to mask sensitive data.
This solution provides a practical framework for builders to use the full potential of generative AI with appropriate safeguards. By combining the content safety mechanisms of Amazon Bedrock Guardrails with the data reversibility of tokenization, you can implement responsible AI workflows that align with your application requirements and organizational policies while preserving the functionality needed for downstream systems.
To learn more about implementing responsible AI practices on AWS, see Transform responsible AI from theory into practice.

About the Authors
Nizar Kheir is a Senior Solutions Architect at AWS with more than 15 years of experience spanning various industry segments. He currently works with public sector customers in France and across EMEA to help them modernize their IT infrastructure and foster innovation by harnessing the power of the AWS Cloud.
Mark Warner is a Principal Solutions Architect for Thales, Cyber Security Products division. He works with companies in various industries such as finance, healthcare, and insurance to improve their security architectures. His focus is assisting organizations with reducing risk, increasing compliance, and streamlining data security operations to reduce the probability of a breach.