Integrate external tools with Amazon Quick Agents using Model Context …

Amazon Quick supports Model Context Protocol (MCP) integrations for action execution, data access, and AI agent integration. You can expose your application’s capabilities as MCP tools by hosting your own MCP server and configuring an MCP integration in Amazon Quick. Amazon Quick acts as an MCP client and connects to your MCP server endpoint to access the tools you expose. After that connection is in place, Amazon Quick AI agents and automations can invoke your tools to retrieve data and run actions in your product, using the customer’s authentication, authorization, and governance controls.
With an Amazon Quick and MCP integration you can build a repeatable integration contract: you define tools once, publish a stable endpoint, and support the same model across customers. You can build AI agents and automations in Amazon Quick to analyze data, search enterprise knowledge, and run workflows across their business. Your customers get a way to use your product inside Amazon Quick workflows, without building custom connectors for every use case.
In this post, you’ll use a six-step checklist to build a new MCP server or validate and adjust an existing MCP server for Amazon Quick integration. The Amazon Quick User Guide describes the MCP client behavior and constraints. This is a “How to” guide for detailed implementation required by 3P partners to integrate with Amazon Quick with MCP.
Solution overview
Amazon Quick includes an MCP client that you configure through an integration. That integration connects to a remote MCP server, discovers the tools and data sources the server exposes, and makes them available to AI agents and automations. MCP integrations in Amazon Quick support both action execution and data access, including knowledge base creation.
Figure 1. shows how customers use Amazon Quick to invoke application capabilities, exposed as MCP tools by ISVs, enterprise systems, or custom solutions through an MCP integration.

Figure 1. Amazon Quick MCP integration with an external MCP server that exposes application capabilities as MCP tools.

Prerequisites

An Amazon Quick Professional subscription.
An Amazon Quick user with Author or higher permissions to create action connectors.
A remote MCP server endpoint that is reachable from the Amazon Quick.
An authentication approach that your MCP server supports user authentication, service authentication or no authentication.
A small initial set of product capabilities as APIs to be exposed as MCP tools (start with the operations your customers use most).

Checklist for Amazon Quick MCP integration readiness
Now let’s walk through the 6 steps process build the integration with Amazon Quick using MCP

Step 1: Choose your MCP server deployment model.
Step 2: Implement a remote MCP server compatible with Amazon Quick.
Step 3: Implement authentication and authorization.
Step 4: Document configuration for Amazon Quick customers
Step 5: Register the MCP integration in Amazon Quick.
Step 6: Test your actions and setup using out-of-the-box test action APIs tool in Amazon Quick.

Use the following steps to either build an MCP server for Amazon Quick or validate an existing server before customers connect it. Steps 1–4 cover server design, implementation, and documentation. Step 5 covers the Amazon Quick integration workflow customers run. Step 6 covers operations.
Step 1: Choose your MCP server deployment model
Decide how you will host your MCP endpoint and isolate tenants. Two common patterns work well:

Shared multi-tenant endpoint: One MCP endpoint serves multiple customers. Your authentication and authorization layer maps each request to a tenant and user, and enforces tenant isolation on every tool call.
Dedicated per-tenant endpoint: Each customer gets a unique MCP endpoint or server instance. You provision and operate a stable URL and credentials for each tenant.

Choose the model that matches your SaaS architecture and support model. If you already run a multi-tenant API tier with tenant-aware authorization, a shared MCP endpoint fits. If you need stronger isolation boundaries or separate compliance controls, dedicated endpoints reduce impact.
Step 2: Implement a remote MCP server compatible with Amazon Quick
Your MCP server must conform to the MCP specification and align with Amazon Quick client constraints. Focus on transport, tool definitions, and operational limits.
Transport and connectivity requirements:

Expose your MCP server over a public endpoint that is reachable from Amazon Quick. Use HTTPS for production.
Support a remote transport. Amazon Quick supports Server-Sent Events (SSE) and streamable HTTP. HTTP streaming is preferred.

Tool and resource requirements:

Define MCP tools using JSON schema so the Amazon Quick MCP client can discover them and invoke them through listTools and callTool.
Keep tool names consistent and version tool behavior intentionally. Amazon Quick treats the tool list as static after registration; administrators must reestablish the connection for the server side to reflect the changes.
If your integration includes data access, expose data sources and resources so that Amazon Quick can use the sources to create knowledge bases.

Amazon Quick MCP client limitations:
As of today, you must consider the following when you design.

Each MCP operation has a fixed 300-second timeout. Operations that exceed this limit fail with HTTP 424.
Connector creation can fail if the Amazon Quick callback URI is not allow-listed by your identity provider or authorization server. See Step 3 for call back URIs details.

If your applications and service providers don’t have an MCP server, you can:

Build and host your own MCP server using an MCP SDK that supports streamable HTTP or SSE. For MCP developer guidance, refer to the Model Context Protocol documentation. For code samples to host it in AWS, see the deployment guidance GitHub repository.
Run your MCP server on Amazon Bedrock AgentCore Runtime, which supports hosting MCP servers in a managed way. For details about hosting agents or tools, see Host agent or tools with Amazon Bedrock AgentCore Runtime.
Front existing REST APIs or AWS Lambda functions with Amazon Bedrock AgentCore Gateway, which can convert APIs and services into MCP-compatible tools and expose them through gateway endpoints. For an overview, see Introducing Amazon Bedrock AgentCore Gateway.

For an end-to-end Amazon Quick example that uses AgentCore Gateway as the MCP server endpoint, refer to Connect Amazon Quick to enterprise apps and agents with MCP. Similarly refer to Build your Custom MCP Server on Agentcore Runtime for a Code Sample.
Step 3: Implement authentication and authorization
Amazon Quick MCP integrations support multiple authentication patterns. Choose the pattern that matches how your customers want Amazon Quick to access your product, then enforce authorization on every tool invocation.
  User authentication:

Use OAuth 2.0 authorization code flow when Amazon Quick needs to act on behalf of individual users.
Support OAuth Dynamic Client Registration (DCR) if you want Amazon Quick to register the client automatically. If you do not support DCR, document the client ID, client secret, token URL, authorization URL, and redirect URL that customers must enter during integration setup.
Issue access tokens scoped to tenant and user, and enforce user-level role-based access control (RBAC) for every tool call.

  Service authentication (service-to-service):

Use service-to-service authentication when Amazon Quick should call your MCP server as a machine client (for example, shared service accounts or backend automation).
Validate client-credential tokens on every request and enforce tenant-scoped access.

  No authentication:

Use no authentication only for public or demo MCP servers. For example, the AWS Knowledge MCP Server does not require authentication (but it is subject to rate limits).

If you front your tools with Amazon Bedrock AgentCore Gateway, Gateway validates inbound requests using OAuth-based authorization aligned with the MCP authorization specification. Gateway functions as an OAuth resource server and can work with identity providers such as Amazon Cognito, Okta, or Auth0. Gateway also supports outbound authentication to downstream APIs and secure credential storage. In this pattern, Amazon Quick authenticates to the Gateway using the authentication method you configure (for example, service-to-service OAuth), and Gateway authenticates to your downstream APIs.
Allowlist requirements for OAuth redirects (required for some IdPs) Some identity providers block OAuth redirects unless the redirect URI is explicitly allowlisted in the OAuth client configuration. If your OAuth setup fails during integration creation, confirm that your OAuth client app allowlists the Amazon Quick redirect URI for each AWS Region where your customers use Amazon Quick.

https://us-east-1.quicksight.aws.amazon.com/sn/oauthcallback
https://us-west-2.quicksight.aws.amazon.com/sn/oauthcallback
https://ap-southeast-2.quicksight.aws.amazon.com/sn/oauthcallback
https://eu-west-1.quicksight.aws.amazon.com/sn/oauthcallback
https://us-east-1-onebox.quicksight.aws.amazon.com/sn/oauthcallback
https://us-west-2-onebox.quicksight.aws.amazon.com/sn/oauthcallback
https://ap-southeast-2-onebox.quicksight.aws.amazon.com/sn/oauthcallback
https://eu-west-1-onebox.quicksight.aws.amazon.com/sn/oauthcallback

Step 4: Document configuration for Amazon Quick customers
Before connecting to Amazon Quick, verify your server’s baseline compatibility using the MCP Inspector. This standard developer tool acts as a generic MCP client, so you can test connectivity, browse your tool catalog, and simulate tool execution in a controlled sandbox. If your server works with the Inspector, it is protocol-compliant and ready for Amazon Quick integration.
Your integration succeeds when you’re able to authenticate into your MCP Server and test your actions using the Test APIs section and you can invoke these tools through Chat Agents and automations.
Add a Amazon Quick integration section to your product documentation that covers:

MCP server endpoint: the exact URL customers enter in the Amazon Quick MCP server endpoint field.
Authentication method: Which Amazon Quick option to choose (user authentication or service authentication or No Authentication), plus the fields and values required.
OAuth details (if used): Required scopes, roles, and any prerequisites such as allow listing the Amazon Quick callback URI.
Network and security notes: Any allow-list requirements, data residency constraints, or compliance implications.
Tool catalog: The tools you expose, what each tool does, required permissions, and error behavior.

Step 5: Register the MCP integration in Amazon Quick
After your server is ready, your customer can create an MCP integration in the Amazon Quick console. This procedure is based on Set up MCP integration in the Amazon Quick User Guide.

Sign in to the Amazon Quick console with a user that has Author permissions or higher.
Choose Integrations.
Choose Add (+), and then choose Model Context Protocol (MCP).
On the Create integration page, enter a Name, an optional Description, and your MCP server endpoint URL. Choose Next.
Select the authentication method your server supports (user authentication or service authentication), and then enter the required configuration values. If your MCP Server supports DCR, you will be skip the Authentication step and the client credentials exchange happens during the sign-in step.
Choose Create and continue. Review the discovered tools and data capabilities from your MCP server, and then choose Next.
If you want other users to use the integration, share it. When you are finished, choose Done.

Amazon Quick does not poll for schema changes. If you modify tool signatures or add new capabilities, you must advise your customers to re-authenticate or refresh their integration settings to enable these updates.
Step 6: Operate, monitor, and meter your MCP server
Treat your MCP server as production API surface area. Add the operational controls you already use for your SaaS APIs, and make them tenant-aware.

Logging and observability: Log each tool invocation with tenant identifier, user identifier (when available), tool name, latency, status, and error details.
Throttling and quotas: Enforce per-tenant rate limits to protect downstream systems and return clear throttling errors.
Versioning: Coordinate tool changes with your documentation and your customers’ refresh workflow. Treat tool names and schemas as a contract.
Security operations: Support credential rotation, token revocation, and audit trails for administrative actions.
Metering (optional): Record usage per tenant (for example, tool calls or data volume) to align with your SaaS pricing or AWS Marketplace metering.

Clean up
If you created a Amazon Quick MCP integration for testing, delete it when you no longer need it.
To delete an integration, follow Integration workflows in the Amazon Quick User Guide. The high-level steps are:

In the Amazon Quick console, choose Integrations.
From the integrations table, select the integration you want to remove.
From the Actions menu (three-dot menu), choose Delete integration.
In the confirmation dialog, review the integration details and any dependent resources that will be affected.
Choose Delete to confirm removal.

If you used OAuth for the integration, also revoke the Amazon Quick client in your authorization server and delete any test credentials you created.
Conclusion
Amazon Quick MCP integrations give your customers a standard way to connect AI agents and automations to your product. When you expose your capabilities as MCP tools on a remote MCP server, customers can configure the connection in the Amazon Quick console and use your tools across multiple workflows.
Start with a small set of high-value tools, design each tool call to complete within the 300-second limit, and document the exact endpoint and authentication settings customers must use. After you validate the integration workflow in Amazon Quick , expand your tool catalog and add the operational controls you use for any production API.
For next steps, review the Amazon Quick MCP documentation, then use the checklist in this post to validate your server. If you want AWS options to build and host MCP servers, refer to the AgentCore documentation and Deploying model context protocol servers on AWS.

About the authors

Ebbey Thomas
Ebbey Thomas is a Senior Worldwide Generative AI Specialist Solutions Architect at AWS. He designs and implements generative AI solutions that address specific customer business problems. He is recognized for simplifying complexity and delivering measurable business outcomes for clients. Ebbey holds a BS in Computer Engineering and an MS in Information Systems from Syracuse University.

Vishnu Elangovan
Vishnu Elangovan is a Worldwide Agentic AI Solution Architect with over 9+ years of experience in Applied AI/ML and Deep Learning. He loves building and tinkering with scalable AI/ML solutions and considers himself a lifelong learner. Vishnu is a trusted thought leader in the AI/ML community, regularly speaking at leading AI conferences and sharing his expertise on Agentic AI at top-tier events.

Sonali Sahu
Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team at AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.

Google AI Releases Gemini 3.1 Pro with 1 Million Token Context and 77. …

Google has officially shifted the Gemini era into high gear with the release of Gemini 3.1 Pro, the first version update in the Gemini 3 series. This release is not just a minor patch; it is a targeted strike at the ‘agentic’ AI market, focusing on reasoning stability, software engineering, and tool-use reliability.

For devs, this update signals a transition. We are moving from models that simply ‘chat’ to models that ‘work.’ Gemini 3.1 Pro is designed to be the core engine for autonomous agents that can navigate file systems, execute code, and reason through scientific problems with a success rate that now rivals—and in some cases exceeds—the industry’s most elite frontier models.

Massive Context, Precise Output

One of the most immediate technical upgrades is the handling of scale. Gemini 3.1 Pro Preview maintains a massive 1M token input context window. To put this in perspective for software engineers: you can now feed the model an entire medium-sized code repository, and it will have enough ‘memory’ to understand the cross-file dependencies without losing the plot.

However, the real news is the 65k token output limit. This 65k window is a significant jump for developers building long-form generators. Whether you are generating a 100-page technical manual or a complex, multi-module Python application, the model can now finish the job in a single turn without hitting an abrupt ‘max token’ wall.

Doubling Down on Reasoning

If Gemini 3.0 was about introducing ‘Deep Thinking,’ Gemini 3.1 is about making that thinking efficient. The performance jumps on rigorous benchmarks are notable:

BenchmarkScoreWhat it measuresARC-AGI-277.1%Ability to solve entirely new logic patternsGPQA Diamond94.1%Graduate-level scientific reasoningSciCode58.9%Python programming for scientific computingTerminal-Bench Hard53.8%Agentic coding and terminal useHumanity’s Last Exam (HLE)44.7%Reasoning against near-human limits

The 77.1% on ARC-AGI-2 is the headline figure here. Google team claims this represents more than double the reasoning performance of the original Gemini 3 Pro. This means the model is much less likely to rely on pattern matching from its training data and is more capable of ‘figuring it out’ when faced with a novel edge case in a dataset.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

The Agentic Toolkit: Custom Tools and ‘Antigravity‘

Google team is making a clear play for the developer’s terminal. Along with the main model, they launched a specialized endpoint: gemini-3.1-pro-preview-customtools.

This endpoint is optimized for developers who mix bash commands with custom functions. In previous versions, models often struggled to prioritize which tool to use, sometimes hallucinating a search when a local file read would have sufficed. The customtools variant is specifically tuned to prioritize tools like view_file or search_code, making it a more reliable backbone for autonomous coding agents.

This release also integrates deeply with Google Antigravity, the company’s new agentic development platform. Developers can now utilize a new ‘medium’ thinking level. This allows you to toggle the ‘reasoning budget’—using high-depth thinking for complex debugging while dropping to medium or low for standard API calls to save on latency and cost.

API Breaking Changes and New File Methods

For those already building on the Gemini API, there is a small but critical breaking change. In the Interactions API v1beta, the field total_reasoning_tokens has been renamed to total_thought_tokens. This change aligns with the ‘thought signatures’ introduced in the Gemini 3 family—encrypted representations of the model’s internal reasoning that must be passed back to the model to maintain context in multi-turn agentic workflows.

The model’s appetite for data has also grown. Key updates to file handling include:

100MB File Limit: The previous 20MB cap for API uploads has been quintupled to 100MB.

Direct YouTube Support: You can now pass a YouTube URL directly as a media source. The model ‘watches’ the video via the URL rather than requiring a manual upload.

Cloud Integration: Support for Cloud Storage buckets and private database pre-signed URLs as direct data sources.

The Economics of Intelligence

Pricing for Gemini 3.1 Pro Preview remains aggressive. For prompts under 200k tokens, input costs are $2 per 1 million tokens, and output is $12 per 1 million. For contexts exceeding 200k, the price scales to $4 input and $18 output.

When compared to competitors like Claude Opus 4.6 or GPT-5.2, Google team is positioning Gemini 3.1 Pro as the ‘efficiency leader.’ According to data from Artificial Analysis, Gemini 3.1 Pro now holds the top spot on their Intelligence Index while costing roughly half as much to run as its nearest frontier peers.

Key Takeaways

Massive 1M/65K Context Window: The model maintains a 1M token input window for large-scale data and repositories, while significantly upgrading the output limit to 65k tokens for long-form code and document generation.

A Leap in Logic and Reasoning: Performance on the ARC-AGI-2 benchmark reached 77.1%, representing more than double the reasoning capability of previous versions. It also achieved a 94.1% on GPQA Diamond for graduate-level science tasks.

Dedicated Agentic Endpoints: Google team introduced a specialized gemini-3.1-pro-preview-customtools endpoint. It is specifically optimized to prioritize bash commands and system tools (like view_file and search_code) for more reliable autonomous agents.

API Breaking Change: Developers must update their codebases as the field total_reasoning_tokens has been renamed to total_thought_tokens in the v1beta Interactions API to better align with the model’s internal “thought” processing.

Enhanced File and Media Handling: The API file size limit has increased from 20MB to 100MB. Additionally, developers can now pass YouTube URLs directly into the prompt, allowing the model to analyze video content without needing to download or re-upload files.

Check out the Technical details and Try it here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Releases Gemini 3.1 Pro with 1 Million Token Context and 77.1 Percent ARC-AGI-2 Reasoning for AI Agents appeared first on MarkTechPost.

A Coding Implementation to Build Bulletproof Agentic Workflows with Py …

In this tutorial, we build a production-ready agentic workflow that prioritizes reliability over best-effort generation by enforcing strict, typed outputs at every step. We use PydanticAI to define clear response schemas, wire in tools via dependency injection, and ensure the agent can safely interact with external systems, such as a database, without breaking execution. By running everything in a notebook-friendly, async-first setup, we demonstrate how to move beyond fragile chatbot patterns toward robust agentic systems suitable for real enterprise workflows.

Copy CodeCopiedUse a different Browser!pip -q install “pydantic-ai-slim[openai]” pydantic

import os, json, sqlite3
from dataclasses import dataclass
from datetime import datetime, timezone
from typing import Literal, Optional, List

from pydantic import BaseModel, Field, field_validator
from pydantic_ai import Agent, RunContext, ModelRetry

if not os.environ.get(“OPENAI_API_KEY”):
try:
from google.colab import userdata
os.environ[“OPENAI_API_KEY”] = (userdata.get(“OPENAI_API_KEY”) or “”).strip()
except Exception:
pass

if not os.environ.get(“OPENAI_API_KEY”):
import getpass
os.environ[“OPENAI_API_KEY”] = getpass.getpass(“Paste your OPENAI_API_KEY: “).strip()

assert os.environ.get(“OPENAI_API_KEY”), “OPENAI_API_KEY is required.”

We set up the execution environment and ensure all required libraries are available for the agent to run correctly. We securely load the OpenAI API key in a Colab-friendly way so the tutorial works without manual configuration changes. We also import all core dependencies that will be shared across schemas, tools, and agent logic.

Copy CodeCopiedUse a different BrowserPriority = Literal[“low”, “medium”, “high”, “critical”]
ActionType = Literal[“create_ticket”, “update_ticket”, “query_ticket”, “list_open_tickets”, “no_action”]
Confidence = Literal[“low”, “medium”, “high”]

class TicketDraft(BaseModel):
title: str = Field(…, min_length=8, max_length=120)
customer: str = Field(…, min_length=2, max_length=60)
priority: Priority
category: Literal[“billing”, “bug”, “feature_request”, “security”, “account”, “other”]
description: str = Field(…, min_length=20, max_length=1000)
expected_outcome: str = Field(…, min_length=10, max_length=250)

class AgentDecision(BaseModel):
action: ActionType
reason: str = Field(…, min_length=20, max_length=400)
confidence: Confidence
ticket: Optional[TicketDraft] = None
ticket_id: Optional[int] = None
follow_up_questions: List[str] = Field(default_factory=list, max_length=5)

@field_validator(“follow_up_questions”)
@classmethod
def short_questions(cls, v):
for q in v:
if len(q) > 140:
raise ValueError(“Each follow-up question must be <= 140 characters.”)
return v

We define the strict data models that act as the contract between the agent and the rest of the system. We use typed fields and validation rules to guarantee that every agent response follows a predictable structure. By enforcing these schemas, we prevent malformed outputs from silently propagating through the workflow.

Copy CodeCopiedUse a different Browser@dataclass
class SupportDeps:
db: sqlite3.Connection
tenant: str
policy: dict

def utc_now_iso() -> str:
return datetime.now(timezone.utc).isoformat()

def init_db() -> sqlite3.Connection:
conn = sqlite3.connect(“:memory:”, check_same_thread=False)
conn.execute(“””
CREATE TABLE tickets (
id INTEGER PRIMARY KEY AUTOINCREMENT,
tenant TEXT NOT NULL,
title TEXT NOT NULL,
customer TEXT NOT NULL,
priority TEXT NOT NULL,
category TEXT NOT NULL,
description TEXT NOT NULL,
expected_outcome TEXT NOT NULL,
status TEXT NOT NULL,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
“””)
conn.commit()
return conn

def seed_ticket(db: sqlite3.Connection, tenant: str, ticket: TicketDraft, status: str = “open”) -> int:
now = utc_now_iso()
cur = db.execute(
“””
INSERT INTO tickets
(tenant, title, customer, priority, category, description, expected_outcome, status, created_at, updated_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
“””,
(
tenant,
ticket.title,
ticket.customer,
ticket.priority,
ticket.category,
ticket.description,
ticket.expected_outcome,
status,
now,
now,
),
)
db.commit()
return int(cur.lastrowid)

We construct the dependency layer and initialize a lightweight SQLite database for persistence. We model real-world runtime dependencies, such as database connections and tenant policies, and make them injectable into the agent. We also define helper functions that safely insert and manage ticket data during execution.

Copy CodeCopiedUse a different Browserdef build_agent(model_name: str) -> Agent[SupportDeps, AgentDecision]:
agent = Agent(
f”openai:{model_name}”,
output_type=AgentDecision,
output_retries=2,
instructions=(
“You are a production support triage agent.n”
“Return an output that matches the AgentDecision schema.n”
“Use tools when you need DB state.n”
“Never invent ticket IDs.n”
“If the user intent is unclear, ask concise follow-up questions.n”
),
)

@agent.tool
def create_ticket(ctx: RunContext[SupportDeps], ticket: TicketDraft) -> int:
deps = ctx.deps
if ticket.priority in (“critical”, “high”) and deps.policy.get(“require_security_phrase_for_critical”, False):
if ticket.category == “security” and “incident” not in ticket.description.lower():
raise ModelRetry(“For security high/critical, include the word ‘incident’ in description and retry.”)
return seed_ticket(deps.db, deps.tenant, ticket, status=”open”)

@agent.tool
def update_ticket_status(
ctx: RunContext[SupportDeps],
ticket_id: int,
status: Literal[“open”, “in_progress”, “resolved”, “closed”],
) -> dict:
deps = ctx.deps
now = utc_now_iso()
cur = deps.db.execute(“SELECT id FROM tickets WHERE tenant=? AND id=?”, (deps.tenant, ticket_id))
if not cur.fetchone():
raise ModelRetry(f”Ticket {ticket_id} not found for this tenant. Ask for the correct ticket_id.”)
deps.db.execute(
“UPDATE tickets SET status=?, updated_at=? WHERE tenant=? AND id=?”,
(status, now, deps.tenant, ticket_id),
)
deps.db.commit()
return {“ticket_id”: ticket_id, “status”: status, “updated_at”: now}

@agent.tool
def query_ticket(ctx: RunContext[SupportDeps], ticket_id: int) -> dict:
deps = ctx.deps
cur = deps.db.execute(
“””
SELECT id, title, customer, priority, category, status, created_at, updated_at
FROM tickets WHERE tenant=? AND id=?
“””,
(deps.tenant, ticket_id),
)
row = cur.fetchone()
if not row:
raise ModelRetry(f”Ticket {ticket_id} not found. Ask the user for a valid ticket_id.”)
keys = [“id”, “title”, “customer”, “priority”, “category”, “status”, “created_at”, “updated_at”]
return dict(zip(keys, row))

@agent.tool
def list_open_tickets(ctx: RunContext[SupportDeps], limit: int = 5) -> list:
deps = ctx.deps
limit = max(1, min(int(limit), 20))
cur = deps.db.execute(
“””
SELECT id, title, priority, category, status, updated_at
FROM tickets
WHERE tenant=? AND status IN (‘open’,’in_progress’)
ORDER BY updated_at DESC
LIMIT ?
“””,
(deps.tenant, limit),
)
rows = cur.fetchall()
return [
{“id”: r[0], “title”: r[1], “priority”: r[2], “category”: r[3], “status”: r[4], “updated_at”: r[5]}
for r in rows
]

@agent.output_validator
def validate_decision(ctx: RunContext[SupportDeps], out: AgentDecision) -> AgentDecision:
deps = ctx.deps
if out.action == “create_ticket” and out.ticket is None:
raise ModelRetry(“You chose create_ticket but did not provide ticket. Provide ticket fields and retry.”)
if out.action in (“update_ticket”, “query_ticket”) and out.ticket_id is None:
raise ModelRetry(“You chose update/query but did not provide ticket_id. Ask for ticket_id and retry.”)
if out.ticket and out.ticket.priority == “critical” and not deps.policy.get(“allow_critical”, True):
raise ModelRetry(“This tenant does not allow ‘critical’. Downgrade to ‘high’ and retry.”)
return out

return agent

It contains the core agent logic for assembling a model-agnostic PydanticAI agent. We register typed tools for creating, querying, updating, and listing tickets, allowing the agent to interact with external state in a controlled way. We also enforce output validation so the agent can self-correct whenever its decisions violate business rules.

Copy CodeCopiedUse a different Browserdb = init_db()
deps = SupportDeps(
db=db,
tenant=”acme_corp”,
policy={“allow_critical”: True, “require_security_phrase_for_critical”: True},
)

seed_ticket(
db,
deps.tenant,
TicketDraft(
title=”Double-charged on invoice 8831″,
customer=”Riya”,
priority=”high”,
category=”billing”,
description=”Customer reports they were billed twice for invoice 8831 and wants a refund and confirmation email.”,
expected_outcome=”Issue a refund and confirm resolution to customer.”,
),
)
seed_ticket(
db,
deps.tenant,
TicketDraft(
title=”App crashes on login after update”,
customer=”Sam”,
priority=”high”,
category=”bug”,
description=”After latest update, the app crashes immediately on login. Reproducible on two devices; needs investigation.”,
expected_outcome=”Provide a fix or workaround and restore successful logins.”,
),
)

agent = build_agent(“gpt-4o-mini”)

async def run_case(prompt: str):
res = await agent.run(prompt, deps=deps)
out = res.output
print(json.dumps(out.model_dump(), indent=2))
return out

case_a = await run_case(
“We suspect account takeover: multiple password reset emails and unauthorized logins. ”
“Customer=Leila. Priority=critical. Open a security ticket.”
)

case_b = await run_case(“List our open tickets and summarize what to tackle first.”)

case_c = await run_case(“What is the status of ticket 1? If it’s open, move it to in_progress.”)

agent_alt = build_agent(“gpt-4o”)
alt_res = await agent_alt.run(
“Create a feature request ticket: customer=Noah wants ‘export to CSV’ in analytics dashboard; priority=medium.”,
deps=deps,
)

print(json.dumps(alt_res.output.model_dump(), indent=2))

We wire everything together by seeding initial data and running the agent asynchronously, in a notebook-safe manner. We execute multiple real-world scenarios to show how the agent reasons, calls tools, and returns schema-valid outputs. We also demonstrate how easily we can swap the underlying model while keeping the same workflows and guarantees intact.

In conclusion, we showed how a type-safe agent can reason, call tools, validate its own outputs, and recover from errors without manual intervention. We kept the logic model-agnostic, allowing us to swap underlying LLMs while preserving the same schemas and tools, which is critical for long-term maintainability. Overall, we demonstrated how combining strict schema enforcement, dependency injection, and async execution closes the reliability gap in agentic AI and provides a solid foundation for building dependable production systems.

Check out the Full Codes Here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation to Build Bulletproof Agentic Workflows with PydanticAI Using Strict Schemas, Tool Injection, and Model-Agnostic Execution appeared first on MarkTechPost.

Zyphra Releases ZUNA: A 380M-Parameter BCI Foundation Model for EEG Da …

Brain-computer interfaces (BCIs) are finally having their ‘foundation model’ moment. Zyphra, a research lab focused on large-scale models, recently released ZUNA, a 380M-parameter foundation model specifically for EEG signals. ZUNA is a masked diffusion auto-encoder designed to perform channel infilling and super-resolution for any electrode layout. This release includes weights under an Apache-2.0 license and an MNE-compatible inference stack.

The Problem with ‘Brittle’ EEG Models

For decades, researchers have struggled with the ‘Wild West’ of EEG data. Different datasets use varying numbers of channels and inconsistent electrode positions. Most deep learning models are trained on fixed channel montages, making them fail when applied to new datasets or recording conditions. Additionally, EEG measurements are often plagued by noise from electrode shifts or subject movement.

ZUNA’s 4D Architecture: Spatial Intelligence

ZUNA solves the generalizability problem by treating brain signals as spatially grounded data. Instead of assuming a fixed grid, ZUNA injects spatiotemporal structure via a 4D rotary positional encoding (4D RoPE).

The model tokenizes multichannel EEG into short temporal windows of 0.125 seconds, or 32 samples. Each token is mapped to a 4D coordinate: its 3D scalp location (x, y, z) and its coarse-time index (t). This allows the model to process arbitrary channel subsets and positions. Because it relies on positional embeddings rather than a fixed schema, ZUNA can ‘imagine’ signal data at any point on the head where a sensor might be missing.

https://www.zyphra.com/post/zuna

Diffusion as a Generative Engine

ZUNA uses a diffusion approach because EEG signals are continuous and real-valued. The model pairs a diffusion decoder with an encoder that stores signal information in a latent bottleneck.

During training, Zyphra used a heavy channel-dropout objective. They randomly dropped 90% of channels, replacing them with zeros in the encoder input. The model was then tasked with reconstructing these ‘masked’ signals from the information in the remaining 10% of channels. This forced the model to learn deep cross-channel correlations and a powerful internal representation of brain activity.

The Massive Data Pipeline: 2 Million Hours

Data quality is the heartbeat of any foundation model. Zyphra aggregated a harmonized corpus spanning 208 public datasets. This massive collection includes:

2 million channel-hours of EEG recordings.

Over 24 million non-overlapping 5-second samples.

A wide range of channel counts from 2 to 256 per recording.

The preprocessing pipeline standardized all signals to a common sampling rate of 256 Hz. They used MNE-Python to apply high-pass filters at 0.5 Hz and an adaptive notch filter to remove line noise. Signals were then z-score normalized to ensure zero-mean and unit-variance while preserving spatial structure.

Benchmarks: Killing the Spherical Spline

For years, the industry standard for filling in missing EEG data has been spherical-spline interpolation. While splines are useful for capturing local smoothness, they have no ‘learned prior’ and fail when gaps between sensors grow too large.

ZUNA consistently outperforms spherical-spline interpolation across multiple benchmarks, including the ANPHY-Sleep dataset and the BCI2000 motor-imagery dataset. The performance gap widens significantly at higher dropout rates. In extreme 90% dropout scenarios—essentially 10x upsampling—ZUNA maintains high reconstruction fidelity while spline methods degrade sharply.

https://www.zyphra.com/post/zuna

Key Takeaways

Universal Generalization: ZUNA is a 380M-parameter model that works with any EEG system, regardless of the number or position of electrodes. Unlike previous AI models limited to fixed layouts, it generalizes across diverse datasets and novel channel positions.

4D Spatiotemporal Intelligence: The model uses a 4D Rotary Positional Encoding (4D RoPE) system to map brain signals across 3D space (x, y, z) and time (t). This allow it to ‘understand’ the physical geometry of the scalp and accurately predict missing data.

Superior Channel Reconstruction: By training as a masked diffusion autoencoder, ZUNA significantly outperforms traditional spherical-spline interpolation. It excels at ‘super-resolution,’ maintaining high accuracy even when up to 90% of the brain’s signals are missing or corrupted.

Massive Training Scale: The model was trained on a harmonized corpus of 208 datasets, totaling approximately 2 million channel-hours and 24 million unique 5-second samples. This scale allows it to learn deep cross-channel correlations that simpler geometric methods miss.

Check out the Paper, Technical Details, Repo and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Zyphra Releases ZUNA: A 380M-Parameter BCI Foundation Model for EEG Data, Advancing Noninvasive Thought-to-Text Development appeared first on MarkTechPost.

Build AI workflows on Amazon EKS with Union.ai and Flyte

As artificial intelligence and machine learning (AI/ML) workflows grow in scale and complexity, it becomes harder for practitioners to organize and deploy their models. AI projects often struggle to move from pilot to production. AI projects often fail not because models are bad, but because infrastructure and processes are fragmented and brittle, and the original pilot code base is often forced to bloat by these additional requirements. This makes it difficult for data scientists and engineers to quickly move from laptop to cluster (local development to production deployment) and reproduce the exact results they had seen during the pilot.
In this post, we explain how you can use the Flyte Python SDK to orchestrate and scale AI/ML workflows. We explore how the Union.ai 2.0 system enables deployment of Flyte on Amazon Elastic Kubernetes Service (Amazon EKS), integrating seamlessly with AWS services like Amazon Simple Storage Service (Amazon S3), Amazon Aurora, AWS Identity and Access Management (IAM), and Amazon CloudWatch. We explore the solution through an AI workflow example, using the new Amazon S3 Vectors service.
Common challenges running AI/ML workflows on Kubernetes
AI/ML workflows running on Kubernetes present several orchestration challenges:

Infrastructure complexity – Provisioning the right compute resources (CPUs, GPUs, memory) dynamically across Kubernetes clusters
Experiment-to-production gap – Moving from experimentation to production often requires rebuilding pipelines in different environments
Reproducibility – Tracking data lineage, model versions, and experiment parameters to facilitate reliable results
Cost management – Efficiently utilizing spot instances, automatic scaling, and avoiding over-provisioning
Reliability – Handling failures gracefully with automatic retries, checkpointing, and recovery mechanisms

Purpose-built AI/ML tooling is essential for orchestrating complex workflows, offering specialized capabilities like intelligent caching, automatic versioning, and dynamic resource allocation that streamline development and deployment cycles.
Why Flyte/Union for Amazon EKS
The Flyte on Amazon EKS Python workflows scale from laptop-to-cluster with dynamic execution, reproducibility, and compute-aware orchestration. These workflows, along with Union.ai’s managed deployment, facilitate seamless, crash-proof operations that fully utilize Amazon EKS without the infrastructure overhead. Flyte transforms how you can orchestrate AI/ML workloads on Amazon EKS, making workflows simple to build. Some key factors include:

Pure Python workflows – Write orchestration logic in Python with 66% less code than traditional orchestrators, alleviating the need to learn domain-specific languages and removing barriers for ML engineers and AI developers migrating existing code
Dynamic execution – Make real-time decisions at runtime with flexible branching, loops, and conditional logic, which is essential for agentic AI systems
Reproducibility by default – Every execution is versioned, cached, and tracked with complete data lineage
Compute-aware orchestration – Dynamically provision the right compute resources for each task, from CPUs for data processing to GPUs for model training
Robustness – Pipelines can quickly recover from failures, isolate errors, and manage checkpoints without manual intervention

Union.ai 2.0 is built on Flyte, the open source, Kubernetes-based workflow orchestration system originally developed at Lyft to power mission-critical ML systems like ETA prediction, pricing, and mapping. After Flyte was open sourced in 2020 and became a Linux Foundation AI & Data project, the core engineering team founded Union.ai 2.0 to deliver an enterprise-grade service purposed-built for teams running AI/ML workloads on Amazon EKS. Union.ai 2.0 reduces the complexity of managing Kubernetes infrastructure through managed operations, a multi-cloud control plane, and abstracted infrastructure management, while providing ML-based capabilities that help data scientists and engineers focus on building models with enhanced scale, speed, security, and reliability.
Additional benefits of using Union.ai 2.0 include:

Enhanced scalability – Workflows respond at runtime with flexible branching, task fanout, and real-time infrastructure scaling.
Crash-proof reliability – Automatic retries, checkpointing, and failure recovery allow workflows to stay resilient without manual intervention.
Agentic AI runtime – Union.ai is designed for long-lived agentic AI systems, supporting stateful agents and truly durable orchestration.
Compliance – For regulated industries, built-in lineage, auditability, and secure execution (SOC2, RBAC, SSO) are critical. Orchestration on Amazon EKS and Union.ai helps facilitate compliance.
Resource awareness – It offers first-class support for compute provisioning, spot instances, and automatic scaling.

The benefits of Flyte and Union.ai 2.0 elevate modern orchestration to a first-class requirement: dynamic execution, fault tolerance, and resource awareness are now built-in, providing a more developer-friendly experience compared to 1.0.
Amazon EKS provides your compute, storage, and networking backbone. Flyte (the open source project) handles workflow orchestration. Union.ai extends Flyte with infrastructure-aware orchestration, enterprise-grade security, and turnkey scalability, giving you production-ready Flyte without the DIY setup. Both Flyte and Union.ai 2.0 run on Amazon EKS, but serve different needs, as detailed in the following table.

Feature
Open Source Flyte
Union.ai 2.0

Deployment
Self-managed on your EKS cluster
Fully managed or BYOC options

Best for
Teams with Kubernetes expertise
Teams wanting managed operations

Performance
Standard scale
10–100 times greater scale, speed, task fanout, and parallelism

Infrastructure
You manage upgrades, scaling
White-glove managed infrastructure

Enterprise features
No role-based access control
Fine-grained role-based access control, single sign-on, managed secrets, cost dashboards

Support
Community-driven
Enterprise SLA with Union.ai team

Real-time serving
Build your own
Built-in real-time inference and near real-time inference with reusable containers

Enterprises like Woven Toyota, Lockheed Martin, Spotify, and Artera orchestrate millions of dollars of compute annually with Flyte and Union, accelerating experimentation by 25 times faster and cutting iteration cycles by 96%.
Both options (open source Flyte and Union.ai 2.0) integrate with the open source community, facilitating rapid feature rollout and continuous improvement.
Solution overview
Although open source Flyte provides powerful orchestration capabilities, Union.ai 2.0 delivers the same core technology with enterprise-grade management, removing the operational overhead so your team can focus on building AI applications instead of managing infrastructure. This is achieved through a hybrid architecture that combines managed simplicity with complete data control. The Regional control plane handles workflow metadata and coordination, while the Union Operator deploys directly into your EKS clusters—keeping your data, code, and secrets entirely within your AWS perimeter.
The following figure illustrates the operational flow between Union’s control plane and your data plane. The Union-managed control plane (left) orchestrates workflows through Elastic Load Balancing (ELB), storing task data in Amazon S3 and execution metadata in Aurora. Within your Amazon EKS environment (right), the data plane executes workflows that pull customer code from your container registry, access secrets from AWS Secrets Manager, and read/write data to your S3 buckets—with the execution logs flowing to both CloudWatch and the Union control plane for observability.

Union.ai 2.0’s AWS integration architecture is built on six key service components that provide end-to-end workflow management:

Control plane and data plane – The control plane operates within the Union.ai AWS account and serves as the central management interface, providing users with authentication and authorization capabilities, observation and monitoring functions, and system management tools. It also orchestrates execution placement on data plane clusters and handles cluster control and management operations. Union.ai 2.0 maintains one control plane per AWS Region, managing the Regional data planes. Available Regions for data plane deployment include us-west, us-east, eu-west, and eu-central, with ongoing expansion to additional Regions.
Data plane object store – This component stores data comprising files, directories, data frames, models, and Python-pickled types, which are passed as references and read by the control plane.
Container registry – This component contains registry data that include names of workflows, tasks, launch plans, and artifacts; input and output types for workflows and tasks; execution status, start time, end time, and duration of workflows and tasks; version information for workflows, tasks, launch plans, and artifacts; and artifact definitions. With the Union.ai 2.0 architecture, you can retain full ownership of your data and compute resources while it manages the infrastructure operations. The Union.ai 2.0 operator resides in the data plane and handles management tasks with least privilege permissions. It enables cluster lifecycle operations and provides support engineers with system-level log access and change implementation capabilities—without exposing secrets or data. Security is further strengthened through unidirectional communication: the data plane operator initiates the connections to the control plane, not the reverse.
Logging and monitoring – CloudWatch provides centralized logging and monitoring through deep integration with Flyte. The system automatically builds logging links for each execution and displays them in the console, with links pointing directly to the AWS Management Console and the specific log stream for that execution—a feature that significantly accelerates troubleshooting during failures.
Security – Security is handled through IAM roles for service accounts (IRSA), which maps the identity between Kubernetes resources and the AWS services they depend on. These configurations enable more secure, fine-grained access control for backend services, and Union.ai 2.0 adds enterprise role-based access control (RBAC) for user access control on top of these AWS security features.
Storage layer – Amazon S3 serves as the durable storage layer for workflows and data. When you register a workflow with Flyte, your code is compiled into a language-independent representation that captures the workflow definition, input, and output types. This representation is packaged and stored in Amazon S3, where FlytePropeller—Flyte’s execution engine—retrieves it to instruct the respective compute framework (such as Kubernetes or Spark) to run workflows and report status. Raw input data used to train and validate models is also stored in Amazon S3. Union.ai 2.0 now includes a new integration with Amazon S3 Vectors, enabling vector storage for Retrieval Augmented Generation (RAG), semantic search, and agentic AI workflows.

With this robust infrastructure in place, Union.ai 2.0 on Amazon EKS excels at orchestrating a wide range of AI/ML workloads. It handles large-scale model training by orchestrating distributed training pipelines across GPU clusters with automatic resource provisioning and spot instance support. For data processing, it can process petabyte-scale datasets with dynamic parallelism and efficient task fanout, scaling to 100,000 task fanouts with 50,000 concurrent actions in Union.ai 2.0. By using Union.ai 2.0 and Flyte on Amazon EKS, you can build and deploy agentic AI systems—long-running, stateful AI agents that make autonomous decisions at runtime. For production deployments, it supports real-time inference with low-latency model serving, using reusable containers for sub-100 millisecond task startup times. Throughout the entire process, Union.ai 2.0 provides comprehensive MLOps and model lifecycle management, automating everything from experimentation to production deployment with built-in versioning and rollback capabilities.
These capabilities are exemplified in specialized implementations like distributed training on AWS Trainium instances, where Flyte orchestrates large-scale training workloads on Amazon EKS.
Deployment options for Union.ai 2.0 on Amazon EKS
Union.ai 2.0 and Flyte offer three flexible deployment models for Amazon EKS, each balancing managed convenience with operational control. Select the approach that best fits your team’s expertise, compliance requirements, and development velocity:

Union BYOC (fully managed) – The fastest path to production. Union.ai 2.0 manages the infrastructure, upgrades, and scaling while your workloads run in your AWS account. This option is ideal for teams that want to focus entirely on AI development rather than infrastructure operations.
Union Self Managed – You can deploy Union.ai 2.0’s managed control plane while maintaining control of your data and compute resources in your AWS account. This option combines the benefits of managed services with data sovereignty and governance requirements.
Flyte OSS on Amazon EKS – You can deploy and operate open source Flyte directly on your EKS cluster using the AWS Cloud Development Kit (AWS CDK). This option provides maximum control and is ideal for teams with strong Kubernetes expertise who want to customize their deployment. (edited) 

The Amazon EKS Blueprints for AWS CDK Union add-on helps AWS customers deploy, scale, and optimize AI/ML workloads using Union on Amazon EKS. It provides modular infrastructure as code (IaC) AWS CDK templates and curated deployment blueprints for running scalable AI workloads, including:

Model training and fine-tuning pipelines
Large language model (LLM) inference and serving
Multi-model deployment and management
Agentic AI pipeline orchestration

Union.ai 2.0 and Flyte provide IaC templates for deploying on Amazon EKS:

Terraform modules – Preconfigured modules for deploying Flyte on Amazon EKS with best practices for networking, security, and observability
AWS CDK support – AWS CDK constructs for integrating Union into existing AWS infrastructure
GitOps workflows – Support for Flux and ArgoCD for declarative infrastructure management

The Union add-on is available by blog publication, and the Flyte add-on is coming—keep watching the GitHub repo.
These templates automate the provisioning of EKS clusters, node groups (including GPU instances), IAM roles, S3 buckets, Aurora databases, and the required Flyte components.
Prerequisites
To start using this solution, you must have the following prerequisites:

An AWS account with appropriate permissions.
Amazon EKS version on standard support.
Required IAM roles. Using IAM roles for service accounts, Flyte can map identity between the Kubernetes resources and AWS services it depends on. These configurations are for the backend and do not interfere with user-control plane communication

How Union.ai 2.0 supports Amazon S3 Vectors
As AI applications increasingly rely on vector embeddings for semantic search and RAG, Union.ai 2.0 empowers teams with Amazon S3 Vectors integration, simplifying vector data management at scale. Built into Flyte 2.0, this feature is available today. Amazon S3 Vectors delivers purpose-built, cost-optimized vector storage for semantic search and AI applications. With Amazon S3 level elasticity and durability for storing vector datasets with subsecond query performance, Amazon S3 Vectors is ideal for applications that need to build and grow vector indexes at scale. Union.ai 2.0 provides support for Amazon S3 Vectors for RAG, semantic search, and multi-agent systems. If you’re using Union.ai 2.0 today with Amazon S3 as your object store, you can start using Amazon S3 Vectors immediately with minimal configuration changes.
To set it up, use Boto’s dedicated APIs to store and query vectors. Your Amazon S3 IAM roles are already in place. Just update the permissions.

By combining Flyte 2.0’s orchestration with Amazon S3 Vector support, multi-agent trading simulations can scale to hundreds of agents that learn from historical data, share industry insights, and execute coordinated strategies in real time. These architectural advantages support sophisticated AI applications like multi-agent systems that require both semantic memory and real-time coordination.
To learn more, refer to the example use case of a multi-agent trading simulation using Flyte 2.0 with Amazon S3 Vectors. In this example, you will learn to build a trading simulation featuring multiple agents that represent team members in a firm, illustrating their interactions, strategic planning, and collaborative trading activities
Consider a multi-agent trading simulation where AI agents interact, test strategies, and continuously learn from their experiences. For realistic agent behavior, each agent must retain context from previous interactions, essentially building a memory of semantic artifacts that inform future decisions. The process includes the following steps:

After each simulation round, embed the agent’s learnings into vector representations using embedding models.
Store embeddings in Amazon S3 using Amazon S3 Vectors with appropriate metadata and tags.
During subsequent executions, retrieve relevant memories using semantic search to ground agent decisions in past experience.

With Flyte 2.0, your agents already run in an orchestration-aware environment. Amazon S3 becomes your vector store. It’s inexpensive, fast, and fully integrated, alleviating the need for separate vector databases. For the steps and associated code to implement the multi-agent trading simulation, refer to the GitHub repo.
In summary, this architecture helps deliver measurable advantages for production AI systems:

Reduced operational complexity – Consolidate your AI/ML orchestration and vector storage on a single environment, alleviating the need to provision, maintain, and secure separate vector database infrastructure
Significant cost savings – Amazon S3 Vectors delivers significantly lower storage costs compared to purpose-built vector databases, while providing subsecond similarity search performance at scale
Zero-friction AWS integration – Use your existing Amazon S3 infrastructure, IRSA configuration, and virtual private cloud (VPC) networking—no additional authentication layers or network configurations are required
Battle-tested scalability – Build on the 99.999999999% durability and elastic scalability of Amazon S3 to support vector datasets from gigabytes to petabytes without re-architecture

Customer success: Woven by Toyota
Toyota’s autonomous driving arm, Woven by Toyota, faced challenges orchestrating complex AI workloads for their autonomous driving technology, requiring petabyte-scale data processing and GPU-intensive training pipelines. After outgrowing their open source Flyte implementation, they migrated to Union.ai’s managed service on AWS in 2023. The impact was transformative: over 20 times faster ML iteration cycles, millions of dollars in annual cost savings through spot instance optimization, and thousands of parallel workers enabling massive scale.

“Union.ai’s wealth of expertise has enabled us to focus our efforts on key ADAS-related functionalities, move fast, and rely on Union.ai to deliver data at scale,”
– Alborz Alavian, Senior Engineering Manager at Woven by Toyota.

Read the full case study about Woven by Toyota’s migration to Union.ai.
Conclusion
Union.ai and Flyte provide the foundation for reliable, scalable AI on Amazon EKS for your AI/ML workflows, such as building autonomous systems, training LLMs, or orchestrating complex data pipelines.To get started, choose your path:

Enterprise ready – Deploy Union.ai through AWS Marketplace (ISVA Partner)
Resource-Aware AI Orchestration – Trial v2
Open source – Try Flyte at flyte.org
Quick start – Deploy your first AI pipeline with the AI on Amazon EKS Blueprint

About the authors
ND Ngoka is Senior Solutions Architect at AWS with specialized focus on AI/ML and storage technologies. Guides customers through complex architectural decisions, enabling them to build resilient, scalable solutions that drive business outcomes.
Samhita Alla is a Senior Solutions Engineer for Partnerships at Union.ai, where she leads the technical execution of strategic integrations across the AI stack, from distributed training and experiment tracking to data platform integrations. She works closely with partners and cross-functional teams to evaluate feasibility, build production-ready solutions, and deliver technical content that drives real-world adoption.
Kristy Cook is Head of Partnerships at Union.ai, where she builds strategic alliances across the AI/ML ecosystem focused on sustained growth. Having forged impactful partnerships at Meta, Yahoo, and Neustar she brings deep expertise in operationalizing AI solutions at scale.
Jim Fratantoni is a GenAI Account Manager at AWS, focused on helping AI startups scale and co-sell with AWS. He is passionate about working with founders to jointly go to market and drive enterprise customer success.
Theo Rashid is an Applied Scientist at Amazon building probabilistic machine learning and forecasting models. He is an active open source contributor, and is passionate about open source tooling across the machine learning stack, from probabilistic programming libraries to workflow orchestration. He holds a PhD in Epidemiology and Biostatistics from Imperial College London.
Alex Fabisiak is a Senior Applied Scientist at Amazon working on applied forecasting and supply chain problems. He specializes in probabilistic and causal modeling as they relate to optimal policy decisions. He holds a PhD in Finance from UCLA.

Amazon Quick now supports key pair authentication to Snowflake data so …

Modern enterprises face significant challenges connecting business intelligence platforms to cloud data warehouses while maintaining automation. Password-based authentication introduces security vulnerabilities, operational friction, and compliance gaps—especially critical as Snowflake is deprecating username password.
Amazon Quick Sight (a capability of Amazon Quick Suite) now supports key pair authentication for Snowflake integrations, using asymmetric cryptography where RSA key pairs replace traditional passwords. This enhancement addresses a critical need as Snowflake moves toward deprecating password-based authentication, which requires more secure authentication methods. With this new capability, Amazon Quick Suite users can establish secure, passwordless connections to Snowflake data sources using RSA key pairs, providing a seamless and secure integration experience that meets enterprise security standards.
In this blog post, we will guide you through establishing data source connectivity between Amazon Quick Sight and Snowflake through secure key pair authentication.
Prerequisites
Before configuring key pair authentication between Amazon Quick and Snowflake, ensure that you have the following:

An active Amazon Quick account with appropriate permissions – You need administrative access to create and manage data sources, configure authentication settings, and grant permissions to users. Amazon Quick Enterprise license or Author role in Amazon Quick Enterprise Sight Edition typically provide sufficient access.
A Snowflake account with ACCOUNTADMIN, SECURITYADMIN, or USERADMIN role – These elevated permissions are essential for modifying user accounts, assigning public keys using ALTER USER commands, and granting warehouse and database permissions. If you don’t have access to these roles, contact your Snowflake administrator for assistance.
OpenSSL installed (for key generation) – This cryptographic toolkit generates RSA key pairs in PKCS#8 format. Most Linux and macOS systems include OpenSSL pre-installed. Windows users can use Windows Subsystem Linux (WSL) or download OpenSSL separately.
(Optional) AWS Secrets Manager access (for API-based setup) – Required for programmatic configurations, you will need IAM permissions to create and manage secrets, and Amazon Quick Sight API access for automated deployments and infrastructure as code (IaC) implementations.

Solution walkthrough
We will guide you through the following essential steps to establish secure key pair authentication between Amazon Quick Sight and Snowflake:

Generate RSA Key Pair – Create public and private keys using OpenSSL with proper encryption standards
Configure Snowflake User – Assign the public key to your Snowflake user account and verify the setup
Establish Data Source Connectivity – Create your connection through either the Amazon Quick UI for interactive setup or AWS Command Line Interface (AWS CLI) for programmatic deployment

Let’s explore each step in detail and secure your Amazon Quick Sight-Snowflake connection with key pair authentication!
Generate RSA key pair:

Navigate to AWS CloudShell in AWS Management Console and execute the following command to generate the RSA private key. You will be prompted to enter an encryption passphrase. Choose a strong passphrase and store it securely—you will need this later when generating the public key.

openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8

Run the following commands to create a public key pair. You will be prompted to enter the phrase that you used in the previous step.

openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub

Extract the private key content (including header and footer):

cat rsa_key.p8

This displays your private key in the format:
—–BEGIN PRIVATE KEY—–[key content]—–END PRIVATE KEY—–
Note: Copy the entire output including the —–BEGIN PRIVATE KEY—– and —–END PRIVATE KEY—– lines. You will use this complete private key (with headers and footers) when creating your Snowflake data source connection.

Snowflake requires the public key in a specific format without headers or line breaks. Run these commands to extract and format the key properly.

grep -v KEY rsa_key.pub | tr -d ‘n’ | awk ‘{print $1}’ > pub.Key
cat pub.Key

This will display your formatted public key string. Copy this output—you will use it in the next step to configure your Snowflake user account.
Assign public key to Snowflake user:

Log in to Snowflake and execute the following SQL commands to assign the public key to your user:

ALTER USER <username> SET RSA_PUBLIC_KEY='<public_key_content>’;

Verify the key assignment: Look for the RSA_PUBLIC_KEY property to confirm if the public key is set.

DESCRIBE USER <username>;

Establish your Snowflake connection in Amazon Quick UI:

Navigate to Amazon Quick in AWS Management Console and select Datasets. Then select the Data sources tab and choose Create data source.

In the Create data source pane, enter “snowflake” in Search datasets, select Snowflake, and then choose Next.

In the New Snowflake data source pane, enter the data source name, then enter the connection type as Public Network or a Private VPC Connection. If you need a VPC connection, see configure the VPC connection in Quick.
Then, enter the database server hostname, database name, and warehouse name.
Select Authentication Type as KeyPair and then enter the username of the Snowflake user.
In the Private Key field, paste the complete output from cat rsa_key.p8 (including the BEGIN and END headers). If you have configured a passphrase during key generation, provide it in the optional Passphrase field.
After all the fields are entered, select the Validate connection button.

After the connection is validated, select the Create data source button.
Then in the Data sources list, find the snowflake data source that you created.
From the Action menu, select the Create dataset option.

Establish your Snowflake Connection using the Amazon Quick Sight API:
Using AWS CLI, create the Amazon Quick data source connection to Snowflake by executing the following command:

aws quicksight create-data-source
–aws-account-id 123456789
–data-source-id awsclikeypairtest
–name “awsclikeypairtest”
–type SNOWFLAKE
–data-source-parameters ‘{
“SnowflakeParameters”: {
“Host”: “hostname.snowflakecomputing.com”,
“Database”: “DB_NAME”,
“Warehouse”: “WH_NAME”,
“AuthenticationType”: “KEYPAIR”
}
}’
–credentials ‘{
“KeyPairCredentials”: {
“KeyPairUsername”: “SNOWFLAKE_USERNAME”,
“PrivateKey”: “—–BEGIN ENCRYPTED PRIVATE KEY—–nPRIVATE_KEYn—–END ENCRYPTED PRIVATE KEY—–“,
“PrivateKeyPassphrase”: “******”
}
}’
–permissions ‘[
{
“Principal”: “arn:aws:quicksight:us-east-1: 123456789:user/default/Admin/username,
“Actions”: [
“quicksight:DescribeDataSource”,
“quicksight:DescribeDataSourcePermissions”,
“quicksight:PassDataSource”,
“quicksight:UpdateDataSource”,
“quicksight:DeleteDataSource”,
“quicksight:UpdateDataSourcePermissions”
]
}
]’
–region us-east-1

Use the following command to check the status of creation:

aws quicksight describe-data-source –region us-east-1 –aws-account-id 123456789 –data-source-id awsclikeypairtest

Initially, the status returned from the describe-data-source command will be CREATION_IN_PROGRESS. The status will change to CREATION_SUCCESSFUL if the new data source is ready for use.
Alternatively, when creating the data source programmatically via CreateDataSource, you can store the username, key and passphrase in AWS Secrets Manager and reference them using the Secret ARN.
After the data source is successfully created, you can navigate to the Quick console. In the Create a Dataset page, you can view the newly created data source connection awsclikeypairtest under the data sources list. You can then continue to create the datasets.
Cleanup
To clean up your resources to avoid incurring additional charges, follow these steps:

Delete the secret created in the AWS Secrets Manager Console.
Delete the data source connection created in Amazon Quick.

Conclusion
Key pair authentication represents a transformative advancement in securing data connectivity between Amazon Quick and Snowflake. By removing password-based vulnerabilities and embracing cryptographic authentication, organizations can achieve superior security posture while maintaining seamless automated workflows. This implementation addresses critical enterprise requirements, such as enhanced security through asymmetric encryption, streamlined service account management, and compliance with evolving authentication standards as Snowflake transitions away from traditional password methods.
Whether deploying through the intuitive Amazon Quick UI or using AWS CLI for Infrastructure as Code implementations, key pair authentication provides flexibility without compromising security. The integration with AWS Secrets Manager helps protect the private keys, while the straightforward setup process enables rapid deployment across development, staging, and production environments.
As data security continues to evolve, adopting key pair authentication positions your organization at the forefront of best practices. Business intelligence teams can now focus on extracting actionable insights from Snowflake data rather than managing authentication complexities, ultimately accelerating time-to-insight and improving operational efficiency.
For further reading, see Snowflake Key-Pair Authentication.

About the authors

Vignessh Baskaran
Vignessh Baskaran is a Sr. Technical Product Manager in the structured DATA domain in Amazon Quick powering BI and GenAI initiatives. He has 9+ years of experience in developing large-scale data and analytics solutions. Prior to this role, he worked as a Sr. Analytics Lead in AWS building comprehensive BI solutions using Quick which were globally adopted across AWS Worldwide Specialist Sales teams. Outside of work, he enjoys watching Cricket, playing Racquetball and exploring different cuisines in Seattle.

Chinnakanu Sai Janakiram
Chinnakanu Sai Janakiram is a Software Development Engineer in Amazon Quick, working on cloud infrastructure automation and feature development using AWS technologies. He has 2+ years of experience building scalable systems across AWS, CI/CD pipelines, CloudFormation, React, and Spring Boot. Prior to this role, he contributed to data and analytics solutions on AWS, improving deployment reliability and scalability across regions. Outside of work, he enjoys following Formula 1 and staying up to date with emerging technologies.

Nithyashree Alwarsamy
Nithyashree Alwarsamy is a Partner Solutions Architect at Amazon Web Services, specializing in data and analytics solutions with a focus on streaming and event-driven architecture. Leveraging deep expertise in modern data architectures, Nithyashree helps organizations unlock the full potential of their data by integrating Snowflake’s cloud-native data platform with the breadth of AWS services.

Andries Engelbrecht
Andries Engelbrecht is a Principal Partner Solutions Engineer at Snowflake working with AWS. He supports product and service integrations, as well as the development of joint solutions with AWS. Andries has over 25 years of experience in the field of data and analytics.

[Tutorial] Building a Visual Document Retrieval Pipeline with ColPali …

In this tutorial, we build an end-to-end visual document retrieval pipeline using ColPali. We focus on making the setup robust by resolving common dependency conflicts and ensuring the environment stays stable. We render PDF pages as images, embed them using ColPali’s multi-vector representations, and rely on late-interaction scoring to retrieve the most relevant pages for a natural-language query. By treating each page visually rather than as plain text, we preserve layout, tables, and figures that are often lost in traditional text-only retrieval.

Copy CodeCopiedUse a different Browserimport subprocess, sys, os, json, hashlib

def pip(cmd):
subprocess.check_call([sys.executable, “-m”, “pip”] + cmd)

pip([“uninstall”, “-y”, “pillow”, “PIL”, “torchaudio”, “colpali-engine”])
pip([“install”, “-q”, “–upgrade”, “pip”])
pip([“install”, “-q”, “pillow<12”, “torchaudio==2.8.0”])
pip([“install”, “-q”, “colpali-engine”, “pypdfium2”, “matplotlib”, “tqdm”, “requests”])

We prepare a clean and stable execution environment by uninstalling conflicting packages and upgrading pip. We explicitly pin compatible versions of Pillow and torchaudio to avoid runtime import errors. We then install ColPali and its required dependencies so the rest of the tutorial runs without interruptions.

Copy CodeCopiedUse a different Browserimport torch
import requests
import pypdfium2 as pdfium
from PIL import Image
from tqdm import tqdm
import matplotlib.pyplot as plt
from transformers.utils.import_utils import is_flash_attn_2_available
from colpali_engine.models import ColPali, ColPaliProcessor

device = “cuda” if torch.cuda.is_available() else “cpu”
dtype = torch.float16 if device == “cuda” else torch.float32

MODEL_NAME = “vidore/colpali-v1.3″

model = ColPali.from_pretrained(
MODEL_NAME,
torch_dtype=dtype,
device_map=device,
attn_implementation=”flash_attention_2” if device == “cuda” and is_flash_attn_2_available() else None,
).eval()

processor = ColPaliProcessor.from_pretrained(MODEL_NAME)

We import all required libraries and detect whether a GPU is available for acceleration. We load the ColPali model and processor with the appropriate precision and attention implementation based on the runtime. We ensure the model is ready for inference by switching it to evaluation mode.

Copy CodeCopiedUse a different BrowserPDF_URL = “https://arxiv.org/pdf/2407.01449.pdf”
pdf_bytes = requests.get(PDF_URL).content

pdf = pdfium.PdfDocument(pdf_bytes)
pages = []
MAX_PAGES = 15

for i in range(min(len(pdf), MAX_PAGES)):
page = pdf[i]
img = page.render(scale=2).to_pil().convert(“RGB”)
pages.append(img)

We download a sample PDF and render its pages as high-resolution RGB images. We limit the number of pages to keep the tutorial lightweight and fast on Colab. We store the rendered pages in memory for direct visual embedding.

Copy CodeCopiedUse a different Browserpage_embeddings = []
batch_size = 2 if device == “cuda” else 1

for i in tqdm(range(0, len(pages), batch_size)):
batch_imgs = pages[i:i+batch_size]
batch = processor.process_images(batch_imgs)
batch = {k: v.to(model.device) for k, v in batch.items()}
with torch.no_grad():
emb = model(**batch)
page_embeddings.extend(list(emb.cpu()))

page_embeddings = torch.stack(page_embeddings)

We generate multi-vector embeddings for each rendered page using ColPali’s image encoder. We process pages in small batches to stay within GPU memory limits. We then stack all page embeddings into a single tensor that supports efficient late-interaction scoring.

Copy CodeCopiedUse a different Browserdef retrieve(query, top_k=3):
q = processor.process_queries([query])
q = {k: v.to(model.device) for k, v in q.items()}
with torch.no_grad():
q_emb = model(**q).cpu()
scores = processor.score_multi_vector(q_emb, page_embeddings)[0]
vals, idxs = torch.topk(scores, top_k)
return [(int(i), float(v)) for i, v in zip(idxs, vals)]

def show(img, title):
plt.figure(figsize=(6,6))
plt.imshow(img)
plt.axis(“off”)
plt.title(title)
plt.show()

query = “What is ColPali and what problem does it solve?”
results = retrieve(query, top_k=3)

for rank, (idx, score) in enumerate(results, 1):
show(pages[idx], f”Rank {rank} — Page {idx+1}”)

def search(query, k=5):
return [{“page”: i+1, “score”: s} for i, s in retrieve(query, k)]

print(json.dumps(search(“late interaction retrieval”), indent=2))

We define the retrieval logic that scores queries against page embeddings using late interaction. We visualize the top-ranked pages to qualitatively inspect retrieval quality. We also expose a small search helper that returns structured results, making the pipeline easy to extend or integrate further.

In conclusion, we have a compact yet powerful visual search system that demonstrates how ColPali enables layout-aware document retrieval in practice. We embedded pages once, reuse those embeddings efficiently, and retrieve results with interpretable relevance scores. This workflow gives us a strong foundation for scaling to larger document collections, adding indexing for speed, or layering generation on top of retrieved pages, while keeping the core pipeline simple, reproducible, and Colab-friendly.

Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post [Tutorial] Building a Visual Document Retrieval Pipeline with ColPali and Late Interaction Scoring appeared first on MarkTechPost.

Tavus Launches Phoenix-4: A Gaussian-Diffusion Model Bringing Real-Tim …

The ‘uncanny valley’ is the final frontier for generative video. We have seen AI avatars that can talk, but they often lack the soul of human interaction. They suffer from stiff movements and a lack of emotional context. Tavus aims to fix this with the launch of Phoenix-4, a new generative AI model designed for the Conversational Video Interface (CVI).

Phoenix-4 represents a shift from static video generation to dynamic, real-time human rendering. It is not just about moving lips; it is about creating a digital human that perceives, times, and reacts with emotional intelligence.

The Power of Three: Raven, Sparrow, and Phoenix

To achieve true realism, Tavus utilizes a 3-part model architecture. Understanding how these models interact is key for developers looking to build interactive agents.

Raven-1 (Perception): This model acts as the ‘eyes and ears.’ It analyzes the user’s facial expressions and tone of voice to understand the emotional context of the conversation.

Sparrow-1 (Timing): This model manages the flow of conversation. It determines when the AI should interrupt, pause, or wait for the user to finish, ensuring the interaction feels natural.

Phoenix-4 (Rendering): The core rendering engine. It uses Gaussian-diffusion to synthesize photorealistic video in real-time.

https://www.tavus.io/post/phoenix-4-real-time-human-rendering-with-emotional-intelligence

Technical Breakthrough: Gaussian-Diffusion Rendering

Phoenix-4 moves away from traditional GAN-based approaches. Instead, it uses a proprietary Gaussian-diffusion rendering model. This allows the AI to calculate complex facial movements, such as the way skin stretching affects light or how micro-expressions appear around the eyes.

This means the model handles spatial consistency better than previous versions. If a digital human turns their head, the textures and lighting remain stable. The model generates these high-fidelity frames at a rate that supports 30 frames per second (fps) streaming, which is essential for maintaining the illusion of life.

Breaking the Latency Barrier: Sub-600ms

In a CVI, speed is everything. If the delay between a user speaking and the AI responding is too long, the ‘human’ feel is lost. Tavus has developed the Phoenix 4 pipeline to achieve an end-to-end conversational latency of sub-600ms.

This is achieved through a ‘stream-first’ architecture. The model uses WebRTC (Web Real-Time Communication) to stream video data directly to the client’s browser. Rather than generating a full video file and then playing it, Phoenix-4 renders and sends video packets incrementally. This ensures that the time to first frame is kept at an absolute minimum.

Programmatic Emotion Control

One of the most powerful features is the Emotion Control API. Developers can now explicitly define the emotional state of a Persona during a conversation.

By passing an emotion parameter in the API request, you can trigger specific behavioral outputs. The model currently supports primary emotional states including:

Joy

Sadness

Anger

Surprise

When the emotion is set to joy, the Phoenix-4 engine adjusts the facial geometry to create a genuine smile, affecting the cheeks and eyes, not just the mouth. This is a form of conditional video generation where the output is influenced by both the text-to-speech phonemes and an emotional vector.

Building with Replicas

Creating a custom ‘Replica’ (a digital twin) requires only 2 minutes of video footage for training. Once the training is complete, the Replica can be deployed via the Tavus CVI SDK.

The workflow is straightforward:

Train: Upload 2 minutes of a person speaking to create a unique replica_id.

Deploy: Use the POST /conversations endpoint to start a session.

Configure: Set the persona_id and the conversation_name.

Connect: Link the provided WebRTC URL to your front-end video component.

https://www.tavus.io/post/phoenix-4-real-time-human-rendering-with-emotional-intelligence

Key Takeaways

Gaussian-Diffusion Rendering: Phoenix-4 moves beyond traditional GANs to use Gaussian-diffusion, enabling high-fidelity, photorealistic facial movements and micro-expressions that solve the ‘uncanny valley’ problem.

The AI Trinity (Raven, Sparrow, Phoenix): The architecture relies on three distinct models: Raven-1 for emotional perception, Sparrow-1 for conversational timing/turn-taking, and Phoenix-4 for the final video synthesis.

Ultra-Low Latency: Optimized for the Conversational Video Interface (CVI), the model achieves sub-600ms end-to-end latency, utilizing WebRTC to stream video packets in real-time.

Programmatic Emotion Control: You can use an Emotion Control API to specify states like joy, sadness, anger, or surprise, which dynamically adjusts the character’s facial geometry and expressions.

Rapid Replica Training: Creating a custom digital twin (‘Replica’) is highly efficient, requiring only 2 minutes of video footage to train a unique identity for deployment via the Tavus SDK.

Check out the Technical details, Docs and Try it here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Tavus Launches Phoenix-4: A Gaussian-Diffusion Model Bringing Real-Time Emotional Intelligence And Sub-600ms Latency To Generative Video AI appeared first on MarkTechPost.

Google DeepMind Releases Lyria 3: An Advanced Music Generation AI Mode …

Google DeepMind is pushing the boundaries of generative AI again. This time, the focus is not on text or images. It is on music. The Google team recently introduced Lyria 3, their most advanced music generation model to date. Lyria 3 represents a significant shift in how machines handle complex audio waveforms and creative intent.

With the release of Lyria 3 inside the Gemini app, Google is moving these tools from the research lab to the hands of everyday users. If you are a software engineer or a data scientist, here is what you need to know about the technical landscape of Lyria 3.

The Challenge of AI Music

Building a music model is much harder than building a text model. Text is discrete and linear. Music is continuous and multi-layered. A model must handle melody, harmony, rhythm, and timbre all at once. It must also maintain long-range coherence. This means a song must sound like the same song from the 1st second to the 30th second.

Lyria 3 is designed to solve these problems. It creates high-fidelity audio that includes vocals and multi-instrumental tracks. It does not just piece together loops. It generates full musical arrangements from scratch.

Lyria 3 and the Gemini Integration

Lyria 3 is now available in the Gemini app. Users can type a prompt or even upload an image to receive a 30-second music track. The interesting part is how Google integrates this into a multimodal ecosystem.

In the Gemini app, Lyria 3 allows for a fast ‘prompt-to-audio’ workflow. You can describe a mood, a genre, or a specific set of instruments. The model then outputs a high-quality file. This integration shows that Google is treating audio as a primary modality alongside text and vision.

Key Technical Specifications of Lyria 3

FeatureSpecificationOutput Length30 secondsSample Rate48kHzAudio Format16-bit PCM (Stereo)Input ModalitiesText, Image, AudioWatermarkingSynthIDLatencyUnder 2 seconds for control changes

Real-Time Control: Lyria RealTime

The Lyria RealTime API is where the real innovation happens. Unlike traditional models that work like a ‘jukebox’ (input a prompt and wait for a file), Lyria RealTime operates on a chunk-based autoregression system.

It uses a bidirectional WebSocket connection to maintain a live stream. The model generates audio in 2-second chunks. It looks back at previous context to maintain the ‘groove’ while looking forward at user controls to decide the style. This allows for steering the audio using WeightedPrompts.

The Music AI Sandbox

For musicians and aspirants, Google DeepMind created the Music AI Sandbox. This is a suite of tools designed for the creative process. It allows users to:

Transform Audio: Take a simple hum or a basic piano line and turn it into a full orchestral arrangement.

Style Transfer: Use MIDI chords to generate a vocal choir.

Instrument Manipulation: Use text prompts to change instruments while keeping the same melody.

This is a clear example of human-in-the-loop AI. It uses latent space representations to allow users to ‘jam’ with the model.

Safety and Attribution: SynthID

Generating music brings up massive questions about copyright. Google DeepMind team addressed this by using SynthID. This tool watermarks AI-generated content by embedding a digital signature directly into the audio waveform.

SynthID is invisible and inaudible to the human ear. However, it can be detected by software. Even if the audio is compressed to MP3, slowed down, or recorded through a microphone (the ‘analog hole’), the watermark remains. This is a critical development in AI ethics. It provides a technical solution to the problem of AI attribution.

How this makes a difference?

Lyria 3 offers several lessons in model architecture:

High Fidelity: Generating audio at 48kHz requires efficient neural networks that can handle massive amounts of data per second.

Causal Streaming: The model must generate audio faster than it is played (real-time factor > 1).

Cross-Modal Embeddings: The ability to steer a model using text or images requires deep understanding of how different data types map to the same latent space.

2026 AI Music Showdown: Lyria 3 vs. Suno vs. Udio

FeatureGoogle Lyria 3Suno (v5 Engine)Udio (v1.5/Pro)Best ForMultimodal integration & speedCatchy pop hits & viral clipsStudio-grade fidelity & controlPrimary WorkflowGemini App / RealTime APIRapid prototyping (Text-to-Song)Iterative “co-writing” & InpaintingMax Track Length30 seconds (Gemini Beta)8 minutes15 minutes (via extensions)Audio Quality48kHz / 16-bit PCMHigh-fidelity (Improved v5)Ultra-realistic / Studio-GradeInput ModalitiesText, Images, & AudioText & Audio UploadText & Audio ReferenceUnique FeatureSynthID Inaudible Watermark12-Stem individual track splittingAdvanced Inpainting & editingSafety TechDigital waveform watermarkingMetadata (Content Credentials)Metadata (Content Credentials)

Key Takeaways

Multimodal Integration in Gemini: Lyria 3 is now a core part of the Gemini ecosystem, allowing users to generate high-fidelity, 30-second music tracks using text, images, or audio prompts directly within the app.

High-Fidelity ‘Prompt-to-Audio’ Workflow: The model creates complex, multi-layered musical arrangements—including vocals and instruments—at a 48kHz sample rate, moving beyond simple loops to full compositions.

Advanced Long-Range Coherence: A major technical breakthrough of Lyria 3 is its ability to maintain musical continuity, ensuring that melody, rhythm, and style remain consistent from the 1st second to the end of the track.

Real-Time Creative Control: Through the Music AI Sandbox and Lyria RealTime API, developers and artists can ‘steer’ the AI in real-time, transforming simple inputs like humming into full orchestral pieces using latent space manipulation.

Built-in Safety with SynthID: To address copyright and authenticity, every track generated by Lyria includes a SynthID watermark. This digital signature is inaudible to humans but remains detectable by software even after heavy compression or editing.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Releases Lyria 3: An Advanced Music Generation AI Model that Turns Photos and Text into Custom Tracks with Included Lyrics and Vocals appeared first on MarkTechPost.

Build unified intelligence with Amazon Bedrock AgentCore

Building cohesive and unified customer intelligence across your organization starts with reducing the friction your sales representatives face when toggling between Salesforce, support tickets, and Amazon Redshift. A sales representative preparing for a customer meeting might spend hours clicking through several different dashboards—product recommendations, engagement metrics, revenue analytics, etc. – before developing a complete picture of the customer’s situation. At AWS, our sales organization experienced this firsthand as we scaled globally. We needed a way to unify siloed customer data across metrics databases, document repositories, and external industry sources – without building complex custom orchestration infrastructure.
We built the Customer Agent & Knowledge Engine (CAKE), a customer centric chat agent using Amazon Bedrock AgentCore to solve this challenge. CAKE coordinates specialized retriever tools – querying knowledge graphs in Amazon Neptune, metrics in Amazon DynamoDB, documents in Amazon OpenSearch Service, and external market data using a web search API, along with security enforcement using Row Level Security tool (RLS), delivering customer insights through natural language queries in under 10 seconds (as observed in agent load tests).
In this post, we demonstrate how to build unified intelligence systems using Amazon Bedrock AgentCore through our real-world implementation of CAKE. You can build custom agents that unlock the following features and benefits:

Coordination of specialized tools through dynamic intent analysis and parallel execution
Integration of purpose-built data stores (Neptune, DynamoDB, OpenSearch Service) with parallel orchestration
Implementation of row-level security and governance within workflows
Production engineering practices for reliability, including template-based reporting to adhere to business semantic and style
Performance optimization through model flexibility

These architectural patterns can help you accelerate development for different use cases, including customer intelligence systems, enterprise AI assistants, or multi-agent systems that coordinate across different data sources.
Why customer intelligence systems need unification
As sales organizations scale globally, they often face three critical challenges: fragmented data across specialized tools (product recommendations, engagement dashboards, revenue analytics, etc.) requiring hours to gather comprehensive customer views, loss of business semantics in traditional databases that can’t capture semantic relationships explaining why metrics matter, and manual consolidation processes that can’t scale with growing data volumes. You need a unified system that can aggregate customer data, understand semantic relationships, and reason through customer needs in business context, making CAKE the essential linchpin for enterprises everywhere.
Solution overview
CAKE is a customer-centric chat agent that transforms fragmented data into unified, actionable intelligence. By consolidating internal and external data sources/tables into a single conversational endpoint, CAKE delivers personalized customer insights powered by context-rich knowledge graphs—all in under 10 seconds. Unlike traditional tools that simply report numbers, the semantic foundation of CAKE captures the meaning and relationships between business metrics, customer behaviors, industry dynamics, and strategic contexts. This enables CAKE to explain not just what is happening with a customer, but why it’s happening and how to act.
Amazon Bedrock AgentCore provides the runtime infrastructure that multi-agent AI systems require as a managed service, including inter-agent communication, parallel execution, conversation state tracking, and tool routing. This helps teams focus on defining agent behaviors and business logic rather than implementing distributed systems infrastructure.
For CAKE, we built a custom agent on Amazon Bedrock AgentCore that coordinates five specialized tools, each optimized for different data access patterns:

Neptune retriever tool for graph relationship queries
DynamoDB agent for instant metric lookups
OpenSearch retriever tool for semantic document search
Web search tool for external industry intelligence
Row level security (RLS) tool for security enforcement

The following diagram shows how Amazon Bedrock AgentCore supports the orchestration of these components.

The solution flows through several key phases in response to a question (for example, “What are the top expansion opportunities for this customer?”):

Analyzes intent and routes the query – The supervisor agent, running on Amazon Bedrock AgentCore, analyzes the natural language query to determine its intent. The question requires customer understanding, relationship data, usage metrics, and strategic insights. The agent’s tool-calling logic, using Amazon Bedrock AgentCore Runtime, identifies which specialized tools to activate.
Dispatches tools in parallel – Rather than executing tool calls sequentially, the orchestration layer dispatches multiple retriever tools in parallel, using the scalable execution environment of Amazon Bedrock AgentCore Runtime. The agent manages the execution lifecycle, handling timeouts, retries, and error conditions automatically.
Synthesizes multiple results – As specialized tools return results, Amazon Bedrock AgentCore streams these partial responses to the supervisor agent, which synthesizes them into a coherent answer. The agent reasons about how different data sources relate to each other, identifies patterns, and generates insights that span multiple knowledge domains.
Enforces security boundaries – Before data retrieval begins, the agent invokes the RLS tool to deterministically enforce user permissions. The custom agent then verifies that subsequent tool calls respect these security boundaries, automatically filtering results and helping prevent unauthorized data access. This security layer operates at the infrastructure level, reducing the risk of implementation errors.

This architecture operates on two parallel tracks: Amazon Bedrock AgentCore provides the runtime for the real-time serving layer that responds to user queries with minimal latency, and an offline data pipeline periodically refreshes the underlying data stores from the analytical data warehouse. In the following sections, we discuss the agent framework design and core solution components, including the knowledge graph, data stores, and data pipeline.
Agent framework design
Our multi-agent system leverages the AWS Strands Agents framework to deliver structured reasoning capabilities while maintaining the enterprise controls required for regulatory compliance and predictable performance. The multi-agent system is built on the AWS Strands Agents framework, which provides a model-driven foundation for building agents from many different models. The supervisor agent analyzes incoming questions to intelligently select which specialized agents and tools to invoke and how to decompose user queries. The framework exposes agent states and outputs to implement decentralized evaluation at both agent and supervisor levels. Building on model-driven approach, we implement agentic reasoning through GraphRAG reasoning chains that construct deterministic inference paths by traversing knowledge relationships. Our agents perform autonomous reasoning within their specialized domains, grounded around pre-defined ontologies while maintaining predictable, auditable behavior patterns required for enterprise applications.
The supervisor agent employs a multi-phase selection protocol:

Question analysis – Parse and understand user intent
Source selection – Intelligent routing determines which combination of tools are needed
Query decomposition – Original questions are broken down into specialized sub-questions optimized for each selected tool
Parallel execution – Selected tools execute concurrently through serverless AWS Lambda action groups

Tools are exposed through a hierarchical composition pattern (accounting for data modality—structured vs. unstructured) where high-level agents and tools coordinate multiple specialized sub-tools:

Graph reasoning tool – Manages entity traversal, relationship analysis, and knowledge extraction
Customer insights agent – Coordinates multiple fine-tuned models in parallel for generating customer summaries from tables
Semantic search tool – Orchestrates unstructured text analysis (such as field notes)
Web research tool – Coordinates web/news retrieval

We extend the core AWS Strands Agents framework with enterprise-grade capabilities including customer access validation, token optimization, multi-hop LLM selection for model throttling resilience, and structured GraphRAG reasoning chains. These extensions deliver the autonomous decision-making capabilities of modern agentic systems while facilitating predictable performance and regulatory compliance alignment.
Building the knowledge graph foundation
CAKE’s knowledge graph in Neptune represents customer relationships, product usage patterns, and industry dynamics in a structured format that empowers AI agents to perform efficient reasoning. Unlike traditional databases that store information in isolation, CAKE’s knowledge graph captures the semantic meaning of business entities and their relationships.
Graph construction and entity modeling
We designed the knowledge graph around AWS sales ontology—the core entities and relationships that sales teams discuss daily:

Customer entities – With properties extracted from data sources including industry classifications, revenue metrics, cloud adoption phase, and engagement scores
Product entities – Representing AWS services, with connections to use cases, industry applications, and customer adoption patterns
Solution entities – Linking products to business outcomes and strategic initiatives
Opportunity entities – Tracking sales pipeline, deal stages, and associated stakeholders
Contact entities – Mapping relationship networks within customer organizations

Amazon Neptune excels at answering questions that require understanding connections—finding how two entities are related, identifying paths between accounts, or discovering indirect relationships that span multiple hops. The offline data construction process runs scheduled queries against Redshift clusters to prepare data to be loaded in the graph.
Capturing relationship context
CAKE’s knowledge graph captures how relationships connect entities. When the graph connects a customer to a product through an increased usage relationship, it also stores contextual attributes: the rate of increase, the business driver (from account plans), and related product adoption patterns. This contextual richness helps the LLM understand business context and provide explanations grounded in actual relationships rather than statistical correlation alone.
Purpose-built data stores
Rather than storing data in a single database, CAKE uses specialized data stores, each designed for how it gets queried. Our custom agent, running on Amazon Bedrock AgentCore, manages the coordination across these stores—sending queries to the right database, running them at the same time, and combining results—so both users and developers work with what feels like a single data source:

Neptune for graph relationships – Neptune stores the web of connections between customers, accounts, stakeholders, and organizational entities. Neptune excels at multi-hop traversal queries that require expensive joins in relational databases—finding relationship paths between disconnected accounts, or discovering customers in an industry who’ve adopted specific AWS services. When Amazon Bedrock AgentCore identifies a query requiring relationship reasoning, it automatically routes to the Neptune retriever tool.
DynamoDB for instant metrics – DynamoDB operates as a key-value store for precomputed aggregations. Rather than computing customer health scores or engagement metrics on-demand, the offline pipeline pre-computes these values and stores them indexed by customer ID. DynamoDB then delivers sub-10ms lookups, enabling instant report generation. Tool chaining in Amazon Bedrock AgentCore allows it to retrieve metrics from DynamoDB, pass them to the magnifAI agent (our custom table-to-text agent) for formatting, and return polished reports—all without custom integration code.
OpenSearch Service for semantic document search – OpenSearch Service stores unstructured content like account plans and field notes. Using embedding models, OpenSearch Service converts text into vector representations that support semantic matching. When Amazon Bedrock AgentCore receives a query about “digital transformation,” for example, it recognizes the need for semantic search and automatically routes to the OpenSearch Service retriever tool, which finds relevant passages even when documents use different terminology.
S3 for document storage – Amazon Simple Storage Service (Amazon S3) provides the foundation for OpenSearch Service. Account plans are stored as Parquet files in Amazon S3 before being indexed because the source warehouse (Amazon Redshift) has truncation limits that would cut off large documents. This multi-step process—Amazon S3 storage, embedding generation, OpenSearch Service indexing—preserves complete content while maintaining the low latency required for real-time queries.

Building on Amazon Bedrock AgentCore makes these multi-database queries feel like a single, unified data source. When a query requires customer relationships from Neptune, metrics from DynamoDB, and document context from OpenSearch Service, our agent automatically dispatches requests to all three in parallel, manages their execution, and synthesizes their results into a single coherent response.
Data pipeline and continuous refresh
The CAKE offline data pipeline operates as a batch process that runs on a scheduled cadence to keep the serving layer synchronized with the latest business data. The pipeline architecture separates data construction from data serving, so the real-time query layer can maintain low latency while the batch pipeline handles computationally intensive aggregations and graph construction.
The Data Processing Orchestration layer coordinates transformations across multiple target databases. For each database, the pipeline performs the following steps:

Extracts relevant data from Amazon Redshift using optimized queries
Applies business logic transformations specific to each data store’s requirements
Loads processed data into the target database with appropriate indexes and partitioning

For Neptune, this involves extracting entity data, constructing graph nodes and edges with property attributes, and loading the graph structure with semantic relationship types. For DynamoDB, the pipeline computes aggregations and metrics, structures data as key-value pairs optimized for customer ID lookups, and applies atomic updates to maintain consistency. For OpenSearch Service, the pipeline follows a specialized path: large documents are first exported from Amazon Redshift to Amazon S3 as Parquet files, then processed through embedding models to generate vector representations, which are finally loaded into the OpenSearch Service index with appropriate metadata for filtering and retrieval.
Engineering for production: Reliability and accuracy
When transitioning CAKE from prototype to production, we implemented several critical engineering practices to facilitate reliability, accuracy, and trust in AI-generated insights.
Model flexibility
The Amazon Bedrock AgentCore architecture decouples the orchestration layer from the underlying LLM, allowing flexible model selection. We implemented model hopping to provide automatic fallback to alternative models when throttling occurs. This resilience happens transparently within AgentCore’s Runtime—detecting throttling conditions, routing requests to available models, and maintaining response quality without user-visible degradation.
Row-Level Security (RLS) and Data Governance
Before data retrieval occurs, the RLS tool enforces row-level security based on user identity and organizational hierarchy. This security layer operates transparently to users while maintaining strict data governance:

Sales representatives access only customers assigned to their territories
Regional managers view aggregated data across their regions
Executives have broader visibility aligned with their responsibilities

The RLS tool routes queries to appropriate data partitions and applies filters at the database query level, so security can be enforced in the data layer rather than relying on application-level filtering.
Results and impact
CAKE has transformed how AWS sales teams access and act on customer intelligence. By providing instant access to unified insights through natural language queries, CAKE reduces the time spent searching for information from hours to seconds as per surveys/feedback from users, helping sales representatives focus on strategic customer engagement rather than data gathering.
The multi-agent architecture delivers query responses in seconds for most queries, with the parallel execution model supporting simultaneous data retrieval from multiple sources. The knowledge graph enables sophisticated reasoning that goes beyond simple data aggregation—CAKE explains why trends occur, identifies patterns across seemingly unrelated data points, and generates recommendations grounded in business relationships. Perhaps most importantly, CAKE democratizes access to customer intelligence across the organization. Sales representatives, account managers, solutions architects, and executives interact with the same unified system, providing consistent customer insights while maintaining appropriate security and access controls.
Conclusion
In this post, we showed how Amazon Bedrock AgentCore supports CAKE’s multi-agent architecture. Building multi-agent AI systems traditionally requires significant infrastructure investment, including implementing custom agent coordination protocols, managing parallel execution frameworks, tracking conversation state, handling failure modes, and building security enforcement layers. Amazon Bedrock AgentCore reduces this undifferentiated heavy lifting by providing these capabilities as managed services within Amazon Bedrock.
Amazon Bedrock AgentCore provides the runtime infrastructure for orchestration, and specialized data stores excel at their specific access patterns. Neptune handles relationship traversal, DynamoDB provides instant metric lookups, and OpenSearch Service supports semantic document search, but our custom agent, built on Amazon Bedrock AgentCore, coordinates these components, automatically routing queries to the right tools, executing them in parallel, synthesizing their results, and maintaining security boundaries throughout the workflow. The CAKE experience demonstrates how Amazon Bedrock AgentCore can help teams build multi-agent AI systems, speeding up the process from months of infrastructure development to weeks of business logic implementation. By providing orchestration infrastructure as a managed service, Amazon Bedrock AgentCore helps teams focus on domain expertise and customer value rather than building distributed systems infrastructure from scratch.
To learn more about Amazon Bedrock AgentCore and building multi-agent AI systems, refer to the Amazon Bedrock User Guide, Amazon Bedrock Workshop, and Amazon Bedrock Agents. For the latest news on AWS, see What’s New with AWS.
Acknowledgments
We extend our sincere gratitude to our executive sponsors and mentors whose vision and guidance made this initiative possible: Aizaz Manzar, Director of AWS Global Sales; Ali Imam, Head of Startup Segment; and Akhand Singh, Head of Data Engineering.
We also thank the dedicated team members whose technical expertise and contributions were instrumental in bringing this product to life: Aswin Palliyali Venugopalan, Software Dev Manager; Alok Singh, Senior Software Development Engineer; Muruga Manoj Gnanakrishnan, Principal Data Engineer; Sai Meka, Machine Learning Engineer; Bill Tran, Data Engineer; and Rui Li, Applied Scientist.

About the authors
Monica Jain is a Senior Technical Product Manager at AWS Global Sales and an analytics professional driving AI-powered sales intelligence at scale. She leads the development of generative AI and ML-powered data products—including knowledge graphs, AI-augmented analytics, natural language query systems, and recommendation engines, that improve seller productivity and decision-making. Her work enables AWS executives and sellers worldwide to access real-time insights and accelerate data-driven customer engagement and revenue growth.
M. Umar Javed is a Senior Applied Scientist at AWS, with over 8 years of experience across academia and industry and a PhD in ML theory. At AWS, he builds production-grade generative AI and machine learning solutions, with work spanning multi-agent LLM architectures, research on small language models, knowledge graphs, recommendation systems, reinforcement learning, and multi-modal deep learning. Prior to AWS, Umar contributed to ML research at NREL, CISCO, Oxford, and UCSD. He is a recipient of the ECEE Excellence Award (2021) and contributed to two Donald P. Eckman Awards (2021, 2023).
Damien Forthomme is a Senior Applied Scientist at AWS, leading a Data Science team in AWS Sales, Marketing, and Global Services (SMGS). With more than 10 years of experience and a PhD in Physics, he focuses on using and building advanced machine learning and generative AI tools to surface the right data to the right people at the right time. His work encompasses initiatives such as forecasting, recommendation systems, core foundational datasets creation, and building generative AI products that enhance sales productivity for the organization.
Mihir Gadgil is a Senior Data Engineer in AWS Sales, Marketing, and Global Services (SMGS), specializing in enterprise-scale data solutions and generative AI applications. With over 9 years of experience and a Master’s in Information Technology & Management, he focuses on building robust data pipelines, complex data modeling, and ETL/ELT processes. His expertise drives business transformation through innovative data engineering solutions and advanced analytics capabilities.
Sujit Narapareddy, Head of Data & Analytics at AWS Global Sales, is a technology leader driving global enterprise transformation. He leads data product and platform teams that power the AWS’s Go-to-Market through AI-augmented analytics and intelligent automation. With a proven track record in enterprise solutions, he has transformed sales productivity, data governance, and operational excellence. Previously at JPMorgan Chase Business Banking, he shaped next-generation FinTech capabilities through data innovation.
Norman Braddock, Senior Manager of AI Product Management at AWS, is a product leader driving the transformation of business intelligence through agentic AI. He leads the Analytics & Insights Product Management team within Sales, Marketing, and Global Services (SMGS), delivering products that bridge AI model performance with measurable business impact. With a background spanning procurement, manufacturing, and sales operations, he combines deep operational expertise with product innovation to shape the future of autonomous business management.

Evaluating AI agents: Real-world lessons from building agentic systems …

The generative AI industry has undergone a significant transformation from using large language model (LLM)-driven applications to agentic AI systems, marking a fundamental shift in how AI capabilities are architected and deployed. While early generative AI applications primarily relied on LLMs to directly generate text and respond to prompts, the industry has evolved from those static, prompt-response paradigms toward autonomous agent frameworks to build dynamic, goal-oriented systems capable of tool orchestration, iterative problem-solving, and adaptive task execution in production environments.
We have witnessed this evolution in Amazon; since 2025, there have been thousands of agents built across Amazon organizations. While single-model benchmarks serve as a crucial foundation for assessing individual LLM performance in LLM-driven applications, agentic AI systems require a fundamental shift in evaluation methodologies. The new paradigm assesses not only the underlying model performance but also the emergent behaviors of the complete system, including the accuracy of tool selection decisions, the coherence of multi-step reasoning processes, the efficiency of memory retrieval operations, and the overall success rates of task completion across production environments.
In this post, we present a comprehensive evaluation framework for Amazon agentic AI systems that addresses the complexity of agentic AI applications at Amazon through two core components: a generic evaluation workflow that standardizes assessment procedures across diverse agent implementations, and an agent evaluation library that provides systematic measurements and metrics in Amazon Bedrock AgentCore Evaluations, along with Amazon use case-specific evaluation approaches and metrics. We also share best practices and experiences captured during engagements with multiple Amazon teams, providing actionable insights for AWS developer communities facing similar challenges in evaluating and deploying agentic AI systems within their own business contexts.
AI agent evaluation framework in Amazon
When builders design, develop, and evaluate AI agents, they face significant challenges. Unlike traditional LLM-driven applications that only generate responses to isolated prompts, AI agents autonomously pursue goals through multi-step reasoning, tool use, and adaptive decision-making across multi-turn interactions. Traditional LLM evaluation methods treat agent systems as black boxes and evaluate only the final outcome, failing to provide sufficient insights to determine why AI agents fail or pinpoint the root causes. Although multiple specific evaluation tools are available in the industry, builders must navigate among them and consolidate results with significant manual efforts. Additionally, while agent development frameworks, such as Strands Agents, LangChain, and LangGraph, have built-in evaluation modules, builders want a framework-agnostic evaluation approach rather than being locked into methods within a single framework.
Additionally, robust self-reflection and error handling in AI agents requires systematic assessment of how agents detect, classify, and recover from failures across the execution lifecycle in reasoning, tool-use, memory handling, and action taking. For example, the evaluation frameworks must measure the agent’s ability to recognize diverse failure scenarios such as inappropriate planning from the reasoning model, invalid tool invocations, malformed parameters, unexpected tool response formats, authentication failures, and memory retrieval errors. A production-grade agent must demonstrate consistent error recovery patterns and resilience in maintaining the coherence of user interactions after encountering exceptions.
To meet these needs, AI agents deployed in production environments at scale require continuous monitoring and systematic evaluation to promptly detect and mitigate agent decay and performance degradation. This demands that the agent evaluation framework streamline the end-to-end process and provide near real-time issue detection, notification, and problem resolution. Finally, incorporating human-in-the-loop (HITL) processes is essential to audit evaluation results, helping to ensure the reliability of system outputs.
To address these challenges, we propose a holistic agentic AI evaluation framework, as shown in the following figure. The framework contains two key components: an automated AI agent evaluation workflow and an AI agent evaluation library.

The automated AI agent evaluation workflow drives the holistic evaluation approach with four steps.
Step 1: Users define inputs for evaluation, typically trace files from agent execution. These can be offline traces collected after the agent completes the task and uploaded to the framework using a unified API access point or online traces where users can define evaluation dimensions and metrics.
Step 2: The AI agent evaluation library is used to automatically generate default and user-defined evaluation metrics. The methods in the library are described in the next list.
Step 3: The evaluation results are shared through an Amazon Simple Storage Service (Amazon S3) bucket or a dashboard that visualizes the agent trace observability and evaluation results.
Step 4: Results are analyzed through agent performance auditing and monitoring. Builders can define their own rules to send notifications upon agent performance degradation and can take action to resolve problems. Builders can also HITL mechanisms to schedule periodic human audits of agent trace subsets and evaluation results, improving consistent agent quality and performance.
The AI agent evaluation library operates across three layers: calculating and generating evaluation metrics for the agent’s final output, assessing individual agent components, and measuring the performance of the underlying LLMs that power the agent.

Bottom layer: Benchmarks multiple foundation models to select the appropriate models powering the AI agent and determine how different models impact the agent overall quality and latency.
Middle layer: Evaluates the performance of the components of the agent, including intent detection, multi-turn conversation, memory, LLM reasoning and planning, tool-use, and others. For example, the middle layer determines whether the agent understands user intents correctly, how the LLM drives agentic workflow planning through chain-of-thought (CoT) reasoning, whether the tool selection and execution are aligned with the agentic plan, and if the plan is completed successfully.
Upper layer: Assesses the agent’s final response, the task completion, and whether the agent meets the goal defined in the use case. It also covers overall responsibility and safety, the costs, and the customer experience impacts.

Amazon Bedrock AgentCore Evaluations provides automated assessment tools to measure how well your agent or tools perform specific tasks, handle edge cases, and maintain consistency across different inputs and contexts. In the agent evaluation library, we provide a set of pre-defined evaluation metrics for the agent’s final response and its components, based on the built-in configurations, evaluators, and metrics of AgentCore Evaluations. We further extended the evaluation library with specialized metrics designed for the heterogeneous scenario complexity and application-specific requirements of Amazon. The primary metrics in the library include

Final response quality:

Correctness: The factual accuracy and correctness of an AI assistant’s response to a given task.
Faithfulness: Whether an AI assistant’s response remains consistent with the conversation history.
Helpfulness: How effectively an AI assistant’s response helps users appropriately address query and progress toward their goals.
Response relevance: How well an AI assistant’s response addresses the specific question or request.
Conciseness: How efficiently an AI assistant communicates information, for instance, whether the response is appropriately brief without missing key information.

Task completion: 

Goal success: Did the AI assistant successfully complete all user goals within a conversation session.
Goal accuracy: Compares the output to the ground truth.

Tool use:

Tool selection accuracy: Did the AI assistant choose the appropriate tool for a given situation.
Tool parameter accuracy: Did the AI assistant correctly use contextual information when making tool calls.
Tool call error rate: The frequency of failures when an AI assistant makes tool calls.
Multi-turn function calling accuracy: Are multiple tools being called and how often the tools are called in the correct sequence.

Memory:

Context retrieval: Assesses the accuracy of findings and surfaces the most relevant contexts for a given query from memory, prioritizing relevant information based on similarity or ranking, and balancing precision and recall.

Multi-turn: 

Topic adherence classification: If a multi-turn conversation includes multiple topics, assesses whether the conversation stays on predefined domains and topics during the interaction.
Topic adherence refusal: Determines if the AI agent refuse to answer questions about a topic.

Reasoning:

Grounding accuracy: Does the model understand the task, appropriately select tools, and is the CoT aligned with the provided context and data returned by external tools.
Faithfulness score: Measures logical consistency across the reasoning process.
Context score: Is each step taken by the agent contextually grounded.

Responsibility and safety:

Hallucination: Do the outputs align with established knowledge, verifiable data, logical inference, or include any elements that are implausible, misleading, or entirely fictional.
Toxicity: Do the outputs contain language, suggestions, or attitudes that are harmful, offensive, disrespectful, or promote negativity. This include content that might be aggressive, demeaning, bigoted, or excessively critical without constructive purpose.
Harmfulness: Is there potentially harmful content in an AI assistant’s response, including insults, hate speech, violence, inappropriate sexual content, and stereotyping.

See AgentCore evaluation templates for other agent output quality metrics, or how to create custom evaluators that are tailored to your specific use cases and evaluation requirements.
Evaluating real-world agent systems used by Amazon
In the past few years, Amazon has been working to advance its approach in building agentic AI applications to address complex business challenges, streamlining business processes, improving operational efficiency, and optimizing business outcomes—moving from early experimentation to production-scale deployments across multiple business units. These agentic AI applications operate at enterprise scale and are deployed across AWS infrastructure, transforming how work gets done across global operations within Amazon. In this section, we introduce a few real-world agentic AI use cases from Amazon, to demonstrate how Amazon teams improve AI agent performance through holistic evaluation using the framework discussed in the previous section.
Evaluating tool-use in the Amazon shopping assistant AI agent
To deliver a smooth shopping experience to Amazon consumers, the Amazon shopping assistant can seamlessly interact with numerous APIs and web services from underlying Amazon systems, as shown in the following figure. The AI agent needs to onboard hundreds, sometimes thousands, of tools from underlying Amazon systems to engage in long-running multi-turn conversations with the consumer. The agent uses these tools to deliver a personalized experience that includes customer profiling, product and inventory discovery, and order placement. However, manually onboarding so many enterprise APIs and web services to an AI agent is a cumbersome process that typically takes months to complete.

Transforming legacy APIs and web services into agent-compatible tools requires the systematic definition of structured schemas and semantic descriptions for the endpoints of the API and web services, enabling the agent’s reasoning and planning mechanisms to accurately identify and select contextually appropriate tools during task execution. Poorly defined tool schemas and imprecise semantic descriptions result in erroneous tool selection during agent runtime, leading to the invocation of irrelevant APIs that unnecessarily expand the context window, increase inference latency, and escalate computational costs through redundant LLM calls. To address these challenges, Amazon defined cross-organizational standards for tool schema and description formalization, creating a governance framework that specifies mandatory compliance requirements for all builder teams involved in tool development and agent integration. This standardization initiative establishes uniform specifications for tool interfaces, parameter definitions, capability descriptions, and usage constraints, helping to ensure that tools developed across diverse organizational units maintain consistent structural patterns and semantic clarity to produce reliable agent-tool interactions. All builder teams engaged in tool development and agent integration must conform to these architectural specifications, which prescribe standardized formats for tool signatures, input validation schemas, output contracts, and human-readable documentation. This helps ensure consistency in tool representation across the enterprise agentic systems. Furthermore, manually defining tool schemas and descriptions for hundreds or thousands of tools represents a significant engineering burden, and the complexity escalates substantially when multiple APIs require coordinated orchestration to accomplish composite tasks. Amazon builders implemented an API self-onboarding system that uses LLMs to automate the generation of standardized tool schemas and descriptions. This significantly improved the efficiency in onboarding large numbers of APIs and services into agent-compatible tools, accelerating integration timelines and reducing manual engineering overhead. To evaluate the tool-selection and tool-use after integration of the APIs is completed, Amazon teams created golden datasets for regression testing. The datasets are generated synthetically using LLMs from historical API invocation logs upon user queries. Using pre-defined tool-selection and tool-use metrics such as tool selection accuracy, tool parameter accuracy, and multi-turn function call accuracy, the Amazon builders can systematically evaluate the shopping assistant AI agent’s capability to correctly identify appropriate tools, populate their parameters with accurate values, and maintain coherent tool invocation sequences across conversational turns. As the agent continues to evolve, the ability to rapidly and reliably integrate new APIs as tools in the agent and evaluate the tool-use performance becomes increasingly crucial. The objective assessment of agent’s functional reliability in production environments effectively reduces development overhead while maintaining robust performance in the agentic AI applications.
Evaluating user intent detection in the Amazon customer service AI agent
In the Amazon customer-service landscape, AI agents are instrumental in handling customer inquiries and resolving issues. At the heart of these systems lies a crucial capability: an orchestration AI agent using it’s reasoning model to accurately detect customer intent, which determines whether a customer’s query is correctly understood and routed to the appropriate specialized resolver implemented by agent tools or subagents, as shown in the following figure. The stakes are high when it comes to intent detection accuracy. When the customer service agent misinterprets a customer’s intent, it can trigger a cascade of problems: queries get routed to the wrong specialized resolvers, customers receive irrelevant responses, and frustration builds. This impacts customer experience and leads to increased operational costs as more customers seek intervention from human agents.

To evaluate the agent’s reasoning capability for intent detection, the Amazon team developed an LLM simulator that uses LLM driven virtual customer personas to simulate diverse user scenarios and interactions. The evaluation is mainly focused on correctness of the intent generated by the orchestration agent and routing to the correct subagent. The simulation dataset contains a set of user query and ground truth intent pairs collected from anonymized historical customer interactions. Using the simulator, the orchestration agent generates the intents upon the user queries in the simulation dataset. By comparing the agent response intent to the ground truth intent, we can validate if the agent-generated intents comply with the ground truth.
In addition to the intent correctness, the evaluation covers the task completion—the agent’s final response and intent resolution—as the final goal of the customer service tasks. For the multi-turn conversation, we also include the metrics of topic adherence classification and topic adherence refusal to help ensure conversational coherence and user experience quality. As AI customer service systems continue to evolve, the importance of robust agent reasoning evaluation for user intent detection only grows, the impact extends beyond immediate customer satisfaction. It also optimizes customer service operation efficiency and service delivery costs, and so maximizes the return on AI investments.
Evaluating multi-agent systems at Amazon
As enterprises increasingly confront multifaceted challenges in complex business environments, ranging from cross-functional workflow orchestration to real-time decision-making under uncertainty, Amazon teams are progressively adopting multi-agent system architectures that decompose monolithic AI solutions into specialized, collaborative agents capable of distributed reasoning, dynamic task allocation, and adaptive problem-solving at scale. One example is the Amazon seller assistant AI agent that encompasses collaborations among multiple AI agents, depicted in the following flow chart.

The agentic workflow, beginning with an LLM planner and task orchestrator, receives user requests, decomposes complex tasks into specialized subtasks, and intelligently assigns each subtask to the most appropriate underlying agent based on their capabilities and current workload. The underlying agents then operate autonomously, executing their assigned tasks by using their specialized tools, reasoning capabilities, and domain expertise to complete objectives without requiring continuous oversight from the orchestrator. Upon task completion the specialized agents communicate back to the orchestration agent, reporting task status updates, completion confirmations, intermediate results, or escalation requests when they encounter scenarios beyond their operational boundaries. The orchestration agent aggregates these responses, monitors overall progress, handles dependencies between subtasks, and synthesizes the collective outputs into a coherent final result that addresses the original user request. To evaluate this multi-agent collaboration process, the evaluation workflow accounts for both individual agent performance and the overall collective system dynamics. In addition to evaluating the overall task execution quality and performance of specialized agents in task completion, reasoning, tool-use and memory retrieval, we also need to measure the interagent communication patterns, coordination efficiency, and task handoff accuracy. For this, Amazon teams use the metrics such as the planning score (successful subtask assignment to subagents), communication score (interagent communication messages for subtask completion), and collaboration success rate (percentage of successful sub-task completion). In multi-agent systems evaluation, HITL becomes critical because of the increased complexity and potential for unexpected emergent behaviors that automated metrics might fail to capture. Human intervention in the evaluation workflow provides essential oversight for assessing inter-agent communication to identify coordination failure in specific edge cases, evaluating the appropriateness of agent specialization and whether task decomposition aligns with agent capabilities, and validating potential conflict resolution strategies when agents produce contradictory recommendations. It also helps ensure logical consistency when multiple agents contribute to a single decision, and that the collective agent behavior serves the intended business objective. These are the dimensions that are difficult to quantify through automated metrics alone but are critical for production deployment success.
Lessons learned and best practices
Through extensive engagements with Amazon product and engineering teams deploying agentic AI systems in production environments, we have identified critical lessons learned and established best practices that address the unique challenges of evaluating autonomous agent architectures at scale.

Holistic evaluation across multiple dimensions: Agentic application evaluation must extend beyond traditional accuracy metrics to encompass a comprehensive assessment framework that covers agent quality, performance, responsibility, and cost. Quality evaluation includes measuring reasoning coherence, tool selection accuracy, and task completion success rates across diverse scenarios. Performance assessment captures latency, throughput, and resource utilization under production workloads. Responsibility evaluation addresses safety, toxicity, bias mitigation, hallucination detection, and guardrails to align with organizational policies and regulatory requirements. Cost analysis quantifies both direct expenses including model inference, tool invocation, data processing, and indirect costs such as human efforts and error remediation. This multi-dimensional approach helps ensure holistic optimization across balanced trade-offs.
Use case and application-specific evaluation: Besides the standardized metrics discussed in the previous sections, application-specific evaluation metrics also contribute to the overall application assessment. For instance, customer service applications require metrics such as customer satisfaction scores, first-contact resolution rates, and sentiment analysis scores to measure final business outcomes. This approach requires close collaboration with domain experts to define meaningful success criteria, define appropriate metrics, and create evaluation datasets that reflect real-world operational complexity to complete the assessment process.
Human-in-the-loop (HITL) as a critical evaluation component: As discussed in the multi-agent system evaluation case, HITL is indispensable, particularly for high-stakes decision scenarios. It provides essential evaluation of agent reasoning chains, the coherence of multi-step workflows, and the alignment of agent behavior with business requirements. HITL also helps provide ground truth labels for building golden testing datasets, and calibration of LLM-as-a-judge in the automatic evaluator to align with human preferences.
Continuous evaluation in production environments: It’s essential to maintain quality because the pre-deployment evaluation might not fully capture the performance characteristics. Also, production evaluation monitors real-world performance across diverse user behaviors, usage patterns, and edge cases not represented before production deployment to identify performance degradation over time. You can track key metrics through operational dashboards, implement alert thresholds, automate anomaly detection process, and establish feedback loops. When the issues are detected, you can start model retraining, refine context engineering, and align with your ultimate business objectives.

Conclusion
As AI systems become increasingly complex, the importance of a thorough AI agent evaluation approach cannot be overstated. Through holistic evaluation across quality, performance, responsibility, and cost dimensions, in addition to continuous production monitoring and human-in-the-loop validation, the full lifecycle of agentic AI deployment from development to production can be addressed. You can learn from the presented examples, best practices, and lessons learned in this post—many of which are available in Amazon Bedrock AgentCore Evaluations—to accelerate your own agentic AI initiatives while avoiding common pitfalls in evaluation design and implementation.

About the authors

Yunfei Bai
Yunfei Bai is a Principal Applied AI Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Winnie Xiong
Winnie Xiong is a Senior Technical Product Manager on the Amazon’s Benchmarking team. She partners with engineers and scientists to build AI and data solutions that solve complex business challenges for Amazon teams. Her expertise span across model evaluation, agent evaluation, and data management.

Allie Colin
Allie Colin is a Head of Product and Science at Amazon’s Benchmarking team. She leads a team of scientists and product managers building tools that help Amazonians test their products for quality through the lens of real customer experiences. Previously, she worked at MicroStrategy as Chief of Staff to the CTO, as well as at Deutsche Bank and Northwestern Mutual. Outside of work, Allie is a mom of four who loves the nightly comedy show they put on and enjoys anything that gets her outdoors—hiking, swimming, and traveling.

Kashif Imran
Kashif Imran is a seasoned engineering and product leader with deep expertise in AI/ML, cloud architecture, and large-scale distributed systems. With a decade of experience at AWS, Kashif has driven innovation across cloud and AI technologies. Currently a Senior Manager at Amazon Prime Video, he leads AI-native engineering teams building scalable agentic AI solutions to drive business transformation.

Anthropic Releases Claude 4.6 Sonnet with 1 Million Token Context to S …

Anthropic is officially entering its ‘Thinking’ era. Today, the company announced Claude 4.6 Sonnet, a model designed to transform how devs and data scientists handle complex logic. Alongside this release comes Improved Web Search with Dynamic Filtering, a feature that uses internal code execution to verify facts in real-time.

https://www.anthropic.com/news/claude-sonnet-4-6

Adaptive Thinking: A New Logic Engine

The core update in Claude 4.6 Sonnet is the Adaptive Thinking engine. Accessed via the extended thinking API, this allows the model to ‘pause’ and reason through a problem before generating a final response.

Instead of jumping straight to code, the model creates internal monologues to test logic paths. You can see this in the new Thought interface. For a dev debugging a complex race condition, this means the model identifies the root cause in its ‘thinking’ stage rather than guessing in the code output.

This improves data cleaning tasks. When processing a messy dataset, 4.6 Sonnet spends more compute time analyzing edge cases and schema inconsistencies. This process significantly reduces the ‘hallucinations’ common in faster, non-reasoning models.

The Benchmarks: Closing the Gap with Opus

The performance data for 4.6 Sonnet shows it is now breathing down the neck of the flagship Opus model. In many categories, it is the most efficient ‘workhorse’ model currently available.

Benchmark CategoryClaude 3.5 SonnetClaude 4.6 SonnetKey ImprovementSWE-bench Verified49.0%79.6%Optimized for complex bug fixing and multi-file editing.OSWorld (Computer Use)14.9%72.5%Massive gain in autonomous UI navigation and tool usage.MATH71.1%88.0%Enhanced reasoning for advanced algorithmic logic.BrowseComp (Search)33.3%46.6%Improved accuracy via native Python-based dynamic filtering.

The 72.5% score on OSWorld is a major highlight. It suggests that Claude 4.6 Sonnet can now navigate spreadsheets, web browsers, and local files with near-human accuracy. This makes it a prime candidate for building autonomous ‘Computer Use’ agents.

Search Meets Python: Dynamic Filtering

Anthropic’s Improved Web Search with Dynamic Filtering changes how AI interacts with the live web. Most AI search tools simply scrape the first few results they find.

Claude 4.6 Sonnet takes a different path. It uses a Python code execution sandbox to post-process search results. If you search for a library update from 2025, the model writes and runs code to filter out any results that are older than your specified date. It also filters by Site Authority, prioritizing technical hubs like GitHub, Stack Overflow, and official documentation.

This means fewer outdated code snippets. The model performs a ‘Multi-Step Retrieval.’ It does an initial search, parses the HTML, and applies filters to ensure the ‘Noise-to-Signal’ ratio remains low. This increased search accuracy from 33.3% to 46.6% in internal testing.

Scaling and Pricing for Production

Anthropic is positioning 4.6 Sonnet as the primary model for production-grade applications. It now features a 1M token context window in beta. This allows developers to feed an entire repository or a massive technical library into the prompt without losing coherence.

Pricing and Availability:

Input Cost: $3 per 1M tokens.

Output Cost: $15 per 1M tokens.

Platforms: Available on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI.

The model also shows improved adherence to System Prompts. This is critical for devs building agents that require strict JSON formatting or specific ‘persona’ constraints.

https://www.anthropic.com/news/claude-sonnet-4-6

Key Takeaways

Adaptive Thinking Engine: Replacing the old binary ‘extended thinking’ mode, Claude 4.6 Sonnet introduces Adaptive Thinking. Using the new effort parameter, the model can dynamically decide how much reasoning is required for a task, optimizing the balance between speed, cost, and intelligence.

Frontier Agentic Performance: The model sets new industry benchmarks for autonomous agents, scoring 79.6% on SWE-bench Verified for coding and 72.5% on OSWorld for computer use. These scores indicate it can now navigate complex software and UI environments with near-human accuracy.

1 Million Token Context Window: Now available in beta, the context window has expanded to 1M tokens. This allows AI devs to ingest entire multi-repo codebases or massive technical archives in a single prompt without the model losing focus or ‘forgetting’ instructions.

Search via Native Code Execution: The new Improved Web Search with Dynamic Filtering allows Claude to write and run Python code to post-process search results. This ensures the model can programmatically filter for the most recent and authoritative sources (like GitHub or official docs) before generating a response.

Production-Ready Efficiency: Claude 4.6 Sonnet maintains a competitive price of $3 per 1M input tokens and $15 per 1M output tokens. Combined with the new Context Compaction API, developers can now build long-running agents that maintain ‘infinite’ conversation history more cost-effectively.

Check out the Technical details here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic Releases Claude 4.6 Sonnet with 1 Million Token Context to Solve Complex Coding and Search for Developers appeared first on MarkTechPost.

How to Build an Advanced, Interactive Exploratory Data Analysis Workfl …

In this tutorial, we demonstrate how to move beyond static, code-heavy charts and build a genuinely interactive exploratory data analysis workflow directly using PyGWalker. We start by preparing the Titanic dataset for large-scale interactive querying. These analysis-ready engineered features reveal the underlying structure of the data while enabling both detailed row-level exploration and high-level aggregated views for deeper insight. Embedding a Tableau-style drag-and-drop interface directly in the notebook enables rapid hypothesis testing, intuitive cohort comparisons, and efficient data-quality inspection, all without the friction of switching between code and visualization tools.

Copy CodeCopiedUse a different Browserimport sys, subprocess, json, math, os
from pathlib import Path

def pip_install(pkgs):
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”] + pkgs)

pip_install([
“pygwalker>=0.4.9”,
“duckdb>=0.10.0”,
“pandas>=2.0.0”,
“numpy>=1.24.0”,
“seaborn>=0.13.0”
])

import numpy as np
import pandas as pd
import seaborn as sns

df_raw = sns.load_dataset(“titanic”).copy()
print(“Raw shape:”, df_raw.shape)
display(df_raw.head(3))

We set up a clean and reproducible Colab environment by installing all required dependencies for interactive EDA. We load the Titanic dataset and perform an initial sanity check to understand its raw structure and scale. It establishes a stable foundation before any transformation or visualization begins.

Copy CodeCopiedUse a different Browserdef make_safe_bucket(series, bins=None, labels=None, q=None, prefix=”bucket”):
s = pd.to_numeric(series, errors=”coerce”)
if q is not None:
try:
cuts = pd.qcut(s, q=q, duplicates=”drop”)
return cuts.astype(“string”).fillna(“Unknown”)
except Exception:
pass
if bins is not None:
cuts = pd.cut(s, bins=bins, labels=labels, include_lowest=True)
return cuts.astype(“string”).fillna(“Unknown”)
return s.astype(“float64″)

def preprocess_titanic_advanced(df):
out = df.copy()
out.columns = [c.strip().lower().replace(” “, “_”) for c in out.columns]

for c in [“survived”, “pclass”, “sibsp”, “parch”]:
if c in out.columns:
out[c] = pd.to_numeric(out[c], errors=”coerce”).fillna(-1).astype(“int64”)

if “age” in out.columns:
out[“age”] = pd.to_numeric(out[“age”], errors=”coerce”).astype(“float64”)
out[“age_is_missing”] = out[“age”].isna()
out[“age_bucket”] = make_safe_bucket(
out[“age”],
bins=[0, 12, 18, 30, 45, 60, 120],
labels=[“child”, “teen”, “young_adult”, “adult”, “mid_age”, “senior”],
)

if “fare” in out.columns:
out[“fare”] = pd.to_numeric(out[“fare”], errors=”coerce”).astype(“float64”)
out[“fare_is_missing”] = out[“fare”].isna()
out[“log_fare”] = np.log1p(out[“fare”].fillna(0))
out[“fare_bucket”] = make_safe_bucket(out[“fare”], q=8)

for c in [“sex”, “class”, “who”, “embarked”, “alone”, “adult_male”]:
if c in out.columns:
out[c] = out[c].astype(“string”).fillna(“Unknown”)

if “cabin” in out.columns:
out[“deck”] = out[“cabin”].astype(“string”).str.strip().str[0].fillna(“Unknown”)
out[“deck_is_missing”] = out[“cabin”].isna()
else:
out[“deck”] = “Unknown”
out[“deck_is_missing”] = True

if “ticket” in out.columns:
t = out[“ticket”].astype(“string”)
out[“ticket_len”] = t.str.len().fillna(0).astype(“int64”)
out[“ticket_has_alpha”] = t.str.contains(r”[A-Za-z]”, regex=True, na=False)
out[“ticket_prefix”] = t.str.extract(r”^([A-Za-z./s]+)”, expand=False).fillna(“None”).str.strip()
out[“ticket_prefix”] = out[“ticket_prefix”].replace(“”, “None”).astype(“string”)

if “sibsp” in out.columns and “parch” in out.columns:
out[“family_size”] = (out[“sibsp”] + out[“parch”] + 1).astype(“int64”)
out[“is_alone”] = (out[“family_size”] == 1)

if “name” in out.columns:
title = out[“name”].astype(“string”).str.extract(r”,s*([^.]+).”, expand=False).fillna(“Unknown”).str.strip()
vc = title.value_counts(dropna=False)
keep = set(vc[vc >= 15].index.tolist())
out[“title”] = title.where(title.isin(keep), other=”Rare”).astype(“string”)
else:
out[“title”] = “Unknown”

out[“segment”] = (
out[“sex”].fillna(“Unknown”).astype(“string”)
+ ” | ”
+ out[“class”].fillna(“Unknown”).astype(“string”)
+ ” | ”
+ out[“age_bucket”].fillna(“Unknown”).astype(“string”)
)

for c in out.columns:
if out[c].dtype == bool:
out[c] = out[c].astype(“int64”)
if out[c].dtype == “object”:
out[c] = out[c].astype(“string”)

return out

df = preprocess_titanic_advanced(df_raw)
print(“Prepped shape:”, df.shape)
display(df.head(3))

We focus on advanced preprocessing and feature engineering to convert the raw data into an analysis-ready form. We create robust, DuckDB-safe features such as buckets, segments, and engineered categorical signals that enhance downstream exploration. We ensure the dataset is stable, expressive, and suitable for interactive querying.

Copy CodeCopiedUse a different Browserdef data_quality_report(df):
rows = []
n = len(df)
for c in df.columns:
s = df[c]
miss = int(s.isna().sum())
miss_pct = (miss / n * 100.0) if n else 0.0
nunique = int(s.nunique(dropna=True))
dtype = str(s.dtype)
sample = s.dropna().head(3).tolist()
rows.append({
“col”: c,
“dtype”: dtype,
“missing”: miss,
“missing_%”: round(miss_pct, 2),
“nunique”: nunique,
“sample_values”: sample
})
return pd.DataFrame(rows).sort_values([“missing”, “nunique”], ascending=[False, False])

dq = data_quality_report(df)
display(dq.head(20))

RANDOM_SEED = 42
MAX_ROWS_FOR_UI = 200_000

df_for_ui = df
if len(df_for_ui) > MAX_ROWS_FOR_UI:
df_for_ui = df_for_ui.sample(MAX_ROWS_FOR_UI, random_state=RANDOM_SEED).reset_index(drop=True)

agg = (
df.groupby([“segment”, “deck”, “embarked”], dropna=False)
.agg(
n=(“survived”, “size”),
survival_rate=(“survived”, “mean”),
avg_fare=(“fare”, “mean”),
avg_age=(“age”, “mean”),
)
.reset_index()
)

for c in [“survival_rate”, “avg_fare”, “avg_age”]:
agg[c] = agg[c].astype(“float64”)

Path(“/content”).mkdir(parents=True, exist_ok=True)
df_for_ui.to_csv(“/content/titanic_prepped_for_ui.csv”, index=False)
agg.to_csv(“/content/titanic_agg_segment_deck_embarked.csv”, index=False)

We evaluate data quality and generate a structured overview of missingness, cardinality, and data types. We prepare both a row-level dataset and an aggregated cohort-level table to support fast comparative analysis. The dual representation allows us to explore detailed patterns and high-level trends simultaneously.

Copy CodeCopiedUse a different Browserimport pygwalker as pyg

SPEC_PATH = Path(“/content/pygwalker_spec_titanic.json”)

def load_spec(path):
if path.exists():
try:
return json.loads(path.read_text())
except Exception:
return None
return None

def save_spec(path, spec_obj):
try:
if isinstance(spec_obj, str):
spec_obj = json.loads(spec_obj)
path.write_text(json.dumps(spec_obj, indent=2))
return True
except Exception:
return False

def launch_pygwalker(df, spec_path):
spec = load_spec(spec_path)
kwargs = {}
if spec is not None:
kwargs[“spec”] = spec

try:
walker = pyg.walk(df, use_kernel_calc=True, **kwargs)
except TypeError:
walker = pyg.walk(df, **kwargs) if spec is not None else pyg.walk(df)

captured = None
for attr in [“spec”, “_spec”]:
if hasattr(walker, attr):
try:
captured = getattr(walker, attr)
break
except Exception:
pass
for meth in [“to_spec”, “export_spec”, “get_spec”]:
if captured is None and hasattr(walker, meth):
try:
captured = getattr(walker, meth)()
break
except Exception:
pass

if captured is not None:
save_spec(spec_path, captured)

return walker

walker_rows = launch_pygwalker(df_for_ui, SPEC_PATH)
walker_agg = pyg.walk(agg)

We integrate PyGWalker to transform our prepared tables into a fully interactive, drag-and-drop analytical interface. We persist the visualization specification so that dashboard layouts and encodings survive notebook reruns. It turns the notebook into a reusable, BI-style exploration environment.

Copy CodeCopiedUse a different BrowserHTML_PATH = Path(“/content/pygwalker_titanic_dashboard.html”)

def export_html_best_effort(df, spec_path, out_path):
spec = load_spec(spec_path)
html = None

try:
html = pyg.walk(df, spec=spec, return_html=True) if spec is not None else pyg.walk(df, return_html=True)
except Exception:
html = None

if html is None:
for fn in [“to_html”, “export_html”]:
if hasattr(pyg, fn):
try:
f = getattr(pyg, fn)
html = f(df, spec=spec) if spec is not None else f(df)
break
except Exception:
continue

if html is None:
return None

if not isinstance(html, str):
html = str(html)

out_path.write_text(html, encoding=”utf-8″)
return out_path

export_html_best_effort(df_for_ui, SPEC_PATH, HTML_PATH)

We extend the workflow by exporting the interactive dashboard as a standalone HTML artifact. We ensure the analysis can be shared or reviewed without requiring a Python environment or Colab session. It completes the pipeline from raw data to distributable, interactive insight.

Interactive EDA Dashboard

In conclusion, we established a robust pattern for advanced EDA that scales far beyond the Titanic dataset while remaining fully notebook-native. We showed how careful preprocessing, type safety, and feature design allow PyGWalker to operate reliably on complex data, and how combining detailed records with aggregated summaries unlocks powerful analytical workflows. Instead of treating visualization as an afterthought, we used it as a first-class interactive layer, allowing us to iterate, validate assumptions, and extract insights in real time.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build an Advanced, Interactive Exploratory Data Analysis Workflow Using PyGWalker and Feature-Engineered Data appeared first on MarkTechPost.

Cloudflare Releases Agents SDK v0.5.0 with Rewritten @cloudflare/ai-ch …

Cloudflare has released the Agents SDK v0.5.0 to address the limitations of stateless serverless functions in AI development. In standard serverless architectures, every LLM call requires rebuilding the session context from scratch, which increases latency and token consumption. The Agents SDK’s latest version (Agents SDK v0.5.0) provides a vertically integrated execution layer where compute, state, and inference coexist at the network edge.

The SDK allows developers to build agents that maintain state over long durations, moving beyond simple request-response cycles. This is achieved through 2 primary technologies: Durable Objects, which provide persistent state and identity, and Infire, a custom-built Rust inference engine designed to optimize edge resources. For devs, this architecture removes the need to manage external database connections or WebSocket servers for state synchronization.

State Management via Durable Objects

The Agents SDK relies on Durable Objects (DO) to provide persistent identity and memory for every agent instance. In traditional serverless models, functions have no memory of previous events unless they query an external database like RDS or DynamoDB, which often adds 50ms to 200ms of latency.

A Durable Object is a stateful micro-server running on Cloudflare’s network with its own private storage. When an agent is instantiated using the Agents SDK, it is assigned a stable ID. All subsequent requests for that user are routed to the same physical instance, allowing the agent to keep its state in memory. Each agent includes an embedded SQLite database with a 1GB storage limit per instance, enabling zero-latency reads and writes for conversation history and task logs.

Durable Objects are single-threaded, which simplifies concurrency management. This design ensures that only 1 event is processed at a time for a specific agent instance, eliminating race conditions. If an agent receives multiple inputs simultaneously, they are queued and processed atomically, ensuring the state remains consistent during complex operations.

Infire: Optimizing Inference with Rust

For the inference layer, Cloudflare developed Infire, an LLM engine written in Rust that replaces Python-based stacks like vLLM. Python engines often face performance bottlenecks due to the Global Interpreter Lock (GIL) and garbage collection pauses. Infire is designed to maximize GPU utilization on H100 hardware by reducing CPU overhead.

The engine utilizes Granular CUDA Graphs and Just-In-Time (JIT) compilation. Instead of launching GPU kernels sequentially, Infire compiles a dedicated CUDA graph for every possible batch size on the fly. This allows the driver to execute work as a single monolithic structure, cutting CPU overhead by 82%. Benchmarks show that Infire is 7% faster than vLLM 0.10.0 on unloaded machines, utilizing only 25% CPU compared to vLLM’s >140%.

MetricvLLM 0.10.0 (Python)Infire (Rust)ImprovementThroughput SpeedBaseline7% Faster+7%CPU Overhead>140% CPU usage25% CPU usage-82%Startup LatencyHigh (Cold Start)<4 seconds (Llama 3 8B)Significant

Infire also uses Paged KV Caching, which breaks memory into non-contiguous blocks to prevent fragmentation. This enables ‘continuous batching,’ where the engine processes new prompts while simultaneously finishing previous generations without a performance drop. This architecture allows Cloudflare to maintain a 99.99% warm request rate for inference.

Code Mode and Token Efficiency

Standard AI agents typically use ‘tool calling,’ where the LLM outputs a JSON object to trigger a function. This process requires a back-and-forth between the LLM and the execution environment for every tool used. Cloudflare’s ‘Code Mode’ changes this by asking the LLM to write a TypeScript program that orchestrates multiple tools at once.

This code executes in a secure V8 isolate sandbox. For complex tasks, such as searching 10 different files, Code Mode provides an 87.5% reduction in token usage. Because intermediate results stay within the sandbox and are not sent back to the LLM for every step, the process is both faster and more cost-effective.

Code Mode also improves security through ‘secure bindings.’ The sandbox has no internet access; it can only interact with Model Context Protocol (MCP) servers through specific bindings in the environment object. These bindings hide sensitive API keys from the LLM, preventing the model from accidentally leaking credentials in its generated code.

February 2026: The v0.5.0 Release

The Agents SDK reached version 0.5.0. This release introduced several utilities for production-ready agents:

this.retry(): A new method for retrying asynchronous operations with exponential backoff and jitter.

Protocol Suppression: Developers can now suppress JSON text frames on a per-connection basis using the shouldSendProtocolMessages hook. This is useful for IoT or MQTT clients that cannot process JSON data.

Stable AI Chat: The @cloudflare/ai-chat package reached version 0.1.0, adding message persistence to SQLite and a “Row Size Guard” that performs automatic compaction when messages approach the 2MB SQLite limit.

FeatureDescriptionthis.retry()Automatic retries for external API calls.Data PartsAttaching typed JSON blobs to chat messages.Tool ApprovalPersistent approval state that survives hibernation.Synchronous GettersgetQueue() and getSchedule() no longer require Promises.

Key Takeaways

Stateful Persistence at the Edge: Unlike traditional stateless serverless functions, the Agents SDK uses Durable Objects to provide agents with a permanent identity and memory. This allows each agent to maintain its own state in an embedded SQLite database with 1GB of storage, enabling zero-latency data access without external database calls.

High-Efficiency Rust Inference: Cloudflare’s Infire inference engine, written in Rust, optimizes GPU utilization by using Granular CUDA Graphs to reduce CPU overhead by 82%. Benchmarks show it is 7% faster than Python-based vLLM 0.10.0 and uses Paged KV Caching to maintain a 99.99% warm request rate, significantly reducing cold start latencies.

Token Optimization via Code Mode: ‘Code Mode’ allows agents to write and execute TypeScript programs in a secure V8 isolate rather than making multiple individual tool calls. This deterministic approach reduces token consumption by 87.5% for complex tasks and keeps intermediate data within the sandbox to improve both speed and security.

Universal Tool Integration: The platform fully supports the Model Context Protocol (MCP), a standard that acts as a universal translator for AI tools. Cloudflare has deployed 13 official MCP servers that allow agents to securely manage infrastructure components like DNS, R2 storage, and Workers KV through natural language commands.

Production-Ready Utilities (v0.5.0): The February, 2026, release introduced critical reliability features, including a this.retry() utility for asynchronous operations with exponential backoff and jitter. It also added protocol suppression, which allows agents to communicate with binary-only IoT devices and lightweight embedded systems that cannot process standard JSON text frames.

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cloudflare Releases Agents SDK v0.5.0 with Rewritten @cloudflare/ai-chat and New Rust-Powered Infire Engine for Optimized Edge Inference Performance appeared first on MarkTechPost.

How to Build Human-in-the-Loop Plan-and-Execute AI Agents with Explici …

In this tutorial, we build a human-in-the-loop travel booking agent that treats the user as a teammate rather than a passive observer. We design the system so the agent first reasons openly by drafting a structured travel plan, then deliberately pauses before taking any action. We expose this proposed plan in a live interface where we can inspect, edit, or reject it, and only after explicit approval do we allow the agent to execute tools. By combining LangGraph interrupts with a Streamlit frontend, we create a workflow that makes agent reasoning visible, controllable, and trustworthy instead of opaque and autonomous.

Copy CodeCopiedUse a different Browser!pip -q install -U langgraph openai streamlit pydantic
!npm -q install -g localtunnel

import os, getpass, textwrap, json, uuid, time
if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass.getpass(“OPENAI_API_KEY (hidden input): “)
os.environ.setdefault(“OPENAI_MODEL”, “gpt-4.1-mini”)

We set up the execution environment by installing all required libraries and utilities needed for agent orchestration and UI exposure. We securely collect the OpenAI API key at runtime so it is never hardcoded or leaked in the notebook. We also configure the model selection upfront to keep the rest of the pipeline clean and reproducible.

Copy CodeCopiedUse a different Browserapp_code = r”’
import os, json, uuid
import streamlit as st
from typing import TypedDict, List, Dict, Any, Optional
from pydantic import BaseModel, Field
from openai import OpenAI

from langgraph.graph import StateGraph, START, END
from langgraph.types import Command, interrupt
from langgraph.checkpoint.memory import InMemorySaver

def tool_search_flights(origin: str, destination: str, depart_date: str, return_date: str, budget_usd: int) -> Dict[str, Any]:
options = [
{“airline”: “SkyJet”, “route”: f”{origin}->{destination}”, “depart”: depart_date, “return”: return_date, “price_usd”: int(budget_usd*0.55)},
{“airline”: “AeroBlue”, “route”: f”{origin}->{destination}”, “depart”: depart_date, “return”: return_date, “price_usd”: int(budget_usd*0.70)},
{“airline”: “Nimbus Air”, “route”: f”{origin}->{destination}”, “depart”: depart_date, “return”: return_date, “price_usd”: int(budget_usd*0.62)},
]
options = sorted(options, key=lambda x: x[“price_usd”])
return {“tool”: “search_flights”, “top_options”: options[:2]}

def tool_search_hotels(city: str, nights: int, budget_usd: int, preferences: List[str]) -> Dict[str, Any]:
base = max(60, int(budget_usd / max(nights, 1)))
picks = [
{“name”: “Central Boutique”, “city”: city, “nightly_usd”: int(base*0.95), “notes”: [“walkable”, “great reviews”]},
{“name”: “Riverside Stay”, “city”: city, “nightly_usd”: int(base*0.80), “notes”: [“quiet”, “good value”]},
{“name”: “Modern Loft Hotel”, “city”: city, “nightly_usd”: int(base*1.10), “notes”: [“new”, “gym”]},
]
if “luxury” in [p.lower() for p in preferences]:
picks = sorted(picks, key=lambda x: -x[“nightly_usd”])
else:
picks = sorted(picks, key=lambda x: x[“nightly_usd”])
return {“tool”: “search_hotels”, “top_options”: picks[:2]}

def tool_build_day_by_day(city: str, days: int, vibe: str) -> Dict[str, Any]:
blocks = []
for d in range(1, days+1):
blocks.append({
“day”: d,
“morning”: f”{city}: coffee + a must-see landmark”,
“afternoon”: f”{city}: {vibe} activity + local lunch”,
“evening”: f”{city}: sunset spot + dinner + optional night walk”
})
return {“tool”: “draft_itinerary”, “days”: blocks}
”’

We define the Streamlit application core and implement safe, deterministic tool functions that simulate flights, hotels, and itinerary generation. We design these tools to behave like real-world APIs while still running fully in a Colab environment. We ensure all tool outputs are structured so they can be audited before execution.

Copy CodeCopiedUse a different Browserapp_code += r”’
class TravelPlan(BaseModel):
trip_title: str = Field(…, description=”Short human-friendly title”)
origin: str
destination: str
depart_date: str
return_date: str
travelers: int = 1
budget_usd: int = 1500
preferences: List[str] = Field(default_factory=list)
vibe: str = “balanced”
lodging_nights: int = 4
daily_outline: List[Dict[str, Any]] = Field(default_factory=list)
tool_calls: List[Dict[str, Any]] = Field(default_factory=list)

class State(TypedDict):
user_request: str
plan: Dict[str, Any]
approval: Dict[str, Any]
execution: Dict[str, Any]

def make_llm_plan(state: State) -> Dict[str, Any]:
client = OpenAI(api_key=os.environ[“OPENAI_API_KEY”])
model = os.environ.get(“OPENAI_MODEL”, “gpt-4.1-mini”)

sys = (
“You are a travel planning agent. ”
“Return a JSON travel plan that matches the provided schema. ”
“Be realistic, concise, and include a tool_calls list describing what you want executed ”
“(e.g., search_flights, search_hotels, draft_itinerary).”
)

schema = TravelPlan.model_json_schema()

resp = client.responses.create(
model=model,
input=[
{“role”:”system”,”content”: sys},
{“role”:”user”,”content”: state[“user_request”]},
{“role”:”user”,”content”: f”Schema (JSON): {json.dumps(schema)}”}
],
)

text = resp.output_text.strip()
start = text.find(“{“)
end = text.rfind(“}”)
if start == -1 or end == -1:
raise ValueError(“Model did not return JSON. Try again or change model.”)
raw = text[start:end+1]
plan_obj = json.loads(raw)

plan = TravelPlan(**plan_obj).model_dump()

if not plan.get(“tool_calls”):
plan[“tool_calls”] = [
{“name”:”search_flights”, “args”:{“origin”: plan[“origin”], “destination”: plan[“destination”], “depart_date”: plan[“depart_date”], “return_date”: plan[“return_date”], “budget_usd”: plan[“budget_usd”]}},
{“name”:”search_hotels”, “args”:{“city”: plan[“destination”], “nights”: plan[“lodging_nights”], “budget_usd”: int(plan[“budget_usd”]*0.35), “preferences”: plan[“preferences”]}},
{“name”:”draft_itinerary”, “args”:{“city”: plan[“destination”], “days”: max(2, plan[“lodging_nights”]+1), “vibe”: plan[“vibe”]}},
]

return {“plan”: plan}

def wait_for_approval(state: State) -> Dict[str, Any]:
payload = {
“kind”: “approval”,
“message”: “Review/edit the plan. Approve to execute tools.”,
“plan”: state[“plan”],
}
decision = interrupt(payload)
return {“approval”: decision}

def execute_tools(state: State) -> Dict[str, Any]:
approval = state.get(“approval”) or {}
if not approval.get(“approved”):
return {“execution”: {“status”: “not_executed”, “reason”: “User rejected or did not approve.”}}

plan = approval.get(“edited_plan”) or state[“plan”]
tool_calls = plan.get(“tool_calls”, [])

results = []
for call in tool_calls:
name = call.get(“name”)
args = call.get(“args”, {})
if name == “search_flights”:
results.append(tool_search_flights(**args))
elif name == “search_hotels”:
results.append(tool_search_hotels(**args))
elif name == “draft_itinerary”:
results.append(tool_build_day_by_day(**args))
else:
results.append({“tool”: name, “error”: “Unknown tool (blocked for safety).”, “args”: args})

return {“execution”: {“status”: “executed”, “tool_results”: results, “final_plan”: plan}}
”’

We formalize the agent’s reasoning using a strict schema that requires the model to output an explicit travel plan rather than free-form text. We generate the plan using the OpenAI model and validate it before allowing it into the workflow. We also auto-inject tool calls if the model omits them to guarantee a complete execution path.

Copy CodeCopiedUse a different Browserapp_code += r”’
def build_graph():
builder = StateGraph(State)
builder.add_node(“plan”, make_llm_plan)
builder.add_node(“approve”, wait_for_approval)
builder.add_node(“execute”, execute_tools)

builder.add_edge(START, “plan”)
builder.add_edge(“plan”, “approve”)
builder.add_edge(“approve”, “execute”)
builder.add_edge(“execute”, END)

memory = InMemorySaver()
graph = builder.compile(checkpointer=memory)
return graph

st.set_page_config(page_title=”Plan → Approve → Execute Travel Agent”, layout=”wide”)
st.title(“Human-in-the-Loop Travel Booking Agent (Plan → Approve/Edit → Execute)”)

with st.sidebar:
st.header(“Runtime”)
if st.button(“New Session / Thread”):
st.session_state.thread_id = str(uuid.uuid4())
st.session_state.ran_once = False
st.session_state.interrupt_payload = None
st.session_state.last_execution = None

thread_id = st.session_state.get(“thread_id”) or str(uuid.uuid4())
st.session_state.thread_id = thread_id

graph = build_graph()
config = {“configurable”: {“thread_id”: thread_id}}

st.caption(f”Thread ID: {thread_id}”)

req = st.text_area(
“Describe your trip request”,
value=st.session_state.get(“user_request”, “Plan a 5-day trip from Dubai to Istanbul in April. Budget $1800. Prefer museums, street food, and a relaxed pace.”),
height=120
)
st.session_state.user_request = req

colA, colB = st.columns([1,1])
run_plan = colA.button(“1) Generate Plan (LLM)”)
resume_btn = colB.button(“2) Resume After Approval”)

if run_plan:
st.session_state.ran_once = True
st.session_state.interrupt_payload = None
st.session_state.last_execution = None

initial = {“user_request”: req, “plan”: {}, “approval”: {}, “execution”: {}}
out = graph.invoke(initial, config=config)

if “__interrupt__” in out and out[“__interrupt__”]:
st.session_state.interrupt_payload = out[“__interrupt__”][0].value
else:
st.session_state.last_execution = out.get(“execution”)

payload = st.session_state.get(“interrupt_payload”)

if payload:
st.subheader(“Plan proposed by agent (editable)”)
plan = payload.get(“plan”, {})
left, right = st.columns([1,1])

with left:
st.write(“**Edit JSON (advanced):**”)
edited_text = st.text_area(“Plan JSON”, value=json.dumps(plan, indent=2), height=420)

with right:
st.write(“**Quick actions:**”)
approved = st.radio(“Decision”, options=[“Approve”, “Reject”], index=0)
st.write(“Tip: If you edit JSON, keep it valid. You can also reject and re-run planning.”)

try:
edited_plan = json.loads(edited_text)
json_ok = True
except Exception as e:
json_ok = False
st.error(f”Invalid JSON: {e}”)

if resume_btn:
if not json_ok:
st.stop()

decision = {
“approved”: (approved == “Approve”),
“edited_plan”: edited_plan
}
out2 = graph.invoke(Command(resume=decision), config=config)
st.session_state.interrupt_payload = None
st.session_state.last_execution = out2.get(“execution”)

exec_result = st.session_state.get(“last_execution”)
if exec_result:
st.subheader(“Execution result”)
st.json(exec_result)
if exec_result.get(“status”) == “executed”:
st.success(“Tools executed only AFTER approval “)
else:
st.warning(“Not executed (rejected or not approved).”)
”’

We construct the LangGraph workflow by separating planning, approval, and execution into distinct nodes. We deliberately interrupt the graph after planning so we can review and control the agent’s intent. We only allow tool execution to proceed when explicit human approval is provided.

Copy CodeCopiedUse a different Browserimport pathlib
pathlib.Path(“app.py”).write_text(app_code)

!streamlit run app.py –server.port 8501 –server.address 0.0.0.0 & sleep 2
!lt –port 8501

We connect the agent workflow to a live Streamlit interface that supports editing, approval, and rejection of plans. We persist the state across runs using a thread identifier so the agent behaves consistently across interactions. We finally launch the app and make it publicly available, enabling real human-in-the-loop collaboration.

In conclusion, we demonstrated how plan-and-execute agents become significantly more reliable when humans remain in the loop at the right moment. We showed that interrupts are not just a technical feature but a design primitive for building trust, accountability, and collaboration into agent systems. By separating planning from execution and inserting a clear approval boundary, we ensured that tools run only with human consent and context. This pattern scales beyond travel planning to any high-stakes automation, giving us agents that think with us rather than act for us.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build Human-in-the-Loop Plan-and-Execute AI Agents with Explicit User Approval Using LangGraph and Streamlit appeared first on MarkTechPost.