A Coding Guide to Build an Autonomous Agentic AI for Time Series Forec …

In this tutorial, we build an advanced agentic AI system that autonomously handles time series forecasting using the Darts library combined with a lightweight HuggingFace model for reasoning. We design the agent to operate in a perception–reasoning–action cycle, where it first analyzes patterns in the data, then selects an appropriate forecasting model, generates predictions, and finally explains and visualizes the results. By walking through this pipeline, we experience how agentic AI can bring together statistical modeling and natural language reasoning to make forecasting both accurate and interpretable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install darts transformers pandas matplotlib numpy -q

import pandas as pd
import numpy as np
from darts import TimeSeries
from darts.models import ExponentialSmoothing, NaiveSeasonal, LinearRegressionModel
from darts.metrics import mape, rmse
from transformers import pipeline
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

We begin by installing and importing the essential libraries, including Darts for time series forecasting, Transformers for reasoning, and supporting packages like pandas, NumPy, and matplotlib. With these tools in place, we set up the foundation to build and run our autonomous forecasting agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TimeSeriesAgent:
“””Autonomous agent for time series analysis and forecasting”””

def __init__(self):
print(” Initializing Agent Brain…”)
self.llm = pipeline(“text-generation”, model=”distilgpt2″, max_length=150,
do_sample=True, temperature=0.7)

self.models = {
‘exponential_smoothing’: ExponentialSmoothing(),
‘naive_seasonal’: NaiveSeasonal(K=12),
‘linear_regression’: LinearRegressionModel(lags=12)
}
self.selected_model = None
self.forecast = None

def perceive(self, data):
“””Agent perceives and analyzes the time series data”””
print(“n PERCEPTION PHASE”)
self.ts = TimeSeries.from_dataframe(data, ‘date’, ‘value’, freq=’M’)

trend = “increasing” if data[‘value’].iloc[-1] > data[‘value’].iloc[0] else “decreasing”
volatility = data[‘value’].std() / data[‘value’].mean()
seasonality = self._detect_seasonality(data[‘value’])

analysis = {
‘length’: len(data),
‘trend’: trend,
‘volatility’: f”{volatility:.2f}”,
‘has_seasonality’: seasonality,
‘mean’: f”{data[‘value’].mean():.2f}”,
‘range’: f”{data[‘value’].min():.2f} to {data[‘value’].max():.2f}”
}

print(f” Data Points: {analysis[‘length’]}”)
print(f” Trend: {analysis[‘trend’].upper()}”)
print(f” Volatility: {analysis[‘volatility’]}”)
print(f” Seasonality: {‘Detected’ if seasonality else ‘Not detected’}”)

return analysis

def _detect_seasonality(self, series, threshold=0.3):
“””Simple seasonality detection”””
if len(series) < 24:
return False
acf = np.correlate(series – series.mean(), series – series.mean(), mode=’full’)
acf = acf[len(acf)//2:]
acf /= acf[0]
return np.max(acf[12:24]) > threshold if len(acf) > 24 else False

def reason(self, analysis):
“””Agent reasons about which model to use”””
print(“n REASONING PHASE”)

prompt = f”Time series analysis: {analysis[‘length’]} data points, {analysis[‘trend’]} trend, ”
f”volatility {analysis[‘volatility’]}, seasonality: {analysis[‘has_seasonality’]}. ”

thought = self.llm(prompt, max_length=100, num_return_sequences=1)[0][‘generated_text’]
print(f” Agent Thinking: {thought[:150]}…”)

if analysis[‘has_seasonality’]:
self.selected_model = ‘naive_seasonal’
reason = “Seasonality detected – using Naive Seasonal model”
elif float(analysis[‘volatility’]) > 0.3:
self.selected_model = ‘exponential_smoothing’
reason = “High volatility – using Exponential Smoothing”
else:
self.selected_model = ‘linear_regression’
reason = “Stable trend – using Linear Regression”

print(f” Decision: {reason}”)
return self.selected_model

def act(self, horizon=12):
“””Agent takes action: trains model and generates forecast”””
print(“n ACTION PHASE”)

train, val = self.ts[:-12], self.ts[-12:]

model = self.models[self.selected_model]
print(f” Training {self.selected_model}…”)
model.fit(train)

self.forecast = model.predict(horizon)

if len(val) > 0:
val_pred = model.predict(len(val))
accuracy = 100 – mape(val, val_pred)
print(f” Validation Accuracy: {accuracy:.2f}%”)

print(f” Generated {horizon}-step forecast”)
return self.forecast

def explain(self):
“””Agent explains its predictions”””
print(“n EXPLANATION PHASE”)

forecast_values = self.forecast.values().flatten()
hist_values = self.ts.values().flatten()

change = ((forecast_values[-1] – hist_values[-1]) / hist_values[-1]) * 100
direction = “increase” if change > 0 else “decrease”

explanation = f”Based on my analysis using {self.selected_model}, ”
f”I predict a {abs(change):.1f}% {direction} in the next period. ”
f”Forecast range: {forecast_values.min():.2f} to {forecast_values.max():.2f}. ”
f”Historical mean was {hist_values.mean():.2f}.”

print(f” {explanation}”)

prompt = f”Forecast summary: {explanation} Explain implications:”
summary = self.llm(prompt, max_length=120)[0][‘generated_text’]
print(f”n Agent Summary: {summary[:200]}…”)

return explanation

def visualize(self):
“””Agent creates visualization of its work”””
print(“n Generating visualization…”)

plt.figure(figsize=(14, 6))

self.ts.plot(label=’Historical Data’, lw=2)

self.forecast.plot(label=f’Forecast ({self.selected_model})’,
lw=2, linestyle=’–‘)

plt.title(‘ Agentic AI Time Series Forecast’, fontsize=16, fontweight=’bold’)
plt.xlabel(‘Date’, fontsize=12)
plt.ylabel(‘Value’, fontsize=12)
plt.legend(loc=’best’, fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

We define a TimeSeriesAgent that thinks with a lightweight HuggingFace model and acts with a small portfolio of Darts models. We perceive patterns (trend, volatility, seasonality), reason to choose the best model, then train, forecast, and validate. Finally, we explain the prediction in plain language and visualize history versus forecast. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef create_sample_data():
“””Generate sample time series data”””
dates = pd.date_range(start=’2020-01-01′, periods=48, freq=’M’)
trend = np.linspace(100, 150, 48)
seasonality = 10 * np.sin(np.linspace(0, 4*np.pi, 48))
noise = np.random.normal(0, 3, 48)
values = trend + seasonality + noise

return pd.DataFrame({‘date’: dates, ‘value’: values})

We create a helper function create_sample_data() that generates synthetic time series data with a clear trend, sinusoidal seasonality, and random noise. This allows us to simulate realistic monthly data from 2020 to 2023 for testing and demonstrating the agent’s forecasting workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
“””Main execution: Agent autonomously handles forecasting task”””
print(“=”*70)
print(” AGENTIC AI TIME SERIES FORECASTING SYSTEM”)
print(“=”*70)

print(“n Loading data…”)
data = create_sample_data()
print(f”Loaded {len(data)} data points from 2020-01 to 2023-12″)

agent = TimeSeriesAgent()

analysis = agent.perceive(data)
agent.reason(analysis)
agent.act(horizon=12)
agent.explain()
agent.visualize()

print(“n” + “=”*70)
print(” AGENT COMPLETED FORECASTING TASK SUCCESSFULLY”)
print(“=”*70)

if __name__ == “__main__”:
main()

We define the main function that runs the full agentic AI pipeline. We load synthetic time series data, let the TimeSeriesAgent perceive patterns, reason to select the best model, act by training and forecasting, explain the results, and finally visualize them. This completes the end-to-end autonomous perception, reasoning, and action cycle.

In conclusion, we see how an autonomous agent can analyze time series data, reason about model selection, generate forecasts, and explain its predictions in natural language. By combining Darts with HuggingFace, we create a compact yet powerful framework that not only produces accurate forecasts but also clearly communicates insights. We complete the cycle with visualization, reinforcing how agentic AI makes forecasting more intuitive and interactive.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Build an Autonomous Agentic AI for Time Series Forecasting with Darts and Hugging Face appeared first on MarkTechPost.

Microsoft Releases ‘Microsoft Agent Framework’: An Open-Source SD …

Microsoft released the Microsoft Agent Framework (public preview), an open-source SDK and runtime that unifies core ideas from AutoGen (agent runtime and multi-agent patterns) with Semantic Kernel (enterprise controls, state, plugins) to help teams build, deploy, and observe production-grade AI agents and multi-agent workflows. The framework is available for Python and .NET and integrates directly with Azure AI Foundry’s Agent Service for scaling and operations.

What exactly is Microsoft shipping?

A consolidated agent runtime and API surface. The Agent Framework carries forward AutoGen’s single- and multi-agent abstractions while adding Semantic Kernel’s enterprise features: thread-based state management, type safety, filters, telemetry, and broad model/embedding support. Microsoft positions it as the successor built by the same teams, rather than a replacement that abandons either project.

First-class orchestration modes. It supports agent orchestration (LLM-driven decision-making) and workflow orchestration (deterministic, business-logic multi-agent flows), enabling hybrid systems where creative planning coexists with reliable handoffs and constraints.

Pro-code and platform interoperability. The base AIAgent interface is designed to swap chat model providers and to interoperate with Azure AI Foundry Agents, OpenAI Assistants, and Copilot Studio, reducing vendor lock-in at the application layer.

Open-source, multi-language SDKs under MIT license. The GitHub repo publishes Python and .NET packages with examples and CI/CD-friendly scaffolding. AutoGen remains maintained (bug fixes, security patches) with guidance to consider Agent Framework for new builds.

Where it runs in production?

Azure AI Foundry’s Agent Service provides the managed runtime: it links models, tools, and frameworks; manages thread state; enforces content safety and identity; and wires in observability. It also supports multi-agent orchestration natively and distinguishes itself from Copilot Studio’s low-code approach by targeting complex, pro-code enterprise scenarios.

But how is it connected to ‘AI economics’?

Enterprise AI economics are dominated by token throughput, latency, failure recovery, and observability. Microsoft’s consolidation addresses those by (a) giving one runtime abstraction for agent collaboration and tool use, (b) attaching production controls—telemetry, filters, identity/networking, safety—to the same abstraction, and (c) deploying onto a managed service that handles scaling, policy, and diagnostics. This reduces the “glue code” that typically drives cost and brittleness in multi-agent systems and aligns with Azure AI Foundry’s model-catalog + toolchain approach.

Architectural notes and developer surface

Runtime & state: Agents coordinate via a runtime that handles lifecycles, identities, communication, and security boundaries—concepts inherited and formalized from AutoGen. Threads are the unit of state, enabling reproducible runs, retries, and audits.

Functions & plugins: The framework leans on Semantic Kernel’s plugin architecture and function-calling to bind tools (code interpreters, custom functions) into agent policies with typed contracts. (

Model/provider flexibility: The same agent interface can target Azure OpenAI, OpenAI, local runtimes (e.g., Ollama/Foundry Local), and GitHub Models, enabling cost/performance tuning per task without rewriting orchestration logic.

Enterprise context

Microsoft frames the release as part of a broader push toward interoperable, standard-friendly “agentic” systems across Azure AI Foundry—consistent with prior statements about multi-agent collaboration, memory, and structured retrieval. Expect tighter ties to Foundry observability and governance controls as these stabilize.

Our Comments

We like this direction because it collapses two divergent stacks—AutoGen’s multi-agent runtime and Semantic Kernel’s enterprise plumbing—into one API surface with a managed path to production. The thread-based state model and OpenTelemetry hooks address the usual blind spots in agentic systems (repro, latency tracing, failure triage), and Azure AI Foundry’s Agent Service takes on identity, content safety, and tool orchestration so teams can iterate on policies instead of glue code. The Python/.NET parity and provider flexibility (Azure OpenAI, OpenAI, GitHub Models, local runtimes) also make cost/perf tuning practical without rewriting orchestration.

The post Microsoft Releases ‘Microsoft Agent Framework’: An Open-Source SDK and Runtime that Simplifies the Orchestration of Multi-Agent Systems appeared first on MarkTechPost.

AWS Open-Sources an MCP Server for Bedrock AgentCore to Streamline AI …

AWS released an open-source Model Context Protocol (MCP) server for Amazon Bedrock AgentCore, providing a direct path from natural-language prompts in agentic IDEs to deployable agents on AgentCore Runtime. The package ships with automated transformations, environment provisioning, and Gateway/tooling hooks designed to compress typical multi-step integration work into conversational commands.

So, what exactly is it?

The “AgentCore MCP server” exposes task-specific tools to a client (e.g., Kiro, Claude Code, Cursor, Amazon Q Developer CLI, or the VS Code Q plugin) and guides the assistant to: (1) minimally refactor an existing agent to the AgentCore Runtime model; (2) provision and configure the AWS environment (credentials, roles/permissions, ECR, config files); (3) wire up AgentCore Gateway for tool calls; and (4) invoke and test the deployed agent—all from the IDE’s chat surface.

Practically, the server teaches your coding assistant to convert entry points to AgentCore handlers, add bedrock_agentcore imports, generate requirements.txt, and rewrite direct agent calls into payload-based handlers compatible with Runtime. It can then call the AgentCore CLI to deploy and exercise the agent, including end-to-end calls through Gateway tools.

https://aws.amazon.com/blogs/machine-learning/accelerate-development-with-the-amazon-bedrock-agentcore-mcpserver/

How to Install? and what’s the client support?

AWS provides a one-click install flow from the GitHub repository, using a lightweight launcher (uvx) and a standard mcp.json entry that most MCP-capable clients consume. The AWS team lists the expected mcp.json locations for Kiro (.kiro/settings/mcp.json), Cursor (.cursor/mcp.json), Amazon Q CLI (~/.aws/amazonq/mcp.json), and Claude Code (~/.claude/mcp.json).

The repository sits in the awslabs “mcp” mono-repo (license Apache-2.0). While the AgentCore server directory hosts the implementation, the root repo also links to broader AWS MCP resources and documentation.

Architecture guidance and the “layered” context model

AWS recommends a layered approach to give the IDE’s assistant progressively richer context: start with the agentic client, then add the AWS Documentation MCP Server, layer in framework documentation (e.g., Strands Agents, LangGraph), include the AgentCore and agent-framework SDK docs, and finally steer recurrent workflows via per-IDE “steering files.” This arrangement reduces retrieval misses and helps the assistant plan the end-to-end transform/deploy/test loop without manual context switching.

Development workflow (typical path)

Bootstrap: Use local tools or MCP servers. Either provision a Lambda target for AgentCore Gateway or deploy the server directly to AgentCore Runtime.

Author/Refactor: Start from Strands Agents or LangGraph code. The server instructs the assistant to convert handlers, imports, and dependencies for Runtime compatibility.

Deploy: The assistant looks up relevant docs and invokes the AgentCore CLI to deploy.

Test & Iterate: Invoke the agent via natural language; if tools are needed, integrate Gateway (MCP client inside the agent), redeploy (v2), and retest.

https://aws.amazon.com/blogs/machine-learning/accelerate-development-with-the-amazon-bedrock-agentcore-mcpserver/

How does it make a difference?

Most “agent frameworks” still require developers to learn cloud-specific runtimes, credentials, role policies, registries, and deployment CLIs before any useful iteration. AWS’s MCP server shifts that work into the IDE assistant and narrows the “prompt-to-production” gap. Since it’s just another MCP server, it composes with existing doc servers (AWS service docs, Strands, LangGraph) and can ride improvements in MCP-aware clients, making it a low-friction entry point for teams standardizing on Bedrock AgentCore.

Comments from MTP (Marktechpost team)

I like that AWS shipped a real MCP endpoint for AgentCore that my IDE can call directly. The uvx-based mcp.json config makes client hookup trivial (Cursor, Claude Code, Kiro, Amazon Q CLI), and the server’s tooling maps cleanly onto the AgentCore Runtime/Gateway/Memory stack while preserving existing Strands/LangGraph code paths. Practically, this collapses the prompt→refactor→deploy→test loop into a reproducible, scriptable workflow rather than bespoke glue code.

Check out the GitHub Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post AWS Open-Sources an MCP Server for Bedrock AgentCore to Streamline AI Agent Development appeared first on MarkTechPost.

Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-Device Speech L …

Neuphonic has released NeuTTS Air, an open-source text-to-speech (TTS) speech language model designed to run locally in real time on CPUs. The Hugging Face model card lists 748M parameters (Qwen2 architecture) and ships in GGUF quantizations (Q4/Q8), enabling inference through llama.cpp/llama-cpp-python without cloud dependencies. It is licensed under Apache-2.0 and includes a runnable demo and examples.

So, what is new?

NeuTTS Air couples a 0.5B-class Qwen backbone with Neuphonic’s NeuCodec audio codec. Neuphonic positions the system as a “super-realistic, on-device” TTS LM that clones a voice from ~3 seconds of reference audio and synthesizes speech in that style, targeting voice agents and privacy-sensitive applications. The model card and repository explicitly emphasize real-time CPU generation and small-footprint deployment.

Key Features

Realism at sub-1B scale: Human-like prosody and timbre preservation for a ~0.7B (Qwen2-class) text-to-speech LM.

On-device deployment: Distributed in GGUF (Q4/Q8) with CPU-first paths; suitable for laptops, phones, and Raspberry Pi-class boards.

Instant speaker cloning: Style transfer from ~3 seconds of reference audio (reference WAV + transcript).

Compact LM+codec stack: Qwen 0.5B backbone paired with NeuCodec (0.8 kbps / 24 kHz) to balance latency, footprint, and output quality.

Explain the model architecture and runtime path?

Backbone: Qwen 0.5B used as a lightweight LM to condition speech generation; the hosted artifact is reported as 748M params under the qwen2 architecture on Hugging Face.

Codec: NeuCodec provides low-bitrate acoustic tokenization/decoding; it targets 0.8 kbps with 24 kHz output, enabling compact representations for efficient on-device use.

Quantization & format: Prebuilt GGUF backbones (Q4/Q8) are available; the repo includes instructions for llama-cpp-python and an optional ONNX decoder path.

Dependencies: Uses espeak for phonemization; examples and a Jupyter notebook are provided for end-to-end synthesis.

On-device performance focus

NeuTTS Air showcases ‘real-time generation on mid-range devices‘ and offers CPU-first defaults; GGUF quantization is intended for laptops and single-board computers. While no fps/RTF numbers are published on the card, the distribution targets local inference without a GPU and demonstrates a working flow through the provided examples and Space.

Voice cloning workflow

NeuTTS Air requires (1) a reference WAV and (2) the transcript text for that reference. It encodes the reference to style tokens and then synthesizes arbitrary text in the reference speaker’s timbre. The Neuphonic team recommends 3–15 s clean, mono audio and provides pre-encoded samples.

Privacy, responsibility, and watermarking

Neuphonic frames the model for on-device privacy (no audio/text leaves the machine without user’s approval) and notes that all generated audio includes a Perth (Perceptual Threshold) watermarker to support responsible use and provenance.

How it compares?

Open, local TTS systems exist (e.g., GGUF-based pipelines), but NeuTTS Air is notable for packaging a small LM + neural codec with instant cloning, CPU-first quantizations, and watermarking under a permissive license. The “world’s first super-realistic, on-device speech LM” phrasing is the vendor’s claim; the verifiable facts are the size, formats, cloning procedure, license, and provided runtimes.

Our Comments

The focus is on system trade-offs: a ~0.7B Qwen-class backbone with GGUF quantization paired with NeuCodec at 0.8 kbps/24 kHz is a pragmatic recipe for real-time, CPU-only TTS that preserves timbre using ~3–15 s style references while keeping latency and memory predictable. The Apache-2.0 licensing and built-in watermarking are deployment-friendly, but publishing RTF/latency on commodity CPUs and cloning-quality vs. reference-length curves would enable rigorous benchmarking against existing local pipelines. Operationally, an offline path with minimal dependencies (eSpeak, llama.cpp/ONNX) lowers privacy/compliance risk for edge agents without sacrificing intelligibility.

Check out the Model Card on Hugging Face and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-Device Speech Language Model with Instant Voice Cloning appeared first on MarkTechPost.

Unlock global AI inference scalability using new global cross-Region i …

Organizations are increasingly integrating generative AI capabilities into their applications to enhance customer experiences, streamline operations, and drive innovation. As generative AI workloads continue to grow in scale and importance, organizations face new challenges in maintaining consistent performance, reliability, and availability of their AI-powered applications. Customers are looking to scale their AI inference workloads across multiple AWS Regions to support consistent performance and reliability.
To address this need, we introduced cross-Region inference (CRIS) for Amazon Bedrock. This managed capability automatically routes inference requests across multiple Regions, enabling applications to handle traffic bursts seamlessly and achieve higher throughput without requiring developers to predict demand fluctuations or implement complex load-balancing mechanisms. CRIS works through inference profiles, which define a foundation model (FM) and the Regions to which requests can be routed.
We are excited to announce availability of global cross-Region inference with Anthropic’s Claude Sonnet 4.5 on Amazon Bedrock. Now, with cross-Region inference, you can choose either a geography-specific inference profile or a global inference profile. This evolution from geography-specific routing provides greater flexibility for organizations because Amazon Bedrock automatically selects the optimal commercial Region within that geography to process your inference request. Global CRIS further enhances cross-Region inference by enabling the routing of inference requests to supported commercial Regions worldwide, optimizing available resources and enabling higher model throughput. This helps support consistent performance and higher throughput, particularly during unplanned peak usage times. Additionally, global CRIS supports key Amazon Bedrock features, including prompt caching, batch inference, Amazon Bedrock Guardrails, Amazon Bedrock Knowledge Bases, and more.
In this post, we explore how global cross-Region inference works, the benefits it offers compared to Regional profiles, and how you can implement it in your own applications with Anthropic’s Claude Sonnet 4.5 to improve your AI applications’ performance and reliability.
Core functionality of global cross-Region inference
Global cross-Region inference helps organizations manage unplanned traffic bursts by using compute resources across different Regions. This section explores how this feature works and the technical mechanisms that power its functionality.
Understanding inference profiles
An inference profile in Amazon Bedrock defines an FM and one or more Regions to which it can route model invocation requests. The global cross-Region inference profile for Anthropic’s Claude Sonnet 4.5 extends this concept beyond geographic boundaries, allowing requests to be routed to one of the supported Amazon Bedrock commercial Regions globally, so you can prepare for unplanned traffic bursts by distributing traffic across multiple Regions.
Inference profiles operate on two key concepts:

Source Region – The Region from which the API request is made
Destination Region – A Region to which Amazon Bedrock can route the request for inference

At the time of writing, global CRIS supports over 20 source Regions, and the destination Region is a supported commercial Region dynamically chosen by Amazon Bedrock.
Intelligent request routing
Global cross-Region inference uses an intelligent request routing mechanism that considers multiple factors, including model availability, capacity, and latency, to route requests to the optimal Region. The system automatically selects the optimal available Region for your request without requiring manual configuration:

Regional capacity – The system considers the current load and available capacity in each potential destination Region.
Latency considerations – Although the system prioritizes availability, it also takes latency into account. By default, the service attempts to fulfill requests from the source Region when possible, but it can seamlessly route requests to other Regions as needed.
Availability metrics – The system continuously monitors the availability of FMs across Regions to support optimal routing decisions.

This intelligent routing system enables Amazon Bedrock to distribute traffic dynamically across the AWS global infrastructure, facilitating optimal availability for each request and smoother performance during high-usage periods.
Monitoring and logging
When using global cross-Region inference, Amazon CloudWatch and AWS CloudTrail continue to record log entries only in the source Region where the request originated. This simplifies monitoring and logging by maintaining all records in a single Region regardless of where the inference request is ultimately processed. To track which Region processed a request, CloudTrail events include an additionalEventData field with an inferenceRegion key that specifies the destination Region. Organizations can monitor and analyze the distribution of their inference requests across the AWS global infrastructure.
Data security and compliance
Global cross-Region inference maintains high standards for data security. Data transmitted during cross-Region inference is encrypted and remains within the secure AWS network. Sensitive information remains protected throughout the inference process, regardless of which Region processes the request. Because security and compliance is a shared responsibility, you must also consider legal or compliance requirements that come with processing inference request in a different geographic location. Because global cross-Region inference allows requests to be routed globally, organizations with specific data residency or compliance requirements can elect, based on their compliance needs, to use geography-specific inference profiles to make sure data remains within certain Regions. This flexibility helps businesses balance redundancy and compliance needs based on their specific requirements.
Implement global cross-Region inference
To use global cross-Region inference with Anthropic’s Claude Sonnet 4.5, developers must complete the following key steps:

Use the global inference profile ID – When making API calls to Amazon Bedrock, specify the global Anthropic’s Claude Sonnet 4.5 inference profile ID (global.anthropic.claude-sonnet-4-5-20250929-v1:0) instead of a Region-specific model ID. This works with both InvokeModel and Converse APIs.
Configure IAM permissions – Grant appropriate AWS Identity and Access Management (IAM) permissions to access the inference profile and FMs in potential destination Regions. In the next section, we provide more details. You can also read more about prerequisites for inference profiles.

Implementing global cross-Region inference with Anthropic’s Claude Sonnet 4.5 is straightforward, requiring only a few changes to your existing application code. The following is an example of how to update your code in Python:

import boto3
import json
bedrock = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)

model_id = “global.anthropic.claude-sonnet-4-5-20250929-v1:0”  

response = bedrock.converse(
    messages=[{“role”: “user”, “content”: [{“text”: “Explain cloud computing in 2 sentences.”}]}],
    modelId=model_id,
)

print(“Response:”, response[‘output’][‘message’][‘content’][0][‘text’])
print(“Tokens used:”, result.get(‘usage’, {}))

If you’re using the Amazon Bedrock InvokeModel API, you can quickly switch to a different model by changing the model ID, as shown in Invoke model code examples.
IAM policy requirements for global CRIS
In this section, we discuss the IAM policy requirements for global CRIS.
Enable global CRIS
To enable global CRIS for your users, you must apply a three-part IAM policy to the role. The following is an example IAM policy to provide granular control. You can replace <REQUESTING REGION> in the example policy with the Region you are operating in.

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Sid”: “GrantGlobalCrisInferenceProfileRegionAccess”,
            “Effect”: “Allow”,
            “Action”: “bedrock:InvokeModel”,
            “Resource”: [
                “arn:aws:bedrock:<REQUESTING REGION>:<ACCOUNT>:inference-profile/global.<MODEL NAME>”
            ],
            “Condition”: {
                “StringEquals”: {
                    “aws:RequestedRegion”: “<REQUESTING REGION>”
                }
            }
        },
        {
            “Sid”: “GrantGlobalCrisInferenceProfileInRegionModelAccess”,
            “Effect”: “Allow”,
            “Action”: “bedrock:InvokeModel”,
            “Resource”: [
                “arn:aws:bedrock:<REQUESTING REGION>::foundation-model/<MODEL NAME>”
            ],
            “Condition”: {
                “StringEquals”: {
                    “aws:RequestedRegion”: “<REQUESTING REGION>”,
                    “bedrock:InferenceProfileArn”: “arn:aws:bedrock:<REQUESTING REGION>:<ACCOUNT>:inference-profile/global.<MODEL NAME>”
                }
            }
        },
        {
            “Sid”: “GrantGlobalCrisInferenceProfileGlobalModelAccess”,
            “Effect”: “Allow”,
            “Action”: “bedrock:InvokeModel”,
            “Resource”: [
                “arn:aws:bedrock:::foundation-model/<MODEL NAME>”
            ],
            “Condition”: {
                “StringEquals”: {
                    “aws:RequestedRegion”: “unspecified”,
                    “bedrock:InferenceProfileArn”: “arn:aws:bedrock:<REQUESTING REGION>:<ACCOUNT>:inference-profile/global.<MODEL NAME>”
                }
            }
        }
    ]
}

The first part of the policy grants access to the Regional inference profile in your requesting Region. This policy allows users to invoke the specified global CRIS inference profile from their requesting Region. The second part of the policy provides access to the Regional FM resource, which is necessary for the service to understand which model is being requested within the Regional context. The third part of the policy grants access to the global FM resource, which enables the cross-Region routing capability that makes global CRIS function. When implementing these policies, make sure all three resource Amazon Resource Names (ARNs) are included in your IAM statements:

The Regional inference profile ARN follows the pattern arn:aws:bedrock:REGION:ACCOUNT:inference-profile/global.MODEL-NAME. This is used to give access to the global inference profile in the source Region.
The Regional FM uses arn:aws:bedrock:REGION::foundation-model/MODEL-NAME. This is used to give access to the FM in the source Region.
The global FM requires arn:aws:bedrock:::foundation-model/MODEL-NAME. This is used to give access to the FM in different global Regions.

The global FM ARN has no Region or account specified, which is intentional and required for the cross-Region functionality.
To simplify onboarding, global CRIS doesn’t require complex changes to an organization’s existing Service Control Policies (SCPs) that might deny access to services in certain Regions. When you opt in to global CRIS using this three-part policy structure, Amazon Bedrock will process inference requests across commercial Regions without validating against Regions denied in other parts of SCPs. This prevents workload failures that could occur when global CRIS routes inference requests to new or previously unused Regions that might be blocked in your organization’s SCPs. However, if you have data residency requirements, you should carefully evaluate your use cases before implementing global CRIS, because requests might be processed in any supported commercial Region.
Disable global CRIS
You can choose from two primary approaches to implement deny policies to global CRIS for specific IAM roles, each with different use cases and implications:

Remove an IAM policy – The first method involves removing one or more of the three required IAM policies from user permissions. Because global CRIS requires all three policies to function, removing a policy will result in denied access.
Implement a deny policy – The second approach is to implement an explicit deny policy that specifically targets global CRIS inference profiles. This method provides clear documentation of your security intent and makes sure that even if someone accidentally adds the required allow policies later, the explicit deny will take precedence. The deny policy should use a StringEquals condition matching the pattern “aws:RequestedRegion”: “unspecified”. This pattern specifically targets inference profiles with the global prefix.

When implementing deny policies, it’s crucial to understand that global CRIS changes how the aws:RequestedRegion field behaves. Traditional Region-based deny policies that use StringEquals conditions with specific Region names such as “aws:RequestedRegion”: “us-west-2” will not work as expected with global CRIS because the service sets this field to global rather than the actual destination Region. However, as mentioned earlier, “aws:RequestedRegion”: “unspecified” will result in the deny effect.
Note: To simplify customer onboarding, global CRIS has been designed to work without requiring complex changes to an organization’s existing SCPs that may deny access to services in certain Regions. When customers opt in to global CRIS using the three-part policy structure described above, Amazon Bedrock will process inference requests across supported AWS commercial Regions without validating against regions denied in any other parts of SCPs. This prevents workload failures that could occur when global CRIS routes inference requests to new or previously unused Regions that might be blocked in your organization’s SCPs. However, customers with data residency requirements should evaluate their use cases before implementing global CRIS, because requests may be processed in any supported commercial Regions. As a best practice, organizations who use geographic CRIS but want to opt out from global CRIS should implement the second approach.
Request limit increases for global CRIS with Anthropic’s Claude Sonnet 4.5
When using global CRIS inference profiles, it’s important to understand that service quota management is centralized in the US East (N. Virginia) Region. However, you can use global CRIS from over 20 supported source Regions. Because this will be a global limit, requests to view, manage, or increase quotas for global cross-Region inference profiles must be made through the Service Quotas console or AWS Command Line Interface (AWS CLI) specifically in the US East (N. Virginia) Region. Quotas for global CRIS inference profiles will not appear on the Service Quotas console or AWS CLI for other source Regions, even when they support global CRIS usage. This centralized quota management approach makes it possible to access your limits globally without estimating usage in individual Regions. If you don’t have access to US East (N. Virginia), reach out to your account teams or AWS support.
Complete the following steps to request a limit increase:

Sign in to the Service Quotas console in your AWS account.
Make sure your selected Region is US East (N. Virginia).
In the navigation pane, choose AWS services.
From the list of services, find and choose Amazon Bedrock.
In the list of quotas for Amazon Bedrock, use the search filter to find the specific global CRIS quotas. For example:

Global cross-Region model inference tokens per day for Anthropic Claude Sonnet 4.5 V1
Global cross-Region model inference tokens per minute for Anthropic Claude Sonnet 4.5 V1

Select the quota you want to increase.
Choose Request increase at account level.
Enter your desired new quota value.
Choose Request to submit your request.

Use global cross-Region inference with Anthropic’s Claude Sonnet 4.5
Claude Sonnet 4.5 is Anthropic’s most intelligent model (at the time of writing), and is best for coding and complex agents. Anthropic’s Claude Sonnet 4.5 demonstrates advancements in agent capabilities, with enhanced performance in tool handling, memory management, and context processing. The model shows marked improvements in code generation and analysis, including identifying optimal improvements and exercising stronger judgment in refactoring decisions. It particularly excels at autonomous long-horizon coding tasks, where it can effectively plan and execute complex software projects spanning hours or days while maintaining consistent performance and reliability throughout the development cycle.
Global cross-Region inference for Anthropic’s Claude Sonnet 4.5 delivers multiple advantages over traditional geographic cross-Region inference profiles:

Enhanced throughput during peak demand – Global cross-Region inference provides improved resilience during periods of peak demand by automatically routing requests to Regions with available capacity. This dynamic routing happens seamlessly without additional configuration or intervention from developers. Unlike traditional approaches that might require complex client-side load balancing between Regions, global cross-Region inference handles traffic spikes automatically. This is particularly important for business-critical applications where downtime or degraded performance can have significant financial or reputational impacts.
Cost-efficiency – Global cross-Region inference for Anthropic’s Claude Sonnet 4.5 offers approximately 10% savings on both input and output token pricing compared to geographic cross-Region inference. The price is calculated based on the Region from which the request is made (source Region). This means organizations can benefit from improved resilience with even lower costs. This pricing model makes global cross-Region inference a cost-effective solution for organizations looking to optimize their generative AI deployments. By improving resource utilization and enabling higher throughput without additional costs, it helps organizations maximize the value of their investment in Amazon Bedrock.
Streamlined monitoring – When using global cross-Region inference, CloudWatch and CloudTrail continue to record log entries in your source Region, simplifying observability and management. Even though your requests are processed across different Regions worldwide, you maintain a centralized view of your application’s performance and usage patterns through your familiar AWS monitoring tools.
On-demand quota flexibility – With global cross-Region inference, your workloads are no longer limited by individual Regional capacity. Instead of being restricted to the capacity available in a specific Region, your requests can be dynamically routed across the AWS global infrastructure. This provides access to a much larger pool of resources, making it less complicated to handle high-volume workloads and sudden traffic spikes.

If you’re currently using Anthropic’s Sonnet models on Amazon Bedrock, upgrading to Claude Sonnet 4.5 is a great opportunity to enhance your AI capabilities. It offers a significant leap in intelligence and capability, offered as a straightforward, drop-in replacement at a comparable price point as Sonnet 4. The primary reason to switch is Sonnet 4.5’s superior performance across critical, high-value domains. It is Anthropic’s most powerful model so far for building complex agents, demonstrating state-of-the-art performance in coding, reasoning, and computer use. Furthermore, its advanced agentic capabilities, such as extended autonomous operation and more effective use of parallel tool calls, enable the creation of more sophisticated AI workflows.
Conclusion
Amazon Bedrock global cross-Region inference for Anthropic’s Claude Sonnet 4.5 marks a significant evolution in AWS generative AI capabilities, enabling global routing of inference requests across the AWS worldwide infrastructure. With straightforward implementation and comprehensive monitoring through CloudTrail and CloudWatch, organizations can quickly use this powerful capability for their AI applications, high-volume workloads, and disaster recovery scenarios.We encourage you to try global cross-Region inference with Anthropic’s Claude Sonnet 4.5 in your own applications and experience the benefits firsthand. Start by updating your code to use the global inference profile ID, configure appropriate IAM permissions, and monitor your application’s performance as it uses the AWS global infrastructure to deliver enhanced resilience.
For more information about global cross-Region inference for Anthropic’s Claude Sonnet 4.5 in Amazon Bedrock, refer to Increase throughput with cross-Region inference, Supported Regions and models for inference profiles, and Use an inference profile in model invocation.

About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Derrick Choo is a Senior Solutions Architect at AWS who accelerates enterprise digital transformation through cloud adoption, AI/ML, and generative AI solutions. He specializes in full-stack development and ML, designing end-to-end solutions spanning frontend interfaces, IoT applications, data integrations, and ML models, with a particular focus on computer vision and multi-modal systems.
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.
Jared Dean is a Principal AI/ML Solutions Architect at AWS. Jared works with customers across industries to develop machine learning applications that improve efficiency. He is interested in all things AI, technology, and BBQ.
Jan Catarata is a software engineer working on Amazon Bedrock, where he focuses on designing robust distributed systems. When he’s not building scalable AI solutions, you can find him strategizing his next move with friends and family at game night.

Secure ingress connectivity to Amazon Bedrock AgentCore Gateway using …

Agentic AI applications represent a significant development in enterprise automation, where intelligent agents autonomously execute complex workflows, access sensitive datasets, and make real-time decisions across your organization’s infrastructure. Amazon Bedrock AgentCore accelerates enterprise AI transformation by providing fully managed services that remove infrastructure complexity, maintain session isolation, and enable seamless integration with enterprise tools so organizations can deploy trustworthy AI agents at scale. AgentCore Gateway, a modular service under AgentCore, simplifies integration by securely transforming APIs, AWS Lambda functions, and services into Model Context Protocol (MCP)-compatible tools and making them available to agents through a unified endpoint, with built-in authentication and serverless infrastructure that minimizes operational overhead.
In production environments, AI agents are typically deployed within virtual private clouds (VPCs) to maintain secure, isolated network access and to meet enterprise security and compliance requirements. Amazon Web Services (AWS) interface VPC endpoints can enhance agentic AI security by creating private connections between VPC-hosted agents and AgentCore Gateway, keeping sensitive communications within the secure infrastructure of AWS. These endpoints use dedicated network interfaces with private IP addresses to deliver reduced latency and superior performance through direct connectivity. Additionally, VPC interface endpoints offer granular access control through endpoint policies, streamline operations by avoiding proxy server management, reduce data transfer costs, and establish the secure foundation that autonomous AI systems require when processing confidential data in regulated environments at enterprise scale.
In this post, we demonstrate how to access AgentCore Gateway through a VPC interface endpoint from an Amazon Elastic Compute Cloud (Amazon EC2) instance in a VPC. We also show how to configure your VPC endpoint policy to provide secure access to the AgentCore Gateway while maintaining the principle of least privilege access.
Architecture overview
This architecture diagram illustrates a user accessing an application supported by backend agents deployed across various AWS compute services, including EC2 instances, Lambda functions, Amazon Elastic Kubernetes Service (Amazon EKS), or Amazon Elastic Container Service (Amazon ECS), all operating within a VPC environment. These agents communicate with AgentCore Gateway to discover, access, and invoke external tools and services that have been transformed into agent-compatible resources, such as enterprise APIs and Lambda functions. In the standard configuration, agent requests to AgentCore Gateway traverse the public internet. By implementing interface VPC endpoints, organizations can route these communications through the AWS secure internal network backbone instead, delivering significant benefits that can include enhanced security, reduced latency, and improved compliance alignment for regulated workloads that require strict network isolation and data protection standards. The solution follows this workflow:

AI agent interaction – An agent running within the VPC obtains the required inbound authorization from identity providers, authenticates with Gateway, and sends a tool-use request (invokes the MCP tool) to the gateway through the interface VPC endpoint.
Gateway processing: Gateway manages OAuth authorization to make sure only valid users and agents can access tools and resources. The inbound request is authorized by Gateway. Converts agent requests using protocols like Model Context Protocol (MCP) into API requests and Lambda invocations
Secure access: The gateway handles credential injection for each tool, enabling agents to use tools with different authentication requirements seamlessly. It uses AgentCore Identity to securely access backend resources (the targets) on behalf of the agent.
Target execution: The gateway data plane invokes the target, which can be a Lambda function, an OpenAPI specification, or a Smithy model.
Monitoring: AgentCore Gateway provides built-in observability and auditing. Additionally, AWS PrivateLink publishes metrics to Amazon CloudWatch for monitoring interface endpoints. You can optionally enable VPC Flow Logs for logging IP traffic to AgentCore Gateway.

Be aware of the following key considerations:

Private and public network communication – The interface VPC endpoint enables secure communication for inbound traffic from agents to AgentCore Gateway through AWS PrivateLink, making sure this traffic remains within the private network. However, authentication workflows—including OAuth access token retrieval and credential exchange processes between agents and external Identity Provider systems for both inbound and outbound flows—and outbound access from the gateway to MCP tools continue to require internet connectivity for establishing secure sessions with identity systems and external resources hosted outside the AWS environment.
Data plane scope – It’s important to understand that, currently, the interface VPC endpoint support is applicable only to the data plane endpoints of your gateway—the runtime endpoints where your applications interact with agent tools. To clarify the distinction: although you can now access your gateway’s runtime endpoint through the interface VPC endpoint, the control plane operations, such as creating gateways, managing tools, and configuring security settings, must still be performed through the standard public AgentCore control plane endpoint (for example, bedrock-agentcore-control.<region>.amazonaws.com)

Prerequisites
To perform the solution, you need the following prerequisites:

An AWS account with appropriate AWS Identity and Access Management (IAM) permissions for VPC and Amazon Elastic Compute Cloud (Amazon EC2) management
Existing VPC setup with subnet configuration and route tables
AgentCore Gateway already provisioned and configured in your AWS account
Basic understanding of VPC networking concepts and security group configurations

Solution walkthrough
In the following sections, we demonstrate how to configure the interface VPC endpoint using the AWS Management Console and establish secure connectivity from a test EC2 instance within the VPC to AgentCore Gateway.
Create a security group for the EC2 instance
To create a security group for the EC2 instance, follow these steps, as shown in the following screenshot:

Navigate to the Amazon EC2 console in your preferred AWS Region and choose Security Groups in the navigation pane under Network & Security.
Choose Create security group.
For Security group name, enter a descriptive name such as ec2-agent-sg.
For Description, enter a meaningful description such as Security group for EC2 instances running AI agents.
For VPC, choose your target VPC.
Add relevant Inbound rules for the EC2 instance management such as SSH (port 22) from your management network or bastion host.
Leave Outbound rules as default (allows all outbound traffic) to make sure agents can communicate with necessary services.
Choose Create security group.

Create a security group for the interface VPC endpoint
To create a security group for the interface VPC endpoint, follow these steps:
Create a second security group named vpce-agentcore-sg that will be attached to the AgentCore Gateway interface VPC endpoint using similar steps to the preceding instructions and selecting the same VPC. For this security group, configure the following rules to enable secure and restricted access:

Inbound rules – Allow HTTPS (port 443) for secure communication to the AgentCore Gateway
Source – Select the EC2 security group (ec2-agent-sg) you created in the preceding section to allow traffic only from authorized agent instances
Outbound rules – Leave as default (all traffic allowed) to support response traffic

This security group configuration implements the principle of least privilege by making sure only EC2 instances with the agent security group can access the VPC endpoint while blocking unauthorized access from other resources in the VPC. These steps are illustrated by the following screenshot.

Provision an EC2 instance within the VPC
Provision an EC2 instance in the same VPC and select an appropriate Availability Zone for your workload requirements. Configure the instance with the network settings shown in the following list, making sure you select the same VPC and note the chosen subnet for VPC endpoint configuration:

VPC – Select your target VPC
Subnet – Choose a private subnet for enhanced security (note this subnet for VPC endpoint configuration)
Security group – Attach the EC2 security group (ec2-agent-sg) you created in the previous steps
IAM role – Configure an IAM role with necessary permissions for Amazon Bedrock and AgentCore Gateway access
Instance type – Choose an appropriate instance type based on your agent workload requirements

Remember the chosen subnet because you’ll need to configure the VPC endpoint in the same subnet to facilitate optimal network routing and minimal latency. These configurations are shown in the following screenshot.

Create an interface VPC endpoint
Create an interface VPC endpoint using Amazon Virtual Private Cloud (Amazon VPC) that automatically uses AWS PrivateLink technology, enabling secure communication from your EC2 instance to AgentCore Gateway without traversing the public internet. Follow these steps:

Navigate to the Amazon VPC console and choose Endpoints in the navigation pane under the PrivateLink and Lattice section.
Choose Create endpoint.
For Name tag, enter a descriptive name (for example, vpce-agentcore-gateway).
For Service category, choose AWS services.
For Services, search for and choose com.amazonaws.<region>.bedrock-agentcore.gateway (replace <region> with your actual AWS Region).

These settings are shown in the following screenshot.

Set the VPC to the same VPC you’ve been working with throughout this setup.
Select Enable DNS name to allow access to the AgentCore Gateway using its default domain name, which simplifies application configuration and maintains compatibility with existing code.
Specify the subnet where the EC2 instance is running to maintain optimal network routing and minimal latency, as shown in the following screenshot.

Set the security group to the VPC endpoint security group (vpce-agentcore-sg) you created earlier to control access to the endpoint.
For initial testing, leave the policy set to Full access to allow agents within your VPC to communicate with AgentCore Gateway in your AWS account. In production environments, implement more restrictive policies based on the principle of least privilege.

After you create the endpoint, it will take approximately 2–5 minutes to become available. You can monitor the status on the Amazon VPC console, and when it shows as Available, you can proceed with testing the connection.
Test the connection
Log in to the EC2 instance to perform following the tests.
Check traffic flow over an interface VPC endpoint
To confirm the traffic flow through the Amazon Bedrock AgentCore Gateway endpoint, check the IP address of the source resource that connects to the AgentCore Gateway endpoint. When you set up an interface VPC endpoint, AWS deploys an elastic network interface with a private IP address in the subnet. This deployment allows communication with AgentCore Gateway from resources within the Amazon VPC and on-premises resources that connect to the interface VPC endpoint through AWS Direct Connect or AWS Site-to-Site VPN. It also allows communication with resources in other Amazon VPC endpoints when you use centralized interface VPC endpoint architecture patterns.
Check whether you turned on private DNS for the AgentCore Gateway endpoint. If you turn on private DNS, then AgentCore Gateway endpoints resolve to the private endpoint IP addresses. For AgentCore Gateway, enabling private DNS means your agents can continue using the standard gateway endpoint URL while benefiting from private network routing through the VPC endpoint.
Before VPC interface endpoint, as shown in the following example, the DNS resolves to a public IP address for AgentCore Gateway endpoint:

nslookup gateway.bedrock-agentcoreamazonaws.com

Non-authoritative answer:
Name: gateway.bedrock-agentcore..amazonaws.com
Address: 52.86.152.150

After VPC interface endpoint creation with private DNS resolution, as shown in the following example, the DNS resolves to private IP address from the CIDR range of the subnet of the VPC in which the VPC endpoint was created.

nslookup .gateway.bedrock-agentcore..amazonaws.com

Non-authoritative answer:
Name: .gateway.bedrock-agentcore..amazonaws.com
Address: 172.31.91.174

When you select Enable DNS name for AgentCore Gateway VPC interface endpoints, by default AWS turns on the Enable private DNS only for inbound endpoints option.
Private DNS enabled (cURL) (recommended)
When private DNS is enabled, your applications can seamlessly use the standard gateway URL endpoint in the format https://{gateway-id}.gateway.bedrock-agentcore.{region}.amazonaws.com while traffic automatically routes through the VPC endpoint.
The following is a sample cURL request to be executed from a resource within the VPC. The command sends a JSON-RPC POST request to retrieve available tools from the AgentCore Gateway:

curl -sS -i -X POST https://<gatewayid>.gateway.bedrock-agentcore.<region>.amazonaws.com/mcp
–header ‘Content-Type: application/json’
–header “Authorization: Bearer $TOKEN”
–data ‘{
“jsonrpc”: “2.0”,
“id”: “‘”$UNIQUE_ID”‘”,
“method”: “tools/list”,
“params”: {}
}’

This cURL command sends a JSON-RPC 2.0 POST request to the AgentCore Gateway MCP endpoint to retrieve a list of available tools. It uses bearer token authentication and includes response headers in the output, calling the tools/list method to discover what tools are accessible through the gateway.
Private DNS disabled (Python)
When Private DNS is disabled, you can’t access the gateway directly through the standard AgentCore Gateway endpoint. Instead, you must route traffic through the VPC DNS name shown in the following screenshot and include the original gateway domain name in the Host header.

curl -sS -i -X POST https://<vpce-dns-name>/mcp
–header ‘Host: <gatewayid>.gateway.bedrock-agentcore.<region>.amazonaws.com
–header ‘Content-Type: application/json’
–header “Authorization: Bearer $TOKEN”
–data ‘{
“jsonrpc”: “2.0”,
“id”: “‘$UNIQUE_ID'”,
“method”: “tools/list”,
“params”: {}
}’

The following steps below walk through executing a Python script that uses the Host header:

Access your EC2 instance. Log in to your EC2 instance that has access to the VPC endpoint.
Configure the required environment variables for the connection:
GATEWAY_URL – The VPC endpoint URL used to access the AgentCore Gateway through your private network connection
TOKEN – Your authentication bearer token for accessing the gateway
GATEWAY_HOST – The original AgentCore Gateway domain name that must be included in the Host header when Private DNS is disabled

For example:

export GATEWAY_URL=https://<vpce_id>.gateway.bedrock-agentcore.ap-southeast-2.vpce.amazonaws.com/mcp
export TOKEN=<your-token-here>
export GATEWAY_HOST=<gateway_id>.gateway.bedrock-agentcore.ap-southeast-2.amazonaws.com

Create and execute the test script.

Copy the following Python code into a file named agent.py. This code tests the AgentCore Gateway workflow by discovering available tools, creating a Strands Agent with the tools, and then testing both conversational interactions (tool listing and weather queries) and direct MCP tool calls. Copy the code:

from strands.models import BedrockModel
from mcp.client.streamable_http import streamablehttp_client
from strands.tools.mcp.mcp_client import MCPClient
from strands import Agent
import logging
import os

# Read authentication token and gateway URL from environment variables
token = os.getenv(‘TOKEN’)
gatewayURL = os.getenv(‘GATEWAY_URL’) #vpc endpoint url
gatewayHost = os.getenv(‘GATEWAY_HOST’) #domain name of the agentcore gateway

def create_streamable_http_transport():
“””Create HTTP transport with proper authentication headers”””
return streamablehttp_client(
gatewayURL,
headers={
“Authorization”: f”Bearer {token}”,
“Host”:gatewayHost
}
)
# Initialize MCP client with the transport
client = MCPClient(create_streamable_http_transport)

# Configure Bedrock model – ensure IAM credentials in ~/.aws/credentials have Bedrock access
yourmodel = BedrockModel(
model_id=”amazon.nova-pro-v1:0″,
temperature=0.7,
)

# Configure logging for debugging and monitoring
logging.getLogger(“strands”).setLevel(logging.INFO)
logging.basicConfig(
format=”%(levelname)s | %(name)s | %(message)s”,
handlers=[logging.StreamHandler()]
)

# Test the complete agent workflow
with client:
targetname = ‘TestGatewayTarget36cb2ebf’

# List available tools from the MCP server
tools = client.list_tools_sync()

# Create an Agent with the model and available tools
agent = Agent(model=yourmodel, tools=tools)
print(f”Tools loaded in the agent: {agent.tool_names}”)

# Test agent with a simple query to list available tools
response1 = agent(“Hi, can you list all tools available to you?”)
print(f”Agent response for tool listing: {response1}”)

# Test agent with a tool invocation request
response2 = agent(“Get the current weather for Seattle and show me the exact response from the tool”)
print(f”Agent response for weather query: {response2}”)

# Direct MCP tool invocation for validation
result = client.call_tool_sync(
tool_use_id=”get-weather-seattle-call-1″, # Unique identifier for this call
name=f”{targetname}___get_weather”, # Tool name format for Lambda targets
arguments={“location”: “Seattle”}
)
print(f”Direct MCP tool response: {result}”)

Invoke the script using the following command:

python3 agent.py
Advanced configuration: VPC endpoint access policies
A VPC endpoint policy is a resource-based policy that controls access to AWS services through the endpoint. Unlike identity-based policies, endpoint policies provide an additional layer of access control at the network level. You can configure access policies for AgentCore Gateway VPC endpoints with specific considerations.When creating endpoint policies for AgentCore Gateway, consider these key elements:

Principal configuration – The Principal field can’t be modified because AgentCore Gateway doesn’t use IAM for authentication. Authentication is handled through bearer tokens rather than IAM principals.
Resource specification – Clearly define the Resource field if you want to restrict access to specific gateway endpoints. Use the full Amazon Resource Name (ARN) format to target particular gateways within your account as shown in the following sample policy structure.
Action permissions – For the Action field, avoid specifying control plane operations. Use a wildcard (*) to allow the necessary data plane operations for gateway functionality.

Here is a sample policy structure:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Principal”: “*”,
“Effect”: “Allow”,
“Action”: “*”,
“Resource”: “arn:aws:bedrock-agentcore:<region>:<AWS_Account_ID>:gateway/<gateway_id>”
}
]
}

When the VPC endpoint policy blocks a request, you will see error responses such as:

{“jsonrpc”:”2.0″,”id”:2,”error”:{“code”:-32002,”message”:”Authorization error – Insufficient permissions”}}

Policy caching behavior
AgentCore Gateway implements a caching mechanism for access policies that introduces a delay of up to 15 minutes before policy changes take effect. Although this caching significantly improves gateway performance, it means that policy modifications might not be immediately reflected in access controls. To work effectively with this behavior, you should allow at least 15 minutes for policy changes to fully propagate throughout the system after making updates. When possible, schedule policy modifications during planned maintenance windows to minimize operational impact. Always test policy changes in nonproduction environments before applying them to production gateways and factor in the caching delay when diagnosing access-related issues to avoid premature troubleshooting efforts.
Advanced patterns
In a shared gateway, multiple agents pattern, multiple agents from different services access a single centralized gateway through a shared VPC endpoint, simplifying network architecture while maintaining security through token-based authentication. This pattern is illustrated in the following diagram.

In a multi-gateway, multi-agent pattern, which is shown in the following diagram, multiple agents across different applications access multiple specialized gateways through dedicated VPC endpoints, providing maximum security isolation with access control per gateway.

In a cross-VPC gateway access pattern, shown in the following diagram, agents in multiple VPCs can access AgentCore Gateway through VPC peering or AWS Transit Gateway connections, allowing centralized gateway access across network boundaries while maintaining isolation.

In a hybrid cloud gateway pattern, on-premises agents can access cloud-based gateways through VPC endpoints with private DNS disabled, enabling hybrid cloud deployments through Direct Connect or VPN connections. The following diagram illustrates this pattern.

Clean up
To avoid ongoing charges and maintain good resource hygiene, clean up your resources by completing the following steps in order:Delete the EC2 instance:

Navigate to the Amazon EC2 console and select your test instance
Choose Instance state and Stop instance, then wait for it to stop
Choose Instance state and Terminate instance to permanently delete the instance

Delete the VPC endpoint:

Navigate to the Amazon VPC console and choose Endpoints
Select the VPC endpoint (vpce-agentcore-gateway) you created
Choose Actions and Delete VPC endpoints
Confirm the deletion

Delete the security groups:

Navigate to the Amazon EC2 console and choose Security groups
Select the EC2 security group (ec2-agent-sg) you created
Choose Actions and Delete security groups
Repeat for the VPC endpoint security group (vpce-agentcore-sg)

Conclusion
In this post, we demonstrated how to establish secure, private connectivity between VPC-hosted resources and Amazon Bedrock AgentCore Gateway using VPC interface endpoints and AWS PrivateLink. This architecture delivers comprehensive benefits for enterprise agentic AI deployments by implementing networks that are isolated from the internet, providing enhanced security through dedicated private network paths. The solution implements a robust data perimeter through VPC endpoint policies, which create granular access controls that establish strict data boundaries around your AI resources. Additionally, the architecture enables private connectivity to Gateway endpoints for on-premises environments, supporting distributed AI architectures that span cloud and on-premises infrastructure. For organizations deploying autonomous AI systems at scale, implementing VPC interface endpoints creates the secure networking foundation necessary for efficient agent operations while delivering reduced latency through optimized network paths. This enterprise-grade approach helps enable your agentic AI applications to achieve improved performance and reduced response times while meeting security and compliance requirements.
To learn more about implementing these patterns and best practices, visit the Amazon Bedrock documentation and AWS PrivateLink documentation for comprehensive guidance on AI deployments.

About the authors
Dhawal Patel is a Principal Machine Learning Architect at Amazon Web Services (AWS). He has worked with organizations ranging from large enterprises to midsized startups on problems related to distributed computing and AI. He focuses on deep learning, including natural language processing (NLP) and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
Sindhura Palakodety is a Senior Solutions Architect at Amazon Web Services (AWS) and Single-Threaded Leader (STL) for ISV Generative AI, where she is dedicated to empowering customers in developing enterprise-scale, Well-Architected solutions. She specializes in generative AI and data analytics domains, enabling organizations to leverage innovative technologies for transformative business outcomes.
Thomas Mathew Veppumthara is a Sr. Software Engineer at Amazon Web Services (AWS) with Amazon Bedrock AgentCore. He has previous generative AI leadership experience in Amazon Bedrock Agents and nearly a decade of distributed systems expertise across Amazon eCommerce Services and Amazon Elastic Block Store (Amazon EBS). He holds multiple patents in distributed systems, storage, and generative AI technologies.
June Won is a Principal Product Manager with Amazon SageMaker JumpStart. He focuses on making foundation models (FMs) easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last-mile delivery.

IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transf …

IBM just released Granite 4.0, an open-source LLM family that swaps monolithic Transformers for a hybrid Mamba-2/Transformer stack to cut serving memory while keeping quality. Sizes span a 3B dense “Micro,” a 3B hybrid “H-Micro,” a 7B hybrid MoE “H-Tiny” (~1B active), and a 32B hybrid MoE “H-Small” (~9B active). The models are Apache-2.0, cryptographically signed, and—per IBM—the first open models covered by an accredited ISO/IEC 42001:2023 AI management system certification. They are available on watsonx.ai and via Docker Hub, Hugging Face, LM Studio, NVIDIA NIM, Ollama, Replicate, Dell Pro AI Studio/Enterprise Hub, Kaggle, with Azure AI Foundry…

So, what is new?

Granite 4.0 introduces a hybrid design that interleaves a small fraction of self-attention blocks with a majority of Mamba-2 state-space layers (9:1 ratio). As per IBM technical blog, relative to conventional Transformer LLMs, Granite 4.0-H can reduce RAM by >70% for long-context and multi-session inference, translating into lower GPU cost at a given throughput/latency target. IBM’s internal comparisons also show the smallest Granite 4.0 models outperforming Granite 3.3-8B despite using fewer parameters.

Tell me what are the released variants?

IBM is shipping both Base and Instruct variants across four initial models:

Granite-4.0-H-Small: 32B total, ~9B active (hybrid MoE).

Granite-4.0-H-Tiny: 7B total, ~1B active (hybrid MoE).

Granite-4.0-H-Micro: 3B (hybrid dense).

Granite-4.0-Micro: 3B (dense Transformer for stacks that don’t yet support hybrids).

All are Apache-2.0 and cryptographically signed; IBM states Granite is the first open model family with accredited ISO/IEC 42001 coverage for its AI management system (AIMS). Reasoning-optimized (“Thinking”) variants are planned later in 2025.

How is it trained, context, and dtype?

Granite 4.0 was trained on samples up to 512K tokens and evaluated up to 128K tokens. Public checkpoints on Hugging Face are BF16 (quantized and GGUF conversions are also published), while FP8 is an execution option on supported hardware—not the format of the released weights.

Lets understand it’s performance signals (enterprise-relevant)

IBM highlights instruction following and tool-use benchmarks:

IFEval (HELM): Granite-4.0-H-Small leads most open-weights models (trailing only Llama 4 Maverick at far larger scale).

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

BFCLv3 (Function Calling): H-Small is competitive with larger open/closed models at lower price points.

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

MTRAG (multi-turn RAG): Improved reliability on complex retrieval workflows.

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

How can I get access?

Granite 4.0 is live on IBM watsonx.ai and distributed via Dell Pro AI Studio/Enterprise Hub, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE, Replicate. IBM notes ongoing enablement for vLLM, llama.cpp, NexaML, and MLX for hybrid serving.

My thoughts/comments

I see Granite 4.0’s hybrid Mamba-2/Transformer stack and active-parameter MoE as a practical path to lower TCO: >70% memory reduction and long-context throughput gains translate directly into smaller GPU fleets without sacrificing instruction-following or tool-use accuracy (IFEval, BFCLv3, MTRAG). The BF16 checkpoints with GGUF conversions simplify local evaluation pipelines, and ISO/IEC 42001 plus signed artifacts address provenance/compliance gaps that typically stall enterprise deployment. Net result: a lean, auditable base model family (1B–9B active) that’s easier to productionize than prior 8B-class Transformers.

Check out the Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transformer Architecture: Drastically Reducing Memory Use without Sacrificing Performance appeared first on MarkTechPost.

ServiceNow AI Releases Apriel-1.5-15B-Thinker: An Open-Weights Multimo …

ServiceNow AI Research Lab has released Apriel-1.5-15B-Thinker, a 15-billion-parameter open-weights multimodal reasoning model trained with a data-centric mid-training recipe—continual pretraining followed by supervised fine-tuning—without reinforcement learning or preference optimization. The model attains an Artificial Analysis Intelligence Index score of 52 with 8x cost savings compared to SOTA. The checkpoint ships under an MIT license on Hugging Face.

So, What’s new in it for me?

Frontier-level composite score at small scale. The model reports Artificial Analysis Intelligence Index (AAI) = 52, matching DeepSeek-R1-0528 on that combined metric while being dramatically smaller. AAI aggregates 10 third-party evaluations (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, τ²-Bench Telecom).

Single-GPU deployability. The model card states the 15B checkpoint “fits on a single GPU,” targeting on-premises and air-gapped deployments with fixed memory and latency budgets.

Open weights and reproducible pipeline. Weights, training recipe, and evaluation protocol are public for independent verification.

https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker

Ok! I got it but what is it’s training mechanism?

Base and upscaling. Apriel-1.5-15B-Thinker starts from Mistral’s Pixtral-12B-Base-2409 multimodal decoder-vision stack. The research team applies depth upscaling—increasing decoder layers from 40→48—then projection-network realignment to align the vision encoder with the enlarged decoder. This avoids pretraining from scratch while preserving single-GPU deployability.

CPT (Continual Pretraining). Two stages: (1) mixed text+image data to build foundational reasoning and document/diagram understanding; (2) targeted synthetic visual tasks (reconstruction, matching, detection, counting) to sharpen spatial and compositional reasoning. Sequence lengths extend to 32k and 16k tokens respectively, with selective loss placement on response tokens for instruction-formatted samples.

SFT (Supervised Fine-Tuning). High-quality, reasoning-trace instruction data for math, coding, science, tool use, and instruction following; two additional SFT runs (stratified subset; longer-context) are weight-merged to form the final checkpoint. No RL (reinforcement learning) or RLAIF (reinforcement learning from AI feedback).

Data note. ~25% of the depth-upscaling text mix derives from NVIDIA’s Nemotron collection.

O’ Wow! Tell me about it’s results then?

Key text benchmarks (pass@1 / accuracy).

AIME 2025 (American Invitational Mathematics Examination 2025): 87.5–88%

GPQA Diamond (Graduate-Level Google-Proof Question Answering, Diamond split): ≈71%

IFBench (Instruction-Following Benchmark): ~62

τ²-Bench (Tau-squared Bench) Telecom: ~68

LiveCodeBench (functional code correctness): ~72.8

Using VLMEvalKit for reproducibility, Apriel scores competitively across MMMU / MMMU-Pro (Massive Multi-discipline Multimodal Understanding), LogicVista, MathVision, MathVista, MathVerse, MMStar, CharXiv, AI2D, BLINK, with stronger results on documents/diagrams and text-dominant math imagery.

https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker/blob/main/Apriel-1.5-Thinker.pdf

Lets Summarize everything

Apriel-1.5-15B-Thinker demonstrates that careful mid-training (continual pretraining + supervised fine-tuning, no reinforcement learning) can deliver a 52 on the Artificial Analysis Intelligence Index (AAI) while remaining deployable on a single graphics processing unit. Reported task-level scores (for example, AIME 2025 ≈88, GPQA Diamond ≈71, IFBench ≈62, Tau-squared Bench Telecom ≈68) align with the model card and place the 15-billion-parameter checkpoint in the most cost-efficient band of current open-weights reasoners. For enterprises, that combination—open weights, reproducible recipe, and single-GPU latency—makes Apriel a practical baseline to evaluate before considering larger closed systems.
The post ServiceNow AI Releases Apriel-1.5-15B-Thinker: An Open-Weights Multimodal Reasoning Model that Hits Frontier-Level Performance on a Single-GPU Budget appeared first on MarkTechPost.

Enhance agentic workflows with enterprise search using Kore.ai and Ama …

This post was written with Meghana Chintalapudi and Surabhi Sankhla of Kore.ai.
As organizations struggle with exponentially growing volumes of data distributed across multiple repositories and applications, employees lose significant time—approximately 30% according to the International Data Corporation (IDC)—searching for information that could be spent on higher-value work. The complexity of modern enterprise data networks demands solutions that can efficiently integrate, process, and deliver actionable insights across disparate systems.
In this post, we demonstrate how organizations can enhance their employee productivity by integrating Kore.ai’s AI for Work platform with Amazon Q Business. We show how to configure AI for Work as a data accessor for Amazon Q index for independent software vendors (ISVs), so employees can search enterprise knowledge and execute end-to-end agentic workflows involving search, reasoning, actions, and content generation. We explore the key benefits of this integration, including advanced search capabilities across more than 90 enterprise connectors and how to extend agentic experiences on top of a search foundation. The post includes a step-by-step implementation guide to help you set up this integration in your environment.
Components of the integration
Kore.ai is a leading Enterprise AI platform consistently recognized by Gartner as a leader in conversational AI. With three key Kore.ai offerings, AI for Work, AI for Process, and AI for Service, enterprises can build and deploy AI solutions based on their business needs. The AI for Work platform helps employees be more productive by making it possible to search across applications, take context-aware actions, generate content, and automate repetitive tasks. The platform goes beyond standalone search to deliver comprehensive agentic orchestration and workflows, helping employees follow up with clients, send weekly updates, or research and write marketing content with a single command. With AI for Work, your employees can create simple no-code agents while your admins have the flexibility to create more advanced low-code or pro-code agents. AI for Process, on the other hand, automates knowledge-intensive business processes end-to-end. AI for Service helps organizations deliver differentiated customer service experiences through self-service, proactive outreach campaigns, and agent assistance.
Amazon Q index for ISVs is a powerful, managed vector search service that supports seamless integration of generative AI applications with customers’ enterprise data through a unified, secure index. ISVs can access and retrieve relevant content through the SearchRelevantContent API for cross-application data retrieval without needing direct access or individual indexing of each data source, while customers retain full control over data access and governance.
When combined with additional search connectors offered by AI for Work platform and its ability to create and orchestrate agents, organizations gain a complete solution that transforms how employees access enterprise data and execute tasks end-to-end. The following video shows one such agentic experience in action, where the AI for Work interface seamlessly orchestrates agents to help a sales executive prepare for a client meeting—compiling information from Amazon Q index and AI for Work connectors, summarizing talking points, and sending them as an email, all from a single query.

Benefits for enterprises
Enterprises often struggle with fragmented data access and repetitive manual tasks that slow down critical business processes. For example, imagine a scenario where a product manager needs to compile quarterly feature requests—with the integration of Kore.ai’s AI for Work and Amazon Q index, they can instantly gather requests from Salesforce, support tickets, and JIRA; automatically generate a structured roadmap; and schedule stakeholder meetings, all with a single query. This seamless integration changes the way enterprises interact with enterprise systems, through multiple key advantages:

Improved search capabilities – Amazon Q index augments the generative AI experience by providing semantically relevant enterprise content across connected systems through its distributed vector database, delivering query responses at enterprise scale. Now, together with AI for Work, your employees can search data from over 90 connectors, integrating with enterprise systems like Microsoft 365, Salesforce, and Workday while also connecting with custom internal knowledge systems and third-party search providers. AI for Work’s orchestrator manages complex query processing and agent routing across multiple data sources, resulting in contextually appropriate and actionable results that significantly reduce search time while also enabling intelligent automations that extend far beyond traditional search capabilities.
Enhanced data processing – The system continuously ingests and analyzes data through the document processing pipeline in Amazon Q index, which automatically handles multiple formats using intelligent chunking algorithms that preserve semantic context. The AI for Work platform unifies search, content generation, and actions in a single interface, to support the creation of multi-step agentic experiences grounded in search. Through real-time incremental indexing that processes only changed content, the system maintains data freshness while converting siloed raw data into actionable insights and multi-step business processes that can be saved and reused across the organization.
Cost optimization – Organizations can achieve significant cost savings by streamlining routine tasks through agents that reduce operational overhead and improve resource allocation. AI for Work supports a wide range of agent-building options, from no-code and low-code to pro-code, for both non-technical employees and technical experts to build agents for themselves and to share across the organization, so teams can accomplish more with existing resources and benefit from sustained productivity improvements.
Security benefits – Security remains paramount, with Amazon Q index implementing vector-level security through end-to-end encryption using AWS Key Management Service (AWS KMS) customer managed keys and document-level access controls that filter search results based on user identity and group membership. The joint solution implements robust role-based access control and audit trails. This zero-trust security approach maintains compliance with industry standards while providing granular control over sensitive enterprise data, making sure users only see information from documents they have explicit permissions to access while maintaining complete data sovereignty. With AI for Work’s robust security and governance tools enterprises can manage permissions and agent access, monitor usage, and enforce guardrails for secure, enterprise-wide deployment of AI solutions at scale.

Solution overview
The Amazon Q Business data accessor provides a secure interface that integrates Kore.ai’s AI for Work platform with Amazon Q index. The integration delivers a robust solution that uses enterprise data across multiple systems to power intelligent agentic actions and content generation capabilities that transform how organizations handle routine tasks and automate complex processes end-to-end.
When a user submits a query through AI for Work, its orchestrator intelligently routes requests between Kore.ai’s native retrievers and Amazon Q index based on predefined routing rules and advanced intent recognition algorithms. For Amazon Q index requests, the architecture implements secure cross-account API calls using OAuth 2.0 tokens that transform into temporary AWS credentials, supporting both security and optimal performance while maintaining strict access controls throughout the entire system. With AI for Work’s agents, users can take follow up actions, such as drafting proposals or submitting tickets—directly on top of search results, for end-to-end task completion in a single interface. Users can also build personalized workflows of pre-defined steps and execute them from a single query to further save time.
This supports use cases such as automated roadmap generation, where a product manager can query feature requests across multiple systems and receive a structured roadmap complete with stakeholder notifications, or RFP response automation, where sales executives can generate comprehensive proposals by pulling compliance documentation and tailoring responses based on client requirements.
The following diagram illustrates the solution architecture.

Prerequisites
Before enabling the Amazon Q index integration with Kore.ai’s AI for Work, you must have the following components in place:

An AWS account with appropriate service access
Amazon Q Business set up with AWS IAM Identity Center for user authentication
Access to Kore.ai’s AI for Work (as a workspace admin)

With these prerequisites met, you can complete the basic configuration steps on both the Amazon Q Business and Kore.ai consoles to get started.
Add Kore.ai as a data accessor
After creating an Amazon Q Business application with AWS IAM Identity Center, administrators can configure Kore.ai as a data accessor through the Amazon Q Business console. Complete the following steps:

On the Amazon Q Business console, choose Data accessors in the navigation pane.
Choose Add data accessor.
Choose Kore.ai as your data accessor. You must retrieve tenantID, a unique identifier for your application tenant. Refer to Prerequisites for instructions to retrieve the TenantId for your application. Similar instructions are also listed later in this post.
For Data source access, configure your level of access. You can select specific data sources from your Amazon Q index to be available through the data accessor. This makes it possible to control which content is surfaced in the AI for Work environment.
For User access, specify which users or groups can access the Amazon Q index through the data accessor. This option makes it possible to configure granular permissions for data accessor accessibility and manage organizational access controls.

After you have added the data accessor, the Amazon Q Business console displays configuration details that you need to share with Kore.ai to complete the setup.

Note down the following information for the next step:

Amazon Q Business application ID
AWS Region of the Amazon Q Business application
Amazon Q Business retriever ID
Region for IAM Identity Center instance

Configure Amazon Q index in Kore.ai’s AI for Work
Kore.ai’s AI for Work supports flexible integration with Amazon Q index based on your enterprise search needs. There are two configuration options: configuring Amazon Q index as the primary enterprise knowledge source or configuring it as a search agent. We provide instructions for both options in this post.
Option 1: Configure Amazon Q index as the primary enterprise knowledge source
If you want Amazon Q index to act as the primary fallback search layer, coming into play, complete the following steps:

In AI for Work, go to Workspaces on the admin console. Then navigate to Enterprise Workspace, which is the default workspace.

Choose Configure to configure an enterprise knowledge data source.
On the Create New dropdown menu, choose Amazon Q.

Enter a source name and brief description.
Copy the tenant ID displayed—this is required during the setup of the data accessor in AWS, as described in the previous section.
Enter the details captured earlier:

Amazon Q Business application ID
Region of the Amazon Q Business application
Amazon Q Business retriever ID
Region for IAM Identity Center instance

Choose Continue to save and complete the configuration.

The new knowledge source now shows as Active.

Option 2: Configure Amazon Q index as a search agent
If you already have a primary search index, you can configure Amazon Q index as a search agent:

In AI for Work, go to Workspaces on the admin console.
Choose the workspace where you want to add Amazon Q index. (Enterprise Workspace is used by default).
Under AI Agents in the navigation pane, choose Search Agent
Choose Create agent.

Provide an agent name and purpose. This helps define when the search agent should be invoked.
Choose Continue to move to configuration.
For Select Search Index, choose Amazon Q.

Copy the tenant ID displayed—it is required during the setup of the data accessor in AWS.

Preview and test the agent.
After you have validated the agent, publish it to selected users or groups.

Your integration is now complete. You can now access the assistant application and start asking questions in the AI for Work console. If you’ve created a search agent, you can also access it from the list of agents and start interacting with it directly.
Clean up
When you are finished using this solution, clean up your resources to avoid additional costs:

Disable the Amazon Q index configuration within AI for Work’s settings.
Delete the Kore.ai data accessor from the Amazon Q Business console, which will remove permissions and access for users.
Delete the Amazon Q Business application to remove the associated index and data source connectors, on your AWS account.

Conclusion
The combination of Kore.ai’s AI for Work and Amazon Q index offers enterprises a transformative approach to boost employee productivity leveraging comprehensive search capabilities while streamlining repetitive tasks and processes. By integrating Kore.ai’s advanced agentic platform with the robust search infrastructure of Amazon Q index, organizations can now execute context aware actions by accessing relevant information across disparate systems while maintaining data ownership and security. This supports faster problem-solving, enhanced productivity, and better collaboration across the organization.
In this post, we explored how enterprises can use the integration between Kore.ai’s AI for Work and Amazon Q Business to streamline their operational processes and unlock valuable productivity gains. We demonstrated how organizations can set up this integration using an Amazon Q data accessor, helping teams access critical information securely and cost-effectively.
Unlock the full potential of your organization’s data and agentic workflows today with the Amazon Q index and Kore.ai’s AI for Work’s unified solution by following the steps in Amazon Q integration with AI for Work.

About the authors
Siddhant Gupta is a Software Development Manager on the Amazon Q team based in Seattle, WA. He is driving innovation and development in cutting-edge AI-powered solutions.
Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS, with a core focus on generative AI. She helps ISVs accelerate the adoption of generative AI by designing scalable and impactful solutions. With a strong background in applied mathematics and machine learning, she specializes in intelligent document processing and AI-driven innovation. Outside of work, she enjoys salsa and bachata dancing.
Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.
Santhosh Urukonda is a Senior PACE (Prototyping & Cloud Engineering) Architect at AWSs with two decades of experience. He specializes in helping customers develop innovative, first-to-market solutions with a focus on generative AI.
Nikhil Kumar Goddeti is a Cloud Support Engineer II at AWS. He specializes in AWS Data Analytics services with emphasis on Amazon OpenSearch Service, Amazon Q Business, Amazon Kinesis, Amazon MSK, Amazon AppFlow, and Amazon Kendra. He is a Subject Matter Expert of OpenSearch. Outside of work, he enjoys travelling with his friends and playing cricket.
Meghana Chintalapudi is a Product Manager at Kore.ai, driving the development of search and agentic AI solutions for the AI for Work platform. She has led large-scale AI implementations for Fortune 500 clients, evolving from deterministic NLP and intent-detection models to advanced large language model deployments, with a strong emphasis on enterprise-grade security and scalability. Outside of work, Meghana is a dancer and takes movement workshops in Hyderabad, India.
Surabhi Sankhla is a VP of Product at Kore.ai, where she leads the AI for Work platform to help enterprises boost employee productivity. With over 13 years of experience in product management and technology, she has launched AI products from the ground up and scaled them to millions of users. At Kore.ai, she drives product strategy, client implementations, and go-to-market execution in partnership with cross-functional teams. Based in San Francisco, Surabhi is passionate about making AI accessible and impactful for all.

Accelerate development with the Amazon Bedrock AgentCore MCP server

Today, we’re excited to announce the Amazon Bedrock AgentCore Model Context Protocol (MCP) Server. With built-in support for runtime, gateway integration, identity management, and agent memory, the AgentCore MCP Server is purpose-built to speed up creation of components compatible with Bedrock AgentCore. You can use the AgentCore MCP server for rapid prototyping, production AI solutions, or to scale your agent infrastructure for your enterprise.
Agentic IDEs like Kiro, Claude Code, GitHub Copilot, and Cursor, along with sophisticated MCP servers are transforming how developers build AI agents. What typically takes significant time and effort, for example learning about Bedrock AgentCore services, integrating Runtime and Tools Gateway, managing security configurations, and deploying to production can now be completed in minutes through conversational commands with your coding assistant.
In this post we introduce the new AgentCore MCP server and walk through the installation steps so you can get started.
AgentCore MCP server capabilities
The AgentCore MCP server brings a new agentic development experience to AWS, providing specialized tools that automate the complete agent lifecycle, eliminate the steep learning curve, and reduce development friction that can slow innovation cycles. To address specific agent development challenges the AgentCore MCP server:

Transforms agents for AgentCore Runtime integration by providing guidance to your coding assistant on the minimum functionality changes needed—adding Runtime library imports, updating dependencies, initializing apps with BedrockAgentCoreApp(), converting entrypoints to decorators, and changing direct agent calls to payload handling—while preserving your existing agent logic and Strands Agents features.
Automates development environment provisioning by handling the complete setup process through your coding assistant: installing required dependencies (bedrock-agentcore SDK, bedrock-agentcore-starter-toolkit CLI helpers, strands-agents SDK), configuring AWS credentials and AWS Regions, defining execution roles with Bedrock AgentCore permissions, setting up ECR repositories, and creating .bedrock_agentcore.yaml configuration files.
Simplifies tool integration with Bedrock AgentCore Gateway for seamless agent-to-tool communication in the cloud environment.
Enables simple agent invocation and testing by providing natural language commands through your coding assistant to invoke provisioned agents on AgentCore Runtime and verify the complete workflow, including calls to AgentCore Gateway tools when applicable.

Layered approach
When using the AgentCore MCP server with your favorite client, we encourage you to consider a layered architecture designed to provide comprehensive AI agent development support:

Layer 1: Agentic IDE or client – Use Kiro, Claude Code, Cursor, VS Code extensions, or another natural language interface for developers. For very simple tasks, agentic IDEs are equipped with the right tools to look up documentation and perform tasks specific to Bedrock AgentCore. However, with this layer alone, developers may observe sub-optimal performance across AgentCore developer paths.
Layer 2: AWS service documentation – Install the AWS Documentation MCP Server for comprehensive AWS service documentation, including context about Bedrock AgentCore.
Layer 3: Framework documentation – Install the Strands, LangGraph, or other framework docs MCP servers or use the llms.txt for framework-specific context.
Layer 4: SDK documentation – Install the MCP or use the llms.txt for the Agent Framework SDK and Bedrock AgentCore SDK for a combined documentation layer that covers the Strands Agents SDK documentation and Bedrock AgentCore API references.
Layer 5: Steering files – Task-specific guidance for more complex and repeated workflows. Each IDE has a different approach to using steering files (for example, see Steering in the Kiro documentation).

Each layer builds upon the previous one, providing increasingly specific context so your coding assistant can handle everything from basic AWS operations to complex agent transformations and deployments.
Installation
To get started with the Amazon Bedrock AgentCore MCP server you can use the one-click install on the Github repository.
Each IDE integrates with an MCP differently using the mcp.json file. Review the MCP documentation for your IDE, such as Kiro, Cursor, Q CLI, and Claude Code to determine the location of the mcp.json.

Client
Location of mcp.json
Documentation

Kiro
.kiro/settings/mcp.json
https://kiro.dev/docs/mcp/

Cursor
.cursor/mcp.json
https://cursor.com/docs/context/mcp

Q CLI
~/.aws/amazonq/mcp.json
https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/qdev-mcp.html

Claude Code
~/.claude/mcp.json
https://docs.claude.com/en/docs/claude-code/mcp

Use the following in your mcp.json:

{
  “mcpServers”: {
    “awslabs.amazon-bedrock-agentcore-mcp-server”: {
      “command”: “uvx”,
      “args”: [“awslabs.amazon-bedrock-agentcore-mcp-server@latest”],
      “env”: {
        “FASTMCP_LOG_LEVEL”: “ERROR”
      },
      “disabled”: false,
      “autoApprove”: []
    }
  }
}

For example, here is what the IDE looks like on Kiro, with the AgentCore MCP server and the two tools, search_agentcore_docs and fetch_agentcore_doc, connected:

Using the AgentCore MCP server for agent development
While we show demos for various use cases below using the Kiro IDE, the AgentCore MCP server has also been tested to work on Claude Code, Amazon Q CLI, Cursor, and the VS Code Q plugin. First, let’s take a look at a typical agent development lifecycle using AgentCore services (remember that this is only one example with the tools available, and you are free to explore more such use cases simply by instructing the agent in your favorite Agentic IDE):

The agent development lifecycle follows these steps:

The user takes a local set of tools or MCP servers and

Creates a lambda target for AgentCore Gateway; or
Deploys the MCP server as-is on AgentCore Runtime

The user prepares the actual agent code using a preferred framework like Strands Agents or LangGraph. The user can either:

Start from scratch (the server can fetch docs from the Strands Agents or LangGraph documentation)
Start from fully or partially working agent code

The user asks the agent to transform the code into a format compatible with AgentCore Runtime with the intention to deploy the agent later. This causes the agent to:

Write an appropriate requirements.txt file
import necessary libraries including bedrock_agentcore
decorate the main handler (or create one) to access the core agent calling logic or input handler

The user may then ask the agent to deploy to AgentCore Runtime. The agent can look up documentation and can use the AgentCore CLI to deploy the agent code to Runtime
The user can test the agent by asking the agent to do so. The AgentCore CLI command required for this is written and executed by the client
The user then asks to modify the code to use the deployed AgentCore Gateway MCP server within this AgentCore Runtime agent.

The agent modifies the original code to add an MCP client that can call the deployed gateway
The agent then deploys a new version v2 of the agent to Runtime
The agent then tests this integration with a new prompt

Here is a demo of the MCP server working with Cursor IDE. We see the agent perform the following steps:

Transform the weather_agent.py to be compatible with AgentCore runtime
Use the AgentCore CLI to deploy the agent
Test the deployed agent with a successful prompt

Here’s another example of deploying a LangGraph agent to AgentCore Runtime with the Cursor IDE performing similar steps as seen above.

Clean up
If you’d like to uninstall the MCP server, follow the MCP documentation for your IDE, such as Kiro, Cursor, Q CLI, and Claude Code for instructions.
Conclusion
In this post, we showed how you can use the AgentCore MCP server with your favorite Agentic IDE of choice to speed up your development workflows.
We encourage you to review the Github repository, as well read through and use the following resources in your development:

Amazon Bedrock AgentCore CLI documentation
Strands Agents MCP Server
LangGraph llms.txt

We encourage you to try out the AgentCore MCP server and provide any feedback through issues in our GitHub repository.

About the authors

Shreyas Subramanian
Shreyas is a Principal Data Scientist and helps customers by using Generative AI to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Deep Learning, and he is a researcher studying the use of Machine Learning and Reinforcement Learning for accelerating learning and optimization tasks. Shreyas is also an Amazon best-selling book author with several research papers and patents to his name.

Primo Mu
Primo is a Software Development Engineer on the Agentic AI Foundation team at AWS, where he builds foundational systems and infrastructure that power intelligent AI applications. He has extensive experience working on backend stateless orchestration services behind products like Kiro and Q Dev CLI. He focuses on creating scalable frameworks and robust architectures that enable developers to build sophisticated agentic systems.

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Mod …

Liquid AI has released LFM2-Audio-1.5B, a compact audio–language foundation model that both understands and generates speech and text through a single end-to-end stack. It positions itself for low-latency, real-time assistants on resource-constrained devices, extending the LFM2 family into audio while retaining a small footprint.

https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model

But what’s actually new? a unified backbone with disentangled audio I/O

LFM2-Audio extends the 1.2B-parameter LFM2 language backbone to treat audio and text as first-class sequence tokens. Crucially, the model disentangles audio representations: inputs are continuous embeddings projected directly from raw waveform chunks (~80 ms), while outputs are discrete audio codes. This avoids discretization artifacts on the input path while keeping training and generation autoregressive for both modalities on the output path.

On the implementation side, the released checkpoint uses:

Backbone: LFM2 (hybrid conv + attention), 1.2B params (LM only)

Audio encoder: FastConformer (~115M, canary-180m-flash)

Audio decoder: RQ-Transformer predicting discrete Mimi codec tokens (8 codebooks)

Context: 32,768 tokens; vocab: 65,536 (text) / 2049×8 (audio)

Precision: bfloat16; license: LFM Open License v1.0; languages: English

https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model

Two generation modes for real-time agents

Interleaved generation for live, speech-to-speech chat where the model alternates text and audio tokens to minimize perceived latency.

Sequential generation for ASR/TTS (switching modalities turn-by-turn).

Liquid AI provides a Python package (liquid-audio) and a Gradio demo to reproduce these behaviors.

Latency: <100 ms to first audio

Liquid AI team reports end-to-end latency below 100 ms from a 4-second audio query to the first audible response—a proxy for perceived responsiveness in interactive use—stating it is faster than models smaller than 1.5B parameters under their setup.

Benchmarks: VoiceBench and ASR results

On VoiceBench—a suite of nine audio-assistant evaluations—Liquid reports an overall score of 56.78 for LFM2-Audio-1.5B, with per-task numbers disclosed in the blog’s chart (e.g., AlpacaEval 3.71, CommonEval 3.49, WildVoice 3.17). The Liquid AI team contrasts this result with larger models like Qwen2.5-Omni-3B and Moshi-7B in the same table. (VoiceBench is an external benchmark introduced in late 2024 for LLM-based voice assistants)

The model card on Hugging Face provides an additional VoiceBench table (with closely related—but not identical—per-task values) and includes classic ASR WERs where LFM2-Audio matches or improves on Whisper-large-v3-turbo for some datasets despite being a generalist speech–text model. For example (lower is better): AMI 15.36 vs. 16.13 (Whisper-large-v3-turbo), LibriSpeech-clean 2.03 vs. 2.10.

https://huggingface.co/LiquidAI/LFM2-Audio-1.5B

Alright, but why does it really matter in voice AI trends?

Most “omni” stacks couple ASR → LLM → TTS, which adds latency and brittle interfaces. LFM2-Audio’s single-backbone design with continuous input embeddings and discrete output codes reduces glue logic and allows interleaved decoding for early audio emission. For developers, this translates to simpler pipelines and faster perceived response times, while still supporting ASR, TTS, classification, and conversational agents from one model. Liquid AI provides code, demo entry points, and distribution via Hugging Face.

Check out the GitHub Page, Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency appeared first on MarkTechPost.

MLPerf Inference v5.1 (2025): Results Explained for GPUs, CPUs, and AI …

What MLPerf Inference Actually Measures?

MLPerf Inference quantifies how fast a complete system (hardware + runtime + serving stack) executes fixed, pre-trained models under strict latency and accuracy constraints. Results are reported for the Datacenter and Edge suites with standardized request patterns (“scenarios”) generated by LoadGen, ensuring architectural neutrality and reproducibility. The Closed division fixes the model and preprocessing for apples-to-apples comparisons; the Open division allows model changes that are not strictly comparable. Availability tags—Available, Preview, RDI (research/development/internal)—indicate whether configurations are shipping or experimental.

The 2025 Update (v5.0 → v5.1): What Changed?

The v5.1 results (published Sept 9, 2025) add three modern workloads and broaden interactive serving:

DeepSeek-R1 (first reasoning benchmark)

Llama-3.1-8B (summarization) replacing GPT-J

Whisper Large V3 (ASR)

This round recorded 27 submitters and first-time appearances of AMD Instinct MI355X, Intel Arc Pro B60 48GB Turbo, NVIDIA GB300, RTX 4000 Ada-PCIe-20GB, and RTX Pro 6000 Blackwell Server Edition. Interactive scenarios (tight TTFT/TPOT limits) were expanded beyond a single model to capture agent/chat workloads.

Scenarios: The Four Serving Patterns You Must Map to Real Workloads

Offline: maximize throughput, no latency bound—batching and scheduling dominate.

Server: Poisson arrivals with p99 latency bounds—closest to chat/agent backends.

Single-Stream / Multi-Stream (Edge emphasis): strict per-stream tail latency; Multi-Stream stresses concurrency at fixed inter-arrival intervals.

Each scenario has a defined metric (e.g., max Poisson throughput for Server; throughput for Offline).

Latency Metrics for LLMs: TTFT and TPOT Are Now First-Class

LLM tests report TTFT (time-to-first-token) and TPOT (time-per-output-token). v5.0 introduced stricter interactive limits for Llama-2-70B (p99 TTFT 450 ms, TPOT 40 ms) to reflect user-perceived responsiveness. The long-context Llama-3.1-405B keeps higher bounds (p99 TTFT 6 s, TPOT 175 ms) due to model size and context length. These constraints carry into v5.1 alongside new LLM and reasoning tasks.

The 2025 Datacenter Menu (Closed Division Targets You’ll Actually Compare)

Key v5.1 entries and their quality/latency gates (abbrev.):

LLM Q&A – Llama-2-70B (OpenOrca): Conversational 2000 ms/200 ms; Interactive 450 ms/40 ms; 99% and 99.9% accuracy targets.

LLM Summarization – Llama-3.1-8B (CNN/DailyMail): Conversational 2000 ms/100 ms; Interactive 500 ms/30 ms.

Reasoning – DeepSeek-R1: TTFT 2000 ms / TPOT 80 ms; 99% of FP16 (exact-match baseline).

ASR – Whisper Large V3 (LibriSpeech): WER-based quality (datacenter + edge).

Long-context – Llama-3.1-405B: TTFT 6000 ms, TPOT 175 ms.

Image – SDXL 1.0: FID/CLIP ranges; Server has a 20 s constraint.

Legacy CV/NLP (ResNet-50, RetinaNet, BERT-L, DLRM, 3D-UNet) remain for continuity.

Power Results: How to Read Energy Claims

MLPerf Power (optional) reports system wall-plug energy for the same runs (Server/Offline: system power; Single/Multi-Stream: energy per stream). Only measured runs are valid for energy efficiency comparisons; TDPs and vendor estimates are out-of-scope. v5.1 includes datacenter and edge power submissions but broader participation is encouraged.

How To Read the Tables Without Fooling Yourself?

Compare Closed vs Closed only; Open runs may use different models/quantization.

Match accuracy targets (99% vs 99.9%)—throughput often drops at stricter quality.

Normalize cautiously: MLPerf reports system-level throughput under constraints; dividing by accelerator count yields a derived “per-chip” number that MLPerf does not define as a primary metric. Use it only for budgeting sanity checks, not marketing claims.

Filter by Availability (prefer Available) and include Power columns when efficiency matters.

Interpreting 2025 Results: GPUs, CPUs, and Other Accelerators

GPUs (rack-scale to single-node). New silicon shows up prominently in Server-Interactive (tight TTFT/TPOT) and in long-context workloads where scheduler & KV-cache efficiency matter as much as raw FLOPs. Rack-scale systems (e.g., GB300 NVL72 class) post the highest aggregate throughput; normalize by both accelerator and host counts before comparing to single-node entries, and keep scenario/accuracy identical.

CPUs (standalone baselines + host effects). CPU-only entries remain useful baselines and highlight preprocessing and dispatch overheads that can bottleneck accelerators in Server mode. New Xeon 6 results and mixed CPU+GPU stacks appear in v5.1; check host generation and memory configuration when comparing systems with similar accelerators.

Alternative accelerators. v5.1 increases architectural diversity (GPUs from multiple vendors plus new workstation/server SKUs). Where Open-division submissions appear (e.g., pruned/low-precision variants), validate that any cross-system comparison holds constant division, model, dataset, scenario, and accuracy.

Practical Selection Playbook (Map Benchmarks to SLAs)

Interactive chat/agents → Server-Interactive on Llama-2-70B/Llama-3.1-8B/DeepSeek-R1 (match latency & accuracy; scrutinize p99 TTFT/TPOT).

Batch summarization/ETL → Offline on Llama-3.1-8B; throughput per rack is the cost driver.

ASR front-ends → Whisper V3 Server with tail-latency bound; memory bandwidth and audio pre/post-processing matter.

Long-context analytics → Llama-3.1-405B; evaluate if your UX tolerates 6 s TTFT / 175 ms TPOT.

What the 2025 Cycle Signals?

Interactive LLM serving is table-stakes. Tight TTFT/TPOT in v5.x makes scheduling, batching, paged attention, and KV-cache management visible in results—expect different leaders than in pure Offline.

Reasoning is now benchmarked. DeepSeek-R1 stresses control-flow and memory traffic differently from next-token generation.

Broader modality coverage. Whisper V3 and SDXL exercise pipelines beyond token decoding, surfacing I/O and bandwidth limits.

Summary

In summary, MLPerf Inference v5.1 makes inference comparisons actionable only when grounded in the benchmark’s rules: align on the Closed division, match scenario and accuracy (including LLM TTFT/TPOT limits for interactive serving), and prefer Available systems with measured Power to reason about efficiency; treat any per-device splits as derived heuristics because MLPerf reports system-level performance. The 2025 cycle expands coverage with DeepSeek-R1, Llama-3.1-8B, and Whisper Large V3, plus broader silicon participation, so procurement should filter results to the workloads that mirror production SLAs—Server-Interactive for chat/agents, Offline for batch—and validate claims directly in the MLCommons result pages and power methodology.

References:

MLCommons Releases New MLPerf Inference v5.1 Benchmark Results

MLPerf Inference: Datacenter

MLPerf Inference: Edge

https://docs.mlcommons.org/inference/

https://docs.mlcommons.org/inference/power/

Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models

DeepSeek Reasoning for MLPerf Inference v5.1

https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/

NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/README.html

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference5.1-repro/README.html

https://newsroom.intel.com/artificial-intelligence/intel-arc-pro-b-series-gpus-and-xeon-6-shine-in-mlperf-inference-v5-1

https://www.globenewswire.com/news-release/2025/09/09/3147136/0/en/MLCommons-Releases-New-MLPerf-Inference-v5-1-Benchmark-Results.html

https://www.tomshardware.com/pc-components/gpus/nvidia-claims-software-and-hardware-upgrades-allow-blackwell-ultra-gb300-to-dominate-mlperf-benchmarks-touts-45-percent-deepseek-r-1-inference-throughput-increase-over-gb200

https://newsroom.intel.com/tag/intel-arc-pro-b60

The post MLPerf Inference v5.1 (2025): Results Explained for GPUs, CPUs, and AI Accelerators appeared first on MarkTechPost.

The Role of Model Context Protocol (MCP) in Generative AI Security and …

Table of contentsOverviewWhat MCP standardizes?Normative authorization controlsWhere MCP supports security engineering in practice ?Case study: the first malicious MCP serverUsing MCP to structure red-team exercisesImplementation-Focused Security Hardening ChecklistGovernance alignmentCurrent adoption you can test againstSummaryResources used in the article

Overview

Model Context Protocol (MCP) is an open, JSON-RPC–based standard that formalizes how AI clients (assistants, IDEs, web apps) connect to servers exposing three primitives—tools, resources, and prompts—over defined transports (primarily stdio for local and Streamable HTTP for remote). MCP’s value for security work is that it renders agent/tool interactions explicit and auditable, with normative requirements around authorization that teams can verify in code and in tests. In practice, this enables tight blast-radius control for tool use, repeatable red-team scenarios at clear trust boundaries, and measurable policy enforcement—provided organizations treat MCP servers as privileged connectors subject to supply-chain scrutiny.

What MCP standardizes?

An MCP server publishes: (1) tools (schema-typed actions callable by the model), (2) resources (readable data objects the client can fetch and inject as context), and (3) prompts (reusable, parameterized message templates, typically user-initiated). Distinguishing these surfaces clarifies who is “in control” at each edge: model-driven for tools, application-driven for resources, and user-driven for prompts. Those roles matter in threat modeling, e.g., prompt injection often targets model-controlled paths, while unsafe output handling often occurs at application-controlled joins.

Transports. The spec defines two standard transports—stdio (Standard Input/Output) and Streamable HTTP—and leaves room for pluggable alternatives. Local stdio reduces network exposure; Streamable HTTP fits multi-client or web deployments and supports resumable streams. Treat the transport choice as a security control: constrain network egress for local servers, and apply standard web authN/Z and logging for remote ones.

Client/server lifecycle and discovery. MCP formalizes how clients discover server capabilities (tools/resources/prompts), negotiate sessions, and exchange messages. That uniformity is what lets security teams instrument call flows, capture structured logs, and assert pre/postconditions without bespoke adapters per integration.

Normative authorization controls

The Authorization approach is unusually prescriptive for an integration protocol and should be enforced as follows:

No token passthrough. “The MCP server MUST NOT pass through the token it received from the MCP client.” Servers are OAuth 2.1 resource servers; clients obtain tokens from an authorization server using RFC 8707 resource indicators so tokens are audience-bound to the intended server. This prevents confused-deputy paths and preserves upstream audit/limit controls.

Audience binding and validation. Servers MUST validate that the access token’s audience matches themselves (resource binding) before serving a request. Operationally, this stops a client-minted token for “Service A” from being replayed to “Service B.” Red teams should include explicit probes for this failure mode.

This is the core of MCP’s security structure: model-side capabilities are powerful, but the protocol insists that servers be first-class principals with their own credentials, scopes, and logs—rather than opaque pass-throughs for a user’s global token.

Where MCP supports security engineering in practice?

Clear trust boundaries. The clientserver edge is an explicit, inspectable boundary. You can attach consent UIs, scope prompts, and structured logging at that edge. Many client implementations present permission prompts that enumerate a server’s tools/resources before enabling them—useful for least-privilege and audit—even though UX is not specified by the standard.

Containment and least privilege. Because a server is a separate principal, you can enforce minimal upstream scopes. For example, a secrets-broker server can mint short-lived credentials and expose only constrained tools (e.g., “fetch secret by policy label”), rather than handing broad vault tokens to the model. Public MCP servers from security vendors illustrate this model.

Deterministic attack surfaces for red teaming. With typed tool schemas and replayable transports, red teams can build fixtures that simulate adversarial inputs at tool boundaries and verify post-conditions across models/clients. This yields reproducible tests for classes of failures like prompt injection, insecure output handling, and supply-chain abuse. Pair those tests with recognized taxonomies.

Case study: the first malicious MCP server

In late September 2025, researchers disclosed a trojanized postmark-mcp npm package that impersonated a Postmark email MCP server. Beginning with v1.0.16, the malicious build silently BCC-exfiltrated every email sent through it to an attacker-controlled address/domain. The package was subsequently removed, but guidance urged uninstalling the affected version and rotating credentials. This appears to be the first publicly documented malicious MCP server in the wild, and it underscores that MCP servers often run with high trust and should be vetted and version-pinned like any privileged connector.

Operational takeaways:

Maintain an allowlist of approved servers and pin versions/hashes.

Require code provenance (signed releases, SBOMs) for production servers.

Monitor for anomalous egress patterns consistent with BCC exfiltration.

Practice credential rotation and “bulk disconnect” drills for MCP integrations.

These are not theoretical controls; the incident impact flowed directly from over-trusted server code in a routine developer workflow.

Using MCP to structure red-team exercises

1) Prompt-injection and unsafe-output drills at the tool boundary. Build adversarial corpora that enter via resources (application-controlled context) and attempt to coerce calls to dangerous tools. Assert that the client sanitizes injected outputs and that server post-conditions (e.g., allowed hostnames, file paths) hold. Map findings to LLM01 (Prompt Injection) and LLM02 (Insecure Output Handling).

2) Confused-deputy probes for token misuse. Craft tasks that try to induce a server to use a client-issued token or to call an unintended upstream audience. A compliant server must reject foreign-audience tokens per the authorization spec; clients must request audience-correct tokens with RFC 8707 resource. Treat any success here as a P1.

3) Session/stream resilience. For remote transports, exercise reconnection/resumption flows and multi-client concurrency for session fixation/hijack risks. Validate non-deterministic session IDs and rapid expiry/rotation in load-balanced deployments. (Streamable HTTP supports resumable connections; use it to stress your session model.)

4) Supply-chain kill-chain drills. In a lab, insert a trojaned server (with benign markers) and verify whether your allowlists, signature checks, and egress detection catch it—mirroring the Postmark incident TTPs. Measure time to detection and credential rotation MTTR.

5) Baseline with trusted public servers. Use vetted servers to construct deterministic tasks. Two practical examples: Google’s Data Commons MCP exposes public datasets under a stable schema (good for fact-based tasks/replays), and Delinea’s MCP demonstrates least-privilege secrets brokering for agent workflows. These are ideal substrates for repeatable jailbreak and policy-enforcement tests.

Implementation-Focused Security Hardening Checklist

Client side

Display the exact command or configuration used to start local servers; gate startup behind explicit user consent and enumerate the tools/resources being enabled. Persist approvals with scope granularity. (This is common practice in clients such as Claude Desktop.)

Maintain an allowlist of servers with pinned versions and checksums; deny unknown servers by default.

Log every tool call (name, arguments metadata, principal, decision) and resource fetch with identifiers so you can reconstruct attack paths post-hoc.

Server side

Implement OAuth 2.1 resource-server behavior; validate tokens and audiences; never forward client-issued tokens upstream.

Minimize scopes; prefer short-lived credentials and capabilities that encode policy (e.g., “fetch secret by label” instead of free-form read).

For local deployments, prefer stdio inside a container/sandbox and restrict filesystem/network capabilities; for remote, use Streamable HTTP with TLS, rate limits, and structured audit logs.

Detection & response

Alert on anomalous server egress (unexpected destinations, email BCC patterns) and sudden capability changes between versions.

Prepare break-glass automation to revoke client approvals and rotate upstream secrets quickly when a server is flagged (your “disconnect & rotate” runbook). The Postmark incident showed why time matters.

Governance alignment

MCP’s separation of concerns—clients as orchestrators, servers as scoped principals with typed capabilities—aligns directly with NIST’s AI RMF guidance for access control, logging, and red-team evaluation of generative systems, and with OWASP’s LLM Top-10 emphasis on mitigating prompt injection, unsafe output handling, and supply-chain vulnerabilities. Use those frameworks to justify controls in security reviews and to anchor acceptance criteria for MCP integrations.

Current adoption you can test against

Anthropic/Claude: product docs and ecosystem material position MCP as the way Claude connects to external tools and data; many community tutorials closely follow the spec’s three-primitive model. This provides ready-made client surfaces for permissioning and logging.

Google’s Data Commons MCP: released Sept 24, 2025, it standardizes access to public datasets; its announcement and follow-up posts include production usage notes (e.g., the ONE Data Agent). Useful as a stable “truth source” in red-team tasks.

Delinea MCP: open-source server integrating with Secret Server and Delinea Platform, emphasizing policy-mediated secret access and OAuth alignment with the MCP authorization spec. A practical example of least-privilege tool exposure.

Summary

MCP is not a silver-bullet “security product.” It is a protocol that gives security and red-team practitioners stable, enforceable levers: audience-bound tokens, explicit clientserver boundaries, typed tool schemas, and transports you can instrument. Use those levers to (1) constrain what agents can do, (2) observe what they actually did, and (3) replay adversarial scenarios reliably. Treat MCP servers as privileged connectors—vet, pin, and monitor them—because adversaries already do. With those practices in place, MCP becomes a practical foundation for secure agentic systems and a reliable substrate for red-team evaluation.

Resources used in the article

MCP specification & concepts

https://modelcontextprotocol.io/specification/2025-06-18/basic/authorization

https://modelcontextprotocol.io/specification/2025-03-26/basic/transports

https://modelcontextprotocol.io/docs/concepts/architecture

https://modelcontextprotocol.io/docs/concepts/prompts

MCP ecosystem (official)

https://www.anthropic.com/news/model-context-protocol

https://docs.claude.com/en/docs/mcp

https://docs.claude.com/en/docs/claude-code/mcp

https://modelcontextprotocol.io/quickstart/server

https://modelcontextprotocol.io/docs/develop/connect-local-servers

https://modelcontextprotocol.io/docs/develop/connect-remote-servers

Security frameworks

https://owasp.org/www-project-top-10-for-large-language-model-applications/

https://genai.owasp.org/llm-top-10/

https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

https://www.nist.gov/itl/ai-risk-management-framework

Incident: malicious postmark-mcp server

https://www.koi.security/blog/postmark-mcp-npm-malicious-backdoor-email-theft

https://thehackernews.com/2025/09/first-malicious-mcp-server-found.html

https://www.itpro.com/security/a-malicious-mcp-server-is-silently-stealing-user-emails

https://threatprotect.qualys.com/2025/09/30/malicious-mcp-server-on-npm-postmark-mcp-exploited-in-attack/

Example MCP servers referenced

https://developers.googleblog.com/en/datacommonsmcp/

https://blog.google/technology/developers/ai-agents-datacommons/

https://github.com/DelineaXPM/delinea-mcp

https://delinea.com/news/delinea-mcp-server-to-provide-secure-credential-access-for-ai-agents?hs_amp=true

https://delinea.com/blog/unlocking-ai-agents-mcp

The post The Role of Model Context Protocol (MCP) in Generative AI Security and Red Teaming appeared first on MarkTechPost.

How Hapag-Lloyd improved schedule reliability with ML-powered vessel s …

This post is cowritten with Thomas Voss and Bernhard Hersberger from Hapag-Lloyd.
Hapag-Lloyd is one of the world’s leading shipping companies with more than 308 modern vessels, 11.9 million TEUs (twenty-foot equivalent units) transported per year, and 16,700 motivated employees in more than 400 offices in 139 countries. They connect continents, businesses, and people through reliable container transportation services on the major trade routes across the globe.
In this post, we share how Hapag-Lloyd developed and implemented a machine learning (ML)-powered assistant predicting vessel arrival and departure times that revolutionizes their schedule planning. By using Amazon SageMaker AI and implementing robust MLOps practices, Hapag-Lloyd has enhanced its schedule reliability—a key performance indicator in the industry and quality promise to their customers.
For Hapag-Lloyd, accurate vessel schedule predictions are crucial for maintaining schedule reliability, where schedule reliability is defined as percentage of vessels arriving within 1 calendar day (earlier or later) of their estimated arrival time, communicated around 3 to 4 weeks before arrival.
Prior to developing the new ML solution, Hapag-Lloyd relied on simple rule-based and statistical calculations, based on historical transit patterns for vessel schedule predictions. While this statistical method provided basic predictions, it couldn’t effectively account for real-time conditions such as port congestion, requiring significant manual intervention from operations teams.
Developing a new ML solution to replace the existing system presented several key challenges:

Dynamic shipping conditions – The estimated time of arrival (ETA) prediction model needs to account for numerous variables that affect journey duration, including weather conditions, port-related delays such as congestion, labor strikes, and unexpected events that force route changes. For example, when the Suez Canal was blocked by the Ever Given container ship in March 2021, vessels had to be rerouted around Africa, adding approximately 10 days to their journey times.
Data integration at scale – The development of accurate models requires integration of large volumes of historical voyage data with external real-time data sources including port congestion information and vessel position tracking (AIS). The solution needs to scale across 120 vessel services or lines and 1,200 unique port-to-port routes.
Robust MLOps infrastructure – A robust MLOps infrastructure is required to continuously monitor model performance and quickly deploy updates whenever needed. This includes capabilities for regular model retraining to adapt to changing patterns, comprehensive performance monitoring, and maintaining real-time inference capabilities for immediate schedule adjustments.

Hapag-Llyod’s previous approach to schedule planning couldn’t effectively address these challenges. A comprehensive solution that could handle both the complexity of vessel schedule prediction and provide the infrastructure needed to sustain ML operations at global scale was needed.
The Hapag-Lloyd network consists of over 308 vessels and many more partner vessels that continuously circumnavigate the globe on predefined service routes, resulting in more than 3,500 port arrivals per month. Each vessel operates on a fixed service line, making regular round trips between a sequence of ports. For instance, a vessel might repeatedly sail a route from Southampton to Le Havre, Rotterdam, Hamburg, New York, and Philadelphia before starting the cycle again. For each port arrival, an ETA must be provided multiple weeks in advance to arrange critical logistics, including berth windows at ports and onward transportation of containers by sea, land or air transport. The following table shows an example where a vessel travels from Southampton to New York through Le Havre, Rotterdam, and Hamburg. The vessel’s time until arrival at the New York port can be calculated as the sum of ocean to port time to Southampton, and the respective berth times and port-to-port times for the intermediate ports called while sailing to New York. If this vessel encounters a delay in Rotterdam, it affects its arrival in Hamburg and cascades through the entire schedule, impacting arrivals in New York and beyond as shown in the following table. This ripple effect can disrupt carefully planned transshipment connections and require extensive replanning of downstream operations.

Port
Terminal call
Scheduled arrival
Scheduled departure

SOUTHAMPTON
1
2025-07-29 07:00
2025-07-29 21:00

LE HAVRE
2
2025-07-30 16:00
2025-07-31 16:00

ROTTERDAM
3
2025-08-03 18:00
2025-08-05 03:00

HAMBURG
4
2025-08-07 07:00
2025-08-08 07:00

NEW YORK
5
2025-08-18 13:00
2025-08-21 13:00

PHILADELPHIA
6
2025-08-22 06:00
2025-08-24 16:30

SOUTHAMPTON
7
2025-09-01 08:00
2025-09-02 20:00

When a vessel departs Rotterdam with a delay, new ETAs must be calculated for the remaining ports. For Hamburg, we only need to estimate the remaining sailing time from the vessel’s current position. However, for subsequent ports like New York, the prediction requires multiple components: the remaining sailing time to Hamburg, the duration of port operations in Hamburg, and the sailing time from Hamburg to New York.
Solution overview
As an input to the vessel ETA prediction, we process the following two data sources:

Hapag-Lloyd’s internal data, which is stored in a data lake. This includes detailed vessel schedules and routes, port and terminal performance information, real-time port congestion and waiting times, and vessel characteristics datasets. This data is prepared for model training using AWS Glue jobs.
Automatic Identification System (AIS) data, which provides streaming updates on the vessel movements. This AIS data ingestion is batched every 20 minutes using AWS Lambda and includes crucial information such as latitude, longitude, speed, and direction of vessels. New batches are processed using AWS Glue and Iceberg to update the existing AIS database—currently holding around 35 million observations.

These data sources are combined to create training datasets for the ML models. We carefully consider the timing of available data through temporal splitting to avoid data leakage. Data leakage occurs when using information that wouldn’t be available at prediction time in the real world. For example, when training a model to predict arrival time in Hamburg for a vessel currently in Rotterdam, we can’t use actual transit times that were only known after the vessel reached Hamburg.
A vessel’s journey can be divided into different legs, which led us to develop a multi-step solution using specialized ML models for each leg, which are orchestrated as hierarchical models to retrieve the overall ETA:

The Ocean to Port (O2P) model predicts the time needed for a vessel to reach its next port from its current position at sea. The model uses features such as remaining distance to destination, vessel speed, journey progress metrics, port congestion data, and historical sea leg durations.
The Port to Port (P2P) model forecasts sailing time between any two ports for a given date, considering key features such as ocean distance between ports, recent transit time trends, weather, and seasonal patterns.
The Berth Time model estimates how long a vessel will spend at port. The model uses vessel characteristics (such as tonnage and load capacity), planned container load, and historical port performance.
The Combined model takes as input the predictions from the O2P, P2P, and Berth Time models, along with the original schedule. Rather than predicting absolute arrival times, it computes the expected deviation from the original schedule by learning patterns in historical prediction accuracy and specific voyage conditions. These computed deviations are then used to update ETAs for the upcoming ports in a vessel’s schedule.

All four models are trained using the XGBoost algorithm built into SageMaker, chosen for its ability to handle complex relationships in tabular data and its robust performance with mixed numerical and categorical features. Each model has a dedicated training pipeline in SageMaker Pipelines, handling data preprocessing steps and model training. The following diagram shows the data processing pipeline, which generates the input datasets for ML training.

As an example, this diagram shows the training pipeline of the Berth model. The steps in the SageMaker training pipelines of the Berth, P2P, O2P, and Combined models are identical. Therefore, the training pipeline is implemented once as a blueprint and re-used across the other models, enabling a fast turn-around time of the implementation.

Because the Combined model depends on outputs from the other three specialized models, we use AWS Step Functions to orchestrate the SageMaker pipelines for training. This helps ensure that the individual models are updated in the correct sequence and maintains prediction consistency across the system. The orchestration of the training pipelines is shown in the following pipeline architecture.
The individual workflow begins with a data processing pipeline that prepares the input data (vessel schedules, AIS data, port congestion, and port performance metrics) and splits it into dedicated datasets. This feeds into three parallel SageMaker training pipelines for our base models (O2P, P2P, and Berth), each following a standardized process of feature encoding, hyperparameter optimization, model evaluation, and registration using SageMaker Processing and hyperparameter turning jobs and SageMaker Model Registry. After training, each base model runs a SageMaker batch transform job to generate predictions that serve as input features for the combined model training. The performance of the latest Combined model version is tested on the last 3 months of data with known ETAs, and performance metrics (R², mean absolute error (MAE)) are computed. If the model’s performance is below a set MAE threshold, the entire training process fails and the model version is automatically discarded, preventing the deployment of models that don’t meet the minimum performance threshold.
All four models are versioned and stored as separate model package groups in the SageMaker Model Registry, enabling systematic version control and deployment. This orchestrated approach helps ensure that our models are trained in the correct sequence using parallel processing, resulting in an efficient and maintainable training process.The hierarchical model approach helps further ensure that a degree of explainability comparable to the current statistical and rule-based solution is maintained—avoiding ML black box behavior. For example, it becomes possible to highlight unusually long berthing time predictions when discussing predictions results with business experts. This helps increase transparency and build trust, which in turn increases acceptance within the company.
Inference solution walkthrough
The inference infrastructure implements a hybrid approach combining batch processing with real-time API capabilities as shown in Figure 5. Because most data sources update daily and require extensive preprocessing, the core predictions are generated through nightly batch inference runs. These pre-computed predictions are complemented by a real-time API that implements business logic for schedule changes and ETA updates.

Daily batch Inference:

Amazon EventBridge triggers a Step Functions workflow every day.
The Step Functions workflow orchestrates the data and inference process:

Lambda copies internal Hapag-Lloyd data from the data lake to Amazon Simple Storage Service (Amazon S3).
AWS Glue jobs combine the different data sources and prepare inference inputs
SageMaker inference executes in sequence:

Fallback predictions are computed from historical averages and written to Amazon Relational Database Service (Amazon RDS). Fallback predictions are used in case of missing data or a downstream inference failure.
Preprocessing data for the four specialized ML models.
O2P, P2P, and Berth model batch transforms.
The Combined model batch transform generates final ETA predictions, which are written to Amazon RDS.
Input features and output files are stored in Amazon S3 for analytics and monitoring.

For operational reliability, any failures in the inference pipeline trigger immediate email notifications to the on-call operations team through Amazon Simple Email Service (Amazon SES).

Real-time API:

Amazon API Gateway receives client requests containing the current schedule and an indication for which vessel-port combinations an ETA update is required. By receiving the current schedule through the client request, we can take care of intraday schedule updates while doing daily batch transform updates.
The API Gateway triggers a Lambda function calculating the response. The Lambda function constructs the response by linking the ETA predictions (stored in Amazon RDS) with the current schedule using custom business logic, so that we can take care of short-term schedule changes unknown at inference time. Typical examples of short-term schedule changes are port omissions (for example, due to port congestion) and one-time port calls.

This architecture enables millisecond response times to custom requests while achieving a 99.5% availability (a maximum 3.5 hours downtime per month).

Conclusion
Hapag Lloyd’s ML powered vessel scheduling assistant outperforms the current solution in both accuracy and response time. Typical API response times are in the order of hundreds of milliseconds, helping to ensure a real-time user experience and outperforming the current solution by more than 80%. Low response times are crucial because, in addition to fully automated schedule updates, business experts require low response times to work with the schedule assistant interactively. In terms of accuracy, the MAE of the ML-powered ETA predictions outperform the current solution by approximately 12%, which translates into climbing by two positions in the international ranking of schedule reliability on average. This is one of the key performance metrics in liner shipping, and this is a significant improvement within the industry.
To learn more about architecting and governing ML workloads at scale on AWS, see the AWS blog post Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker and the accompanying AWS workshop AWS Multi-Account Data & ML Governance Workshop.
Acknowledgement
We acknowledge the significant and valuable work of Michal Papaj and Piotr Zielinski from Hapag-Lloyd in the data science and data engineering areas of the project.
About the authors
Thomas Voss Thomas Voss works at Hapag-Lloyd as a data scientist. With his background in academia and logistics, he takes pride in leveraging data science expertise to drive business innovation and growth through the practical design and modeling of AI solutions.
Bernhard Hersberger Bernhard Hersberger works as a data scientist at Hapag-Lloyd, where he heads the AI Hub team in Hamburg. He is enthusiastic about integrating AI solutions across the company, taking comprehensive responsibility from identifying business issues to deploying and scaling AI solutions worldwide.
Gabija Pasiunaite At AWS, Gabija Pasiunaite was a Machine Learning Engineer at AWS Professional Services based in Zurich. She specialized in building scalable ML and data solutions for AWS Enterprise customers, combining expertise in data engineering, ML automation and cloud infrastructure. Gabija has contributed to the AWS MLOps Framework used by AWS customers globally. Outside work, Gabija enjoys exploring new destinations and staying active through hiking, skiing, and running.
Jean-Michel Lourier Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data driven applications side by side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, machine learning, and their business applications. He is also an enthusiastic cyclist.
Mousam Majhi Mousam Majhi is a Senior ProServe Cloud Architect focusing on Data & AI within AWS Professional Services. He works with Manufacturing and Travel, Transportation & Logistics customers in DACH to achieve their business outcomes by leveraging data and AI powered solutions. Outside of work, Mousam enjoys hiking in the Bavarian Alps.

Rox accelerates sales productivity with AI agents powered by Amazon Be …

This post was co-written with Shriram Sridharan, Taeuk Kang, and Santhosh Kumar Manavasi Lakshminarayanan from Rox.
Rox is building a new revenue operating system for the applied AI era.
Modern revenue teams rely on more data than ever before, such as Customer Relationship Management (CRM) systems, marketing automation, finance systems, support tickets, and live product usage. Though each serves its role, together they create silos that slow sellers down and leave insights untapped.
Rox addresses this by providing a revenue operating system: a unified layer that brings these signals together and equips AI agents to execute go-to-market (GTM) workflows. Instead of reconciling reports or updating fields, sellers get real-time intelligence and automation in their daily flow.
Today, we’re excited to announce that Rox is generally available, with Rox infrastructure built on AWS and delivered across web, Slack, macOS, and iOS. In this post, we share how Rox accelerates sales productivity with AI agents powered by Amazon Bedrock.
Solution overview
As noted in Rox is transforming revenue teams with AI-driven integration powered by AWS, modern GTM teams need more than a static database. Revenue data spans dozens of systems, such as product usage, finance, and support, and teams require a system that unifies context and acts on it in real time.
Rox delivers this through a layered architecture on AWS:

System of record – A unified, governed knowledge graph consolidates CRM, finance, support, product telemetry, and web data
Agent swarms – Intelligent, account-aware agents reason over the graph and orchestrate multi-step workflows like research, outreach, opportunity management, and proposal generation
Interfaces across surfaces – Sellers engage these workflows where they work, such as web application, Slack, iOS, and macOS

This converts the CRM from a passive system of record into an active system of action, so teams can act on their data immediately and intelligently.
The following diagram illustrates the solution architecture.

Benefits and features of ROX
Now generally available, Rox extends from intelligence to full execution with Command, a new conversational interface that orchestrates multi-agent workflows. Command coordinates with multiple specialized agents running in parallel. A single request (for example, “prep me for the ACME renewal and draft follow-ups”) expands into a plan: research usage and support signals, identify missing stakeholders, refresh enrichment, propose next-best actions, draft outreach, update the opportunity, and assemble a proposal. Each step is completed through tool calls into your systems and is subject to guardrail approvals. Our comprehensive safety architecture employs a sophisticated multi-layer guardrail system as the first line of defense against inappropriate, harmful, or malicious requests. Incoming requests undergo rigorous analysis through our advanced filtering mechanisms before reaching the inference layer. This preprocessing stage evaluates multiple dimensions of safety and appropriateness, such as legal compliance assessment and business relevance evaluation, to make sure only legitimate, safe, and contextually appropriate requests proceed to model execution.
Command decomposes the request, routes steps to the right agents, sequences external tool invocations (CRM, calendar, enrichment, email), reconciles results into the system of context, and returns one coherent thread that’s ready for consumption on the web, Slack, iOS, or macOS. Every suggestion is explainable (sources and traces), reversible (audit logs), and policy-aware (role-based access control, rate limits, required approvals).
How Amazon Bedrock powers Rox
Command demands a model capable of reasoning across multiple steps, orchestrating tools, and adapting dynamically.
To meet these needs, Rox chose Anthropic’s Claude Sonnet 4 on Amazon Bedrock. Anthropic’s Claude Sonnet 4 has consistently demonstrated unmatched tool-calling and reasoning performance, allowing Rox agents to sequence workflows like account research, enrichment, outreach, opportunity management, and proposal generation with reliability.
Amazon Bedrock provides the foundation to deliver Rox at enterprise scale, offering security, flexibility to integrate with the latest models, and scalability to handle thousands of concurrent agents reliably.
In addition to Command, Rox includes the following features:

Research – Offers deep account and market research, grounded in unified context (carried over from private beta)
Meet – Makes it possible to record, transcribe, summarize, and turn meetings into actions (carried over from private beta)
Outreach – Provides personalized prospect engagement, contextualized by unified data (new)
Revenue – Helps you track, update, and advance pipelines in the flow of work (new)
Auto-fill proposals – Helps you assemble tailored proposals in seconds from account context (new)
Rox apps – Offers modular extensions that add purpose-built workflows (dashboards, trackers) directly into the system (new)
iOS app – Delivers notifications and meeting prep on the go (new)
Mac app – Brings the ability to transcribe calls and add them to the system of context (new)
Regional expansion – Now live in the AWS Middle East (Bahrain) AWS Region, aligning with data residency and sovereignty needs (new)

Early customer impact
In beta, enterprises saw immediate gains:

50% higher representative productivity
20% faster sales velocity
Twofold revenue per rep

For example, real Rox customers were able to sharpen their focus on high-value opportunities, driving a 40–50% increase in average selling price. Another customer saw 90% reduction in rep prep time and faster closes, plus 15% more six-figure deals uncovered through Rox insights. Rox also shortens ramp time for new reps, with customers reporting 50% quicker ramp time using Rox.
Try Rox today
Our vision is for revenue teams to run with an always-on agent swarm that continuously researches accounts, engages stakeholders, and moves the pipeline forward.
Rox is now generally available. Get started at rox.com or visit the AWS Marketplace. Together with AWS, we will continue to build the AI-based operating system for modern revenue teams.

About the authors
Shriram Sridharan is the Co-Founder/Engineering Head of Rox, a Sequoia backed AI company. Before Rox, Shriram led the data infrastructure team at Confluent responsible for making Kafka faster and cheaper across clouds. Prior to that he was one of the early engineers in Amazon Aurora (pre-launch) re-imagining databases for the cloud. Aurora was the fastest growing AWS Service and a recipient of the 2019 SIGMOD systems award.
Taeuk Kang is a Founding Engineer at Rox, working across AI research and engineering. He studied Computer Science at Stanford. Prior to Rox, he built large language model agents and retrieval-augmented generation systems at X (formerly Twitter) and designed the distributed LLM infrastructure powering core product features and Trust & Safety, improving overall platform health. Earlier at Stripe, he developed high-performance streaming and batch data processing pipelines integrating Apache Flink, Spark, Kafka, and AWS SQS.
Santhosh Kumar Manavasi Lakshminarayanan leads Platform at Rox. Before Rox he was Director of Engineering at StreamSets, acquired by IBM leading StreamSets Cloud Platform making it seamless for big enterprises to run their data pipeline at scale on modern cloud providers. Before StreamSets, he was an senior engineer at Platform Metadata team at Informatica.
Andrew Brown is an Account Executive for AI Startups at Amazon Web Services (AWS) in San Francisco, CA. With a strong background in cloud computing and a focus on supporting startups, Andrew specializes in helping companies scale their operations using AWS technologies.
Santhan Pamulapati is a Sr. Solutions Architect for GenAI startups at AWS, with deep expertise in designing and building scalable solutions that drives customer growth. He has strong background in building HPC systems leveraging AWS services and worked with strategic customers to solve business challenges.