Agentic Design Methodology: How to Build Reliable and Human-Like AI Ag …

Building robust AI agents differs fundamentally from traditional software development, as it centers on probabilistic model behavior rather than deterministic code execution. This guide provides a neutral overview of methodologies for designing AI agents that are both reliable and adaptable, with an emphasis on creating clear boundaries, effective behaviors, and safe interactions.

What Is Agentic Design?

Agentic design refers to constructing AI systems capable of independent action within defined parameters. Unlike conventional coding, which specifies exact outcomes for inputs, agentic systems require designers to articulate desirable behaviors and trust the model to navigate specifics.

Variability in AI Responses

Traditional software outputs remain constant for identical inputs. In contrast, agentic systems—based on probabilistic models—produce varied yet contextually appropriate responses each time. This makes effective prompt and guideline design critical for both human-likeness and safety.

In an agentic system, a request like “Can you help me reset my password?” might elicit different yet appropriate replies such as “Of course! Please tell me your username,” “Absolutely, let’s get started—what’s your email address?” or “I can assist with that. Do you remember your account ID?”. This variability is purposeful, designed to enhance user experience by mimicking the nuance and flexibility of human dialogue. At the same time, this unpredictability requires thoughtful guidelines and safeguards so the system responds safely and consistently across scenarios

Why Clear Instructions Matter

Language models interpret instructions rather than execute them literally. Vague guidance such as:

Copy CodeCopiedUse a different Browseragent.create_guideline(
condition=”User expresses frustration”,
action=”Try to make them happy”
)

can lead to unpredictable or unsafe behavior, like unintended offers or promises. Instead, instructions should be concrete and action-focused:

Instead, be specific and safe:

Copy CodeCopiedUse a different Browseragent.create_guideline(
condition=”User is upset by a delayed delivery”,
action=”Acknowledge the delay, apologize, and provide a status update”
)

This approach ensures the model’s actions align with organizational policy and user expectations.

Building Compliance: Layers of Control

LLMs can’t be fully “controlled,” but you can still guide and constrain their behavior effectively.

Layer 1: Guidelines

Use guidelines to define and shape normal behavior.

Copy CodeCopiedUse a different Browserawait agent.create_guideline(
condition=”Customer asks about topics outside your scope”,
action=”Politely decline and redirect to what you can help with”
)

Layer 2: Canned Responses

For high-risk situations (such as policy or medical advice), use pre-approved canned responses to ensure consistency and safety.

Copy CodeCopiedUse a different Browserawait agent.create_canned_response(
template=”I can help with account questions, but for policy details I’ll connect you to a specialist.”
)

This layered approach minimizes risk and ensures the agent never improvises in sensitive situations.

Tool Calling: When Agents Take Action

When AI agents take action using tools such as APIs or functions, the process involves more complexity than simply executing a command. For example, if a user says, “Schedule a meeting with Sarah for next week,” the agent must interpret several unclear elements: Which Sarah is being referred to? What specific day and time within “next week” should the meeting be scheduled? And on which calendar?

This illustrates the Parameter Guessing Problem, where the agent attempts to infer missing details that weren’t explicitly provided. To address this, tools should be designed with clear purpose descriptions, parameter hints, and contextual examples to reduce ambiguity. Additionally, tool names should be intuitive and parameter types consistent, helping the agent reliably select and populate inputs. Well-structured tools improve accuracy, reduce errors, and make the interactions smoother and more predictable for both the agent and the user.

This thoughtful tool design practice is essential for effective, safe agent functionality in real-world applications.When AI agents perform tasks through tools such as APIs or functions, the complexity is often higher than it initially appears.

Agent Design Is Iterative

Unlike static software, agent behavior in agentic systems is not fixed; it matures over time through a continuous cycle of observation, evaluation, and refinement. The process typically begins with implementing straightforward, high-frequency user scenarios—those “happy path” interactions where the agent’s responses can be easily anticipated and validated. Once deployed in a safe testing environment, the agent’s behavior is closely monitored for unexpected answers, user confusion, or any breaches of policy guidelines.

As issues are observed, the agent is systematically improved by introducing targeted rules or refining existing logic to address problematic cases. For example, if users repeatedly decline an upsell offer but the agent continues to bring it up, a focused rule can be added to prevent this behavior within the same session. Through this deliberate, incremental tuning, the agent gradually evolves from a basic prototype into a sophisticated conversational system that is responsive, reliable, and well-aligned with both user expectations and operational constraints.

Writing Effective Guidelines

Each guideline has three key parts:

Example:

Copy CodeCopiedUse a different Browserawait agent.create_guideline(
condition=”Customer requests a specific appointment time that’s unavailable”,
action=”Offer the three closest available slots as alternatives”,
tools=[get_available_slots]
)

Structured Conversations: Journeys

For complex tasks such as booking appointments, onboarding, or troubleshooting, simple guidelines alone are often insufficient. This is where Journeys become essential. Journeys provide a framework to design structured, multi-step conversational flows that guide the user through a process smoothly while maintaining a natural dialogue.

For example, a booking flow can be initiated by creating a journey with a clear title and conditions defining when it applies, such as when a customer wants to schedule an appointment. The journey then progresses through states—first asking the customer what type of service they need, then checking availability using an appropriate tool, and finally offering available time slots. This structured approach balances flexibility and control, enabling the agent to handle complex interactions efficiently without losing the conversational feel.

Example: Booking Flow

Copy CodeCopiedUse a different Browserbooking_journey = await agent.create_journey(
title=”Book Appointment”,
conditions=[“Customer wants to schedule an appointment”],
description=”Guide customer through the booking process”
)

t1 = await booking_journey.initial_state.transition_to(
chat_state=”Ask what type of service they need”
)
t2 = await t1.target.transition_to(
tool_state=check_availability_for_service
)
t3 = await t2.target.transition_to(
chat_state=”Offer available time slots”
)

Balancing Flexibility and Predictability

Balancing flexibility and predictability is essential when designing an AI agent. The agent should feel natural and conversational, rather than overly scripted, but it must still operate within safe and consistent boundaries. 

If instructions are too rigid—for example, telling the agent to “Say exactly: ‘Our premium plan is $99/month‘”—the interaction can feel mechanical and unnatural. On the other hand, instructions that are too vague, such as “Help them understand our pricing“, can lead to unpredictable or inconsistent responses. 

A balanced approach provides clear direction while allowing the agent some adaptability, for example: “Explain our pricing tiers clearly, highlight the value, and ask about the customer’s needs to recommend the best fit.” This ensures the agent remains both reliable and engaging in its interactions.

Designing for Real Conversations

Designing for real conversations requires recognizing that, unlike web forms, conversations are non-linear. Users may change their minds, skip steps, or move the discussion in unexpected directions. To handle this effectively, there are several key principles to follow. 

Context preservation ensures the agent keeps track of information already provided so it can respond appropriately. 

Progressive disclosure means revealing options or information gradually, rather than overwhelming the user with everything at once. 

Recovery mechanisms allow the agent to manage misunderstandings or deviations gracefully, for example by rephrasing a response or gently redirecting the conversation for clarity. 

This approach helps create interactions that feel natural, flexible, and user-friendly.

Effective agentic design means starting with core features, focusing on main tasks before tackling rare cases. It involves careful monitoring to spot any issues in the agent’s behavior. Improvements should be based on real observations, adding clear rules to guide better responses. It’s important to balance clear boundaries that keep the agent safe while allowing natural, flexible conversation. For complex tasks, use structured flows called journeys to guide multi-step interactions. Finally, be transparent about what the agent can do and its limits to set proper expectations. This simple process helps create reliable, user-friendly AI agents.
The post Agentic Design Methodology: How to Build Reliable and Human-Like AI Agents using Parlant appeared first on MarkTechPost.

Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mix …

What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12–15 tool-using agents that share notes and stop early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced TUMIX (Tool-Use Mixture)—a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants) and lets them share intermediate answers over a few refinement rounds, then stop early via an LLM-based judge. The result: higher accuracy at lower cost on hard reasoning benchmarks such as HLE, GPQA-Diamond, and AIME (2024/2025).

https://arxiv.org/pdf/2510.01279

So, What exactly is different new?

Mixture over modality, not just more samples: TUMIX runs ~15 agent styles spanning Chain-of-Thought (CoT), code execution, web search, dual-tool agents, and guided variants. Each round, every agent sees (a) the original question and (b) other agents’ previous answers, then proposes a refined answer. This message-passing raises average accuracy early while diversity gradually collapses—so stopping matters.

Adaptive early-termination: An LLM-as-Judge halts refinement once answers exhibit strong consensus (with a minimum round threshold). This preserves accuracy at ~49% of the inference cost vs. fixed-round refinement; token cost drops to ~46% because late rounds are token-heavier.

Auto-designed agents: Beyond human-crafted agents, TUMIX prompts the base LLM to generate new agent types; mixing these with the manual set yields an additional ~+1.2% average lift without extra cost. The empirical “sweet spot” is ~12–15 agent styles.

https://arxiv.org/pdf/2510.01279

How does it work?

TUMIX runs a group of heterogeneous agents—text-only Chain-of-Thought, code-executing, web-searching, and guided variants—in parallel, then iterates a small number of refinement rounds where each agent conditions on the original question plus the other agents’ prior rationales and answers (structured note-sharing). After each round, an LLM-based judge evaluates consensus/consistency to decide early termination; if confidence is insufficient, another round is triggered, otherwise the system finalizes via simple aggregation (e.g., majority vote or selector). This mixture-of-tool-use design trades brute-force re-sampling for diverse reasoning paths, improving coverage of correct candidates while controlling token/tool budgets; empirically, benefits saturate around 12–15 agent styles, and stopping early preserves diversity and lowers cost without sacrificing accuracy

Lets discuss the Results

Under comparable inference budgets to strong tool-augmented baselines (Self-MoA, Symbolic-MoE, DEI, SciMaster, GSA), TUMIX yields the best average accuracy; a scaled variant (TUMIX+) pushes further with more compute:

HLE (Humanity’s Last Exam): Pro: 21.6% → 34.1% (TUMIX+); Flash: 9.7% → 23.1%.(HLE is a 2,500-question, difficult, multi-domain benchmark finalized in 2025.)

GPQA-Diamond: Pro: up to 88.3%; Flash: up to 82.1%. (GPQA-Diamond is the hardest 198-question subset authored by domain experts.)

AIME 2024/25: Pro: 96.7%; Flash: 86.7% with TUMIX(+) at test time.

Across tasks, TUMIX averages +3.55% over the best prior tool-augmented test-time scaling baseline at similar cost, and +7.8% / +17.4% over no-scaling for Pro/Flash, respectively.

https://arxiv.org/pdf/2510.01279

Our Comments

TUMIX is a great approach from Google because it frames test-time scaling as a search problem over heterogeneous tool policies rather than brute-force sampling. The parallel committee (text, code, search) improves candidate coverage, while the LLM-judge enables early-stop that preserves diversity and reduces token/tool spend—useful under latency budgets. The HLE gains (34.1% with Gemini-2.5 Pro) align with the benchmark’s finalized 2,500-question design, and the ~12–15 agent styles “sweet spot” indicates selection—not generation—is the limiting factor.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture appeared first on MarkTechPost.

Can a Small Language Model Predict Kernel Latency, Memory, and Model A …

Researchers from Cornell and Google introduce a unified Regression Language Model (RLM) that predicts numeric outcomes directly from code strings—covering GPU kernel latency, program memory usage, and even neural network accuracy and latency—without hand-engineered features. A 300M-parameter encoder–decoder initialized from T5-Gemma achieves strong rank correlations across heterogeneous tasks and languages, using a single text-to-number decoder that emits digits with constrained decoding.

What exactly is new?

Unified code-to-metric regression: One RLM predicts (i) peak memory from high-level code (Python/C/C++ and more), (ii) latency for Triton GPU kernels, and (iii) accuracy and hardware-specific latency from ONNX graphs—by reading raw text representations and decoding numeric outputs. No feature engineering, graph encoders, or zero-cost proxies are required.

Concrete results: Reported correlations include Spearman ρ ≈ 0.93 on APPS LeetCode memory, ρ ≈ 0.52 for Triton kernel latency, ρ > 0.5 average across 17 CodeNet languages, and Kendall τ ≈ 0.46 across five classic NAS spaces—competitive with and in some cases surpassing graph-based predictors.

Multi-objective decoding: Because the decoder is autoregressive, the model conditions later metrics on earlier ones (e.g., accuracy → per-device latencies), capturing realistic trade-offs along Pareto fronts.

https://arxiv.org/abs/2509.26476

Why is this important?

Performance prediction pipelines in compilers, GPU kernel selection, and NAS typically rely on bespoke features, syntax trees, or GNN encoders that are brittle to new ops/languages. Treating regression as next-token prediction over numbers standardizes the stack: tokenize inputs as plain text (source code, Triton IR, ONNX), then decode calibrated numeric strings digit-by-digit with constrained sampling. This reduces maintenance cost and improves transfer to new tasks via fine-tuning.

Data and benchmarks

Code-Regression dataset (HF): Curated to support code-to-metric tasks spanning APPS/LeetCode runs, Triton kernel latencies (KernelBook-derived), and CodeNet memory footprints.

NAS/ONNX suite: Architectures from NASBench-101/201, FBNet, Once-for-All (MB/PN/RN), Twopath, Hiaml, Inception, and NDS are exported to ONNX text to predict accuracy and device-specific latency.

How does it work?

Backbone: Encoder–decoder with a T5-Gemma encoder initialization (~300M params). Inputs are raw strings (code or ONNX). Outputs are numbers emitted as sign/exponent/mantissa digit tokens; constrained decoding enforces valid numerals and supports uncertainty via sampling.

Ablations: (i) Language pretraining accelerates convergence and improves Triton latency prediction; (ii) decoder-only numeric emission outperforms MSE regression heads even with y-normalization; (iii) learned tokenizers specialized for ONNX operators increase effective context; (iv) longer contexts help; (v) scaling to a larger Gemma encoder further improves correlation with adequate tuning.

Training code. The regress-lm library provides text-to-text regression utilities, constrained decoding, and multi-task pretraining/fine-tuning recipes.

Stats that matters

APPS (Python) memory: Spearman ρ > 0.9.

CodeNet (17 languages) memory: average ρ > 0.5; strongest languages include C/C++ (~0.74–0.75).

Triton kernels (A6000) latency: ρ ≈ 0.52.

NAS ranking: average Kendall τ ≈ 0.46 across NASNet, Amoeba, PNAS, ENAS, DARTS; competitive with FLAN and GNN baselines.

Key Takeaways

Unified code-to-metric regression works. A single ~300M-parameter T5Gemma-initialized model (“RLM”) predicts: (a) memory from high-level code, (b) Triton GPU kernel latency, and (c) model accuracy + device latency from ONNX—directly from text, no hand-engineered features.

The research shows Spearman ρ > 0.9 on APPS memory, ≈0.52 on Triton latency, >0.5 average across 17 CodeNet languages, and Kendall-τ ≈ 0.46 on five NAS spaces.

Numbers are decoded as text with constraints. Instead of a regression head, RLM emits numeric tokens with constrained decoding, enabling multi-metric, autoregressive outputs (e.g., accuracy followed by multi-device latencies) and uncertainty via sampling.

The Code-Regression dataset unifies APPS/LeetCode memory, Triton kernel latency, and CodeNet memory; the regress-lm library provides the training/decoding stack.

Our Comments

It is very interesting how this work reframes performance prediction as text-to-number generation: a compact T5Gemma-initialized RLM reads source (Python/C++), Triton kernels, or ONNX graphs and emits calibrated numerics via constrained decoding. The reported correlations—APPS memory (ρ>0.9), Triton latency on RTX A6000 (~0.52), and NAS Kendall-τ ≈0.46—are strong enough to matter for compiler heuristics, kernel pruning, and multi-objective NAS triage without bespoke features or GNNs. The open dataset and library make replication straightforward and lower the barrier to fine-tuning on new hardware or languages.

Check out the Paper, GitHub Page and Dataset Card. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes appeared first on MarkTechPost.

A Coding Guide to Build an Autonomous Agentic AI for Time Series Forec …

In this tutorial, we build an advanced agentic AI system that autonomously handles time series forecasting using the Darts library combined with a lightweight HuggingFace model for reasoning. We design the agent to operate in a perception–reasoning–action cycle, where it first analyzes patterns in the data, then selects an appropriate forecasting model, generates predictions, and finally explains and visualizes the results. By walking through this pipeline, we experience how agentic AI can bring together statistical modeling and natural language reasoning to make forecasting both accurate and interpretable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install darts transformers pandas matplotlib numpy -q

import pandas as pd
import numpy as np
from darts import TimeSeries
from darts.models import ExponentialSmoothing, NaiveSeasonal, LinearRegressionModel
from darts.metrics import mape, rmse
from transformers import pipeline
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

We begin by installing and importing the essential libraries, including Darts for time series forecasting, Transformers for reasoning, and supporting packages like pandas, NumPy, and matplotlib. With these tools in place, we set up the foundation to build and run our autonomous forecasting agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TimeSeriesAgent:
“””Autonomous agent for time series analysis and forecasting”””

def __init__(self):
print(” Initializing Agent Brain…”)
self.llm = pipeline(“text-generation”, model=”distilgpt2″, max_length=150,
do_sample=True, temperature=0.7)

self.models = {
‘exponential_smoothing’: ExponentialSmoothing(),
‘naive_seasonal’: NaiveSeasonal(K=12),
‘linear_regression’: LinearRegressionModel(lags=12)
}
self.selected_model = None
self.forecast = None

def perceive(self, data):
“””Agent perceives and analyzes the time series data”””
print(“n PERCEPTION PHASE”)
self.ts = TimeSeries.from_dataframe(data, ‘date’, ‘value’, freq=’M’)

trend = “increasing” if data[‘value’].iloc[-1] > data[‘value’].iloc[0] else “decreasing”
volatility = data[‘value’].std() / data[‘value’].mean()
seasonality = self._detect_seasonality(data[‘value’])

analysis = {
‘length’: len(data),
‘trend’: trend,
‘volatility’: f”{volatility:.2f}”,
‘has_seasonality’: seasonality,
‘mean’: f”{data[‘value’].mean():.2f}”,
‘range’: f”{data[‘value’].min():.2f} to {data[‘value’].max():.2f}”
}

print(f” Data Points: {analysis[‘length’]}”)
print(f” Trend: {analysis[‘trend’].upper()}”)
print(f” Volatility: {analysis[‘volatility’]}”)
print(f” Seasonality: {‘Detected’ if seasonality else ‘Not detected’}”)

return analysis

def _detect_seasonality(self, series, threshold=0.3):
“””Simple seasonality detection”””
if len(series) < 24:
return False
acf = np.correlate(series – series.mean(), series – series.mean(), mode=’full’)
acf = acf[len(acf)//2:]
acf /= acf[0]
return np.max(acf[12:24]) > threshold if len(acf) > 24 else False

def reason(self, analysis):
“””Agent reasons about which model to use”””
print(“n REASONING PHASE”)

prompt = f”Time series analysis: {analysis[‘length’]} data points, {analysis[‘trend’]} trend, ”
f”volatility {analysis[‘volatility’]}, seasonality: {analysis[‘has_seasonality’]}. ”

thought = self.llm(prompt, max_length=100, num_return_sequences=1)[0][‘generated_text’]
print(f” Agent Thinking: {thought[:150]}…”)

if analysis[‘has_seasonality’]:
self.selected_model = ‘naive_seasonal’
reason = “Seasonality detected – using Naive Seasonal model”
elif float(analysis[‘volatility’]) > 0.3:
self.selected_model = ‘exponential_smoothing’
reason = “High volatility – using Exponential Smoothing”
else:
self.selected_model = ‘linear_regression’
reason = “Stable trend – using Linear Regression”

print(f” Decision: {reason}”)
return self.selected_model

def act(self, horizon=12):
“””Agent takes action: trains model and generates forecast”””
print(“n ACTION PHASE”)

train, val = self.ts[:-12], self.ts[-12:]

model = self.models[self.selected_model]
print(f” Training {self.selected_model}…”)
model.fit(train)

self.forecast = model.predict(horizon)

if len(val) > 0:
val_pred = model.predict(len(val))
accuracy = 100 – mape(val, val_pred)
print(f” Validation Accuracy: {accuracy:.2f}%”)

print(f” Generated {horizon}-step forecast”)
return self.forecast

def explain(self):
“””Agent explains its predictions”””
print(“n EXPLANATION PHASE”)

forecast_values = self.forecast.values().flatten()
hist_values = self.ts.values().flatten()

change = ((forecast_values[-1] – hist_values[-1]) / hist_values[-1]) * 100
direction = “increase” if change > 0 else “decrease”

explanation = f”Based on my analysis using {self.selected_model}, ”
f”I predict a {abs(change):.1f}% {direction} in the next period. ”
f”Forecast range: {forecast_values.min():.2f} to {forecast_values.max():.2f}. ”
f”Historical mean was {hist_values.mean():.2f}.”

print(f” {explanation}”)

prompt = f”Forecast summary: {explanation} Explain implications:”
summary = self.llm(prompt, max_length=120)[0][‘generated_text’]
print(f”n Agent Summary: {summary[:200]}…”)

return explanation

def visualize(self):
“””Agent creates visualization of its work”””
print(“n Generating visualization…”)

plt.figure(figsize=(14, 6))

self.ts.plot(label=’Historical Data’, lw=2)

self.forecast.plot(label=f’Forecast ({self.selected_model})’,
lw=2, linestyle=’–‘)

plt.title(‘ Agentic AI Time Series Forecast’, fontsize=16, fontweight=’bold’)
plt.xlabel(‘Date’, fontsize=12)
plt.ylabel(‘Value’, fontsize=12)
plt.legend(loc=’best’, fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

We define a TimeSeriesAgent that thinks with a lightweight HuggingFace model and acts with a small portfolio of Darts models. We perceive patterns (trend, volatility, seasonality), reason to choose the best model, then train, forecast, and validate. Finally, we explain the prediction in plain language and visualize history versus forecast. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef create_sample_data():
“””Generate sample time series data”””
dates = pd.date_range(start=’2020-01-01′, periods=48, freq=’M’)
trend = np.linspace(100, 150, 48)
seasonality = 10 * np.sin(np.linspace(0, 4*np.pi, 48))
noise = np.random.normal(0, 3, 48)
values = trend + seasonality + noise

return pd.DataFrame({‘date’: dates, ‘value’: values})

We create a helper function create_sample_data() that generates synthetic time series data with a clear trend, sinusoidal seasonality, and random noise. This allows us to simulate realistic monthly data from 2020 to 2023 for testing and demonstrating the agent’s forecasting workflow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef main():
“””Main execution: Agent autonomously handles forecasting task”””
print(“=”*70)
print(” AGENTIC AI TIME SERIES FORECASTING SYSTEM”)
print(“=”*70)

print(“n Loading data…”)
data = create_sample_data()
print(f”Loaded {len(data)} data points from 2020-01 to 2023-12″)

agent = TimeSeriesAgent()

analysis = agent.perceive(data)
agent.reason(analysis)
agent.act(horizon=12)
agent.explain()
agent.visualize()

print(“n” + “=”*70)
print(” AGENT COMPLETED FORECASTING TASK SUCCESSFULLY”)
print(“=”*70)

if __name__ == “__main__”:
main()

We define the main function that runs the full agentic AI pipeline. We load synthetic time series data, let the TimeSeriesAgent perceive patterns, reason to select the best model, act by training and forecasting, explain the results, and finally visualize them. This completes the end-to-end autonomous perception, reasoning, and action cycle.

In conclusion, we see how an autonomous agent can analyze time series data, reason about model selection, generate forecasts, and explain its predictions in natural language. By combining Darts with HuggingFace, we create a compact yet powerful framework that not only produces accurate forecasts but also clearly communicates insights. We complete the cycle with visualization, reinforcing how agentic AI makes forecasting more intuitive and interactive.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Guide to Build an Autonomous Agentic AI for Time Series Forecasting with Darts and Hugging Face appeared first on MarkTechPost.

Microsoft Releases ‘Microsoft Agent Framework’: An Open-Source SD …

Microsoft released the Microsoft Agent Framework (public preview), an open-source SDK and runtime that unifies core ideas from AutoGen (agent runtime and multi-agent patterns) with Semantic Kernel (enterprise controls, state, plugins) to help teams build, deploy, and observe production-grade AI agents and multi-agent workflows. The framework is available for Python and .NET and integrates directly with Azure AI Foundry’s Agent Service for scaling and operations.

What exactly is Microsoft shipping?

A consolidated agent runtime and API surface. The Agent Framework carries forward AutoGen’s single- and multi-agent abstractions while adding Semantic Kernel’s enterprise features: thread-based state management, type safety, filters, telemetry, and broad model/embedding support. Microsoft positions it as the successor built by the same teams, rather than a replacement that abandons either project.

First-class orchestration modes. It supports agent orchestration (LLM-driven decision-making) and workflow orchestration (deterministic, business-logic multi-agent flows), enabling hybrid systems where creative planning coexists with reliable handoffs and constraints.

Pro-code and platform interoperability. The base AIAgent interface is designed to swap chat model providers and to interoperate with Azure AI Foundry Agents, OpenAI Assistants, and Copilot Studio, reducing vendor lock-in at the application layer.

Open-source, multi-language SDKs under MIT license. The GitHub repo publishes Python and .NET packages with examples and CI/CD-friendly scaffolding. AutoGen remains maintained (bug fixes, security patches) with guidance to consider Agent Framework for new builds.

Where it runs in production?

Azure AI Foundry’s Agent Service provides the managed runtime: it links models, tools, and frameworks; manages thread state; enforces content safety and identity; and wires in observability. It also supports multi-agent orchestration natively and distinguishes itself from Copilot Studio’s low-code approach by targeting complex, pro-code enterprise scenarios.

But how is it connected to ‘AI economics’?

Enterprise AI economics are dominated by token throughput, latency, failure recovery, and observability. Microsoft’s consolidation addresses those by (a) giving one runtime abstraction for agent collaboration and tool use, (b) attaching production controls—telemetry, filters, identity/networking, safety—to the same abstraction, and (c) deploying onto a managed service that handles scaling, policy, and diagnostics. This reduces the “glue code” that typically drives cost and brittleness in multi-agent systems and aligns with Azure AI Foundry’s model-catalog + toolchain approach.

Architectural notes and developer surface

Runtime & state: Agents coordinate via a runtime that handles lifecycles, identities, communication, and security boundaries—concepts inherited and formalized from AutoGen. Threads are the unit of state, enabling reproducible runs, retries, and audits.

Functions & plugins: The framework leans on Semantic Kernel’s plugin architecture and function-calling to bind tools (code interpreters, custom functions) into agent policies with typed contracts. (

Model/provider flexibility: The same agent interface can target Azure OpenAI, OpenAI, local runtimes (e.g., Ollama/Foundry Local), and GitHub Models, enabling cost/performance tuning per task without rewriting orchestration logic.

Enterprise context

Microsoft frames the release as part of a broader push toward interoperable, standard-friendly “agentic” systems across Azure AI Foundry—consistent with prior statements about multi-agent collaboration, memory, and structured retrieval. Expect tighter ties to Foundry observability and governance controls as these stabilize.

Our Comments

We like this direction because it collapses two divergent stacks—AutoGen’s multi-agent runtime and Semantic Kernel’s enterprise plumbing—into one API surface with a managed path to production. The thread-based state model and OpenTelemetry hooks address the usual blind spots in agentic systems (repro, latency tracing, failure triage), and Azure AI Foundry’s Agent Service takes on identity, content safety, and tool orchestration so teams can iterate on policies instead of glue code. The Python/.NET parity and provider flexibility (Azure OpenAI, OpenAI, GitHub Models, local runtimes) also make cost/perf tuning practical without rewriting orchestration.

The post Microsoft Releases ‘Microsoft Agent Framework’: An Open-Source SDK and Runtime that Simplifies the Orchestration of Multi-Agent Systems appeared first on MarkTechPost.

AWS Open-Sources an MCP Server for Bedrock AgentCore to Streamline AI …

AWS released an open-source Model Context Protocol (MCP) server for Amazon Bedrock AgentCore, providing a direct path from natural-language prompts in agentic IDEs to deployable agents on AgentCore Runtime. The package ships with automated transformations, environment provisioning, and Gateway/tooling hooks designed to compress typical multi-step integration work into conversational commands.

So, what exactly is it?

The “AgentCore MCP server” exposes task-specific tools to a client (e.g., Kiro, Claude Code, Cursor, Amazon Q Developer CLI, or the VS Code Q plugin) and guides the assistant to: (1) minimally refactor an existing agent to the AgentCore Runtime model; (2) provision and configure the AWS environment (credentials, roles/permissions, ECR, config files); (3) wire up AgentCore Gateway for tool calls; and (4) invoke and test the deployed agent—all from the IDE’s chat surface.

Practically, the server teaches your coding assistant to convert entry points to AgentCore handlers, add bedrock_agentcore imports, generate requirements.txt, and rewrite direct agent calls into payload-based handlers compatible with Runtime. It can then call the AgentCore CLI to deploy and exercise the agent, including end-to-end calls through Gateway tools.

https://aws.amazon.com/blogs/machine-learning/accelerate-development-with-the-amazon-bedrock-agentcore-mcpserver/

How to Install? and what’s the client support?

AWS provides a one-click install flow from the GitHub repository, using a lightweight launcher (uvx) and a standard mcp.json entry that most MCP-capable clients consume. The AWS team lists the expected mcp.json locations for Kiro (.kiro/settings/mcp.json), Cursor (.cursor/mcp.json), Amazon Q CLI (~/.aws/amazonq/mcp.json), and Claude Code (~/.claude/mcp.json).

The repository sits in the awslabs “mcp” mono-repo (license Apache-2.0). While the AgentCore server directory hosts the implementation, the root repo also links to broader AWS MCP resources and documentation.

Architecture guidance and the “layered” context model

AWS recommends a layered approach to give the IDE’s assistant progressively richer context: start with the agentic client, then add the AWS Documentation MCP Server, layer in framework documentation (e.g., Strands Agents, LangGraph), include the AgentCore and agent-framework SDK docs, and finally steer recurrent workflows via per-IDE “steering files.” This arrangement reduces retrieval misses and helps the assistant plan the end-to-end transform/deploy/test loop without manual context switching.

Development workflow (typical path)

Bootstrap: Use local tools or MCP servers. Either provision a Lambda target for AgentCore Gateway or deploy the server directly to AgentCore Runtime.

Author/Refactor: Start from Strands Agents or LangGraph code. The server instructs the assistant to convert handlers, imports, and dependencies for Runtime compatibility.

Deploy: The assistant looks up relevant docs and invokes the AgentCore CLI to deploy.

Test & Iterate: Invoke the agent via natural language; if tools are needed, integrate Gateway (MCP client inside the agent), redeploy (v2), and retest.

https://aws.amazon.com/blogs/machine-learning/accelerate-development-with-the-amazon-bedrock-agentcore-mcpserver/

How does it make a difference?

Most “agent frameworks” still require developers to learn cloud-specific runtimes, credentials, role policies, registries, and deployment CLIs before any useful iteration. AWS’s MCP server shifts that work into the IDE assistant and narrows the “prompt-to-production” gap. Since it’s just another MCP server, it composes with existing doc servers (AWS service docs, Strands, LangGraph) and can ride improvements in MCP-aware clients, making it a low-friction entry point for teams standardizing on Bedrock AgentCore.

Comments from MTP (Marktechpost team)

I like that AWS shipped a real MCP endpoint for AgentCore that my IDE can call directly. The uvx-based mcp.json config makes client hookup trivial (Cursor, Claude Code, Kiro, Amazon Q CLI), and the server’s tooling maps cleanly onto the AgentCore Runtime/Gateway/Memory stack while preserving existing Strands/LangGraph code paths. Practically, this collapses the prompt→refactor→deploy→test loop into a reproducible, scriptable workflow rather than bespoke glue code.

Check out the GitHub Repo and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post AWS Open-Sources an MCP Server for Bedrock AgentCore to Streamline AI Agent Development appeared first on MarkTechPost.

Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-Device Speech L …

Neuphonic has released NeuTTS Air, an open-source text-to-speech (TTS) speech language model designed to run locally in real time on CPUs. The Hugging Face model card lists 748M parameters (Qwen2 architecture) and ships in GGUF quantizations (Q4/Q8), enabling inference through llama.cpp/llama-cpp-python without cloud dependencies. It is licensed under Apache-2.0 and includes a runnable demo and examples.

So, what is new?

NeuTTS Air couples a 0.5B-class Qwen backbone with Neuphonic’s NeuCodec audio codec. Neuphonic positions the system as a “super-realistic, on-device” TTS LM that clones a voice from ~3 seconds of reference audio and synthesizes speech in that style, targeting voice agents and privacy-sensitive applications. The model card and repository explicitly emphasize real-time CPU generation and small-footprint deployment.

Key Features

Realism at sub-1B scale: Human-like prosody and timbre preservation for a ~0.7B (Qwen2-class) text-to-speech LM.

On-device deployment: Distributed in GGUF (Q4/Q8) with CPU-first paths; suitable for laptops, phones, and Raspberry Pi-class boards.

Instant speaker cloning: Style transfer from ~3 seconds of reference audio (reference WAV + transcript).

Compact LM+codec stack: Qwen 0.5B backbone paired with NeuCodec (0.8 kbps / 24 kHz) to balance latency, footprint, and output quality.

Explain the model architecture and runtime path?

Backbone: Qwen 0.5B used as a lightweight LM to condition speech generation; the hosted artifact is reported as 748M params under the qwen2 architecture on Hugging Face.

Codec: NeuCodec provides low-bitrate acoustic tokenization/decoding; it targets 0.8 kbps with 24 kHz output, enabling compact representations for efficient on-device use.

Quantization & format: Prebuilt GGUF backbones (Q4/Q8) are available; the repo includes instructions for llama-cpp-python and an optional ONNX decoder path.

Dependencies: Uses espeak for phonemization; examples and a Jupyter notebook are provided for end-to-end synthesis.

On-device performance focus

NeuTTS Air showcases ‘real-time generation on mid-range devices‘ and offers CPU-first defaults; GGUF quantization is intended for laptops and single-board computers. While no fps/RTF numbers are published on the card, the distribution targets local inference without a GPU and demonstrates a working flow through the provided examples and Space.

Voice cloning workflow

NeuTTS Air requires (1) a reference WAV and (2) the transcript text for that reference. It encodes the reference to style tokens and then synthesizes arbitrary text in the reference speaker’s timbre. The Neuphonic team recommends 3–15 s clean, mono audio and provides pre-encoded samples.

Privacy, responsibility, and watermarking

Neuphonic frames the model for on-device privacy (no audio/text leaves the machine without user’s approval) and notes that all generated audio includes a Perth (Perceptual Threshold) watermarker to support responsible use and provenance.

How it compares?

Open, local TTS systems exist (e.g., GGUF-based pipelines), but NeuTTS Air is notable for packaging a small LM + neural codec with instant cloning, CPU-first quantizations, and watermarking under a permissive license. The “world’s first super-realistic, on-device speech LM” phrasing is the vendor’s claim; the verifiable facts are the size, formats, cloning procedure, license, and provided runtimes.

Our Comments

The focus is on system trade-offs: a ~0.7B Qwen-class backbone with GGUF quantization paired with NeuCodec at 0.8 kbps/24 kHz is a pragmatic recipe for real-time, CPU-only TTS that preserves timbre using ~3–15 s style references while keeping latency and memory predictable. The Apache-2.0 licensing and built-in watermarking are deployment-friendly, but publishing RTF/latency on commodity CPUs and cloning-quality vs. reference-length curves would enable rigorous benchmarking against existing local pipelines. Operationally, an offline path with minimal dependencies (eSpeak, llama.cpp/ONNX) lowers privacy/compliance risk for edge agents without sacrificing intelligibility.

Check out the Model Card on Hugging Face and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-Device Speech Language Model with Instant Voice Cloning appeared first on MarkTechPost.

Unlock global AI inference scalability using new global cross-Region i …

Organizations are increasingly integrating generative AI capabilities into their applications to enhance customer experiences, streamline operations, and drive innovation. As generative AI workloads continue to grow in scale and importance, organizations face new challenges in maintaining consistent performance, reliability, and availability of their AI-powered applications. Customers are looking to scale their AI inference workloads across multiple AWS Regions to support consistent performance and reliability.
To address this need, we introduced cross-Region inference (CRIS) for Amazon Bedrock. This managed capability automatically routes inference requests across multiple Regions, enabling applications to handle traffic bursts seamlessly and achieve higher throughput without requiring developers to predict demand fluctuations or implement complex load-balancing mechanisms. CRIS works through inference profiles, which define a foundation model (FM) and the Regions to which requests can be routed.
We are excited to announce availability of global cross-Region inference with Anthropic’s Claude Sonnet 4.5 on Amazon Bedrock. Now, with cross-Region inference, you can choose either a geography-specific inference profile or a global inference profile. This evolution from geography-specific routing provides greater flexibility for organizations because Amazon Bedrock automatically selects the optimal commercial Region within that geography to process your inference request. Global CRIS further enhances cross-Region inference by enabling the routing of inference requests to supported commercial Regions worldwide, optimizing available resources and enabling higher model throughput. This helps support consistent performance and higher throughput, particularly during unplanned peak usage times. Additionally, global CRIS supports key Amazon Bedrock features, including prompt caching, batch inference, Amazon Bedrock Guardrails, Amazon Bedrock Knowledge Bases, and more.
In this post, we explore how global cross-Region inference works, the benefits it offers compared to Regional profiles, and how you can implement it in your own applications with Anthropic’s Claude Sonnet 4.5 to improve your AI applications’ performance and reliability.
Core functionality of global cross-Region inference
Global cross-Region inference helps organizations manage unplanned traffic bursts by using compute resources across different Regions. This section explores how this feature works and the technical mechanisms that power its functionality.
Understanding inference profiles
An inference profile in Amazon Bedrock defines an FM and one or more Regions to which it can route model invocation requests. The global cross-Region inference profile for Anthropic’s Claude Sonnet 4.5 extends this concept beyond geographic boundaries, allowing requests to be routed to one of the supported Amazon Bedrock commercial Regions globally, so you can prepare for unplanned traffic bursts by distributing traffic across multiple Regions.
Inference profiles operate on two key concepts:

Source Region – The Region from which the API request is made
Destination Region – A Region to which Amazon Bedrock can route the request for inference

At the time of writing, global CRIS supports over 20 source Regions, and the destination Region is a supported commercial Region dynamically chosen by Amazon Bedrock.
Intelligent request routing
Global cross-Region inference uses an intelligent request routing mechanism that considers multiple factors, including model availability, capacity, and latency, to route requests to the optimal Region. The system automatically selects the optimal available Region for your request without requiring manual configuration:

Regional capacity – The system considers the current load and available capacity in each potential destination Region.
Latency considerations – Although the system prioritizes availability, it also takes latency into account. By default, the service attempts to fulfill requests from the source Region when possible, but it can seamlessly route requests to other Regions as needed.
Availability metrics – The system continuously monitors the availability of FMs across Regions to support optimal routing decisions.

This intelligent routing system enables Amazon Bedrock to distribute traffic dynamically across the AWS global infrastructure, facilitating optimal availability for each request and smoother performance during high-usage periods.
Monitoring and logging
When using global cross-Region inference, Amazon CloudWatch and AWS CloudTrail continue to record log entries only in the source Region where the request originated. This simplifies monitoring and logging by maintaining all records in a single Region regardless of where the inference request is ultimately processed. To track which Region processed a request, CloudTrail events include an additionalEventData field with an inferenceRegion key that specifies the destination Region. Organizations can monitor and analyze the distribution of their inference requests across the AWS global infrastructure.
Data security and compliance
Global cross-Region inference maintains high standards for data security. Data transmitted during cross-Region inference is encrypted and remains within the secure AWS network. Sensitive information remains protected throughout the inference process, regardless of which Region processes the request. Because security and compliance is a shared responsibility, you must also consider legal or compliance requirements that come with processing inference request in a different geographic location. Because global cross-Region inference allows requests to be routed globally, organizations with specific data residency or compliance requirements can elect, based on their compliance needs, to use geography-specific inference profiles to make sure data remains within certain Regions. This flexibility helps businesses balance redundancy and compliance needs based on their specific requirements.
Implement global cross-Region inference
To use global cross-Region inference with Anthropic’s Claude Sonnet 4.5, developers must complete the following key steps:

Use the global inference profile ID – When making API calls to Amazon Bedrock, specify the global Anthropic’s Claude Sonnet 4.5 inference profile ID (global.anthropic.claude-sonnet-4-5-20250929-v1:0) instead of a Region-specific model ID. This works with both InvokeModel and Converse APIs.
Configure IAM permissions – Grant appropriate AWS Identity and Access Management (IAM) permissions to access the inference profile and FMs in potential destination Regions. In the next section, we provide more details. You can also read more about prerequisites for inference profiles.

Implementing global cross-Region inference with Anthropic’s Claude Sonnet 4.5 is straightforward, requiring only a few changes to your existing application code. The following is an example of how to update your code in Python:

import boto3
import json
bedrock = boto3.client(‘bedrock-runtime’, region_name=’us-east-1′)

model_id = “global.anthropic.claude-sonnet-4-5-20250929-v1:0”  

response = bedrock.converse(
    messages=[{“role”: “user”, “content”: [{“text”: “Explain cloud computing in 2 sentences.”}]}],
    modelId=model_id,
)

print(“Response:”, response[‘output’][‘message’][‘content’][0][‘text’])
print(“Tokens used:”, result.get(‘usage’, {}))

If you’re using the Amazon Bedrock InvokeModel API, you can quickly switch to a different model by changing the model ID, as shown in Invoke model code examples.
IAM policy requirements for global CRIS
In this section, we discuss the IAM policy requirements for global CRIS.
Enable global CRIS
To enable global CRIS for your users, you must apply a three-part IAM policy to the role. The following is an example IAM policy to provide granular control. You can replace <REQUESTING REGION> in the example policy with the Region you are operating in.

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Sid”: “GrantGlobalCrisInferenceProfileRegionAccess”,
            “Effect”: “Allow”,
            “Action”: “bedrock:InvokeModel”,
            “Resource”: [
                “arn:aws:bedrock:<REQUESTING REGION>:<ACCOUNT>:inference-profile/global.<MODEL NAME>”
            ],
            “Condition”: {
                “StringEquals”: {
                    “aws:RequestedRegion”: “<REQUESTING REGION>”
                }
            }
        },
        {
            “Sid”: “GrantGlobalCrisInferenceProfileInRegionModelAccess”,
            “Effect”: “Allow”,
            “Action”: “bedrock:InvokeModel”,
            “Resource”: [
                “arn:aws:bedrock:<REQUESTING REGION>::foundation-model/<MODEL NAME>”
            ],
            “Condition”: {
                “StringEquals”: {
                    “aws:RequestedRegion”: “<REQUESTING REGION>”,
                    “bedrock:InferenceProfileArn”: “arn:aws:bedrock:<REQUESTING REGION>:<ACCOUNT>:inference-profile/global.<MODEL NAME>”
                }
            }
        },
        {
            “Sid”: “GrantGlobalCrisInferenceProfileGlobalModelAccess”,
            “Effect”: “Allow”,
            “Action”: “bedrock:InvokeModel”,
            “Resource”: [
                “arn:aws:bedrock:::foundation-model/<MODEL NAME>”
            ],
            “Condition”: {
                “StringEquals”: {
                    “aws:RequestedRegion”: “unspecified”,
                    “bedrock:InferenceProfileArn”: “arn:aws:bedrock:<REQUESTING REGION>:<ACCOUNT>:inference-profile/global.<MODEL NAME>”
                }
            }
        }
    ]
}

The first part of the policy grants access to the Regional inference profile in your requesting Region. This policy allows users to invoke the specified global CRIS inference profile from their requesting Region. The second part of the policy provides access to the Regional FM resource, which is necessary for the service to understand which model is being requested within the Regional context. The third part of the policy grants access to the global FM resource, which enables the cross-Region routing capability that makes global CRIS function. When implementing these policies, make sure all three resource Amazon Resource Names (ARNs) are included in your IAM statements:

The Regional inference profile ARN follows the pattern arn:aws:bedrock:REGION:ACCOUNT:inference-profile/global.MODEL-NAME. This is used to give access to the global inference profile in the source Region.
The Regional FM uses arn:aws:bedrock:REGION::foundation-model/MODEL-NAME. This is used to give access to the FM in the source Region.
The global FM requires arn:aws:bedrock:::foundation-model/MODEL-NAME. This is used to give access to the FM in different global Regions.

The global FM ARN has no Region or account specified, which is intentional and required for the cross-Region functionality.
To simplify onboarding, global CRIS doesn’t require complex changes to an organization’s existing Service Control Policies (SCPs) that might deny access to services in certain Regions. When you opt in to global CRIS using this three-part policy structure, Amazon Bedrock will process inference requests across commercial Regions without validating against Regions denied in other parts of SCPs. This prevents workload failures that could occur when global CRIS routes inference requests to new or previously unused Regions that might be blocked in your organization’s SCPs. However, if you have data residency requirements, you should carefully evaluate your use cases before implementing global CRIS, because requests might be processed in any supported commercial Region.
Disable global CRIS
You can choose from two primary approaches to implement deny policies to global CRIS for specific IAM roles, each with different use cases and implications:

Remove an IAM policy – The first method involves removing one or more of the three required IAM policies from user permissions. Because global CRIS requires all three policies to function, removing a policy will result in denied access.
Implement a deny policy – The second approach is to implement an explicit deny policy that specifically targets global CRIS inference profiles. This method provides clear documentation of your security intent and makes sure that even if someone accidentally adds the required allow policies later, the explicit deny will take precedence. The deny policy should use a StringEquals condition matching the pattern “aws:RequestedRegion”: “unspecified”. This pattern specifically targets inference profiles with the global prefix.

When implementing deny policies, it’s crucial to understand that global CRIS changes how the aws:RequestedRegion field behaves. Traditional Region-based deny policies that use StringEquals conditions with specific Region names such as “aws:RequestedRegion”: “us-west-2” will not work as expected with global CRIS because the service sets this field to global rather than the actual destination Region. However, as mentioned earlier, “aws:RequestedRegion”: “unspecified” will result in the deny effect.
Note: To simplify customer onboarding, global CRIS has been designed to work without requiring complex changes to an organization’s existing SCPs that may deny access to services in certain Regions. When customers opt in to global CRIS using the three-part policy structure described above, Amazon Bedrock will process inference requests across supported AWS commercial Regions without validating against regions denied in any other parts of SCPs. This prevents workload failures that could occur when global CRIS routes inference requests to new or previously unused Regions that might be blocked in your organization’s SCPs. However, customers with data residency requirements should evaluate their use cases before implementing global CRIS, because requests may be processed in any supported commercial Regions. As a best practice, organizations who use geographic CRIS but want to opt out from global CRIS should implement the second approach.
Request limit increases for global CRIS with Anthropic’s Claude Sonnet 4.5
When using global CRIS inference profiles, it’s important to understand that service quota management is centralized in the US East (N. Virginia) Region. However, you can use global CRIS from over 20 supported source Regions. Because this will be a global limit, requests to view, manage, or increase quotas for global cross-Region inference profiles must be made through the Service Quotas console or AWS Command Line Interface (AWS CLI) specifically in the US East (N. Virginia) Region. Quotas for global CRIS inference profiles will not appear on the Service Quotas console or AWS CLI for other source Regions, even when they support global CRIS usage. This centralized quota management approach makes it possible to access your limits globally without estimating usage in individual Regions. If you don’t have access to US East (N. Virginia), reach out to your account teams or AWS support.
Complete the following steps to request a limit increase:

Sign in to the Service Quotas console in your AWS account.
Make sure your selected Region is US East (N. Virginia).
In the navigation pane, choose AWS services.
From the list of services, find and choose Amazon Bedrock.
In the list of quotas for Amazon Bedrock, use the search filter to find the specific global CRIS quotas. For example:

Global cross-Region model inference tokens per day for Anthropic Claude Sonnet 4.5 V1
Global cross-Region model inference tokens per minute for Anthropic Claude Sonnet 4.5 V1

Select the quota you want to increase.
Choose Request increase at account level.
Enter your desired new quota value.
Choose Request to submit your request.

Use global cross-Region inference with Anthropic’s Claude Sonnet 4.5
Claude Sonnet 4.5 is Anthropic’s most intelligent model (at the time of writing), and is best for coding and complex agents. Anthropic’s Claude Sonnet 4.5 demonstrates advancements in agent capabilities, with enhanced performance in tool handling, memory management, and context processing. The model shows marked improvements in code generation and analysis, including identifying optimal improvements and exercising stronger judgment in refactoring decisions. It particularly excels at autonomous long-horizon coding tasks, where it can effectively plan and execute complex software projects spanning hours or days while maintaining consistent performance and reliability throughout the development cycle.
Global cross-Region inference for Anthropic’s Claude Sonnet 4.5 delivers multiple advantages over traditional geographic cross-Region inference profiles:

Enhanced throughput during peak demand – Global cross-Region inference provides improved resilience during periods of peak demand by automatically routing requests to Regions with available capacity. This dynamic routing happens seamlessly without additional configuration or intervention from developers. Unlike traditional approaches that might require complex client-side load balancing between Regions, global cross-Region inference handles traffic spikes automatically. This is particularly important for business-critical applications where downtime or degraded performance can have significant financial or reputational impacts.
Cost-efficiency – Global cross-Region inference for Anthropic’s Claude Sonnet 4.5 offers approximately 10% savings on both input and output token pricing compared to geographic cross-Region inference. The price is calculated based on the Region from which the request is made (source Region). This means organizations can benefit from improved resilience with even lower costs. This pricing model makes global cross-Region inference a cost-effective solution for organizations looking to optimize their generative AI deployments. By improving resource utilization and enabling higher throughput without additional costs, it helps organizations maximize the value of their investment in Amazon Bedrock.
Streamlined monitoring – When using global cross-Region inference, CloudWatch and CloudTrail continue to record log entries in your source Region, simplifying observability and management. Even though your requests are processed across different Regions worldwide, you maintain a centralized view of your application’s performance and usage patterns through your familiar AWS monitoring tools.
On-demand quota flexibility – With global cross-Region inference, your workloads are no longer limited by individual Regional capacity. Instead of being restricted to the capacity available in a specific Region, your requests can be dynamically routed across the AWS global infrastructure. This provides access to a much larger pool of resources, making it less complicated to handle high-volume workloads and sudden traffic spikes.

If you’re currently using Anthropic’s Sonnet models on Amazon Bedrock, upgrading to Claude Sonnet 4.5 is a great opportunity to enhance your AI capabilities. It offers a significant leap in intelligence and capability, offered as a straightforward, drop-in replacement at a comparable price point as Sonnet 4. The primary reason to switch is Sonnet 4.5’s superior performance across critical, high-value domains. It is Anthropic’s most powerful model so far for building complex agents, demonstrating state-of-the-art performance in coding, reasoning, and computer use. Furthermore, its advanced agentic capabilities, such as extended autonomous operation and more effective use of parallel tool calls, enable the creation of more sophisticated AI workflows.
Conclusion
Amazon Bedrock global cross-Region inference for Anthropic’s Claude Sonnet 4.5 marks a significant evolution in AWS generative AI capabilities, enabling global routing of inference requests across the AWS worldwide infrastructure. With straightforward implementation and comprehensive monitoring through CloudTrail and CloudWatch, organizations can quickly use this powerful capability for their AI applications, high-volume workloads, and disaster recovery scenarios.We encourage you to try global cross-Region inference with Anthropic’s Claude Sonnet 4.5 in your own applications and experience the benefits firsthand. Start by updating your code to use the global inference profile ID, configure appropriate IAM permissions, and monitor your application’s performance as it uses the AWS global infrastructure to deliver enhanced resilience.
For more information about global cross-Region inference for Anthropic’s Claude Sonnet 4.5 in Amazon Bedrock, refer to Increase throughput with cross-Region inference, Supported Regions and models for inference profiles, and Use an inference profile in model invocation.

About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Derrick Choo is a Senior Solutions Architect at AWS who accelerates enterprise digital transformation through cloud adoption, AI/ML, and generative AI solutions. He specializes in full-stack development and ML, designing end-to-end solutions spanning frontend interfaces, IoT applications, data integrations, and ML models, with a particular focus on computer vision and multi-modal systems.
Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.
Jared Dean is a Principal AI/ML Solutions Architect at AWS. Jared works with customers across industries to develop machine learning applications that improve efficiency. He is interested in all things AI, technology, and BBQ.
Jan Catarata is a software engineer working on Amazon Bedrock, where he focuses on designing robust distributed systems. When he’s not building scalable AI solutions, you can find him strategizing his next move with friends and family at game night.

Secure ingress connectivity to Amazon Bedrock AgentCore Gateway using …

Agentic AI applications represent a significant development in enterprise automation, where intelligent agents autonomously execute complex workflows, access sensitive datasets, and make real-time decisions across your organization’s infrastructure. Amazon Bedrock AgentCore accelerates enterprise AI transformation by providing fully managed services that remove infrastructure complexity, maintain session isolation, and enable seamless integration with enterprise tools so organizations can deploy trustworthy AI agents at scale. AgentCore Gateway, a modular service under AgentCore, simplifies integration by securely transforming APIs, AWS Lambda functions, and services into Model Context Protocol (MCP)-compatible tools and making them available to agents through a unified endpoint, with built-in authentication and serverless infrastructure that minimizes operational overhead.
In production environments, AI agents are typically deployed within virtual private clouds (VPCs) to maintain secure, isolated network access and to meet enterprise security and compliance requirements. Amazon Web Services (AWS) interface VPC endpoints can enhance agentic AI security by creating private connections between VPC-hosted agents and AgentCore Gateway, keeping sensitive communications within the secure infrastructure of AWS. These endpoints use dedicated network interfaces with private IP addresses to deliver reduced latency and superior performance through direct connectivity. Additionally, VPC interface endpoints offer granular access control through endpoint policies, streamline operations by avoiding proxy server management, reduce data transfer costs, and establish the secure foundation that autonomous AI systems require when processing confidential data in regulated environments at enterprise scale.
In this post, we demonstrate how to access AgentCore Gateway through a VPC interface endpoint from an Amazon Elastic Compute Cloud (Amazon EC2) instance in a VPC. We also show how to configure your VPC endpoint policy to provide secure access to the AgentCore Gateway while maintaining the principle of least privilege access.
Architecture overview
This architecture diagram illustrates a user accessing an application supported by backend agents deployed across various AWS compute services, including EC2 instances, Lambda functions, Amazon Elastic Kubernetes Service (Amazon EKS), or Amazon Elastic Container Service (Amazon ECS), all operating within a VPC environment. These agents communicate with AgentCore Gateway to discover, access, and invoke external tools and services that have been transformed into agent-compatible resources, such as enterprise APIs and Lambda functions. In the standard configuration, agent requests to AgentCore Gateway traverse the public internet. By implementing interface VPC endpoints, organizations can route these communications through the AWS secure internal network backbone instead, delivering significant benefits that can include enhanced security, reduced latency, and improved compliance alignment for regulated workloads that require strict network isolation and data protection standards. The solution follows this workflow:

AI agent interaction – An agent running within the VPC obtains the required inbound authorization from identity providers, authenticates with Gateway, and sends a tool-use request (invokes the MCP tool) to the gateway through the interface VPC endpoint.
Gateway processing: Gateway manages OAuth authorization to make sure only valid users and agents can access tools and resources. The inbound request is authorized by Gateway. Converts agent requests using protocols like Model Context Protocol (MCP) into API requests and Lambda invocations
Secure access: The gateway handles credential injection for each tool, enabling agents to use tools with different authentication requirements seamlessly. It uses AgentCore Identity to securely access backend resources (the targets) on behalf of the agent.
Target execution: The gateway data plane invokes the target, which can be a Lambda function, an OpenAPI specification, or a Smithy model.
Monitoring: AgentCore Gateway provides built-in observability and auditing. Additionally, AWS PrivateLink publishes metrics to Amazon CloudWatch for monitoring interface endpoints. You can optionally enable VPC Flow Logs for logging IP traffic to AgentCore Gateway.

Be aware of the following key considerations:

Private and public network communication – The interface VPC endpoint enables secure communication for inbound traffic from agents to AgentCore Gateway through AWS PrivateLink, making sure this traffic remains within the private network. However, authentication workflows—including OAuth access token retrieval and credential exchange processes between agents and external Identity Provider systems for both inbound and outbound flows—and outbound access from the gateway to MCP tools continue to require internet connectivity for establishing secure sessions with identity systems and external resources hosted outside the AWS environment.
Data plane scope – It’s important to understand that, currently, the interface VPC endpoint support is applicable only to the data plane endpoints of your gateway—the runtime endpoints where your applications interact with agent tools. To clarify the distinction: although you can now access your gateway’s runtime endpoint through the interface VPC endpoint, the control plane operations, such as creating gateways, managing tools, and configuring security settings, must still be performed through the standard public AgentCore control plane endpoint (for example, bedrock-agentcore-control.<region>.amazonaws.com)

Prerequisites
To perform the solution, you need the following prerequisites:

An AWS account with appropriate AWS Identity and Access Management (IAM) permissions for VPC and Amazon Elastic Compute Cloud (Amazon EC2) management
Existing VPC setup with subnet configuration and route tables
AgentCore Gateway already provisioned and configured in your AWS account
Basic understanding of VPC networking concepts and security group configurations

Solution walkthrough
In the following sections, we demonstrate how to configure the interface VPC endpoint using the AWS Management Console and establish secure connectivity from a test EC2 instance within the VPC to AgentCore Gateway.
Create a security group for the EC2 instance
To create a security group for the EC2 instance, follow these steps, as shown in the following screenshot:

Navigate to the Amazon EC2 console in your preferred AWS Region and choose Security Groups in the navigation pane under Network & Security.
Choose Create security group.
For Security group name, enter a descriptive name such as ec2-agent-sg.
For Description, enter a meaningful description such as Security group for EC2 instances running AI agents.
For VPC, choose your target VPC.
Add relevant Inbound rules for the EC2 instance management such as SSH (port 22) from your management network or bastion host.
Leave Outbound rules as default (allows all outbound traffic) to make sure agents can communicate with necessary services.
Choose Create security group.

Create a security group for the interface VPC endpoint
To create a security group for the interface VPC endpoint, follow these steps:
Create a second security group named vpce-agentcore-sg that will be attached to the AgentCore Gateway interface VPC endpoint using similar steps to the preceding instructions and selecting the same VPC. For this security group, configure the following rules to enable secure and restricted access:

Inbound rules – Allow HTTPS (port 443) for secure communication to the AgentCore Gateway
Source – Select the EC2 security group (ec2-agent-sg) you created in the preceding section to allow traffic only from authorized agent instances
Outbound rules – Leave as default (all traffic allowed) to support response traffic

This security group configuration implements the principle of least privilege by making sure only EC2 instances with the agent security group can access the VPC endpoint while blocking unauthorized access from other resources in the VPC. These steps are illustrated by the following screenshot.

Provision an EC2 instance within the VPC
Provision an EC2 instance in the same VPC and select an appropriate Availability Zone for your workload requirements. Configure the instance with the network settings shown in the following list, making sure you select the same VPC and note the chosen subnet for VPC endpoint configuration:

VPC – Select your target VPC
Subnet – Choose a private subnet for enhanced security (note this subnet for VPC endpoint configuration)
Security group – Attach the EC2 security group (ec2-agent-sg) you created in the previous steps
IAM role – Configure an IAM role with necessary permissions for Amazon Bedrock and AgentCore Gateway access
Instance type – Choose an appropriate instance type based on your agent workload requirements

Remember the chosen subnet because you’ll need to configure the VPC endpoint in the same subnet to facilitate optimal network routing and minimal latency. These configurations are shown in the following screenshot.

Create an interface VPC endpoint
Create an interface VPC endpoint using Amazon Virtual Private Cloud (Amazon VPC) that automatically uses AWS PrivateLink technology, enabling secure communication from your EC2 instance to AgentCore Gateway without traversing the public internet. Follow these steps:

Navigate to the Amazon VPC console and choose Endpoints in the navigation pane under the PrivateLink and Lattice section.
Choose Create endpoint.
For Name tag, enter a descriptive name (for example, vpce-agentcore-gateway).
For Service category, choose AWS services.
For Services, search for and choose com.amazonaws.<region>.bedrock-agentcore.gateway (replace <region> with your actual AWS Region).

These settings are shown in the following screenshot.

Set the VPC to the same VPC you’ve been working with throughout this setup.
Select Enable DNS name to allow access to the AgentCore Gateway using its default domain name, which simplifies application configuration and maintains compatibility with existing code.
Specify the subnet where the EC2 instance is running to maintain optimal network routing and minimal latency, as shown in the following screenshot.

Set the security group to the VPC endpoint security group (vpce-agentcore-sg) you created earlier to control access to the endpoint.
For initial testing, leave the policy set to Full access to allow agents within your VPC to communicate with AgentCore Gateway in your AWS account. In production environments, implement more restrictive policies based on the principle of least privilege.

After you create the endpoint, it will take approximately 2–5 minutes to become available. You can monitor the status on the Amazon VPC console, and when it shows as Available, you can proceed with testing the connection.
Test the connection
Log in to the EC2 instance to perform following the tests.
Check traffic flow over an interface VPC endpoint
To confirm the traffic flow through the Amazon Bedrock AgentCore Gateway endpoint, check the IP address of the source resource that connects to the AgentCore Gateway endpoint. When you set up an interface VPC endpoint, AWS deploys an elastic network interface with a private IP address in the subnet. This deployment allows communication with AgentCore Gateway from resources within the Amazon VPC and on-premises resources that connect to the interface VPC endpoint through AWS Direct Connect or AWS Site-to-Site VPN. It also allows communication with resources in other Amazon VPC endpoints when you use centralized interface VPC endpoint architecture patterns.
Check whether you turned on private DNS for the AgentCore Gateway endpoint. If you turn on private DNS, then AgentCore Gateway endpoints resolve to the private endpoint IP addresses. For AgentCore Gateway, enabling private DNS means your agents can continue using the standard gateway endpoint URL while benefiting from private network routing through the VPC endpoint.
Before VPC interface endpoint, as shown in the following example, the DNS resolves to a public IP address for AgentCore Gateway endpoint:

nslookup gateway.bedrock-agentcoreamazonaws.com

Non-authoritative answer:
Name: gateway.bedrock-agentcore..amazonaws.com
Address: 52.86.152.150

After VPC interface endpoint creation with private DNS resolution, as shown in the following example, the DNS resolves to private IP address from the CIDR range of the subnet of the VPC in which the VPC endpoint was created.

nslookup .gateway.bedrock-agentcore..amazonaws.com

Non-authoritative answer:
Name: .gateway.bedrock-agentcore..amazonaws.com
Address: 172.31.91.174

When you select Enable DNS name for AgentCore Gateway VPC interface endpoints, by default AWS turns on the Enable private DNS only for inbound endpoints option.
Private DNS enabled (cURL) (recommended)
When private DNS is enabled, your applications can seamlessly use the standard gateway URL endpoint in the format https://{gateway-id}.gateway.bedrock-agentcore.{region}.amazonaws.com while traffic automatically routes through the VPC endpoint.
The following is a sample cURL request to be executed from a resource within the VPC. The command sends a JSON-RPC POST request to retrieve available tools from the AgentCore Gateway:

curl -sS -i -X POST https://<gatewayid>.gateway.bedrock-agentcore.<region>.amazonaws.com/mcp
–header ‘Content-Type: application/json’
–header “Authorization: Bearer $TOKEN”
–data ‘{
“jsonrpc”: “2.0”,
“id”: “‘”$UNIQUE_ID”‘”,
“method”: “tools/list”,
“params”: {}
}’

This cURL command sends a JSON-RPC 2.0 POST request to the AgentCore Gateway MCP endpoint to retrieve a list of available tools. It uses bearer token authentication and includes response headers in the output, calling the tools/list method to discover what tools are accessible through the gateway.
Private DNS disabled (Python)
When Private DNS is disabled, you can’t access the gateway directly through the standard AgentCore Gateway endpoint. Instead, you must route traffic through the VPC DNS name shown in the following screenshot and include the original gateway domain name in the Host header.

curl -sS -i -X POST https://<vpce-dns-name>/mcp
–header ‘Host: <gatewayid>.gateway.bedrock-agentcore.<region>.amazonaws.com
–header ‘Content-Type: application/json’
–header “Authorization: Bearer $TOKEN”
–data ‘{
“jsonrpc”: “2.0”,
“id”: “‘$UNIQUE_ID'”,
“method”: “tools/list”,
“params”: {}
}’

The following steps below walk through executing a Python script that uses the Host header:

Access your EC2 instance. Log in to your EC2 instance that has access to the VPC endpoint.
Configure the required environment variables for the connection:
GATEWAY_URL – The VPC endpoint URL used to access the AgentCore Gateway through your private network connection
TOKEN – Your authentication bearer token for accessing the gateway
GATEWAY_HOST – The original AgentCore Gateway domain name that must be included in the Host header when Private DNS is disabled

For example:

export GATEWAY_URL=https://<vpce_id>.gateway.bedrock-agentcore.ap-southeast-2.vpce.amazonaws.com/mcp
export TOKEN=<your-token-here>
export GATEWAY_HOST=<gateway_id>.gateway.bedrock-agentcore.ap-southeast-2.amazonaws.com

Create and execute the test script.

Copy the following Python code into a file named agent.py. This code tests the AgentCore Gateway workflow by discovering available tools, creating a Strands Agent with the tools, and then testing both conversational interactions (tool listing and weather queries) and direct MCP tool calls. Copy the code:

from strands.models import BedrockModel
from mcp.client.streamable_http import streamablehttp_client
from strands.tools.mcp.mcp_client import MCPClient
from strands import Agent
import logging
import os

# Read authentication token and gateway URL from environment variables
token = os.getenv(‘TOKEN’)
gatewayURL = os.getenv(‘GATEWAY_URL’) #vpc endpoint url
gatewayHost = os.getenv(‘GATEWAY_HOST’) #domain name of the agentcore gateway

def create_streamable_http_transport():
“””Create HTTP transport with proper authentication headers”””
return streamablehttp_client(
gatewayURL,
headers={
“Authorization”: f”Bearer {token}”,
“Host”:gatewayHost
}
)
# Initialize MCP client with the transport
client = MCPClient(create_streamable_http_transport)

# Configure Bedrock model – ensure IAM credentials in ~/.aws/credentials have Bedrock access
yourmodel = BedrockModel(
model_id=”amazon.nova-pro-v1:0″,
temperature=0.7,
)

# Configure logging for debugging and monitoring
logging.getLogger(“strands”).setLevel(logging.INFO)
logging.basicConfig(
format=”%(levelname)s | %(name)s | %(message)s”,
handlers=[logging.StreamHandler()]
)

# Test the complete agent workflow
with client:
targetname = ‘TestGatewayTarget36cb2ebf’

# List available tools from the MCP server
tools = client.list_tools_sync()

# Create an Agent with the model and available tools
agent = Agent(model=yourmodel, tools=tools)
print(f”Tools loaded in the agent: {agent.tool_names}”)

# Test agent with a simple query to list available tools
response1 = agent(“Hi, can you list all tools available to you?”)
print(f”Agent response for tool listing: {response1}”)

# Test agent with a tool invocation request
response2 = agent(“Get the current weather for Seattle and show me the exact response from the tool”)
print(f”Agent response for weather query: {response2}”)

# Direct MCP tool invocation for validation
result = client.call_tool_sync(
tool_use_id=”get-weather-seattle-call-1″, # Unique identifier for this call
name=f”{targetname}___get_weather”, # Tool name format for Lambda targets
arguments={“location”: “Seattle”}
)
print(f”Direct MCP tool response: {result}”)

Invoke the script using the following command:

python3 agent.py
Advanced configuration: VPC endpoint access policies
A VPC endpoint policy is a resource-based policy that controls access to AWS services through the endpoint. Unlike identity-based policies, endpoint policies provide an additional layer of access control at the network level. You can configure access policies for AgentCore Gateway VPC endpoints with specific considerations.When creating endpoint policies for AgentCore Gateway, consider these key elements:

Principal configuration – The Principal field can’t be modified because AgentCore Gateway doesn’t use IAM for authentication. Authentication is handled through bearer tokens rather than IAM principals.
Resource specification – Clearly define the Resource field if you want to restrict access to specific gateway endpoints. Use the full Amazon Resource Name (ARN) format to target particular gateways within your account as shown in the following sample policy structure.
Action permissions – For the Action field, avoid specifying control plane operations. Use a wildcard (*) to allow the necessary data plane operations for gateway functionality.

Here is a sample policy structure:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Principal”: “*”,
“Effect”: “Allow”,
“Action”: “*”,
“Resource”: “arn:aws:bedrock-agentcore:<region>:<AWS_Account_ID>:gateway/<gateway_id>”
}
]
}

When the VPC endpoint policy blocks a request, you will see error responses such as:

{“jsonrpc”:”2.0″,”id”:2,”error”:{“code”:-32002,”message”:”Authorization error – Insufficient permissions”}}

Policy caching behavior
AgentCore Gateway implements a caching mechanism for access policies that introduces a delay of up to 15 minutes before policy changes take effect. Although this caching significantly improves gateway performance, it means that policy modifications might not be immediately reflected in access controls. To work effectively with this behavior, you should allow at least 15 minutes for policy changes to fully propagate throughout the system after making updates. When possible, schedule policy modifications during planned maintenance windows to minimize operational impact. Always test policy changes in nonproduction environments before applying them to production gateways and factor in the caching delay when diagnosing access-related issues to avoid premature troubleshooting efforts.
Advanced patterns
In a shared gateway, multiple agents pattern, multiple agents from different services access a single centralized gateway through a shared VPC endpoint, simplifying network architecture while maintaining security through token-based authentication. This pattern is illustrated in the following diagram.

In a multi-gateway, multi-agent pattern, which is shown in the following diagram, multiple agents across different applications access multiple specialized gateways through dedicated VPC endpoints, providing maximum security isolation with access control per gateway.

In a cross-VPC gateway access pattern, shown in the following diagram, agents in multiple VPCs can access AgentCore Gateway through VPC peering or AWS Transit Gateway connections, allowing centralized gateway access across network boundaries while maintaining isolation.

In a hybrid cloud gateway pattern, on-premises agents can access cloud-based gateways through VPC endpoints with private DNS disabled, enabling hybrid cloud deployments through Direct Connect or VPN connections. The following diagram illustrates this pattern.

Clean up
To avoid ongoing charges and maintain good resource hygiene, clean up your resources by completing the following steps in order:Delete the EC2 instance:

Navigate to the Amazon EC2 console and select your test instance
Choose Instance state and Stop instance, then wait for it to stop
Choose Instance state and Terminate instance to permanently delete the instance

Delete the VPC endpoint:

Navigate to the Amazon VPC console and choose Endpoints
Select the VPC endpoint (vpce-agentcore-gateway) you created
Choose Actions and Delete VPC endpoints
Confirm the deletion

Delete the security groups:

Navigate to the Amazon EC2 console and choose Security groups
Select the EC2 security group (ec2-agent-sg) you created
Choose Actions and Delete security groups
Repeat for the VPC endpoint security group (vpce-agentcore-sg)

Conclusion
In this post, we demonstrated how to establish secure, private connectivity between VPC-hosted resources and Amazon Bedrock AgentCore Gateway using VPC interface endpoints and AWS PrivateLink. This architecture delivers comprehensive benefits for enterprise agentic AI deployments by implementing networks that are isolated from the internet, providing enhanced security through dedicated private network paths. The solution implements a robust data perimeter through VPC endpoint policies, which create granular access controls that establish strict data boundaries around your AI resources. Additionally, the architecture enables private connectivity to Gateway endpoints for on-premises environments, supporting distributed AI architectures that span cloud and on-premises infrastructure. For organizations deploying autonomous AI systems at scale, implementing VPC interface endpoints creates the secure networking foundation necessary for efficient agent operations while delivering reduced latency through optimized network paths. This enterprise-grade approach helps enable your agentic AI applications to achieve improved performance and reduced response times while meeting security and compliance requirements.
To learn more about implementing these patterns and best practices, visit the Amazon Bedrock documentation and AWS PrivateLink documentation for comprehensive guidance on AI deployments.

About the authors
Dhawal Patel is a Principal Machine Learning Architect at Amazon Web Services (AWS). He has worked with organizations ranging from large enterprises to midsized startups on problems related to distributed computing and AI. He focuses on deep learning, including natural language processing (NLP) and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.
Sindhura Palakodety is a Senior Solutions Architect at Amazon Web Services (AWS) and Single-Threaded Leader (STL) for ISV Generative AI, where she is dedicated to empowering customers in developing enterprise-scale, Well-Architected solutions. She specializes in generative AI and data analytics domains, enabling organizations to leverage innovative technologies for transformative business outcomes.
Thomas Mathew Veppumthara is a Sr. Software Engineer at Amazon Web Services (AWS) with Amazon Bedrock AgentCore. He has previous generative AI leadership experience in Amazon Bedrock Agents and nearly a decade of distributed systems expertise across Amazon eCommerce Services and Amazon Elastic Block Store (Amazon EBS). He holds multiple patents in distributed systems, storage, and generative AI technologies.
June Won is a Principal Product Manager with Amazon SageMaker JumpStart. He focuses on making foundation models (FMs) easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last-mile delivery.

IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transf …

IBM just released Granite 4.0, an open-source LLM family that swaps monolithic Transformers for a hybrid Mamba-2/Transformer stack to cut serving memory while keeping quality. Sizes span a 3B dense “Micro,” a 3B hybrid “H-Micro,” a 7B hybrid MoE “H-Tiny” (~1B active), and a 32B hybrid MoE “H-Small” (~9B active). The models are Apache-2.0, cryptographically signed, and—per IBM—the first open models covered by an accredited ISO/IEC 42001:2023 AI management system certification. They are available on watsonx.ai and via Docker Hub, Hugging Face, LM Studio, NVIDIA NIM, Ollama, Replicate, Dell Pro AI Studio/Enterprise Hub, Kaggle, with Azure AI Foundry…

So, what is new?

Granite 4.0 introduces a hybrid design that interleaves a small fraction of self-attention blocks with a majority of Mamba-2 state-space layers (9:1 ratio). As per IBM technical blog, relative to conventional Transformer LLMs, Granite 4.0-H can reduce RAM by >70% for long-context and multi-session inference, translating into lower GPU cost at a given throughput/latency target. IBM’s internal comparisons also show the smallest Granite 4.0 models outperforming Granite 3.3-8B despite using fewer parameters.

Tell me what are the released variants?

IBM is shipping both Base and Instruct variants across four initial models:

Granite-4.0-H-Small: 32B total, ~9B active (hybrid MoE).

Granite-4.0-H-Tiny: 7B total, ~1B active (hybrid MoE).

Granite-4.0-H-Micro: 3B (hybrid dense).

Granite-4.0-Micro: 3B (dense Transformer for stacks that don’t yet support hybrids).

All are Apache-2.0 and cryptographically signed; IBM states Granite is the first open model family with accredited ISO/IEC 42001 coverage for its AI management system (AIMS). Reasoning-optimized (“Thinking”) variants are planned later in 2025.

How is it trained, context, and dtype?

Granite 4.0 was trained on samples up to 512K tokens and evaluated up to 128K tokens. Public checkpoints on Hugging Face are BF16 (quantized and GGUF conversions are also published), while FP8 is an execution option on supported hardware—not the format of the released weights.

Lets understand it’s performance signals (enterprise-relevant)

IBM highlights instruction following and tool-use benchmarks:

IFEval (HELM): Granite-4.0-H-Small leads most open-weights models (trailing only Llama 4 Maverick at far larger scale).

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

BFCLv3 (Function Calling): H-Small is competitive with larger open/closed models at lower price points.

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

MTRAG (multi-turn RAG): Improved reliability on complex retrieval workflows.

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

How can I get access?

Granite 4.0 is live on IBM watsonx.ai and distributed via Dell Pro AI Studio/Enterprise Hub, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE, Replicate. IBM notes ongoing enablement for vLLM, llama.cpp, NexaML, and MLX for hybrid serving.

My thoughts/comments

I see Granite 4.0’s hybrid Mamba-2/Transformer stack and active-parameter MoE as a practical path to lower TCO: >70% memory reduction and long-context throughput gains translate directly into smaller GPU fleets without sacrificing instruction-following or tool-use accuracy (IFEval, BFCLv3, MTRAG). The BF16 checkpoints with GGUF conversions simplify local evaluation pipelines, and ISO/IEC 42001 plus signed artifacts address provenance/compliance gaps that typically stall enterprise deployment. Net result: a lean, auditable base model family (1B–9B active) that’s easier to productionize than prior 8B-class Transformers.

Check out the Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transformer Architecture: Drastically Reducing Memory Use without Sacrificing Performance appeared first on MarkTechPost.

ServiceNow AI Releases Apriel-1.5-15B-Thinker: An Open-Weights Multimo …

ServiceNow AI Research Lab has released Apriel-1.5-15B-Thinker, a 15-billion-parameter open-weights multimodal reasoning model trained with a data-centric mid-training recipe—continual pretraining followed by supervised fine-tuning—without reinforcement learning or preference optimization. The model attains an Artificial Analysis Intelligence Index score of 52 with 8x cost savings compared to SOTA. The checkpoint ships under an MIT license on Hugging Face.

So, What’s new in it for me?

Frontier-level composite score at small scale. The model reports Artificial Analysis Intelligence Index (AAI) = 52, matching DeepSeek-R1-0528 on that combined metric while being dramatically smaller. AAI aggregates 10 third-party evaluations (MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, τ²-Bench Telecom).

Single-GPU deployability. The model card states the 15B checkpoint “fits on a single GPU,” targeting on-premises and air-gapped deployments with fixed memory and latency budgets.

Open weights and reproducible pipeline. Weights, training recipe, and evaluation protocol are public for independent verification.

https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker

Ok! I got it but what is it’s training mechanism?

Base and upscaling. Apriel-1.5-15B-Thinker starts from Mistral’s Pixtral-12B-Base-2409 multimodal decoder-vision stack. The research team applies depth upscaling—increasing decoder layers from 40→48—then projection-network realignment to align the vision encoder with the enlarged decoder. This avoids pretraining from scratch while preserving single-GPU deployability.

CPT (Continual Pretraining). Two stages: (1) mixed text+image data to build foundational reasoning and document/diagram understanding; (2) targeted synthetic visual tasks (reconstruction, matching, detection, counting) to sharpen spatial and compositional reasoning. Sequence lengths extend to 32k and 16k tokens respectively, with selective loss placement on response tokens for instruction-formatted samples.

SFT (Supervised Fine-Tuning). High-quality, reasoning-trace instruction data for math, coding, science, tool use, and instruction following; two additional SFT runs (stratified subset; longer-context) are weight-merged to form the final checkpoint. No RL (reinforcement learning) or RLAIF (reinforcement learning from AI feedback).

Data note. ~25% of the depth-upscaling text mix derives from NVIDIA’s Nemotron collection.

O’ Wow! Tell me about it’s results then?

Key text benchmarks (pass@1 / accuracy).

AIME 2025 (American Invitational Mathematics Examination 2025): 87.5–88%

GPQA Diamond (Graduate-Level Google-Proof Question Answering, Diamond split): ≈71%

IFBench (Instruction-Following Benchmark): ~62

τ²-Bench (Tau-squared Bench) Telecom: ~68

LiveCodeBench (functional code correctness): ~72.8

Using VLMEvalKit for reproducibility, Apriel scores competitively across MMMU / MMMU-Pro (Massive Multi-discipline Multimodal Understanding), LogicVista, MathVision, MathVista, MathVerse, MMStar, CharXiv, AI2D, BLINK, with stronger results on documents/diagrams and text-dominant math imagery.

https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker/blob/main/Apriel-1.5-Thinker.pdf

Lets Summarize everything

Apriel-1.5-15B-Thinker demonstrates that careful mid-training (continual pretraining + supervised fine-tuning, no reinforcement learning) can deliver a 52 on the Artificial Analysis Intelligence Index (AAI) while remaining deployable on a single graphics processing unit. Reported task-level scores (for example, AIME 2025 ≈88, GPQA Diamond ≈71, IFBench ≈62, Tau-squared Bench Telecom ≈68) align with the model card and place the 15-billion-parameter checkpoint in the most cost-efficient band of current open-weights reasoners. For enterprises, that combination—open weights, reproducible recipe, and single-GPU latency—makes Apriel a practical baseline to evaluate before considering larger closed systems.
The post ServiceNow AI Releases Apriel-1.5-15B-Thinker: An Open-Weights Multimodal Reasoning Model that Hits Frontier-Level Performance on a Single-GPU Budget appeared first on MarkTechPost.

Enhance agentic workflows with enterprise search using Kore.ai and Ama …

This post was written with Meghana Chintalapudi and Surabhi Sankhla of Kore.ai.
As organizations struggle with exponentially growing volumes of data distributed across multiple repositories and applications, employees lose significant time—approximately 30% according to the International Data Corporation (IDC)—searching for information that could be spent on higher-value work. The complexity of modern enterprise data networks demands solutions that can efficiently integrate, process, and deliver actionable insights across disparate systems.
In this post, we demonstrate how organizations can enhance their employee productivity by integrating Kore.ai’s AI for Work platform with Amazon Q Business. We show how to configure AI for Work as a data accessor for Amazon Q index for independent software vendors (ISVs), so employees can search enterprise knowledge and execute end-to-end agentic workflows involving search, reasoning, actions, and content generation. We explore the key benefits of this integration, including advanced search capabilities across more than 90 enterprise connectors and how to extend agentic experiences on top of a search foundation. The post includes a step-by-step implementation guide to help you set up this integration in your environment.
Components of the integration
Kore.ai is a leading Enterprise AI platform consistently recognized by Gartner as a leader in conversational AI. With three key Kore.ai offerings, AI for Work, AI for Process, and AI for Service, enterprises can build and deploy AI solutions based on their business needs. The AI for Work platform helps employees be more productive by making it possible to search across applications, take context-aware actions, generate content, and automate repetitive tasks. The platform goes beyond standalone search to deliver comprehensive agentic orchestration and workflows, helping employees follow up with clients, send weekly updates, or research and write marketing content with a single command. With AI for Work, your employees can create simple no-code agents while your admins have the flexibility to create more advanced low-code or pro-code agents. AI for Process, on the other hand, automates knowledge-intensive business processes end-to-end. AI for Service helps organizations deliver differentiated customer service experiences through self-service, proactive outreach campaigns, and agent assistance.
Amazon Q index for ISVs is a powerful, managed vector search service that supports seamless integration of generative AI applications with customers’ enterprise data through a unified, secure index. ISVs can access and retrieve relevant content through the SearchRelevantContent API for cross-application data retrieval without needing direct access or individual indexing of each data source, while customers retain full control over data access and governance.
When combined with additional search connectors offered by AI for Work platform and its ability to create and orchestrate agents, organizations gain a complete solution that transforms how employees access enterprise data and execute tasks end-to-end. The following video shows one such agentic experience in action, where the AI for Work interface seamlessly orchestrates agents to help a sales executive prepare for a client meeting—compiling information from Amazon Q index and AI for Work connectors, summarizing talking points, and sending them as an email, all from a single query.

Benefits for enterprises
Enterprises often struggle with fragmented data access and repetitive manual tasks that slow down critical business processes. For example, imagine a scenario where a product manager needs to compile quarterly feature requests—with the integration of Kore.ai’s AI for Work and Amazon Q index, they can instantly gather requests from Salesforce, support tickets, and JIRA; automatically generate a structured roadmap; and schedule stakeholder meetings, all with a single query. This seamless integration changes the way enterprises interact with enterprise systems, through multiple key advantages:

Improved search capabilities – Amazon Q index augments the generative AI experience by providing semantically relevant enterprise content across connected systems through its distributed vector database, delivering query responses at enterprise scale. Now, together with AI for Work, your employees can search data from over 90 connectors, integrating with enterprise systems like Microsoft 365, Salesforce, and Workday while also connecting with custom internal knowledge systems and third-party search providers. AI for Work’s orchestrator manages complex query processing and agent routing across multiple data sources, resulting in contextually appropriate and actionable results that significantly reduce search time while also enabling intelligent automations that extend far beyond traditional search capabilities.
Enhanced data processing – The system continuously ingests and analyzes data through the document processing pipeline in Amazon Q index, which automatically handles multiple formats using intelligent chunking algorithms that preserve semantic context. The AI for Work platform unifies search, content generation, and actions in a single interface, to support the creation of multi-step agentic experiences grounded in search. Through real-time incremental indexing that processes only changed content, the system maintains data freshness while converting siloed raw data into actionable insights and multi-step business processes that can be saved and reused across the organization.
Cost optimization – Organizations can achieve significant cost savings by streamlining routine tasks through agents that reduce operational overhead and improve resource allocation. AI for Work supports a wide range of agent-building options, from no-code and low-code to pro-code, for both non-technical employees and technical experts to build agents for themselves and to share across the organization, so teams can accomplish more with existing resources and benefit from sustained productivity improvements.
Security benefits – Security remains paramount, with Amazon Q index implementing vector-level security through end-to-end encryption using AWS Key Management Service (AWS KMS) customer managed keys and document-level access controls that filter search results based on user identity and group membership. The joint solution implements robust role-based access control and audit trails. This zero-trust security approach maintains compliance with industry standards while providing granular control over sensitive enterprise data, making sure users only see information from documents they have explicit permissions to access while maintaining complete data sovereignty. With AI for Work’s robust security and governance tools enterprises can manage permissions and agent access, monitor usage, and enforce guardrails for secure, enterprise-wide deployment of AI solutions at scale.

Solution overview
The Amazon Q Business data accessor provides a secure interface that integrates Kore.ai’s AI for Work platform with Amazon Q index. The integration delivers a robust solution that uses enterprise data across multiple systems to power intelligent agentic actions and content generation capabilities that transform how organizations handle routine tasks and automate complex processes end-to-end.
When a user submits a query through AI for Work, its orchestrator intelligently routes requests between Kore.ai’s native retrievers and Amazon Q index based on predefined routing rules and advanced intent recognition algorithms. For Amazon Q index requests, the architecture implements secure cross-account API calls using OAuth 2.0 tokens that transform into temporary AWS credentials, supporting both security and optimal performance while maintaining strict access controls throughout the entire system. With AI for Work’s agents, users can take follow up actions, such as drafting proposals or submitting tickets—directly on top of search results, for end-to-end task completion in a single interface. Users can also build personalized workflows of pre-defined steps and execute them from a single query to further save time.
This supports use cases such as automated roadmap generation, where a product manager can query feature requests across multiple systems and receive a structured roadmap complete with stakeholder notifications, or RFP response automation, where sales executives can generate comprehensive proposals by pulling compliance documentation and tailoring responses based on client requirements.
The following diagram illustrates the solution architecture.

Prerequisites
Before enabling the Amazon Q index integration with Kore.ai’s AI for Work, you must have the following components in place:

An AWS account with appropriate service access
Amazon Q Business set up with AWS IAM Identity Center for user authentication
Access to Kore.ai’s AI for Work (as a workspace admin)

With these prerequisites met, you can complete the basic configuration steps on both the Amazon Q Business and Kore.ai consoles to get started.
Add Kore.ai as a data accessor
After creating an Amazon Q Business application with AWS IAM Identity Center, administrators can configure Kore.ai as a data accessor through the Amazon Q Business console. Complete the following steps:

On the Amazon Q Business console, choose Data accessors in the navigation pane.
Choose Add data accessor.
Choose Kore.ai as your data accessor. You must retrieve tenantID, a unique identifier for your application tenant. Refer to Prerequisites for instructions to retrieve the TenantId for your application. Similar instructions are also listed later in this post.
For Data source access, configure your level of access. You can select specific data sources from your Amazon Q index to be available through the data accessor. This makes it possible to control which content is surfaced in the AI for Work environment.
For User access, specify which users or groups can access the Amazon Q index through the data accessor. This option makes it possible to configure granular permissions for data accessor accessibility and manage organizational access controls.

After you have added the data accessor, the Amazon Q Business console displays configuration details that you need to share with Kore.ai to complete the setup.

Note down the following information for the next step:

Amazon Q Business application ID
AWS Region of the Amazon Q Business application
Amazon Q Business retriever ID
Region for IAM Identity Center instance

Configure Amazon Q index in Kore.ai’s AI for Work
Kore.ai’s AI for Work supports flexible integration with Amazon Q index based on your enterprise search needs. There are two configuration options: configuring Amazon Q index as the primary enterprise knowledge source or configuring it as a search agent. We provide instructions for both options in this post.
Option 1: Configure Amazon Q index as the primary enterprise knowledge source
If you want Amazon Q index to act as the primary fallback search layer, coming into play, complete the following steps:

In AI for Work, go to Workspaces on the admin console. Then navigate to Enterprise Workspace, which is the default workspace.

Choose Configure to configure an enterprise knowledge data source.
On the Create New dropdown menu, choose Amazon Q.

Enter a source name and brief description.
Copy the tenant ID displayed—this is required during the setup of the data accessor in AWS, as described in the previous section.
Enter the details captured earlier:

Amazon Q Business application ID
Region of the Amazon Q Business application
Amazon Q Business retriever ID
Region for IAM Identity Center instance

Choose Continue to save and complete the configuration.

The new knowledge source now shows as Active.

Option 2: Configure Amazon Q index as a search agent
If you already have a primary search index, you can configure Amazon Q index as a search agent:

In AI for Work, go to Workspaces on the admin console.
Choose the workspace where you want to add Amazon Q index. (Enterprise Workspace is used by default).
Under AI Agents in the navigation pane, choose Search Agent
Choose Create agent.

Provide an agent name and purpose. This helps define when the search agent should be invoked.
Choose Continue to move to configuration.
For Select Search Index, choose Amazon Q.

Copy the tenant ID displayed—it is required during the setup of the data accessor in AWS.

Preview and test the agent.
After you have validated the agent, publish it to selected users or groups.

Your integration is now complete. You can now access the assistant application and start asking questions in the AI for Work console. If you’ve created a search agent, you can also access it from the list of agents and start interacting with it directly.
Clean up
When you are finished using this solution, clean up your resources to avoid additional costs:

Disable the Amazon Q index configuration within AI for Work’s settings.
Delete the Kore.ai data accessor from the Amazon Q Business console, which will remove permissions and access for users.
Delete the Amazon Q Business application to remove the associated index and data source connectors, on your AWS account.

Conclusion
The combination of Kore.ai’s AI for Work and Amazon Q index offers enterprises a transformative approach to boost employee productivity leveraging comprehensive search capabilities while streamlining repetitive tasks and processes. By integrating Kore.ai’s advanced agentic platform with the robust search infrastructure of Amazon Q index, organizations can now execute context aware actions by accessing relevant information across disparate systems while maintaining data ownership and security. This supports faster problem-solving, enhanced productivity, and better collaboration across the organization.
In this post, we explored how enterprises can use the integration between Kore.ai’s AI for Work and Amazon Q Business to streamline their operational processes and unlock valuable productivity gains. We demonstrated how organizations can set up this integration using an Amazon Q data accessor, helping teams access critical information securely and cost-effectively.
Unlock the full potential of your organization’s data and agentic workflows today with the Amazon Q index and Kore.ai’s AI for Work’s unified solution by following the steps in Amazon Q integration with AI for Work.

About the authors
Siddhant Gupta is a Software Development Manager on the Amazon Q team based in Seattle, WA. He is driving innovation and development in cutting-edge AI-powered solutions.
Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS, with a core focus on generative AI. She helps ISVs accelerate the adoption of generative AI by designing scalable and impactful solutions. With a strong background in applied mathematics and machine learning, she specializes in intelligent document processing and AI-driven innovation. Outside of work, she enjoys salsa and bachata dancing.
Bobby Williams is a Senior Solutions Architect at AWS. He has decades of experience designing, building, and supporting enterprise software solutions that scale globally. He works on solutions across industry verticals and horizontals and is driven to create a delightful experience for every customer.
Santhosh Urukonda is a Senior PACE (Prototyping & Cloud Engineering) Architect at AWSs with two decades of experience. He specializes in helping customers develop innovative, first-to-market solutions with a focus on generative AI.
Nikhil Kumar Goddeti is a Cloud Support Engineer II at AWS. He specializes in AWS Data Analytics services with emphasis on Amazon OpenSearch Service, Amazon Q Business, Amazon Kinesis, Amazon MSK, Amazon AppFlow, and Amazon Kendra. He is a Subject Matter Expert of OpenSearch. Outside of work, he enjoys travelling with his friends and playing cricket.
Meghana Chintalapudi is a Product Manager at Kore.ai, driving the development of search and agentic AI solutions for the AI for Work platform. She has led large-scale AI implementations for Fortune 500 clients, evolving from deterministic NLP and intent-detection models to advanced large language model deployments, with a strong emphasis on enterprise-grade security and scalability. Outside of work, Meghana is a dancer and takes movement workshops in Hyderabad, India.
Surabhi Sankhla is a VP of Product at Kore.ai, where she leads the AI for Work platform to help enterprises boost employee productivity. With over 13 years of experience in product management and technology, she has launched AI products from the ground up and scaled them to millions of users. At Kore.ai, she drives product strategy, client implementations, and go-to-market execution in partnership with cross-functional teams. Based in San Francisco, Surabhi is passionate about making AI accessible and impactful for all.

Accelerate development with the Amazon Bedrock AgentCore MCP server

Today, we’re excited to announce the Amazon Bedrock AgentCore Model Context Protocol (MCP) Server. With built-in support for runtime, gateway integration, identity management, and agent memory, the AgentCore MCP Server is purpose-built to speed up creation of components compatible with Bedrock AgentCore. You can use the AgentCore MCP server for rapid prototyping, production AI solutions, or to scale your agent infrastructure for your enterprise.
Agentic IDEs like Kiro, Claude Code, GitHub Copilot, and Cursor, along with sophisticated MCP servers are transforming how developers build AI agents. What typically takes significant time and effort, for example learning about Bedrock AgentCore services, integrating Runtime and Tools Gateway, managing security configurations, and deploying to production can now be completed in minutes through conversational commands with your coding assistant.
In this post we introduce the new AgentCore MCP server and walk through the installation steps so you can get started.
AgentCore MCP server capabilities
The AgentCore MCP server brings a new agentic development experience to AWS, providing specialized tools that automate the complete agent lifecycle, eliminate the steep learning curve, and reduce development friction that can slow innovation cycles. To address specific agent development challenges the AgentCore MCP server:

Transforms agents for AgentCore Runtime integration by providing guidance to your coding assistant on the minimum functionality changes needed—adding Runtime library imports, updating dependencies, initializing apps with BedrockAgentCoreApp(), converting entrypoints to decorators, and changing direct agent calls to payload handling—while preserving your existing agent logic and Strands Agents features.
Automates development environment provisioning by handling the complete setup process through your coding assistant: installing required dependencies (bedrock-agentcore SDK, bedrock-agentcore-starter-toolkit CLI helpers, strands-agents SDK), configuring AWS credentials and AWS Regions, defining execution roles with Bedrock AgentCore permissions, setting up ECR repositories, and creating .bedrock_agentcore.yaml configuration files.
Simplifies tool integration with Bedrock AgentCore Gateway for seamless agent-to-tool communication in the cloud environment.
Enables simple agent invocation and testing by providing natural language commands through your coding assistant to invoke provisioned agents on AgentCore Runtime and verify the complete workflow, including calls to AgentCore Gateway tools when applicable.

Layered approach
When using the AgentCore MCP server with your favorite client, we encourage you to consider a layered architecture designed to provide comprehensive AI agent development support:

Layer 1: Agentic IDE or client – Use Kiro, Claude Code, Cursor, VS Code extensions, or another natural language interface for developers. For very simple tasks, agentic IDEs are equipped with the right tools to look up documentation and perform tasks specific to Bedrock AgentCore. However, with this layer alone, developers may observe sub-optimal performance across AgentCore developer paths.
Layer 2: AWS service documentation – Install the AWS Documentation MCP Server for comprehensive AWS service documentation, including context about Bedrock AgentCore.
Layer 3: Framework documentation – Install the Strands, LangGraph, or other framework docs MCP servers or use the llms.txt for framework-specific context.
Layer 4: SDK documentation – Install the MCP or use the llms.txt for the Agent Framework SDK and Bedrock AgentCore SDK for a combined documentation layer that covers the Strands Agents SDK documentation and Bedrock AgentCore API references.
Layer 5: Steering files – Task-specific guidance for more complex and repeated workflows. Each IDE has a different approach to using steering files (for example, see Steering in the Kiro documentation).

Each layer builds upon the previous one, providing increasingly specific context so your coding assistant can handle everything from basic AWS operations to complex agent transformations and deployments.
Installation
To get started with the Amazon Bedrock AgentCore MCP server you can use the one-click install on the Github repository.
Each IDE integrates with an MCP differently using the mcp.json file. Review the MCP documentation for your IDE, such as Kiro, Cursor, Q CLI, and Claude Code to determine the location of the mcp.json.

Client
Location of mcp.json
Documentation

Kiro
.kiro/settings/mcp.json
https://kiro.dev/docs/mcp/

Cursor
.cursor/mcp.json
https://cursor.com/docs/context/mcp

Q CLI
~/.aws/amazonq/mcp.json
https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/qdev-mcp.html

Claude Code
~/.claude/mcp.json
https://docs.claude.com/en/docs/claude-code/mcp

Use the following in your mcp.json:

{
  “mcpServers”: {
    “awslabs.amazon-bedrock-agentcore-mcp-server”: {
      “command”: “uvx”,
      “args”: [“awslabs.amazon-bedrock-agentcore-mcp-server@latest”],
      “env”: {
        “FASTMCP_LOG_LEVEL”: “ERROR”
      },
      “disabled”: false,
      “autoApprove”: []
    }
  }
}

For example, here is what the IDE looks like on Kiro, with the AgentCore MCP server and the two tools, search_agentcore_docs and fetch_agentcore_doc, connected:

Using the AgentCore MCP server for agent development
While we show demos for various use cases below using the Kiro IDE, the AgentCore MCP server has also been tested to work on Claude Code, Amazon Q CLI, Cursor, and the VS Code Q plugin. First, let’s take a look at a typical agent development lifecycle using AgentCore services (remember that this is only one example with the tools available, and you are free to explore more such use cases simply by instructing the agent in your favorite Agentic IDE):

The agent development lifecycle follows these steps:

The user takes a local set of tools or MCP servers and

Creates a lambda target for AgentCore Gateway; or
Deploys the MCP server as-is on AgentCore Runtime

The user prepares the actual agent code using a preferred framework like Strands Agents or LangGraph. The user can either:

Start from scratch (the server can fetch docs from the Strands Agents or LangGraph documentation)
Start from fully or partially working agent code

The user asks the agent to transform the code into a format compatible with AgentCore Runtime with the intention to deploy the agent later. This causes the agent to:

Write an appropriate requirements.txt file
import necessary libraries including bedrock_agentcore
decorate the main handler (or create one) to access the core agent calling logic or input handler

The user may then ask the agent to deploy to AgentCore Runtime. The agent can look up documentation and can use the AgentCore CLI to deploy the agent code to Runtime
The user can test the agent by asking the agent to do so. The AgentCore CLI command required for this is written and executed by the client
The user then asks to modify the code to use the deployed AgentCore Gateway MCP server within this AgentCore Runtime agent.

The agent modifies the original code to add an MCP client that can call the deployed gateway
The agent then deploys a new version v2 of the agent to Runtime
The agent then tests this integration with a new prompt

Here is a demo of the MCP server working with Cursor IDE. We see the agent perform the following steps:

Transform the weather_agent.py to be compatible with AgentCore runtime
Use the AgentCore CLI to deploy the agent
Test the deployed agent with a successful prompt

Here’s another example of deploying a LangGraph agent to AgentCore Runtime with the Cursor IDE performing similar steps as seen above.

Clean up
If you’d like to uninstall the MCP server, follow the MCP documentation for your IDE, such as Kiro, Cursor, Q CLI, and Claude Code for instructions.
Conclusion
In this post, we showed how you can use the AgentCore MCP server with your favorite Agentic IDE of choice to speed up your development workflows.
We encourage you to review the Github repository, as well read through and use the following resources in your development:

Amazon Bedrock AgentCore CLI documentation
Strands Agents MCP Server
LangGraph llms.txt

We encourage you to try out the AgentCore MCP server and provide any feedback through issues in our GitHub repository.

About the authors

Shreyas Subramanian
Shreyas is a Principal Data Scientist and helps customers by using Generative AI to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Deep Learning, and he is a researcher studying the use of Machine Learning and Reinforcement Learning for accelerating learning and optimization tasks. Shreyas is also an Amazon best-selling book author with several research papers and patents to his name.

Primo Mu
Primo is a Software Development Engineer on the Agentic AI Foundation team at AWS, where he builds foundational systems and infrastructure that power intelligent AI applications. He has extensive experience working on backend stateless orchestration services behind products like Kiro and Q Dev CLI. He focuses on creating scalable frameworks and robust architectures that enable developers to build sophisticated agentic systems.

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Mod …

Liquid AI has released LFM2-Audio-1.5B, a compact audio–language foundation model that both understands and generates speech and text through a single end-to-end stack. It positions itself for low-latency, real-time assistants on resource-constrained devices, extending the LFM2 family into audio while retaining a small footprint.

https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model

But what’s actually new? a unified backbone with disentangled audio I/O

LFM2-Audio extends the 1.2B-parameter LFM2 language backbone to treat audio and text as first-class sequence tokens. Crucially, the model disentangles audio representations: inputs are continuous embeddings projected directly from raw waveform chunks (~80 ms), while outputs are discrete audio codes. This avoids discretization artifacts on the input path while keeping training and generation autoregressive for both modalities on the output path.

On the implementation side, the released checkpoint uses:

Backbone: LFM2 (hybrid conv + attention), 1.2B params (LM only)

Audio encoder: FastConformer (~115M, canary-180m-flash)

Audio decoder: RQ-Transformer predicting discrete Mimi codec tokens (8 codebooks)

Context: 32,768 tokens; vocab: 65,536 (text) / 2049×8 (audio)

Precision: bfloat16; license: LFM Open License v1.0; languages: English

https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model

Two generation modes for real-time agents

Interleaved generation for live, speech-to-speech chat where the model alternates text and audio tokens to minimize perceived latency.

Sequential generation for ASR/TTS (switching modalities turn-by-turn).

Liquid AI provides a Python package (liquid-audio) and a Gradio demo to reproduce these behaviors.

Latency: <100 ms to first audio

Liquid AI team reports end-to-end latency below 100 ms from a 4-second audio query to the first audible response—a proxy for perceived responsiveness in interactive use—stating it is faster than models smaller than 1.5B parameters under their setup.

Benchmarks: VoiceBench and ASR results

On VoiceBench—a suite of nine audio-assistant evaluations—Liquid reports an overall score of 56.78 for LFM2-Audio-1.5B, with per-task numbers disclosed in the blog’s chart (e.g., AlpacaEval 3.71, CommonEval 3.49, WildVoice 3.17). The Liquid AI team contrasts this result with larger models like Qwen2.5-Omni-3B and Moshi-7B in the same table. (VoiceBench is an external benchmark introduced in late 2024 for LLM-based voice assistants)

The model card on Hugging Face provides an additional VoiceBench table (with closely related—but not identical—per-task values) and includes classic ASR WERs where LFM2-Audio matches or improves on Whisper-large-v3-turbo for some datasets despite being a generalist speech–text model. For example (lower is better): AMI 15.36 vs. 16.13 (Whisper-large-v3-turbo), LibriSpeech-clean 2.03 vs. 2.10.

https://huggingface.co/LiquidAI/LFM2-Audio-1.5B

Alright, but why does it really matter in voice AI trends?

Most “omni” stacks couple ASR → LLM → TTS, which adds latency and brittle interfaces. LFM2-Audio’s single-backbone design with continuous input embeddings and discrete output codes reduces glue logic and allows interleaved decoding for early audio emission. For developers, this translates to simpler pipelines and faster perceived response times, while still supporting ASR, TTS, classification, and conversational agents from one model. Liquid AI provides code, demo entry points, and distribution via Hugging Face.

Check out the GitHub Page, Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency appeared first on MarkTechPost.

MLPerf Inference v5.1 (2025): Results Explained for GPUs, CPUs, and AI …

What MLPerf Inference Actually Measures?

MLPerf Inference quantifies how fast a complete system (hardware + runtime + serving stack) executes fixed, pre-trained models under strict latency and accuracy constraints. Results are reported for the Datacenter and Edge suites with standardized request patterns (“scenarios”) generated by LoadGen, ensuring architectural neutrality and reproducibility. The Closed division fixes the model and preprocessing for apples-to-apples comparisons; the Open division allows model changes that are not strictly comparable. Availability tags—Available, Preview, RDI (research/development/internal)—indicate whether configurations are shipping or experimental.

The 2025 Update (v5.0 → v5.1): What Changed?

The v5.1 results (published Sept 9, 2025) add three modern workloads and broaden interactive serving:

DeepSeek-R1 (first reasoning benchmark)

Llama-3.1-8B (summarization) replacing GPT-J

Whisper Large V3 (ASR)

This round recorded 27 submitters and first-time appearances of AMD Instinct MI355X, Intel Arc Pro B60 48GB Turbo, NVIDIA GB300, RTX 4000 Ada-PCIe-20GB, and RTX Pro 6000 Blackwell Server Edition. Interactive scenarios (tight TTFT/TPOT limits) were expanded beyond a single model to capture agent/chat workloads.

Scenarios: The Four Serving Patterns You Must Map to Real Workloads

Offline: maximize throughput, no latency bound—batching and scheduling dominate.

Server: Poisson arrivals with p99 latency bounds—closest to chat/agent backends.

Single-Stream / Multi-Stream (Edge emphasis): strict per-stream tail latency; Multi-Stream stresses concurrency at fixed inter-arrival intervals.

Each scenario has a defined metric (e.g., max Poisson throughput for Server; throughput for Offline).

Latency Metrics for LLMs: TTFT and TPOT Are Now First-Class

LLM tests report TTFT (time-to-first-token) and TPOT (time-per-output-token). v5.0 introduced stricter interactive limits for Llama-2-70B (p99 TTFT 450 ms, TPOT 40 ms) to reflect user-perceived responsiveness. The long-context Llama-3.1-405B keeps higher bounds (p99 TTFT 6 s, TPOT 175 ms) due to model size and context length. These constraints carry into v5.1 alongside new LLM and reasoning tasks.

The 2025 Datacenter Menu (Closed Division Targets You’ll Actually Compare)

Key v5.1 entries and their quality/latency gates (abbrev.):

LLM Q&A – Llama-2-70B (OpenOrca): Conversational 2000 ms/200 ms; Interactive 450 ms/40 ms; 99% and 99.9% accuracy targets.

LLM Summarization – Llama-3.1-8B (CNN/DailyMail): Conversational 2000 ms/100 ms; Interactive 500 ms/30 ms.

Reasoning – DeepSeek-R1: TTFT 2000 ms / TPOT 80 ms; 99% of FP16 (exact-match baseline).

ASR – Whisper Large V3 (LibriSpeech): WER-based quality (datacenter + edge).

Long-context – Llama-3.1-405B: TTFT 6000 ms, TPOT 175 ms.

Image – SDXL 1.0: FID/CLIP ranges; Server has a 20 s constraint.

Legacy CV/NLP (ResNet-50, RetinaNet, BERT-L, DLRM, 3D-UNet) remain for continuity.

Power Results: How to Read Energy Claims

MLPerf Power (optional) reports system wall-plug energy for the same runs (Server/Offline: system power; Single/Multi-Stream: energy per stream). Only measured runs are valid for energy efficiency comparisons; TDPs and vendor estimates are out-of-scope. v5.1 includes datacenter and edge power submissions but broader participation is encouraged.

How To Read the Tables Without Fooling Yourself?

Compare Closed vs Closed only; Open runs may use different models/quantization.

Match accuracy targets (99% vs 99.9%)—throughput often drops at stricter quality.

Normalize cautiously: MLPerf reports system-level throughput under constraints; dividing by accelerator count yields a derived “per-chip” number that MLPerf does not define as a primary metric. Use it only for budgeting sanity checks, not marketing claims.

Filter by Availability (prefer Available) and include Power columns when efficiency matters.

Interpreting 2025 Results: GPUs, CPUs, and Other Accelerators

GPUs (rack-scale to single-node). New silicon shows up prominently in Server-Interactive (tight TTFT/TPOT) and in long-context workloads where scheduler & KV-cache efficiency matter as much as raw FLOPs. Rack-scale systems (e.g., GB300 NVL72 class) post the highest aggregate throughput; normalize by both accelerator and host counts before comparing to single-node entries, and keep scenario/accuracy identical.

CPUs (standalone baselines + host effects). CPU-only entries remain useful baselines and highlight preprocessing and dispatch overheads that can bottleneck accelerators in Server mode. New Xeon 6 results and mixed CPU+GPU stacks appear in v5.1; check host generation and memory configuration when comparing systems with similar accelerators.

Alternative accelerators. v5.1 increases architectural diversity (GPUs from multiple vendors plus new workstation/server SKUs). Where Open-division submissions appear (e.g., pruned/low-precision variants), validate that any cross-system comparison holds constant division, model, dataset, scenario, and accuracy.

Practical Selection Playbook (Map Benchmarks to SLAs)

Interactive chat/agents → Server-Interactive on Llama-2-70B/Llama-3.1-8B/DeepSeek-R1 (match latency & accuracy; scrutinize p99 TTFT/TPOT).

Batch summarization/ETL → Offline on Llama-3.1-8B; throughput per rack is the cost driver.

ASR front-ends → Whisper V3 Server with tail-latency bound; memory bandwidth and audio pre/post-processing matter.

Long-context analytics → Llama-3.1-405B; evaluate if your UX tolerates 6 s TTFT / 175 ms TPOT.

What the 2025 Cycle Signals?

Interactive LLM serving is table-stakes. Tight TTFT/TPOT in v5.x makes scheduling, batching, paged attention, and KV-cache management visible in results—expect different leaders than in pure Offline.

Reasoning is now benchmarked. DeepSeek-R1 stresses control-flow and memory traffic differently from next-token generation.

Broader modality coverage. Whisper V3 and SDXL exercise pipelines beyond token decoding, surfacing I/O and bandwidth limits.

Summary

In summary, MLPerf Inference v5.1 makes inference comparisons actionable only when grounded in the benchmark’s rules: align on the Closed division, match scenario and accuracy (including LLM TTFT/TPOT limits for interactive serving), and prefer Available systems with measured Power to reason about efficiency; treat any per-device splits as derived heuristics because MLPerf reports system-level performance. The 2025 cycle expands coverage with DeepSeek-R1, Llama-3.1-8B, and Whisper Large V3, plus broader silicon participation, so procurement should filter results to the workloads that mirror production SLAs—Server-Interactive for chat/agents, Offline for batch—and validate claims directly in the MLCommons result pages and power methodology.

References:

MLCommons Releases New MLPerf Inference v5.1 Benchmark Results

MLPerf Inference: Datacenter

MLPerf Inference: Edge

https://docs.mlcommons.org/inference/

https://docs.mlcommons.org/inference/power/

Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models

DeepSeek Reasoning for MLPerf Inference v5.1

https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/

NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference-v5.1/README.html

https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inference5.1-repro/README.html

https://newsroom.intel.com/artificial-intelligence/intel-arc-pro-b-series-gpus-and-xeon-6-shine-in-mlperf-inference-v5-1

https://www.globenewswire.com/news-release/2025/09/09/3147136/0/en/MLCommons-Releases-New-MLPerf-Inference-v5-1-Benchmark-Results.html

https://www.tomshardware.com/pc-components/gpus/nvidia-claims-software-and-hardware-upgrades-allow-blackwell-ultra-gb300-to-dominate-mlperf-benchmarks-touts-45-percent-deepseek-r-1-inference-throughput-increase-over-gb200

https://newsroom.intel.com/tag/intel-arc-pro-b60

The post MLPerf Inference v5.1 (2025): Results Explained for GPUs, CPUs, and AI Accelerators appeared first on MarkTechPost.