Why Spatial Supersensing is Emerging as the Core Capability for Multim …

Even strong ‘long-context’ AI models fail badly when they must track objects and counts over long, messy video streams, so the next competitive edge will come from models that predict what comes next and selectively remember only surprising, important events, not from just buying more compute and bigger context windows. A team of researchers from New York University and Stanford introduce Cambrian-S, a spatially grounded video multimodal large language model family, together with the VSI Super benchmark and the VSI 590K dataset to test and train spatial supersensing in long videos.

https://arxiv.org/pdf/2511.04670

From video question answering to spatial supersensing

The research team frames spatial supersensing as a progression of capabilities beyond linguistic only reasoning. The stages are semantic perception, streaming event cognition, implicit 3D spatial cognition and predictive world modeling.

Most current video MLLMs sample sparse frames and rely on language priors. They often answer benchmark questions using captions or single frames rather than continuous visual evidence. Diagnostic tests show that several popular video benchmarks are solvable with limited or text only input, so they do not strongly test spatial sensing.

Cambrian-S targets the higher stages of this hierarchy, where the model must remember spatial layouts across time, reason about object locations and counts and anticipate changes in a 3D world.

VSI Super, a stress test for continual spatial sensing

To expose the gap between current systems and spatial supersensing, the research team designed VSI Super, a two part benchmark that runs on arbitrarily long indoor videos.

https://arxiv.org/pdf/2511.04670

VSI Super Recall, or VSR, evaluates long horizon spatial observation and recall. Human annotators take indoor walkthrough videos from ScanNet, ScanNet++ and ARKitScenes and use Gemini to insert an unusual object, such as a Teddy Bear, into four frames at different spatial locations. These edited sequences are concatenated into streams up to 240 minutes. The model must report the order of locations where the object appears, which is a visual needle in a haystack task with sequential recall.

https://arxiv.org/pdf/2511.04670

VSI Super Count, or VSC, measures continual counting under changing viewpoints and rooms. The benchmark concatenates room tour clips from VSI Bench and asks for the total number of instances of a target object across all rooms. The model must handle viewpoint changes, revisits and scene transitions and maintain a cumulative count. Evaluation uses mean relative accuracy for durations from 10 to 120 minutes.

When Cambrian-S 7B is evaluated on VSI Super in a streaming setup at 1 frame per second, accuracy on VSR drops from 38.3 percent at 10 minutes to 6.0 percent at 60 minutes and becomes zero beyond 60 minutes. VSC accuracy is near zero across lengths. Gemini 2.5 Flash also degrades on VSI Super despite a long context window, which shows that brute force context scaling is not sufficient for continual spatial sensing.

VSI 590K, spatially focused instruction data

To test whether data scaling can help, the research team construct VSI 590K, a spatial instruction corpus with 5,963 videos, 44,858 images and 590,667 question answer pairs from 10 sources.

Sources include 3D annotated real indoor scans such as ScanNet, ScanNet++ V2, ARKitScenes, S3DIS and Aria Digital Twin, simulated scenes from ProcTHOR and Hypersim and pseudo annotated web data such as YouTube RoomTour and robot datasets Open X Embodiment and AgiBot World.

The dataset defines 12 spatial question types, such as object count, absolute and relative distance, object size, room size and appearance order. Questions are generated from 3D annotations or reconstructions so that spatial relationships are grounded in geometry rather than text heuristics. Ablations show that annotated real videos contribute the largest gains on VSI Bench, followed by simulated data and then pseudo annotated images and that training on the full mix gives the best spatial performance.

https://arxiv.org/pdf/2511.04670

Cambrian-S model family and spatial performance

Cambrian-S builds on Cambrian-1 and uses Qwen2.5 language backbones at 0.5B, 1.5B, 3B and 7B parameters with a SigLIP2 SO400M vision encoder and a two layer MLP connector.

Training follows a four stage pipeline. Stage 1 performs vision language alignment on image text pairs. Stage 2 applies image instruction tuning, equivalent to the improved Cambrian-1 setup. Stage 3 extends to video with general video instruction tuning on a 3 million sample mixture called Cambrian-S 3M. Stage 4 performs spatial video instruction tuning on a mixture of VSI 590K and a subset of the stage 3 data.

https://arxiv.org/pdf/2511.04670

On VSI Bench, Cambrian-S 7B reaches 67.5 percent accuracy and outperforms open source baselines like InternVL3.5 8B and Qwen VL 2.5 7B as well as proprietary Gemini 2.5 Pro by more than 16 absolute points. The model also maintains strong performance on Perception Test, EgoSchema and other general video benchmarks, so the focus on spatial sensing does not destroy general capabilities.

Predictive sensing with latent frame prediction and surprise

To go beyond static context expansion, the research team propose predictive sensing. They add a Latent Frame Prediction head, which is a two layer MLP that predicts the latent representation of the next video frame in parallel with next token prediction.

Training modifies stage 4. The model uses mean squared error and cosine distance losses between predicted and ground truth latent features, weighted against the language modeling loss. A subset of 290,000 videos from VSI 590K, sampled at 1 frame per second, is reserved for this objective. During this stage the connector, language model and both output heads are trained jointly, while the SigLIP vision encoder remains frozen.

https://arxiv.org/pdf/2511.04670

At inference time the cosine distance between predicted and actual features becomes a surprise score. Frames with low surprise are compressed before being stored in long term memory and high surprise frames are retained with more detail. A fixed size memory buffer uses surprise to decide which frames to consolidate or drop and queries retrieve frames that are most relevant to the question.

https://arxiv.org/pdf/2511.04670

For VSR, this surprise driven memory system lets Cambrian-S maintain accuracy as video length increases while keeping GPU memory usage stable. It outperforms Gemini 1.5 Flash and Gemini 2.5 Flash on VSR at all tested durations and avoids the sharp degradation seen in models that only extend context.

For VSC, the research team designed a surprise driven event segmentation scheme. The model accumulates features in an event buffer and when a high surprise frame signals a scene change, it summarizes that buffer into a segment level answer and resets the buffer. Aggregating segment answers gives the final count. In streaming evaluation, Gemini Live and GPT Realtime achieve less than 15 percent mean relative accuracy and drop near zero on 120 minute streams, while Cambrian-S with surprise segmentation reaches about 38 percent at 10 minutes and maintains around 28 percent at 120 minutes.

Key Takeaways

Cambrian-S and VSI 590K show that careful spatial data design and strong video MLLMs can significantly improve spatial cognition on VSI Bench, but they still fail on VSI Super, so scale alone does not solve spatial supersensing.

VSI Super, through VSR and VSC, is intentionally built from arbitrarily long indoor videos to stress continual spatial observation, recall and counting, which makes it resistant to brute force context window expansion and standard sparse frame sampling.

Benchmarking shows that frontier models, including Gemini 2.5 Flash and Cambrian S, degrade sharply on VSI Super even when video lengths remain within their nominal context limits, revealing a structural weakness in current long context multimodal architectures.

The Latent Frame Prediction based predictive sensing module uses next latent frame prediction error, or surprise, to drive memory compression and event segmentation, which yields substantial gains on VSI Super compared to long context baselines while keeping GPU memory usage stable.

The research work positions spatial supersensing as a hierarchy from semantic perception to predictive world modeling and argues that future video MLLMs must incorporate explicit predictive objectives and surprise driven memory, not only larger models and datasets, to handle unbounded streaming video in real applications.

Editorial Comments

Cambrian-S is a useful stress test of current video MLLMs because it shows that VSI SUPER is not just a harder benchmark, it exposes a structural failure of long context architectures that still rely on reactive perception. The predictive sensing module, based on Latent Frame Prediction and surprise driven memory, is an important step because it couples spatial sensing with internal world modeling rather than only scaling data and parameters. This research signals a shift from passive video understanding to predictive spatial supersensing as the next design target for multimodal models.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Why Spatial Supersensing is Emerging as the Core Capability for Multimodal AI Systems? appeared first on MarkTechPost.

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. That comes down to three implementation details: how the runtime batches requests, how it overlaps prefill and decode, and how it stores and reuses the KV cache. Different engines make different tradeoffs on these axes, which show up directly as differences in tokens per second, P50/P99 latency, and GPU memory usage.

This article compares six runtimes that show up repeatedly in production stacks:

vLLM

TensorRT LLM

Hugging Face Text Generation Inference (TGI v3)

LMDeploy

SGLang

DeepSpeed Inference / ZeRO Inference

1. vLLM

Design

vLLM is built around PagedAttention. Instead of storing each sequence’s KV cache in a large contiguous buffer, it partitions KV into fixed size blocks and uses an indirection layer so each sequence points to a list of blocks.

This gives:

Very low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)

High GPU utilization with continuous batching

Native support for prefix sharing and KV reuse at block level

Recent versions add KV quantization (FP8) and integrate FlashAttention style kernels.

Performance

vLLM evaluation:

vLLM achieves 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.

KV and memory behavior

PagedAttention provides a KV layout that is both GPU friendly and fragmentation resistant.

FP8 KV quantization reduces KV size and improves decode throughput when compute is not the bottleneck.

Where it fits

Default high performance engine when you need a general LLM serving backend with good throughput, good TTFT, and hardware flexibility.

2. TensorRT LLM

Design

TensorRT LLM is a compilation based engine on top of NVIDIA TensorRT. It generates fused kernels per model and shape, and exposes an executor API used by frameworks such as Triton.

Its KV subsystem is explicit and feature rich:

Paged KV cache

Quantized KV cache (INT8, FP8, with some combinations still evolving)

Circular buffer KV cache

KV cache reuse, including offloading KV to CPU and reusing it across prompts to reduce TTFT

NVIDIA reports that CPU based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios.

Performance

TensorRT LLM is highly tunable, so results vary. Common patterns from public comparisons and vendor benchmarks:

Very low single request latency on NVIDIA GPUs when engines are compiled for the exact model and configuration.

At moderate concurrency, it can be tuned either for low TTFT or for high throughput; at very high concurrency, throughput optimized profiles push P99 up due to aggressive batching.

KV and memory behavior

Paged KV plus quantized KV gives strong control over memory use and bandwidth.

Executor and memory APIs let you design cache aware routing policies at the application layer.

Where it fits

Latency critical workloads and NVIDIA only environments, where teams can invest in engine builds and per model tuning.

3. Hugging Face TGI v3

Design

Text Generation Inference (TGI) is a server focused stack with:

Rust based HTTP and gRPC server

Continuous batching, streaming, safety hooks

Backends for PyTorch and TensorRT and tight Hugging Face Hub integration

TGI v3 adds a new long context pipeline:

Chunked prefill for long inputs

Prefix KV caching so long conversation histories are not recomputed on each request

Performance

For conventional prompts, recent third party work shows:

vLLM often edges out TGI on raw tokens per second at high concurrency due to PagedAttention, but the difference is not huge on many setups.

TGI v3 processes around 3× more tokens and is up to 13× faster than vLLM on long prompts, under a setup with very long histories and prefix caching enabled.

Latency profile:

P50 for short and mid length prompts is similar to vLLM when both are tuned with continuous batching.

For long chat histories, prefill dominates in naive pipelines; TGI v3’s reuse of earlier tokens gives a large win in TTFT and P50.

KV and memory behavior

TGI uses KV caching with paged attention style kernels and reduces memory footprint through chunking of prefill and other runtime changes.

It integrates quantization through bits and bytes and GPTQ and runs across several hardware backends.

Where it fits

Production stacks already on Hugging Face, especially for chat style workloads with long histories where prefix caching gives large real world gains.

4. LMDeploy

Design

LMDeploy is a toolkit for compression and deployment from the InternLM ecosystem. It exposes two engines:

TurboMind: high performance CUDA kernels for NVIDIA GPUs

PyTorch engine: flexible fallback

Key runtime features:

Persistent, continuous batching

Blocked KV cache with a manager for allocation and reuse

Dynamic split and fuse for attention blocks

Tensor parallelism

Weight only and KV quantization (including AWQ and online INT8 / INT4 KV quant)

LMDeploy delivers up to 1.8× higher request throughput than vLLM, attributing this to persistent batching, blocked KV and optimized kernels.

Performance

Evaluations show:

For 4 bit Llama style models on A100, LMDeploy can reach higher tokens per second than vLLM under comparable latency constraints, especially at high concurrency.

It also reports that 4 bit inference is about 2.4× faster than FP16 for supported models.

Latency:

Single request TTFT is in the same ballpark as other optimized GPU engines when configured without extreme batch limits.

Under heavy concurrency, persistent batching plus blocked KV let LMDeploy sustain high throughput without TTFT collapse.

KV and memory behavior

Blocked KV cache trades contiguous per sequence buffers for a grid of KV chunks managed by the runtime, similar in spirit to vLLM’s PagedAttention but with a different internal layout.

Support for weight and KV quantization targets large models on constrained GPUs.

Where it fits

NVIDIA centric deployments that want maximum throughput and are comfortable using TurboMind and LMDeploy specific tooling.

5. SGLang

Design

SGLang is both:

A DSL for building structured LLM programs such as agents, RAG workflows and tool pipelines

A runtime that implements RadixAttention, a KV reuse mechanism that shares prefixes using a radix tree structure rather than simple block hashes.

RadixAttention:

Stores KV for many requests in a prefix tree keyed by tokens

Enables high KV hit rates when many calls share prefixes, such as few shot prompts, multi turn chat, or tool chains

Performance

Key Insights:

SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems such as vLLM, LMQL and others on structured workloads.

Improvements are largest when there is heavy prefix reuse, for example multi turn chat or evaluation workloads with repeated context.

Reported KV cache hit rates range from roughly 50% to 99%, and cache aware schedulers get close to the optimal hit rate on the measured benchmarks.

KV and memory behavior

RadixAttention sits on top of paged attention style kernels and focuses on reuse rather than just allocation.

SGLang integrates well with hierarchical context caching systems that move KV between GPU and CPU when sequences are long, although those systems are usually implemented as separate projects.

Where it fits

Agentic systems, tool pipelines, and heavy RAG applications where many calls share large prompt prefixes and KV reuse matters at the application level.

6. DeepSpeed Inference / ZeRO Inference

Design

DeepSpeed provides two pieces relevant for inference:

DeepSpeed Inference: optimized transformer kernels plus tensor and pipeline parallelism

ZeRO Inference / ZeRO Offload: techniques that offload model weights, and in some setups KV cache, to CPU or NVMe to run very large models on limited GPU memory

ZeRO Inference focuses on:

Keeping little or no model weights resident in GPU

Streaming tensors from CPU or NVMe as needed

Targeting throughput and model size rather than low latency

Performance

In the ZeRO Inference OPT 30B example on a single V100 32GB:

Full CPU offload reaches about 43 tokens per second

Full NVMe offload reaches about 30 tokens per second

Both are 1.3–2.4× faster than partial offload configurations, because full offload enables larger batch sizes

These numbers are small compared to GPU resident LLM runtimes on A100 or H100, but they apply to a model that does not fit natively in 32GB.

A recent I/O characterization of DeepSpeed and FlexGen confirms that offload based systems are dominated by small 128 KiB reads and that I/O behavior becomes the main bottleneck.

KV and memory behavior

Model weights and sometimes KV blocks are offloaded to CPU or SSD to fit models beyond GPU capacity.

TTFT and P99 are high compared to pure GPU engines, but the tradeoff is the ability to run very large models that otherwise would not fit.

Where it fits

Offline or batch inference, or low QPS services where model size matters more than latency and GPU count.

Comparison Tables

This table summarizes the main tradeoffs qualitatively:

RuntimeMain design ideaRelative strengthKV strategyTypical use casevLLMPagedAttention, continuous batchingHigh tokens per second at given TTFTPaged KV blocks, FP8 KV supportGeneral purpose GPU serving, multi hardwareTensorRT LLMCompiled kernels on NVIDIA + KV reuseVery low latency and high throughput on NVIDIAPaged, quantized KV, reuse and offloadNVIDIA only, latency sensitiveTGI v3HF serving layer with long prompt pathStrong long prompt performance, integrated stackPaged KV, chunked prefill, prefix cachingHF centric APIs, long chat historiesLMDeployTurboMind kernels, blocked KV, quantUp to 1.8× vLLM throughput in vendor testsBlocked KV cache, weight and KV quantNVIDIA deployments focused on raw throughputSGLangRadixAttention and structured programsUp to 6.4× throughput and 3.7× lower latency on structured workloadsRadix tree KV reuse over prefixesAgents, RAG, high prefix reuseDeepSpeedGPU CPU NVMe offload for huge modelsEnables large models on small GPU; throughput orientedOffloaded weights and sometimes KVVery large models, offline or low QPS

Choosing a runtime in practice

For a production system, the choice tends to collapse to a few simple patterns:

You want a strong default engine with minimal custom work: You can start with vLLM. It gives you good throughput, reasonable TTFT, and solid KV handling on common hardware.

You are committed to NVIDIA and want fine grained control over latency and KV: You can use TensorRT LLM, likely behind Triton or TGI. Plan for model specific engine builds and tuning.

Your stack is already on Hugging Face and you care about long chats: You can use TGI v3. Its long prompt pipeline and prefix caching are very effective for conversation style traffic.

You want maximum throughput per GPU with quantized models: You can use LMDeploy with TurboMind and blocked KV, especially for 4 bit Llama family models.

You are building agents, tool chains or heavy RAG systems: You can use SGLang and design prompts so that KV reuse via RadixAttention is high.

You must run very large models on limited GPUs: You can use DeepSpeed Inference / ZeRO Inference, accept higher latency, and treat the GPU as a throughput engine with SSD in the loop.

Overall, all these engines are converging on the same idea: KV cache is the real bottleneck resource. The winners are the runtimes that treat KV as a first class data structure to be paged, quantized, reused and offloaded, not just a big tensor slapped into GPU memory.
The post Comparing the Top 6 Inference Runtimes for LLM Serving in 2025 appeared first on MarkTechPost.

Connect Amazon Bedrock agents to cross-account knowledge bases

Organizations need seamless access to their structured data repositories to power intelligent AI agents. However, when these resources span multiple AWS accounts integration challenges can arise. This post explores a practical solution for connecting Amazon Bedrock agents to knowledge bases in Amazon Redshift clusters residing in different AWS accounts.
The challenge
Organizations that build AI agents using Amazon Bedrock can maintain their structured data in Amazon Redshift clusters. When these data repositories exist in separate AWS accounts from their AI agents, they face a significant limitation: Amazon Bedrock Knowledge Bases doesn’t natively support cross-account Redshift integration.
This creates a challenge for enterprises with multi-account architectures who want to:

Leverage existing structured data in Redshift for their AI agents.
Maintain separation of concerns across different AWS accounts.
Avoid duplicating data across accounts.
Ensure proper security and access controls.

Solution overview
Our solution enables cross-account knowledge base integration through a secure, serverless architecture that maintains secure access controls while allowing AI agents to query structured data. The approach uses AWS Lambda as an intermediary to facilitate secure cross-account data access.

The action flow as shown above:

Users enter their natural language question in Amazon Bedrock Agents which is configured in the agent account.
Amazon Bedrock Agents invokes a Lambda function through action groups which provides access to the Amazon Bedrock knowledge base configured in the agent-kb account above.
Action group Lambda function running in agent account assumes an IAM role created in agent-kb account above to connect to the knowledge base in the agent-kb account.
Amazon Bedrock Knowledge Base in the agent-kb account uses an IAM role created in the same account to access Amazon Redshift data warehouse and query data in the data warehouse.

The solution follows these key components:

Amazon Bedrock agent in the agent account that handles user interactions.
Amazon Redshift serverless workgroup in VPC and private subnet in the agent-kb account containing structured data.
Amazon Bedrock Knowledge base using the Amazon Redshift serverless workgroup as structured data source.
Lambda function in the agent account.
Action group configuration to connect the agent in the agent account to the Lambda function.
IAM roles and policies that enable secure cross-account access.

Prerequisites
This solution requires you to have the following:

Two AWS accounts. Create an AWS account if you do not have one. Specific permissions required for both account which will be set up in subsequent steps.
Install the AWS CLI (2.24.22 – current version)
Set up authentication using IAM user credentials for the AWS CLI for each account
Make sure you have jq installed, jq is lightweight command-line JSON processor. For example, in Mac you can use the command brew install jq (jq-1.7.1-apple – current version) to install it.
Navigate to the Amazon Bedrock console and make sure you enable access to the meta.llama3-1-70b-instruct-v1:0 model for the agent-kb account and access for us.amazon.nova-pro-v1:0 model in the agent account in the us-west-2, US West (Oregon) AWS Region.

Assumption
Let’s call the AWS account profile, agent profile that has the Amazon Bedrock agent. Similarly, the AWS account profile be called agent-kb that has the Amazon Bedrock knowledge base with Amazon Redshift Serverless and the structured data source. We will use the us-west-2 US West (Oregon) AWS Region but feel free to choose another AWS Region as necessary (the prerequisites will be applicable to the AWS Region you choose to deploy this solution in). We will use the meta.llama3-1-70b-instruct-v1:0 model for the agent-kb. This is an available on-demand model in us-west-2. You are free to choose other models with cross-Region inference but that would mean changing the roles and polices accordingly and enable model access in all Regions they are available in. Based on our model choice for this solution the AWS Region must be us-west-2. For the agent we will be using an Amazon Bedrock agent optimized model like us.amazon.nova-pro-v1:0.
Implementation walkthrough
The following is a step-by-step implementation guide. Make sure to perform all steps in the same AWS Region in both accounts.
These steps are to deploy and test an end-to-end solution from scratch and if you are already running some of these components, you may skip over those steps.

Make a note of the AWS account numbers in the agent and agent-kb account. In the implementation steps we will refer them as follows:

Profile
AWS account
Description

agent
111122223333
Account for the Bedrock Agent

agent-kb
999999999999
Account for the Bedrock Knowledge base

Note: These steps use example profile names and account numbers, please replace with actuals before running.
Create the Amazon Redshift Serverless workgroup in the agent-kb account:

Log on to the agent-kb account
Follow the workshop link to create the Amazon Redshift Serverless workgroup in private subnet
Make a note of the namespace, workgroup, and other details and follow the rest of the hands-on workshop instructions.

Set up your data warehouse in the agent-kb account.
Create your AI knowledge base in the agent-kb account. Make a note of the knowledge base ID.
Train your AI Assistant in the agent-kb account.
Test natural language queries in the agent-kb account. You can find the code in aws-samples git repository: sample-for-amazon-bedrock-agent-connect-cross-account-kb.
Create necessary roles and policies in both the accounts. Run the script create_bedrock_agent_kb_roles_policies.sh with the following input parameters.

Input parameter
Value
Description

–agent-kb-profile
agent-kb
The agent knowledgebase profile that you set up with the AWS CLI with aws_access_key_id, aws_secret_access_key as mentioned in the prerequisites.

–lambda-role
lambda_bedrock_kb_query_role
This is the IAM role the agent account Bedrock agent action group lambda will assume to connect to the Redshift cross account

–kb-access-role
bedrock_kb_access_role
This is the IAM role the agent-kb account which the lambda_bedrock_kb_query_role in agent account assumes to connect to the Redshift cross account

–kb-access-policy
bedrock_kb_access_policy
IAM policy attached to the IAM role bedrock_kb_access_role

–lambda-policy
lambda_bedrock_kb_query_policy
IAM policy attached to the IAM role lambda_bedrock_kb_query_role

–knowledge-base-id
XXXXXXXXXX
Replace with the actual knowledge base ID created in Step 4

–agent-account
111122223333
Replace with the 12-digit AWS account number where the Bedrock agent is running. (agent account)

–agent-kb-account
999999999999
Replace with the 12-digit AWS account number where the Bedrock knowledge base is running. (agent-kb acccount)

Download the script (create_bedrock_agent_kb_roles_policies.sh) from the aws-samples GitHub repository.
Open Terminal in Mac or similar bash shell for other platforms.
Locate and change the directory to the downloaded location, provide executable permissions:

cd /my/location
chmod +x create_bedrock_agent_kb_roles_policies.sh

If you are still not clear on the script usage or inputs, then you can run the script with the –help option and the script will display the usage: ./create_bedrock_agent_kb_roles_policies.sh –help
Run the script with the right input parameters as described in the previous table.

./create_bedrock_agent_kb_roles_policies.sh –agent-profile agent
–agent-kb-profile agent-kb
–lambda-role lambda_bedrock_kb_query_role
–kb-access-role bedrock_kb_access_role
–kb-access-policy bedrock_kb_access_policy
–lambda-policy lambda_bedrock_kb_query_policy
–knowledge-base-id XXXXXXXXXX
–agent-account 111122223333
–agent-kb-account 999999999999

The script on successful execution shows the summary of the IAM, roles and policies created in both accounts.
Log on to both the agent and agent-kb account to verify the IAM roles and policies are created.

For the agent account: Make a note of the ARN of the lambda_bedrock_kb_query_role as that will be the value of CloudFormation stack parameter AgentLambdaExecutionRoleArn in the next step.
For the agent-kb account: Make a note of the ARN of the bedrock_kb_access_role as that will be the value of CloudFormation stack parameter TargetRoleArn in the next step.

Run the AWS CloudFormation script to create a Bedrock agent:

Download the CloudFormation script: cloudformation_bedrock_agent_kb_query_cross_account.yaml from the aws-samples GitHub repository.
Log on to the agent account and navigate to the CloudFormation console, and verify you are in the us-west-2 (Oregon) Region, choose Create stack and choose With new resources (standard).
In the Specify template section choose Upload a template file and then Choose file and select the file from (1). Then, choose Next.
Enter the following stack details and choose Next.

Parameter
Value
Description

Stack name
bedrock-agent-connect-kb-cross-account-agent
You can choose any name

AgentFoundationModelId
us.amazon.nova-pro-v1:0
Do not change

AgentLambdaExecutionRoleArn
arn:aws:iam:: 111122223333:role/lambda_bedrock_kb_query_role
Replace with you agent account number

BedrockAgentDescription
Agent to query inventory data from Redshift Serverless database
Keep this as default

BedrockAgentInstructions
You are an assistant that helps users query inventory data from our Redshift Serverless database using the action group.
Do not change

BedrockAgentName
bedrock_kb_query_cross_account
Keep this as default

KBFoundationModelId
meta.llama3-1-70b-instruct-v1:0
Do not change

KnowledgeBaseId
XXXXXXXXXX
Knowledge base id from Step 4

TargetRoleArn
arn:aws:iam::999999999999:role/bedrock_kb_access_role
Replace with you agent-kb account number

Complete the acknowledgement and choose Next.
Scroll down through the page and choose Submit.
You will see the CloudFormation stack is getting created as shown by the status CREATE_IN_PROGRESS.
It will take a few minutes, and you will see the status change to CREATE_COMPLETE indicating creation of all resources. Choose the Outputs tab to make a note of the resources that were created. In summary, the CloudFormation script does the following in the agent account.

Creates a Bedrock agent
Creates an action group
Also creates a Lambda function which is invoked by the Bedrock action group
Defines the OpenAPI schema
Creates necessary roles and permissions for the Bedrock agent
Finally, it prepares the Bedrock agent so that it is ready to test.

Check for model access in Oregon (us-west-2)

Verify Nova Pro (us.amazon.nova-pro-v1:0) model access in the agent account. Navigate to the Amazon Bedrock console and choose Model access under Configure and learn. Search for Model name : Nova Pro to verify access. If not, then enable model access.
Verify access to the meta.llama3-1-70b-instruct-v1:0 model in the agent-kb account. This should already be enabled as we set up the knowledge base earlier.

Run the agent. Log on to agent account. Navigate to Amazon Bedrock console and choose Agents under Build.
Choose the name of the agent and choose Test. You can test the following questions as mentioned the workshop’s Stage 4: Test Natural Language Queries page. For example:

Who are the top 5 customers in Saudi Arabia?
Who are the top parts supplier in the United States by volume?
What is the total revenue by region for the year 1998?
Which products have the highest profit margins?
Show me orders with the highest priority from the last quarter of 1997.

Choose Show trace to investigate the agent traces.

Some recommended best practices:

Phrase your question to be more specific
Use terminology that matches your table descriptions
Try questions similar to your curated examples
Verify your question relates to data that exists in the TPCH dataset
Use Amazon Bedrock Guardrails to add configurable safeguards to questions and responses.

Clean up resources
It is recommended that you clean up any resources you do not need anymore to avoid any unnecessary charges:

Navigate to the CloudFormation console for the agent and agent-kb account, search for the stack and and choose Delete.
S3 buckets need to be deleted separately.
For deleting the roles and policies created in both accounts, download the script delete-bedrock-agent-kb-roles-policies.sh from the aws-samples GitHub repository.

Open Terminal in Mac or similar bash shell on other platforms.
Locate and change the directory to the downloaded location, provide executable permissions:

cd /my/location
chmod +x delete-bedrock-agent-kb-roles-policies.sh

If you are still not clear on the script usage or inputs, then you can run the script with the –help option then the script will display the usage: ./ delete-bedrock-agent-kb-roles-policies.sh –help
Run the script: delete-bedrock-agent-kb-roles-policies.sh with the same values for the same input parameters as in Step7 when running the create_bedrock_agent_kb_roles_policies.sh script. Note: Enter the correct account numbers for agent-account and agent-kb-account before running.

./delete-bedrock-agent-kb-roles-policies.sh –agent-profile agent
–agent-kb-profile agent-kb
–lambda-role lambda_bedrock_kb_query_role
–kb-access-role bedrock_kb_access_role
–kb-access-policy bedrock_kb_access_policy
–lambda-policy lambda_bedrock_kb_query_policy
–agent-account 111122223333
–agent-kb-account 999999999999
The script will ask for a confirmation, say yes and press enter.

Summary
This solution demonstrates how the Amazon Bedrock agent in the agent account can query the Amazon Bedrock knowledge base in the agent-kb account.
Conclusion
This solution uses Amazon Bedrock Knowledge Bases for structured data to create a more integrated approach to cross-account data access. The knowledge base in agent-kb account connects directly to Amazon Redshift Serverless in a private VPC. The Amazon Bedrock agent in the agent account invokes an AWS Lambda function as part of its action group to make a cross-account connection to retrieve response from the structured knowledge base.
This architecture offers several advantages:

Uses Amazon Bedrock Knowledge Bases capabilities for structured data
Provides a more seamless integration between the agent and the data source
Maintains proper security boundaries between accounts
Reduces the complexity of direct database access codes

As Amazon Bedrock continues to evolve, you can take advantage of future enhancements to knowledge base functionality while maintaining your multi-account architecture.

About the Authors
Kunal Ghosh is an expert in AWS technologies. He passionate about building efficient and effective solutions on AWS, especially involving generative AI, analytics, data science, and machine learning. Besides family time, he likes reading, swimming, biking, and watching movies, and he is a foodie.
Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers adopt and use the AWS Cloud. He is focused on big data, data lakes, streaming and batch analytics services, and generative AI technologies.
Indranil Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers in the hi-tech and semi-conductor sectors solve complex business problems using the AWS Cloud. His special interests are in the areas of legacy modernization and migration, building analytics platforms and helping customers adopt cutting edge technologies such as generative AI.
Vinayak Datar is Sr. Solutions Manager based in Bay Area, helping enterprise customers accelerate their AWS Cloud journey. He’s focusing on helping customers to convert ideas from concepts to working prototypes to production using AWS generative AI services.

Democratizing AI: How Thomson Reuters Open Arena supports no-code AI f …

This post is cowritten by Laura Skylaki, Vaibhav Goswami, Ramdev Wudali and Sahar El Khoury from Thomson Reuters.
Thomson Reuters (TR) is a leading AI and technology company dedicated to delivering trusted content and workflow automation solutions. With over 150 years of expertise, TR provides essential solutions across legal, tax, accounting, risk, trade, and media sectors in a fast-evolving world.
TR recognized early that AI adoption would fundamentally transform professional work. According to TR’s 2025 Future of Professionals Report, 80% of professionals anticipate AI significantly impacting their work within five years, with projected productivity gains of up to 12 hours per week by 2029. To unlock this immense potential, TR needed a solution to democratize AI creation across its organization.
In this blog post, we explore how TR addressed key business use cases with Open Arena, a highly scalable and flexible no-code AI solution powered by Amazon Bedrock and other AWS services such as Amazon OpenSearch Service, Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and AWS Lambda. We’ll explain how TR used AWS services to build this solution, including how the architecture was designed, the use cases it solves, and the business profiles that use it. The system demonstrates TR’s successful approach of using existing TR services for rapid launches while supporting thousands of users, showcasing how organizations can democratize AI access and support business profiles (for example, AI explorers and SMEs) to create applications without coding expertise.
Introducing Open Arena: No-code AI for all
TR introduced Open Arena to non-technical professionals to create their own customized AI solutions. With Open Arena users can use cutting-edge AI powered by Amazon Bedrock in a no-code environment, exemplifying TR’s commitment to democratizing AI access.
Today, Open Arena supports:

High adoption: ~70% employee adoption, with 19,000 monthly active users.
Custom solutions: Thousands of customized AI solutions created without coding, used for internal workflows or integrated into TR products for customers.
Self-served functionality: 100% self-served functionality, so that users, irrespective of technical background, can develop, evaluate, and deploy generative AI solutions.

The Open Arena journey: From prototype to enterprise solution
Conceived as a rapid prototype, Open Arena was developed in under six weeks at the onset of the generative AI boom in early 2023 by TR Labs – TR’s dedicated applied research division focused on the research, development, and application of AI and emerging trends in technologies. The goal was to support internal team exploration of large language models (LLMs) and discover unique use cases by merging LLM capabilities with TR company data.
Open Arena’s introduction significantly increased AI awareness, fostered developer-SME collaboration for groundbreaking concepts, and accelerated AI capability development for TR products. The rapid success and demand for new features quickly highlighted Open Arena’s potential for AI democratization, so TR developed an enterprise version of Open Arena. Built on the TR AI Platform, Open Arena enterprise version offers secure, scalable, and standardized services covering the entire AI development lifecycle, significantly accelerating time to production.
The Open Arena enterprise version uses existing system capabilities for enhanced data access controls, standardized service access, and compliance with TR’s governance and ethical standards. This version introduced self-served capabilities so that every user, irrespective of their technical ability, can create, evaluate, and deploy customized AI solutions in a no-code environment.

“The foundation of the AI Platform has always been about empowerment; in the early days it was about empowering Data Scientists but with the rise of Gen AI, the platform adapted and evolved on empowering users of any background to leverage and create AI Solutions.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

As of July 2025, the TR Enterprise AI Platform consists of 15 services spanning the entire AI development lifecycle and user personas. Open Arena remains one of its most popular, serving 19,000 users each month, with increasing monthly usage.
Addressing key enterprise AI challenges across user types
Using the TR Enterprise AI Platform, Open Arena helped thousands of professionals transition into using generative AI. AI-powered innovation is now readily in the hands of everyone, not just AI scientists.
Open Arena successfully addresses four critical enterprise AI challenges:

Enablement: Delivers AI solution building with consistent LLM and service provider experience and support for various user personas, including non-technical.
Security and quality: Streamlines AI solution quality tracking using evaluation and monitoring services, whilst complying with data governance and ethics policies.
Speed and reusability: Automates workflows and uses existing AI solutions and prompts.
Resources and cost management: Tracks and displays generative AI solution resource consumption, supporting transparency and efficiency.

The solution currently supports several AI experiences, including tech support, content creation, coding assistance, data extraction and analysis, proof reading, project management, content summarization, personal development, translation, and problem solving, catering to different user needs across the organization.

Figure 1. Examples of Open Arena use cases.
AI explorers use Open Arena to speed up day-to-day tasks, such as summarizing documents, engaging in LLM chat, building custom workflows, and comparing AI models. AI creators and Subject Matter Experts (SMEs) use Open Arena to build custom AI workflows and experiences and to evaluate solutions without requiring coding knowledge. Meanwhile, developers can develop and deploy new AI solutions at speed, training models, creating new AI skills, and deploying AI capabilities.
Why Thomson Reuters selected AWS for Open Arena
TR strategically chose AWS as a primary cloud provider for Open Arena based on several critical factors:

Comprehensive AI/ML capabilities: Amazon Bedrock offers easy access to a choice of high-performing foundation models from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma AI, Meta, Mistral AI, OpenAI, Qwen, Stability AI, TwelveLabs, Writer, and Amazon. It supports simple chat and complex RAG workflows, and integrates seamlessly with TR’s existing Enterprise AI Platform.
Enterprise-grade security and governance: Advanced security controls, model access using RBAC, data handling with enhanced security features, single sign-on (SSO) enabled, and clear operational and user data separation across AWS accounts.
Scalable infrastructure: Serverless architecture for automatic scaling, pay-per-use pricing for cost optimization, and global availability with low latency.
Existing relationship and expertise: Strong, established relationship between TR and AWS, existing Enterprise AI Platform on AWS, and deep AWS expertise within TR’s technical teams.

“Our long-standing partnership with AWS and their robust, flexible and innovative services made them the natural choice to power Open Arena and accelerate our AI initiatives.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

Open Arena architecture: Scalability, extensibility, and security
Designed for a broad enterprise audience, Open Arena prioritizes scalability, extensibility and security while maintaining simplicity for non-technical users to create and deploy AI solutions. The following diagram illustrates the architecture of Open Arena.

Figure 2. Architecture design of Open Arena.
The architecture design facilitates enterprise-grade performance with clear separation between capability and usage, aligning with TR’s enterprise cost and usage tracking requirements.
The following are key components of the solution architecture:

No-code interface: Intuitive UI, visual workflow builder, pre-built templates, drag-and-drop functionality.
Enterprise integration: Seamless integration with TR’s Enterprise AI Platform, SSO enabled, data handling with enhanced security, clear data separation.
Solution management: Searchable repository, public/private sharing, version control, usage analytics.

TR developed Open Arena using AWS services such as Amazon Bedrock, Amazon OpenSearch, Amazon DynamoDB, Amazon API Gateway, AWS Lambda, and AWS Step Functions. It uses Amazon Bedrock for foundational model interactions, supporting simple chat and complex Retrieval-Augmented Generation (RAG) tasks. Open Arena uses Amazon Bedrock Flows as the custom workflow builder where users can drag-and-drop components like prompts, agents, knowledge bases and Lambda functions to create sophisticated AI workflows without coding. The system also integrates with AWS OpenSearch for knowledge bases and external APIs for advanced agent capabilities.
For data separation, orchestration is managed using the Enterprise AI Platform AWS account, capturing operational data. Flow instances and user-specific data reside in the user’s dedicated AWS account, stored in a database. Each user’s data and workflow executions are isolated within their respective AWS accounts, which is required for complying with Thomson Reuters data sovereignty and enterprise security policies with strict regional controls. The system integrates with Thomson Reuters SSO solution to automatically identify users and grant secure, private access to foundational models.
The orchestration layer, centrally hosted within the Enterprise AI Platform AWS account, manages AI workflow activities, including scheduling, deployment, resource provisioning, and governance across user environments.
The system features fully automated provisioning of  Amazon Bedrock Flows directly within each user’s AWS account, avoiding manual setup and accelerating time to value. Using AWS Lambda for serverless compute and DynamoDB for scalable, low-latency storage, the system dynamically allocates resources based on real-time demand. This architecture makes sure prompt flows and supporting infrastructure are deployed and scaled to match workload fluctuations, optimizing performance, cost, and user experience.

“Our decision to adopt a cross-account architecture was driven by a commitment to enterprise security and operational excellence. By isolating orchestration from execution, we make sure that each user’s data remains private and secure within their own AWS account, while still delivering a seamless, centrally-managed experience. This design empowers organizations to innovate rapidly without compromising compliance or control.”
– Thomson Reuters’ architecture team

Evolution of Open Arena: From classic to Amazon Bedrock Flows-powered chain builder
Open Arena has evolved to cater to varying levels of user sophistication:

Open Arena v1 (Classic): Features a form-based interface for simple prompt customization and basic AI workflow deployment within a single AWS account. Its simplicity appeals to novice users for straightforward use cases, though with limited advanced capabilities.
Open Arena v2 (Chain Builder): Introduces a robust, visual workflow builder interface, enabling users to design complex, multi-step AI workflows using drag-and-drop components. With support for advanced node types, parallel execution, and seamless cross-account deployment, Chain Builder dramatically expands the system’s capabilities and accessibility for non-technical users.

Thomson Reuters uses Amazon Bedrock Flows as a core feature of Chain Builder. Users can define, customize, and deploy AI-driven workflows using Amazon Bedrock models. Bedrock Flows supports advanced workflows combining multiple prompt nodes, incorporating AWS Lambda functions, and supporting sophisticated RAG pipelines. Operating seamlessly across user AWS accounts, Bedrock Flows facilitates secure, scalable execution of personalized AI solutions, serving as the fundamental engine for the Chain Builder workflows and driving TR’s ability to deliver robust, enterprise-grade automation and innovation.
What’s next?
TR continues to expand Open Arena’s capabilities through the strategic partnership with AWS, focusing on:

Driving further adoption of Open Arena’s DIY capabilities.
Enhancing flexibility for workflow creation in Chain Builder with custom components, such as inline scripts.
Developing new templates to represent common tasks and workflows.
Enhancing collaboration features within Open Arena.
Extending multimodal capabilities and model integration.
Expanding into new use cases across the enterprise.

“From innovating new product ideas to reimagining daily tasks for Thomson Reuters employees, we continue to push the boundaries of what’s possible with Open Arena.”
– Maria Apazoglou, Head of AI Engineering, CoCounsel

Conclusion
In this blog post, we explored how Thomson Reuters’ Open Arena demonstrates the successful democratization of AI across an enterprise by using AWS services, particularly Amazon Bedrock and Bedrock Flows. With 19,000 monthly active users and 70% employee adoption, the system proves that no-code AI solutions can deliver enterprise-scale impact while maintaining security and governance standards.
By combining the robust infrastructure of AWS with innovative architecture design, TR has created a blueprint for AI democratization that empowers professionals across technical skill levels to harness generative AI for their daily work.
As Open Arena continues to evolve, it exemplifies how strategic cloud partnerships can accelerate AI adoption and transform how organizations approach innovation with generative AI.

About the authors
Laura Skylaki, PhD, leads the Enterprise AI Platform at Thomson Reuters, driving the development of GenAI services that accelerate the creation, testing and deployment of AI solutions, enhancing product value. A recognized expert with a doctorate in stem cell bioinformatics, her extensive experience in AI research and practical application spans legal, tax, and biotech domains. Her machine learning work is published in leading academic journals, and she is a frequent speaker on AI and machine learning
Vaibhav Goswami is a Lead Software Engineer on the AI Platform team at Thomson Reuters, where he leads the development of the Generative AI Platform that empowers users to build and deploy generative AI solutions at scale. With expertise in building production-grade AI systems, he focuses on creating tools and infrastructure that democratize access to cutting-edge AI capabilities across the enterprise.
Ramdev Wudali is a Distinguished Engineer, helping architect and build the AI/ML Platform to enable the Enterprise user, data scientists and researchers to develop Generative AI and machine learning solutions by democratizing access to tools and LLMs. In his spare time, he loves to fold paper to create origami tessellations, and wearing irreverent T-shirts
As the director of AI Platform Adoption and Training, Sahar El Khoury guides users to seamlessly onboard and successfully use the platform services, drawing on her experience in AI and data analysis across robotics (PhD), financial markets, and media.
Vu San Ha Huynh is a Solutions Architect at AWS with a PhD in Computer Science. He helps large Enterprise customers drive innovation across different domains with a focus on AI/ML and Generative AI solutions.
Paul Wright is a Senior Technical Account Manager, with over 20 years experience in the IT industry and over 7 years of dedicated cloud focus. Paul has helped some of the largest enterprise customers grow their business and improve their operational excellence. In his spare time Paul is a huge football and NFL fan.
Mike Bezak is a Senior Technical Account Manager in AWS Enterprise Support. He has over 20 years of experience in information technology, primarily disaster recovery and systems administration. Mike’s current focus is helping customers streamline and optimize their AWS Cloud journey. Outside of AWS, Mike enjoys spending time with family & friends.

Introducing structured output for Custom Model Import in Amazon Bedroc …

With Amazon Bedrock Custom Model Import, you can deploy and scale fine-tuned or proprietary foundation models in a fully managed, serverless environment. You can bring your own models into Amazon Bedrock, scale them securely without managing infrastructure, and integrate them with other Amazon Bedrock capabilities.
Today, we are excited to announce the addition of structured output to Custom Model Import. Structured output constrains a model’s generation process in real time so that every token it produces conforms to a schema you define. Rather than relying on prompt-engineering tricks or brittle post-processing scripts, you can now generate structured outputs directly at inference time.
For certain production applications, the predictability of model outputs is more important than their creative flexibility. A customer service chatbot might benefit from varied, natural-sounding responses, but an order processing system needs exact, structured data that conforms to predefined schemas. Structured output bridges this gap by maintaining the intelligence of foundation models while verifying their outputs meet strict formatting requirements.
This represents a shift from free-form text generation to outputs that are consistent, machine-readable, and designed for seamless integration with enterprise systems. While free-form text excels for human consumption, production applications require more precision. Businesses can’t afford the ambiguity of natural language variations when their systems depend on structured outputs to reliably interface with APIs, databases, and automated workflows.
In this post, you will learn how to implement structured output for Custom Model Import in Amazon Bedrock. We will cover what structured output is, how to enable it in your API calls, and how to apply it to real-world scenarios that require structured, predictable outputs.
Understanding structured output
Structured output, also known as constrained decoding, is a method that directs LLM outputs to conform to a predefined schema, such as valid JSON. Rather than allowing the model to freely select tokens based on probability distributions, it introduces constraints during generation that limit choices to only those that maintain structural validity. If a particular token would violate the schema by producing invalid JSON, inserting stray characters, or using an unexpected field name the structured output rejects it and requires the model to select another allowed option. This real-time validation helps keep the final output consistent, machine readable, and immediately usable by downstream applications without the need for additional post-processing.
Without structured output, developers often attempt to enforce structure through prompt instructions like “Respond only in JSON.” While this approach sometimes works, it remains unreliable due to the inherently probabilistic nature of LLMs. These models generate text by sampling from probability distributions, introducing natural variability that makes responses feel human but creates significant challenges for automated systems.
Consider a customer support application that classifies tickets: if responses vary between “This seems like a billing issue,” “I’d classify this as: Billing,” and “Category = BILLING,” downstream code cannot reliably interpret the results. What production systems require instead is predictable, structured output. For example:

{
“category”: “billing”,
“priority”: “high”,
“sentiment”: “negative”
}

With a response like this, your application can automatically route tickets, trigger workflows, or update databases without human intervention. By providing predictable, schema-aligned responses, structured output transforms LLMs from conversational tools into reliable system components that can be integrated with databases, APIs, and business logic. This capability opens new possibilities for automation while maintaining the intelligent reasoning that underpin the value of these models.
Beyond improving reliability and simplifying post-processing, structured output offers additional benefits that strengthens performance, security and safety in production environments.

Lower token usage and faster responses: By constraining generation to a defined schema, structured output removes unnecessary verbose, free-form text, resulting in reduced token count. Because token generation is sequential, shorter outputs directly translate to faster responses and lower latency, improving overall performance and cost efficiency.
Enhanced security against prompt injection: Structured output narrows the model’s expression space and helps prevent it from producing arbitrary or unsafe content. Bad actors cannot inject instructions, code or unexpected text outside the defined structure. Each field must match its expected type and format, making sure outputs remain within safe boundaries.
Safety and policy controls: Structured output enables you to design schemas that inherently help prevent harmful, toxic, or policy-violating content. By limiting fields to approved values, enforcing patterns, and restricting free-form text, schemas make sure outputs align with regulatory requirements.

In the next section, we will explore how structured output works with Custom Model Import in Amazon Bedrock and walks through an example of enabling it in your API calls.
Using structured output with Custom Model Import in Amazon Bedrock
Let’s start by assuming you have already imported a Hugging Face model into Amazon Bedrock using the Custom Model Import feature.
Prerequisites
Before proceeding, make sure you have:

An active AWS account with access to Amazon Bedrock
A custom model created in Amazon Bedrock using the Custom Model Import feature
Appropriate AWS Identity and Access Management (IAM) permissions to invoke models through the Amazon Bedrock Runtime

With these prerequisites in place, let’s explore how to implement structured output with your imported model.
To start using structured output with a Custom Model Import in Amazon Bedrock, begin by configuring your environment. In Python, this involves creating a Bedrock Runtime client and initializing a tokenizer from your imported Hugging Face model.
The Bedrock Runtime client provides access to your imported model using the Bedrock InvokeModel API. The tokenizer applies the correct chat template that aligns with the imported model, which defines how user, system, and assistant messages are combined into a single prompt, how the role markers (for example, <|user|>, <|assistant|>) are inserted, and where the model’s response should begin.
By calling tokenizer.apply_chat_template(messages, tokenize=False) you can generate a prompt that matches the exact input format your model expects, which is essential for consistent and reliable inference, especially when structured encoding is enabled.

import boto3
from transformers import AutoTokenizer
from botocore.config import Config

# HF model identifier imported into Bedrock
hf_model_id = “<<huggingface_model_id>>” # Example: “deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
model_arn = “arn:aws:bedrock:<<aws-region>>:<<account-id>>:imported-model/your-model-id”
region = “<<aws-region>>”

# Initialize tokenizer aligned with your imported model
tokenizer = AutoTokenizer.from_pretrained(hf_model_id)

# Initialize Bedrock client
bedrock_runtime = boto3.client(
service_name=”bedrock-runtime”,
region_name=region)

Implementing structured output
When you invoke a custom model on Amazon Bedrock, you have the option to enable structured output by adding a response_format block to the request payload. This block accepts a JSON schema that defines the structured of the model’s response. During inference, the model enforces this schema in real-time, making sure that each generated token conforms to the defined structure. Below is a walkthrough demonstrating how to implement structured output using a simple address extraction task.
Step 1: Define the data structure
You can define your expected output using a Pydantic model, which serves as a typed contract for the data you want to extract.

from pydantic import BaseModel, Field

class Address(BaseModel):
street_number: str = Field(description=”Street number”)
street_name: str = Field(description=”Street name including type (Ave, St, Rd, etc.)”)
city: str = Field(description=”City name”)
state: str = Field(description=”Two-letter state abbreviation”)
zip_code: str = Field(description=”5-digit ZIP code”)

Step 2: Generate the JSON schema
Pydantic can automatically convert your data model into a JSON schema:

schema = Address.model_json_schema()
address_schema = {
“name”: “Address”,
“schema”: schema
}

This schema defines each field’s type, description, and requirement, creating a blueprint that the model will follow during generation.
Step 3: Prepare your input messages
Format your input using the chat format expected by your model:

messages = [{
“role”: “user”,
“content”: “Extract the address: 456 Tech Boulevard, San Francisco, CA 94105”
}]

Step 4: Apply the chat template
Use your model’s tokenizer to generate the formatted prompt:

prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

Step 5: Build the request payload
Create your request body, including the response_format that references your schema:

request_body = {
‘prompt’: prompt,
‘temperature’: 0.1,
‘max_gen_len’: 1000,
‘top_p’: 0.9,
‘response_format’: {
“type”: “json_schema”,
“json_schema”: address_schema
}
}

Step 6: Invoke the model
Send the request using the InvokeModel API:

response = bedrock_runtime.invoke_model(
modelId=model_arn,
body=json.dumps(request_body),
accept=”application/json”,
contentType=”application/json”
)

Step 7: Parse the response
Extract the generated text from the response:

result = json.loads(response[‘body’].read().decode(‘utf-8’))
raw_output = result[‘choices’][0][‘text’]
print(raw_output)

Because the schema defines required fields, the model’s response will contain them:

{
“street_number”: “456”,
“street_name”: “Tech Boulevard”,
“city”: “San Francisco”,
“state”: “CA”,
“zip_code”: “94105”
}

The output is clean, valid JSON that can be consumed directly by your application with no extra parsing, filtering, or cleanup required.
Conclusion
Structured output with Custom Model Import in Amazon Bedrock provides an effective way to generate structures, schema-aligned outputs from your models. By shifting validation into the model inference itself, structured output reduce the need for complex post-processing workflows and error handling code.
Structured output generates outputs that are predictable and straightforward to integrate into your systems and supports a variety of use cases, for example, building financial applications that require precise data extraction, healthcare systems that need structured clinical documentation, or customer service systems that demand consistent ticket classification.
Start experimenting with structured output with your Custom Model Import today and transform how your AI applications deliver consistent, production-ready results.

About the authors
Manoj Selvakumar is a Generative AI Specialist Solutions Architect at AWS, where he helps organizations design, prototype, and scale AI-powered solutions in the cloud. With expertise in deep learning, scalable cloud-native systems, and multi-agent orchestration, he focuses on turning emerging innovations into production-ready architectures that drive measurable business value. He is passionate about making complex AI concepts practical and enabling customers to innovate responsibly at scale—from early experimentation to enterprise deployment. Before joining AWS, Manoj worked in consulting, delivering data science and AI solutions for enterprise clients, building end-to-end machine learning systems supported by strong MLOps practices for training, deployment, and monitoring in production.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Revendra Kumar is a Senior Software Development Engineer at Amazon Web Services. In his current role, he focuses on model hosting and inference MLOps on Amazon Bedrock. Prior to this, he worked as an engineer on hosting Quantum computers on the cloud and developing infrastructure solutions for on-premises cloud environments. Outside of his professional pursuits, Revendra enjoys staying active by playing tennis and hiking.
Muzart Tuman is a software engineer utilizing his experience in fields like deep learning, machine learning optimization, and AI-driven applications to help solve real-world problems in a scalable, efficient, and accessible manner. His goal is to create impactful tools that not only advance technical capabilities but also inspire meaningful change across industries and communities.

Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model th …

How do we design AI systems that can plan, reason, and act over long sequences of decisions without constant human guidance? Moonshot AI has released Kimi K2 Thinking, an open source thinking agent model that exposes the full reasoning stream of the Kimi K2 Mixture of Experts architecture. It targets workloads that need deep reasoning, long horizon tool use, and stable agent behavior across many steps.

https://moonshotai.github.io/Kimi-K2/thinking.html

What is Kimi K2 Thinking?

Kimi K2 Thinking is described as the latest, most capable version of Moonshot’s open source thinking model. It is built as a thinking agent that reasons step by step and dynamically invokes tools during inference. The model is designed to interleave chain of thought with function calls so it can read, think, call a tool, think again, and repeat for hundreds of steps.

The model sets a new state of the art on Humanity’s Last Exam and BrowseComp, while maintaining coherent behavior across about 200 to 300 sequential tool calls without human interference.

At the same time, K2 Thinking is released as an open weights model with a 256K token context window and native INT4 inference, which reduces latency and GPU memory usage while preserving benchmark performance.

K2 Thinking is already live on kimi.com in chat mode and is accessible through the Moonshot platform API, with a dedicated agentic mode planned to expose the full tool using behavior.

Architecture, MoE design, and context length

Kimi K2 Thinking inherits the Kimi K2 Mixture of Experts design. The model uses a MoE architecture with 1T total parameters and 32B activated parameters per token. It has 61 layers including 1 dense layer, 384 experts with 8 experts selected per token, 1 shared expert, 64 attention heads, and an attention hidden dimension of 7168. The MoE hidden dimension is 2048 per expert.

The vocabulary size is 160K tokens and the context length is 256K. The attention mechanism is Multi head Latent Attention, and the activation function is SwiGLU.

Test time scaling and long horizon thinking

Kimi K2 Thinking is explicitly optimized for test time scaling. The model is trained to expand its reasoning length and tool call depth when facing harder tasks, rather than relying on a fixed short chain of thought.

https://moonshotai.github.io/Kimi-K2/thinking.html

On Humanity’s Last Exam in the no tools setting, K2 Thinking scores 23.9. With tools, the score rises to 44.9, and in the heavy setting it reaches 51.0. On AIME25 with Python, it reports 99.1, and on HMMT25 with Python it reports 95.1. On IMO AnswerBench it scores 78.6, and on GPQA it scores 84.5.

The testing protocol caps thinking token budgets at 96K for HLE, AIME25, HMMT25, and GPQA. It uses 128K thinking tokens for IMO AnswerBench, LiveCodeBench, and OJ Bench, and 32K completion tokens for Longform Writing. On HLE, the maximum step limit is 120 with a 48K reasoning budget per step. On agentic search tasks, the limit is 300 steps with a 24K reasoning budget per step.

Benchmarks in agentic search and coding

On agentic search tasks with tools, K2 Thinking reports 60.2 on BrowseComp, 62.3 on BrowseComp ZH, 56.3 on Seal 0, 47.4 on FinSearchComp T3, and 87.0 on Frames.

On general knowledge benchmarks, it reports 84.6 on MMLU Pro, 94.4 on MMLU Redux, 73.8 on Longform Writing, and 58.0 on HealthBench.

For coding, K2 Thinking achieves 71.3 on SWE bench Verified with tools, 61.1 on SWE bench Multilingual with tools, 41.9 on Multi SWE bench with tools, 44.8 on SciCode, 83.1 on LiveCodeBenchV6, 48.7 on OJ Bench in the C plus plus setting, and 47.1 on Terminal Bench with simulated tools.

Moonshot team also defines a Heavy Mode that runs eight trajectories in parallel, then aggregates them to produce a final answer. This is used in some reasoning benchmarks to squeeze out extra accuracy from the same base model.

Native INT4 quantization and deployment

K2 Thinking is trained as a native INT4 model. The research team applies Quantization Aware Training during the post training stage and uses INT4 weight only quantization on the MoE components. This supports INT4 inference with roughly a 2x generation speed improvement in low latency mode while maintaining state of the art performance. All reported benchmark scores are obtained under INT4 precision.

The checkpoints are saved in compressed tensors format and can be unpacked to higher precision formats such as FP8 or BF16 using the official compressed tensors tools. Recommended inference engines include vLLM, SGLang, and KTransformers.

Key Takeaways

Kimi K2 Thinking is an open weights thinking agent that extends the Kimi K2 Mixture of Experts architecture with explicit long horizon reasoning and tool use, not just short chat style responses.

The model uses a trillion parameter MoE design with about tens of billions of active parameters per token, a 256K context window, and is trained as a native INT4 model with Quantization Aware Training, which gives about 2x faster inference while keeping benchmark performance stable.

K2 Thinking is optimized for test time scaling, it can carry out hundreds of sequential tool calls in a single task and is evaluated under large thinking token budgets and strict step caps, which is important when you try to reproduce its reasoning and agentic results.

On public benchmarks, it leads or is competitive on reasoning, agentic search, and coding tasks such as HLE with tools, BrowseComp, and SWE bench Verified with tools, showing that the thinking oriented variant delivers clear gains over the base non thinking K2 model.

Editorial Comments

Kimi K2 Thinking is a strong signal that test time scaling is now a first class design target for open source reasoning models. Moonshot AI is not only exposing a 1T parameter Mixture of Experts system with 32B active parameters and 256K context window, it is doing so with native INT4 quantization, Quantization Aware Training, and tool orchestration that runs for hundreds of steps in production like settings. Overall, Kimi K2 Thinking shows that open weights reasoning agents with long horizon planning and tool use are becoming practical infrastructure, not just research demos.

Check out the Model Weights and Technical Details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model that can Execute up to 200–300 Sequential Tool Calls without Human Interference appeared first on MarkTechPost.

Build an Autonomous Wet-Lab Protocol Planner and Validator Using Sales …

In this tutorial, we build a Wet-Lab Protocol Planner & Validator that acts as an intelligent agent for experimental design and execution. We design the system using Python and integrate Salesforce’s CodeGen-350M-mono model for natural language reasoning. We structure the pipeline into modular components: ProtocolParser for extracting structured data, such as steps, durations, and temperatures, from textual protocols; InventoryManager for validating reagent availability and expiry; Schedule Planner for generating timelines and parallelization; and Safety Validator for identifying biosafety or chemical hazards. The LLM is then used to generate optimization suggestions, effectively closing the loop between perception, planning, validation, and refinement.

Copy CodeCopiedUse a different Browserimport re, json, pandas as pd
from datetime import datetime, timedelta
from collections import defaultdict
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_NAME = “Salesforce/codegen-350M-mono”
print(“Loading CodeGen model (30 seconds)…”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16, device_map=”auto”
)
print(“✓ Model loaded!”)

We begin by importing essential libraries and loading the Salesforce CodeGen-350M-mono model locally for lightweight, API-free inference. We initialize both the tokenizer and model with float16 precision and automatic device mapping to ensure compatibility and speed on Colab GPUs.

Copy CodeCopiedUse a different Browserclass ProtocolParser:
def read_protocol(self, text):
steps = []
lines = text.split(‘n’)
for i, line in enumerate(lines, 1):
step_match = re.search(r’^(d+).s+(.+)’, line.strip())
if step_match:
num, name = step_match.groups()
context = ‘n’.join(lines[i:min(i+4, len(lines))])
duration = self._extract_duration(context)
temp = self._extract_temp(context)
safety = self._check_safety(context)
steps.append({
‘step’: int(num), ‘name’: name, ‘duration_min’: duration,
‘temp’: temp, ‘safety’: safety, ‘line’: i, ‘details’: context[:200]
})
return steps

def _extract_duration(self, text):
text = text.lower()
if ‘overnight’ in text: return 720
match = re.search(r'(d+)s*(?:hour|hr|h)(?:s)?(?!w)’, text)
if match: return int(match.group(1)) * 60
match = re.search(r'(d+)s*(?:min|minute)(?:s)?’, text)
if match: return int(match.group(1))
match = re.search(r'(d+)-(d+)s*(?:min|minute)’, text)
if match: return (int(match.group(1)) + int(match.group(2))) // 2
return 30

def _extract_temp(self, text):
text = text.lower()
if ‘4°c’ in text or ‘4 °c’ in text or ‘4°’ in text: return ‘4C’
if ’37°c’ in text or ’37 °c’ in text: return ’37C’
if ‘-20°c’ in text or ‘-80°c’ in text: return ‘FREEZER’
if ‘room temp’ in text or ‘rt’ in text or ‘ambient’ in text: return ‘RT’
return ‘RT’

def _check_safety(self, text):
flags = []
text_lower = text.lower()
if re.search(r’bsl-[23]|biosafety’, text_lower): flags.append(‘BSL-2/3′)
if re.search(r’caution|corrosive|hazard|toxic’, text_lower): flags.append(‘HAZARD’)
if ‘sharp’ in text_lower or ‘needle’ in text_lower: flags.append(‘SHARPS’)
if ‘dark’ in text_lower or ‘light-sensitive’ in text_lower: flags.append(‘LIGHT-SENSITIVE’)
if ‘flammable’ in text_lower: flags.append(‘FLAMMABLE’)
return flags

class InventoryManager:
def __init__(self, csv_text):
from io import StringIO
self.df = pd.read_csv(StringIO(csv_text))
self.df[‘expiry’] = pd.to_datetime(self.df[‘expiry’])

def check_availability(self, reagent_list):
issues = []
for reagent in reagent_list:
reagent_clean = reagent.lower().replace(‘_’, ‘ ‘).replace(‘-‘, ‘ ‘)
matches = self.df[self.df[‘reagent’].str.lower().str.contains(
‘|’.join(reagent_clean.split()[:2]), na=False, regex=True
)]
if matches.empty:
issues.append(f” {reagent}: NOT IN INVENTORY”)
else:
row = matches.iloc[0]
if row[‘expiry’] < datetime.now():
issues.append(f” {reagent}: EXPIRED on {row[‘expiry’].date()} (lot {row[‘lot’]})”)
elif (row[‘expiry’] – datetime.now()).days < 30:
issues.append(f” {reagent}: Expires soon ({row[‘expiry’].date()}, lot {row[‘lot’]})”)
if row[‘quantity’] < 10:
issues.append(f” {reagent}: LOW STOCK ({row[‘quantity’]} {row[‘unit’]} remaining)”)
return issues

def extract_reagents(self, protocol_text):
reagents = set()
patterns = [
r’b([A-Z][a-z]+(?:s+[A-Z][a-z]+)*)s+(?:antibody|buffer|solution)’,
r’b([A-Z]{2,}(?:-[A-Z0-9]+)?)b’,
r'(?:add|use|prepare|dilute)s+([a-z-]+s*(?:antibody|buffer|substrate|solution))’,
]
for pattern in patterns:
matches = re.findall(pattern, protocol_text, re.IGNORECASE)
reagents.update(m.strip() for m in matches if len(m) > 2)
return list(reagents)[:15]

We define the ProtocolParser and InventoryManager classes to extract structured experimental details and verify reagent inventory. We parse each protocol step for duration, temperature, and safety markers, while the inventory manager validates stock levels, expiry dates, and reagent availability through fuzzy matching.

Copy CodeCopiedUse a different Browserclass SchedulePlanner:
def make_schedule(self, steps, start_time=”09:00″):
schedule = []
current = datetime.strptime(f”2025-01-01 {start_time}”, “%Y-%m-%d %H:%M”)
day = 1
for step in steps:
end = current + timedelta(minutes=step[‘duration_min’])
if step[‘duration_min’] > 480:
day += 1
current = datetime.strptime(f”2025-01-0{day} 09:00″, “%Y-%m-%d %H:%M”)
end = current
schedule.append({
‘step’: step[‘step’], ‘name’: step[‘name’][:40],
‘start’: current.strftime(“%H:%M”), ‘end’: end.strftime(“%H:%M”),
‘duration’: step[‘duration_min’], ‘temp’: step[‘temp’],
‘day’: day, ‘can_parallelize’: step[‘duration_min’] > 60,
‘safety’: ‘, ‘.join(step[‘safety’]) if step[‘safety’] else ‘None’
})
if step[‘duration_min’] <= 480:
current = end
return schedule

def optimize_parallelization(self, schedule):
parallel_groups = []
idle_time = 0
for i, step in enumerate(schedule):
if step[‘can_parallelize’] and i + 1 < len(schedule):
next_step = schedule[i+1]
if step[‘temp’] == next_step[‘temp’]:
saved = min(step[‘duration’], next_step[‘duration’])
parallel_groups.append(
f” Steps {step[‘step’]} & {next_step[‘step’]} can overlap → Save {saved} min”
)
idle_time += saved
return parallel_groups, idle_time

class SafetyValidator:
RULES = {
‘ph_range’: (5.0, 11.0),
‘temp_limits’: {‘4C’: (2, 8), ’37C’: (35, 39), ‘RT’: (20, 25)},
‘max_concurrent_instruments’: 3,
}

def validate(self, steps):
risks = []
for step in steps:
ph_match = re.search(r’phs*(d+.?d*)’, step[‘details’].lower())
if ph_match:
ph = float(ph_match.group(1))
if not (self.RULES[‘ph_range’][0] <= ph <= self.RULES[‘ph_range’][1]):
risks.append(f” Step {step[‘step’]}: pH {ph} OUT OF SAFE RANGE”)
if ‘BSL-2/3’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: BSL-2 cabinet REQUIRED”)
if ‘HAZARD’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Full PPE + chemical hood REQUIRED”)
if ‘SHARPS’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Sharps container + needle safety”)
if ‘LIGHT-SENSITIVE’ in step[‘safety’]:
risks.append(f” Step {step[‘step’]}: Work in dark/amber tubes”)
return risks

We implement the SchedulePlanner and SafetyValidator to design efficient experiment timelines and enforce lab safety standards. We dynamically generate daily schedules, identify parallelizable steps, and validate potential risks, such as unsafe pH levels, hazardous chemicals, or biosafety-level requirements.

Copy CodeCopiedUse a different Browserdef llm_call(prompt, max_tokens=200):
try:
inputs = tokenizer(prompt, return_tensors=”pt”, truncation=True, max_length=512).to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=max_tokens, do_sample=True,
temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):].strip()
except:
return “Batch similar temperature steps together. Pre-warm instruments.”

def agent_loop(protocol_text, inventory_csv, start_time=”09:00″):
print(“n AGENT STARTING PROTOCOL ANALYSIS…n”)
parser = ProtocolParser()
steps = parser.read_protocol(protocol_text)
print(f” Parsed {len(steps)} protocol steps”)
inventory = InventoryManager(inventory_csv)
reagents = inventory.extract_reagents(protocol_text)
print(f” Identified {len(reagents)} reagents: {‘, ‘.join(reagents[:5])}…”)
inv_issues = inventory.check_availability(reagents)
validator = SafetyValidator()
safety_risks = validator.validate(steps)
planner = SchedulePlanner()
schedule = planner.make_schedule(steps, start_time)
parallel_opts, time_saved = planner.optimize_parallelization(schedule)
total_time = sum(s[‘duration’] for s in schedule)
optimized_time = total_time – time_saved
opt_prompt = f”Protocol has {len(steps)} steps, {total_time} min total. Key bottleneck optimization:”
optimization = llm_call(opt_prompt, max_tokens=80)
return {
‘steps’: steps, ‘schedule’: schedule, ‘inventory_issues’: inv_issues,
‘safety_risks’: safety_risks, ‘parallelization’: parallel_opts,
‘time_saved’: time_saved, ‘total_time’: total_time,
‘optimized_time’: optimized_time, ‘ai_optimization’: optimization,
‘reagents’: reagents
}

We construct the agent loop, integrating perception, planning, validation, and revision into a single, coherent flow. We use CodeGen for reasoning-based optimization to refine step sequencing and propose practical improvements for efficiency and parallel execution.

Copy CodeCopiedUse a different Browserdef generate_checklist(results):
md = “# WET-LAB PROTOCOL CHECKLISTnn”
md += f”**Total Steps:** {len(results[‘schedule’])}n”
md += f”**Estimated Time:** {results[‘total_time’]} min ({results[‘total_time’]//60}h {results[‘total_time’]%60}m)n”
md += f”**Optimized Time:** {results[‘optimized_time’]} min (save {results[‘time_saved’]} min)nn”
md += “## TIMELINEn”
current_day = 1
for item in results[‘schedule’]:
if item[‘day’] > current_day:
md += f”n### Day {item[‘day’]}n”
current_day = item[‘day’]
parallel = ” ” if item[‘can_parallelize’] else “”
md += f”- [ ] **{item[‘start’]}-{item[‘end’]}** | Step {item[‘step’]}: {item[‘name’]} ({item[‘temp’]}){parallel}n”
md += “n## REAGENT PICK-LISTn”
for reagent in results[‘reagents’]:
md += f”- [ ] {reagent}n”
md += “n## SAFETY & INVENTORY ALERTSn”
all_issues = results[‘safety_risks’] + results[‘inventory_issues’]
if all_issues:
for risk in all_issues:
md += f”- {risk}n”
else:
md += “- No critical issues detectedn”
md += “n## OPTIMIZATION TIPSn”
for tip in results[‘parallelization’]:
md += f”- {tip}n”
md += f”- AI Suggestion: {results[‘ai_optimization’]}n”
return md

def generate_gantt_csv(schedule):
df = pd.DataFrame(schedule)
return df.to_csv(index=False)

We create output generators that transform results into human-readable Markdown checklists and Gantt-compatible CSVs. We ensure that every execution produces clear summaries of reagents, time savings, and safety or inventory alerts for streamlined lab operations.

Copy CodeCopiedUse a different BrowserSAMPLE_PROTOCOL = “””ELISA Protocol for Cytokine Detection

1. Coating (Day 1, 4°C overnight)
– Dilute capture antibody to 2 μg/mL in coating buffer (pH 9.6)
– Add 100 μL per well to 96-well plate
– Incubate at 4°C overnight (12-16 hours)
– BSL-2 cabinet required

2. Blocking (Day 2)
– Wash plate 3× with PBS-T (200 μL/well)
– Add 200 μL blocking buffer (1% BSA in PBS)
– Incubate 1 hour at room temperature

3. Sample Incubation
– Wash 3× with PBS-T
– Add 100 μL diluted samples/standards
– Incubate 2 hours at room temperature

4. Detection Antibody
– Wash 5× with PBS-T
– Add 100 μL biotinylated detection antibody (0.5 μg/mL)
– Incubate 1 hour at room temperature

5. Streptavidin-HRP
– Wash 5× with PBS-T
– Add 100 μL streptavidin-HRP (1:1000 dilution)
– Incubate 30 minutes at room temperature
– Work in dark

6. Development
– Wash 7× with PBS-T
– Add 100 μL TMB substrate
– Incubate 10-15 minutes (monitor color development)
– Add 50 μL stop solution (2M H2SO4) – CAUTION: corrosive
“””

SAMPLE_INVENTORY = “””reagent,quantity,unit,expiry,lot
capture antibody,500,μg,2025-12-31,AB123
blocking buffer,500,mL,2025-11-30,BB456
PBS-T,1000,mL,2026-01-15,PT789
detection antibody,8,μg,2025-10-15,DA321
streptavidin HRP,10,mL,2025-12-01,SH654
TMB substrate,100,mL,2025-11-20,TM987
stop solution,250,mL,2026-03-01,SS147
BSA,100,g,2024-09-30,BS741″””

results = agent_loop(SAMPLE_PROTOCOL, SAMPLE_INVENTORY, start_time=”09:00″)
print(“n” + “=”*70)
print(generate_checklist(results))
print(“n” + “=”*70)
print(“n GANTT CSV (first 400 chars):n”)
print(generate_gantt_csv(results[‘schedule’])[:400])
print(“n Time Savings:”, f”{results[‘time_saved’]} minutes via parallelization”)

We conduct a comprehensive test run using a sample ELISA protocol and a reagent inventory dataset. We visualize the agent’s outputs, optimized schedule, parallelization gains, and AI-suggested improvements, demonstrating how our planner functions as a self-contained, intelligent lab assistant.

At last, we demonstrated how agentic AI principles can enhance reproducibility and safety in wet-lab workflows. By parsing free-form experimental text into structured, actionable plans, we automated protocol validation, reagent management, and temporal optimization in a single pipeline. The integration of CodeGen enables on-device reasoning about bottlenecks and safety conditions, allowing for self-contained, data-secure operations. We concluded with a fully functional planner that generates Gantt-compatible schedules, Markdown checklists, and AI-driven optimization tips, establishing a robust foundation for autonomous laboratory planning systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Build an Autonomous Wet-Lab Protocol Planner and Validator Using Salesforce CodeGen for Agentic Experiment Design and Safety Optimization appeared first on MarkTechPost.

Google AI Introduces DS STAR: A Multi Agent Data Science System That P …

How do you turn a vague business style question over messy folders of CSV, JSON and text into reliable Python code without a human analyst in the loop? Google researchers introduce DS STAR (Data Science Agent via Iterative Planning and Verification), a multi agent framework that turns open ended data science questions into executable Python scripts over heterogeneous files. Instead of assuming a clean SQL database and a single query, DS STAR treats the problem as Text to Python and operates directly on mixed formats such as CSV, JSON, Markdown and unstructured text.

https://arxiv.org/pdf/2509.21825

From Text To Python Over Heterogeneous Data

Existing data science agents often rely on Text to SQL over relational databases. This constraint limits them to structured tables and simple schema, which does not match many enterprise environments where data sits across documents, spreadsheets and logs.

DS STAR changes the abstraction. It generates Python code that loads and combines whatever files the benchmark provides. The system first summarizes every file, then uses that context to plan, implement and verify a multi step solution. This design allows DS STAR to work on benchmarks such as DABStep, KramaBench and DA Code, which expect multi step analysis over mixed file types and require answers in strict formats.

https://arxiv.org/pdf/2509.21825

Stage 1: Data File Analysis With Aanalyzer

The first stage builds a structured view of the data lake. For each file (Dᵢ), the Aanalyzer agent generates a Python script (sᵢ_desc) that parses the file and prints essential information such as column names, data types, metadata and text summaries. DS STAR executes this script and captures the output as a concise description (dᵢ).

This process works for both structured and unstructured data. CSV files yield column level statistics and samples, while JSON or text files produce structural summaries and key snippets. The collection {dᵢ} becomes shared context for all later agents.

https://arxiv.org/pdf/2509.21825

Stage 2: Iterative Planning, Coding And Verification

After file analysis, DS STAR runs an iterative loop that mirrors how a human uses a notebook.

Aplanner creates an initial executable step (p₀) using the query and the file descriptions, for example loading a relevant table.

Acoder turns the current plan (p) into Python code (s). DS STAR executes this code to obtain an observation (r).

Averifier is an LLM based judge. It receives the cumulative plan, the query, the current code and its execution result and returns a binary decision, sufficient or insufficient.

If the plan is insufficient, Arouter decides how to refine it. It either outputs the token Add Step, which appends a new step, or an index of an erroneous step to truncate and regenerate from.

Aplanner is conditioned on the latest execution result (rₖ), so each new step explicitly responds to what went wrong in the previous attempt. The loop of routing, planning, coding, executing and verifying continues until Averifier marks the plan sufficient or the system hits a maximum of 20 refinement rounds.

https://arxiv.org/pdf/2509.21825

To satisfy strict benchmark formats, a separate Afinalyzer agent converts the final plan into solution code that enforces rules such as rounding and CSV output.

Robustness Modules, Adebugger And Retriever

Realistic pipelines fail on schema drift and missing columns. DS STAR adds Adebugger to repair broken scripts. When code fails, Adebugger receives the script, the traceback and the analyzer descriptions {dᵢ}. It generates a corrected script by conditioning on all three signals, which is important because many data centric bugs require knowledge of column headers, sheet names or schema, not only the stack trace.

KramaBench introduces another challenge, thousands of candidate files per domain. DS STAR handles this with a Retriever. The system embeds the user query and each description (dᵢ) using a pre trained embedding model and selects the top 100 most similar files for the agent context, or all files if there are fewer than 100. In the implementation, the research team used Gemini Embedding 001 for similarity search.

https://arxiv.org/pdf/2509.21825

Benchmark Results On DABStep, KramaBench And DA Code

All main experiments run DS STAR with Gemini 2.5 Pro as the base LLM and allow up to 20 refinement rounds per task.

On DABStep, model only Gemini 2.5 Pro achieves 12.70 percent hard level accuracy. DS STAR with the same model reaches 45.24 percent on hard tasks and 87.50 percent on easy tasks. This is an absolute gain of more than 32 percentage points on the hard split and it outperforms other agents such as ReAct, AutoGen, Data Interpreter, DA Agent and several commercial systems recorded on the public leaderboard.

https://arxiv.org/pdf/2509.21825

The Google research team reports that, compared to the best alternative system on each benchmark, DS STAR improves overall accuracy from 41.0 percent to 45.2 percent on DABStep, from 39.8 percent to 44.7 percent on KramaBench and from 37.0 percent to 38.5 percent on DA Code.

https://arxiv.org/pdf/2509.21825

For KramaBench, which requires retrieving relevant files from large domain specific data lakes, DS STAR with retrieval and Gemini 2.5 Pro achieves a total normalized score of 44.69. The strongest baseline, DA Agent with the same model, reaches 39.79.

https://arxiv.org/pdf/2509.21825

On DA Code, DS STAR again beats DA Agent. On hard tasks, DS STAR reaches 37.1 percent accuracy versus 32.0 percent for DA Agent when both use Gemini 2.5 Pro.

Key Takeaways

DS STAR reframes data science agents as Text to Python over heterogeneous files such as CSV, JSON, Markdown and text, instead of only Text to SQL over clean relational tables.

The system uses a multi agent loop with Aanalyzer, Aplanner, Acoder, Averifier, Arouter and Afinalyzer, which iteratively plans, executes and verifies Python code until the verifier marks the solution as sufficient.

Adebugger and a Retriever module improve robustness, by repairing failing scripts using rich schema descriptions and by selecting the top 100 relevant files from large domain specific data lakes.

With Gemini 2.5 Pro and 20 refinement rounds, DS STAR achieves large gains over prior agents on DABStep, KramaBench and DA Code, for example increasing DABStep hard accuracy from 12.70 percent to 45.24 percent.

Ablations show that analyzer descriptions and routing are critical, and experiments with GPT 5 confirm that the DS STAR architecture is model agnostic, while iterative refinement is essential for solving hard multi step analytics tasks.

Editorial Comments

DS STAR shows that practical data science automation needs explicit structure around large language models, not only better prompts. The combination of Aanalyzer, Averifier, Arouter and Adebugger turns free form data lakes into a controlled Text to Python loop that is measurable on DABStep, KramaBench and DA Code, and portable across Gemini 2.5 Pro and GPT 5. This work moves data agents from table demos toward benchmarked, end to end analytics systems.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google AI Introduces DS STAR: A Multi Agent Data Science System That Plans, Codes And Verifies End To End Analytics appeared first on MarkTechPost.

Transform your MCP architecture: Unite MCP servers through AgentCore G …

As AI agents are adopted at scale, developer teams can create dozens to hundreds of specialized Model Context Protocol (MCP) servers, tailored for specific agent use case and domain, organization functions or teams. Organizations also need to integrate their own existing MCP servers or open source MCP servers for their AI workflows. There is a need for a way to efficiently combine these existing MCP servers–whether custom-built, publicly available, or open source–into a unified interface that AI agents can readily consume and teams can seamlessly share across the organization.
Earlier this year, we introduced Amazon Bedrock AgentCore Gateway, a fully managed service that serves as a centralized MCP tool server, providing a unified interface where agents can discover, access, and invoke tools. Today, we’re extending support for existing MCP servers as a new target type in AgentCore Gateway. With this capability, you can group multiple task-specific MCP servers aligned to agent goals behind a single, manageable MCP gateway interface. This reduces the operational complexity of maintaining separate gateways, while providing the same centralized tool and authentication management that existed for REST APIs and AWS Lambda functions.
Without a centralized approach, customers face significant challenges: discovering and sharing tools across organizations becomes fragmented, managing authentication across multiple MCP servers grows increasingly complex, and maintaining separate gateway instances for each server quickly becomes unmanageable. Amazon Bedrock AgentCore Gateway helps solves these challenges by treating existing MCP servers as native targets, giving customers a single point of control for routing, authentication, and tool management—making it as simple to integrate MCP servers as it is to add other targets to the gateway.
Breaking down MCP silos: Why enterprise teams need a unified Gateway
Let’s explore this through a real-world example of an e-commerce ordering system, where different teams maintain specialized MCP servers for their specific domains. Consider an enterprise e-commerce system where different teams have developed specialized MCP servers:

The Shopping Cart team maintains an MCP server with cart management tools
The Product Catalog team runs their MCP server for product browsing and search
The Promotions team operates an MCP server handling promotional logic

Previously, an ordering agent would need to interact with each of these MCP servers separately, managing multiple connections and authentication contexts. With the new MCP server target support in AgentCore Gateway, these specialized servers can now be unified under a single gateway while maintaining their team-specific ownership and access controls. The power of this approach lies in its organizational flexibility. Teams can group their MCP servers based on multiple logical criteria:

Business unit alignment: Organize the MCP servers by business unit
Product feature boundaries: Each product team owns their MCP server with domain-specific tools allowing them to maintain clear ownership while providing a unified interface for their agents
Security and access control: Different MCP servers require different authentication mechanisms. The gateway handles the authentication complexity, making it simple for authorized agents to access the tools they need

The following diagram illustrates how an ordering agent interacts with multiple MCP servers through AgentCore Gateway. The agent connects to the gateway and discovers the available tools. Each team maintains control over their domain-specific tools while contributing to a cohesive agent experience. The gateway handles tool naming collisions, authentication, and provides unified semantic search across the tools.

The AgentCore Gateway serves as an integration hub in modern agentic architectures, offering a unified interface for connecting diverse agent implementations with a wide array of tool providers. The architecture, as illustrated in the diagram, demonstrates how the gateway bridges the gap between agent and tool implementation approaches, now enhanced with the ability to directly integrate MCP server targets.
AgentCore Gateway integration architecture
In AgentCore Gateway, a target defines the APIs, Lambda functions, or other MCP servers that a gateway will provide as tools to an agent. Targets can be Lambda functions, OpenAPI specifications, Smithy models, MCP servers, or other tool definitions.
The target integration side of the architecture showcases the gateway’s versatility in tool integration. With the new MCP server target support, the gateway can directly incorporate tools from public MCP servers, treating them as first-class citizens alongside other target types. This capability extends to federation scenarios where one AgentCore Gateway instance can serve as a target for another, for hierarchical tool organization across organizational boundaries. The gateway can seamlessly integrate with AgentCore Runtime instances that expose agents as tools, private MCP servers maintained by customers, traditional AWS Lambda functions, and both Smithy and AWS service APIs.
Beyond target diversity, the gateway’s authentication architecture provides additional operational benefits. The gateway decouples its inbound authentication from target systems, letting agents access tools that use multiple identity providers through a single interface. This centralized approach simplifies development, deployment, and maintenance of AI agents. Now, the same approach can be used for MCP server targets, where the gateway manages the complexity of interfacing with the server using the configured identity provider for the target.
With this authentication foundation you get sophisticated tool management capabilities through a unified architecture. When an agent requests tool discovery, the gateway provides a consistent view across the integrated targets, with tools from MCP servers appearing alongside Lambda functions and traditional APIs. The semantic search capability operates uniformly across the tool types, so agents can discover relevant tools regardless of their implementation. During tool invocation, the gateway handles the necessary protocol translations, authentication flows, and data transformations, presenting a clean, consistent interface to agents while managing the complexity of different target systems behind the scenes.
The addition of MCP server target support represents a significant evolution in the gateway’s capabilities. Organizations can now directly integrate MCP-native tools while maintaining their investments in traditional APIs and Lambda functions. This flexibility allows for gradual migration strategies where teams can adopt MCP-native implementations at their own pace while facilitating continuous operation of existing integrations. The gateway’s synchronization mechanisms make sure that tool definitions remain current across the different target types, while its authentication and authorization systems provide consistent security controls regardless of the underlying tool implementation.
The gateway combines MCP servers, traditional APIs, and serverless functions into a coherent tool environment. This capability, along with enterprise-grade security and performance, makes it a beneficial infrastructure for agentic computing.

Solution Walkthrough
In this post, we’ll guide you through the steps to set up an MCP server target in AgentCore Gateway, which is as simple as adding a new MCP server type target to a new or existing MCP Gateway. Adding an MCP server to an AgentCore Gateway will allow you to centralize your tool management, security authentication, and operational best practices with managing MCP servers at scale.

Get started with adding MCP Server into AgentCore Gateway
To get started, you will create an AgentCore Gateway and add your MCP Server as a target.
Prerequisites
Verify you have the following prerequisites:

AWS account with Amazon Bedrock AgentCore access. For more information review Permissions for AgentCore Runtime documentation.
Python 3.12 or later
Basic understanding of OAuth 2.0

You can create gateways and add targets through multiple interfaces:

AWS SDK for Python (Boto3)
AWS Management Console
AWS Command Line Interface (AWS CLI)
AgentCore starter toolkit for fast and straightforward setup

The following practical examples and code snippets demonstrate how to set up and use Amazon Bedrock AgentCore Gateway. For an interactive walkthrough, you can use these Jupyter Notebook samples on GitHub.
Create a gateway
To create a gateway, you can use the AgentCore starter toolkit to create a default authorization configuration with Amazon Cognito for JWT-based inbound authorization. You can also use another OAuth 2.0-compliant authentication provider instead of Cognito.

import time
import boto3

gateway_client = boto3.client(“bedrock-agentcore-control”)

# Create an authorization configuration, that specifies what client is authorized to access this Gateway
auth_config = {
“customJWTAuthorizer”: {
“allowedClients”: [‘<cognito_client_id>’], # Client MUST match with the ClientId configured in Cognito.
“discoveryUrl”: ‘<cognito_oauth_discovery_url>’,
}
}

# Call the create_gateway API
# This operation is asynchronous so may take time for Gateway creation
# This Gateway will leverage a CUSTOM_JWT authorizer, the Cognito User Pool we reference in auth_config
def deploy_gateway(poll_interval=5):
create_response = gateway_client.create_gateway(
name=”DemoGateway”,
roleArn=”<IAM Role>”, # The IAM Role must have permissions to create/list/get/delete Gateway
protocolType=”MCP”,
authorizerType=”CUSTOM_JWT”,
authorizerConfiguration=auth_config,
description=”AgentCore Gateway with MCP Server Target”,
)
gatewayID = create_response[“gatewayId”]
gatewayURL = create_response[“gatewayUrl”]

# Wait for deployment
while True:
status_response = gateway_client.get_gateway(gatewayIdentifier=gatewayID)
status = status_response[“status”]
if status == “READY”:
print(“✅ AgentCore Gateway is READY!”)
break
elif status in [“FAILED”]:
print(f”❌ Deployment failed: {status}”)
return None
print(f”Status: {status} – waiting…”)
time.sleep(poll_interval)

if __name__ == “__main__”:
deploy_gateway()

# Values with < > needs to be replaced with real values

 Create a sample MCP Server
As an example, let’s create a sample MCP server with three simple tools that return static responses. The server uses FastMCP with stateless_http=True which is required for AgentCore Runtime compatibility.

from mcp.server.fastmcp import FastMCP

mcp = FastMCP(host=”0.0.0.0″, stateless_http=True)

@mcp.tool()
def getOrder() -> int:
“””Get an order”””
return 123

@mcp.tool()
def updateOrder(orderId: int) -> int:
“””Update existing order”””
return 456

@mcp.tool()
def cancelOrder(orderId: int) -> int:
“””cancel existing order”””
return 789

if __name__ == “__main__”:
mcp.run(transport=”streamable-http”)

Configure AgentCore Runtime deployment
Next, we will use the starter toolkit to configure the AgentCore Runtime deployment. The toolkit can create the Amazon ECR repository on launch and generate a Dockerfile for deployment on AgentCore Runtime. You can use your own existing MCP server, we’re using the following only as an example. In a real-world environment, the inbound authorization for your MCP server will likely differ from the gateway configuration. Refer to this GitHub code example to create an Amazon Cognito user pool for Runtime authorization.

from bedrock_agentcore_starter_toolkit import Runtime
from boto3.session import Session

boto_session = Session()
region = boto_session.region_name
print(f”Using AWS region: {region}”)

required_files = [‘mcp_server.py’, ‘requirements.txt’]
for file in required_files:
    if not os.path.exists(file):
        raise FileNotFoundError(f”Required file {file} not found”)
print(“All required files found ✓”)

agentcore_runtime = Runtime()

auth_config = {
    “customJWTAuthorizer”: {
        “allowedClients”: [
            ‘<runtime_cognito_client_id>’ # Client MUST match with the ClientId configured in Cognito, and can be separate from the Gateway Cognito provider.
        ],
        “discoveryUrl”: ‘<cognito_oauth_discovery_url>’,
    }
}

print(“Configuring AgentCore Runtime…”)
response = agentcore_runtime.configure(
    entrypoint=”mcp_server.py”,
    auto_create_execution_role=True,
    auto_create_ecr=True,
    requirements_file=”requirements.txt”,
    region=region,
    authorizer_configuration=auth_config,
    protocol=”MCP”,
    agent_name=”mcp_server_agentcore”
)
print(“Configuration completed ✓”)

# Values with < > needs to be replaced with real values

Launch MCP server to AgentCore Runtime
Now that we have the Dockerfile, let’s launch the MCP server to AgentCore Runtime:

print(“Launching MCP server to AgentCore Runtime…”)
print(“This may take several minutes…”)
launch_result = agentcore_runtime.launch()
agent_arn = launch_result.agent_arn
agent_id = launch_result.agent_id
print(“Launch completed ✓”)

encoded_arn = agent_arn.replace(‘:’, ‘%3A’).replace(‘/’, ‘%2F’)
mcp_url = f”https://bedrock-agentcore.{region}.amazonaws.com/runtimes/{encoded_arn}/invocations?qualifier=DEFAULT”

print(f”Agent ARN: {launch_result.agent_arn}”)
print(f”Agent ID: {launch_result.agent_id}”)

Create MCP server as target for AgentCore Gateway
Create an AgentCore Identity Resource Credential Provider for the AgentCore Gateway to use as outbound auth to the MCP server agent in AgentCore Runtime:

identity_client = boto3.client(‘bedrock-agentcore-control’, region_name=region)

cognito_provider = identity_client.create_oauth2_credential_provider(
    name=”gateway-mcp-server-identity”,
    credentialProviderVendor=”CustomOauth2″,
    oauth2ProviderConfigInput={
        ‘customOauth2ProviderConfig’: {
            ‘oauthDiscovery’: {
                ‘discoveryUrl’: ‘<cognito_oauth_discovery_url>’,
            },
            ‘clientId’: ‘<runtime_cognito_client_id>’, # Client MUST match with the ClientId configured in Cognito for the Runtime authorizer
            ‘clientSecret’: ‘<cognito_client_secret>’
        }
    }
)
cognito_provider_arn = cognito_provider[‘credentialProviderArn’]
print(cognito_provider_arn)

# Values with < > needs to be replaced with real values

Create a gateway target pointing to the MCP server:

gateway_client = boto3.client(“bedrock-agentcore-control”, region_name=region)
create_gateway_target_response = gateway_client.create_gateway_target(
name=”mcp-server-target”,
gatewayIdentifier=gatewayID,
targetConfiguration={“mcp”: {“mcpServer”: {“endpoint”: mcp_url}}},
credentialProviderConfigurations=[
{
“credentialProviderType”: “OAUTH”,
“credentialProvider”: {
“oauthCredentialProvider”: {
“providerArn”: cognito_provider_arn,
“scopes”: [“<cognito_oauth_scopes>”],
}
},
},
],
) # Asynchronously create gateway target
gatewayTargetID = create_gateway_target_response[“targetId”]

# Values with < > needs to be replaced with real values

After creating a gateway target, implement a polling mechanism to check for the gateway target status using the get_gateway_target API call:

import time

def poll_for_status(interval=5):
# Poll for READY status
while True:
gateway_target_response = gateway_client.get_gateway_target(gatewayIdentifier=gatewayID, targetId=gatewayTargetID)
status = gateway_target_response[“status”]
if status == ‘READY’:
break
elif status in [‘FAILED’, ‘UPDATE_UNSUCCESSFUL’, ‘SYNCHRONIZE_UNSUCCESSFUL’]:
raise Exception(f”Gateway target failed with status: {status}”)
time.sleep(interval)

poll_for_status()

Test Gateway with Strands Agents framework
Let’s test the Gateway with the Strands Agents integration to list the tools from MCP server. You can also use other MCP-compatible agents built with different agentic frameworks.

from strands import Agent
from mcp.client.streamable_http import streamablehttp_client
from strands.tools.mcp.mcp_client import MCPClient

def create_streamable_http_transport():
return streamablehttp_client(gatewayURL,headers={“Authorization”: f”Bearer {token}”})

client = MCPClient(create_streamable_http_transport)

with client:
# Call the listTools
tools = client.list_tools_sync()
# Create an Agent with the model and tools
agent = Agent(model=yourmodel,tools=tools) ## you can replace with any model you like
# Invoke the agent with the sample prompt. This will only invoke MCP listTools and retrieve the list of tools the LLM has access to. The below does not actually call any tool.
agent(“Hi , can you list all tools available to you”)
# Invoke the agent with sample prompt, invoke the tool and display the response
agent(“Get the Order id”)

Refreshing tool definitions of your MCP servers in AgentCore Gateway
The SynchronizeGatewayTargets API is a new asynchronous operation that enables on-demand synchronization of tools from MCP server targets. MCP servers host tools which agents can discover and invoke. With time, these tools might need to be updated, or new tools may be introduced in an existing MCP server target. You can connect with external MCP servers through the SynchronizeGatewayTargets API that performs protocol handshakes and indexes available tools. This API provides customers with explicit control over when to refresh their tool definitions, particularly useful after making changes to their MCP server’s tool configurations.
When a target is configured with OAuth authentication, the API first interacts with the AgentCore Identity service to retrieve the necessary credentials from the specified credential provider. These credentials are validated for freshness and availability before communication with the MCP server begins. If the credential retrieval fails or returns expired tokens, the synchronization operation fails immediately with appropriate error details, transitioning the target to a FAILED state. For targets configured without authentication, the API proceeds directly to tool synchronization.
The tool processing workflow begins with an initialize call to the MCP server to establish a session. Following successful initialization, the API makes paginated calls to the MCP server’s tools/list capability, processing tools in batches of 100 to optimize performance and resource utilization. Each batch of tools undergoes normalization where the API adds target-specific prefixes to help prevent naming collisions with tools from other targets. During processing, tool definitions are normalized to facilitate consistency across different target types, while preserving the essential metadata from the original MCP server definitions.

The synchronization flow begins when:

An Ops Admin initiates the SynchronizeGatewayTargets API, triggering AgentCore Gateway to refresh the configured MCP target.
The gateway obtains an OAuth token from AgentCore Identity for secure access to the MCP target.
The gateway then initializes a secure session with the MCP server to retrieve version capabilities.
Finally, the gateway makes paginated calls to the MCP server tools/list endpoint to retrieve the tool definitions, making sure the gateway maintains a current and accurate list of tools.

The SynchronizeGatewayTargets API addresses a critical challenge in managing MCP targets within AgentCore Gateway: maintaining an accurate representation of available tools while optimizing system performance and resource utilization. Here’s why this explicit synchronization approach is valuable:
Schema consistency management: Without explicit synchronization, AgentCore Gateway would need to either make real-time calls to MCP servers during ListTools operations (impacting latency and reliability) or risk serving stale tool definitions. The SynchronizeGatewayTargets API provides a controlled mechanism where customers can refresh their tool schemas at strategic times, such as after deploying new tools or updating existing ones in their MCP server. This approach makes sure that tool definitions in the gateway accurately reflect the target MCP server’s capabilities without compromising performance.

Performance impact trade-offs: The API implements optimistic locking during synchronization to help prevent concurrent modifications that could lead to inconsistent states. While this means multiple synchronization requests might need to retry if there’s contention, this trade-off is acceptable because:

Tool schema changes are typically infrequent operational events rather than regular runtime occurrences
The performance cost of synchronization is incurred only when explicitly requested, not during regular tool invocations
The cached tool definitions facilitate consistent high performance for ListTools operations between synchronizations

Invoke the synchronize gateway API
Use the following example to invoke the synchronize gateway operation:

import requests
import json

def search_tools(gateway_url, access_token, query):
headers = {
“Content-Type”: “application/json”,
“Authorization”: f”Bearer {access_token}”
}

payload = {
“jsonrpc”: “2.0”,
“id”: “search-tools-request”,
“method”: “tools/call”,
“params”: {
“name”: “x_amz_bedrock_agentcore_search”,
“arguments”: {
“query”: query
}
}
}

response = requests.post(gateway_url, headers=headers, json=payload, timeout=5)
response.raise_for_status()
return response.json()

# Example usage
token_response = utils.get_token(user_pool_id, client_id, client_secret, scopeString, REGION)
access_token = token_response[‘access_token’]
results = search_tools(gatewayURL, access_token, “order operations”)
print(json.dumps(results, indent=2))

Implicit synchronization of tools schema
During CreateGatewayTarget and UpdateGatewayTarget operations, AgentCore Gateway performs an implicit synchronization that differs from the explicit SynchronizeGatewayTargets API. This implicit synchronization makes sure that MCP targets are created or updated with valid, current tool definitions, aligning with the assurance from AgentCore Gateway that targets in READY state are immediately usable. While this might make create/update operations take longer than with other target types, it helps prevent the complexity and potential issues of having targets without validated tool definitions.

The implicit synchronization flow begins when:

An Ops Admin creates or updates the MCP target using CreateGatewayTarget or UpdateGatewayTarget operations.
AgentCore Gateway configures the new or updated MCP target.
The gateway asynchronously triggers the synchronization process to update the tool definitions.
The gateway obtains an OAuth token from AgentCore Identity for secure access.
The gateway then initializes a secure session with the MCP server to retrieve version capabilities.
Finally, the gateway makes paginated calls to the MCP server’s tools/list endpoint to retrieve the tool definitions, making sure the gateway maintains a current and accurate list of tools.

ListTools behavior for MCP targets
The ListTools operation in AgentCore Gateway provides access to tool definitions previously synchronized from MCP targets, following a cache-first approach that prioritizes performance and reliability. Unlike traditional OpenAPI or Lambda targets where tool definitions are statically defined, MCP target tools are discovered and cached through synchronization operations. When a client calls ListTools, the gateway retrieves tool definitions from its persistent storage rather than making real-time calls to the MCP server. These definitions were previously populated either through implicit synchronization during target creation/update or through explicit SynchronizeGatewayTargets API calls. The operation returns a paginated list of normalized tool definitions.

InvokeTool (tools/call) Behavior for MCP Targets
The InvokeTool operation for MCP targets handles the actual execution of tools discovered through ListTools, managing real-time communication with the target MCP server. Unlike the cache-based ListTools operation, tools/call requires active communication with the MCP server, introducing specific authentication, session management, and error handling requirements. When a tools/call request arrives, AgentCore Gateway first validates the tool exists in its synchronized definitions. For MCP targets, AgentCore Gateway performs an initial initialize call to establish a session with the MCP server. If the target is configured with OAuth credentials, AgentCore Gateway retrieves fresh credentials from AgentCore Identity before making the initialize call. This makes sure that even if ListTools returned cached tools with expired credentials, the actual invocation uses valid authentication.

The inbound authorization flow begins when:

The MCP client initializes a request with MCP protocol version to AgentCore Gateway.
The client then sends the tools/call request to the gateway.
The gateway obtains an OAuth token from AgentCore Identity for secure access.
The gateway initializes a secure session with the MCP server to invoke and handle the actual execution of the tool.

Search tool behavior for MCP targets
The search capability in AgentCore Gateway enables semantic discovery of tools across the different target types, including MCP targets. For MCP targets, the search functionality operates on normalized tool definitions that were captured and indexed during synchronization operations, providing efficient semantic search without real-time MCP server communication.
When tool definitions are synchronized from an MCP target, AgentCore Gateway automatically generates embeddings for each tool’s name, description, and parameter descriptions. These embeddings are stored alongside the normalized tool definitions, enabling semantic search that understands the intent and context of search queries. Unlike traditional keyword matching, this allows agents to discover relevant tools even when exact terminology doesn’t match.

Search for MCP server tools through the gateway
Use the following example to search for tools through the gateway.

import requests
import json

def search_tools(gateway_url, access_token, query):
    headers = {
        “Content-Type”: “application/json”,
        “Authorization”: f”Bearer {access_token}”
    }

    payload = {
        “jsonrpc”: “2.0”,
        “id”: “search-tools-request”,
        “method”: “tools/call”,
        “params”: {
            “name”: “x_amz_bedrock_agentcore_search”,
            “arguments”: {
                “query”: query
            }
        }
    }

    response = requests.post(gateway_url, headers=headers, json=payload, timeout=5)
response.raise_for_status()
    return response.json()

# Example usage
token_response = utils.get_token(user_pool_id, client_id, client_secret, scopeString, REGION)
access_token = token_response[‘access_token’]
results = search_tools(gatewayURL, access_token, “math operations”)
print(json.dumps(results, indent=2))

Conclusion
Today’s announcement of MCP server support as a target type in Amazon Bedrock AgentCore Gateway is an advancement in enterprise AI agent development. This new capability addresses critical challenges in scaling MCP server implementations while maintaining security and operational efficiency. By integrating existing MCP servers alongside REST APIs and Lambda functions, AgentCore Gateway provides a more unified, secure, and manageable solution for tool integration at scale. Organizations can now manage their tools through a single, centralized interface while benefiting from unified authentication, simplified tool discovery and reduced maintenance overhead.
For more detailed information and advanced configurations, refer to the code samples on GitHub, the Amazon Bedrock AgentCore Gateway Developer Guide and Amazon AgentCore Gateway pricing.

About the authors
Frank Dallezotte is a Senior Solutions Architect at AWS and is passionate about working with independent software vendors to design and build scalable applications on AWS. He has experience creating software, implementing build pipelines, and deploying these solutions in the cloud.
Ganesh Thiyagarajan is a Senior Solutions Architect at Amazon Web Services (AWS) with over 20 years of experience in software architecture, IT consulting, and solution delivery. He helps ISVs transform and modernize their applications on AWS. He is also part of the AI/ML Technical field community, helping customers build and scale Gen AI solutions.
Dhawal Patel is a Principal Generative AI Tech lead at Amazon Web Services (AWS). He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to Agentic AI, Deep learning, distributed computing.

Generalist AI Introduces GEN-θ: A New Class of Embodied Foundation Mo …

How do you build a single model that can learn physical skills from chaotic real world robot data without relying on simulation? Generalist AI has unveiled GEN-θ, a family of embodied foundation models trained directly on high fidelity raw physical interaction data instead of internet video or simulation. The system is built to establish scaling laws for robotics in the same way that large language models did for text, but now grounded in continuous sensorimotor streams from real robots operating in homes, warehouses and workplaces.

Harmonic Reasoning, thinking and acting in real time

GEN-θ is introduced as an embodied foundation model architecture that builds on the strengths of vision and language models, and extends them with native support for human level reflexes and physical commonsense. The core feature is Harmonic Reasoning, where the model is trained to think and act at the same time over asynchronous, continuous time streams of sensing and acting tokens.

This design targets a robotics specific constraint. Language models can simply spend more time thinking before replying, but robots must act while physics continues to evolve. Harmonic Reasoning creates a harmonic interplay between sensing and acting streams so that GEN-θ can scale to very large model sizes without depending on  System1-System2 architectures or heavy inference time guidance controllers.

GEN-θ is explicitly cross embodiment. The same architecture runs on different robots and has been tested on 6DoF, 7DoF and 16+DoF semi humanoid systems, which lets a single pre-training run serve heterogeneous fleets.

Surpassing the intelligence threshold in robotics

The Generalist AI team reports a phase transition in capability as GEN-θ scales in a high data regime. Their scaling research experiment also show that the models must be large enough to absorb vast amounts of physical interaction data.

Their behaviors are as follows:

1B models struggle to absorb complex and diverse sensorimotor data during pretraining and their weights stop absorbing new information, which the research team describe as ossification.

6B models start to benefit from pretraining and show strong multi task capabilities.

7B+ models internalize large scale robotic pretraining so that a few thousand post training steps on downstream tasks are sufficient for transfer.

https://generalistai.com/blog/nov-04-2025-GEN-0

The above image plots next action validation prediction error on a completely withheld long horizon downstream task across model sizes and pre-training compute. 1B models plateau early while 6B and 7B models continue to improve as pretraining increases. The research team connect this phase transition to Moravec’s Paradox, arguing that physical commonsense and dexterity appear to require higher compute thresholds than abstract language reasoning, and that GEN-θ is operating beyond that activation point.

Generalist AI team states that GEN-θ has been scaled to 10B+ model sizes, and that larger variants adapt to new tasks with increasingly less post training.

Scaling laws for robotics

Another focus of this research is scaling laws that relate pre-training data and compute to downstream post training performance. The research team samples checkpoints from GEN-θ training runs on different subsets of the pre-training dataset, then post trains those checkpoints on multi task, language conditioned data. This supervised fine tuning stage spans 16 task sets, covering dexterity tasks such as building Lego, industry workflows such as fast food packing, and generalization tasks that include anything style instructions.

Across various tasks, more pre-training improves validation loss and next action prediction error during post training. At sufficient model scale, the relationship between pre-training dataset size and downstream validation error is well described by a power law of the form.

L(D)=(Dc​/D)αD​

where (D) is the number of action trajectories in pre-training and (L(D)) is validation error on a downstream task. This formula lets robotics teams estimate how much pre-training data is needed to reach a target next action prediction error, or how much downstream labeled data can be traded for additional pre-training.

Data engine and infrastructure at robotics scale

GEN-θ is trained on an in house dataset of 270,000 hours of real world manipulation trajectories collected in thousands of homes, warehouses and workplaces worldwide. The data operation currently adds more than 10,000 new hours per week. Generalist AI team claims that GEN-θ is trained on orders of magnitude more real world manipulation data than prior large robotics datasets as of today.

To sustain this regime, the research team has built custom hardware, data-loaders and network infrastructure, including dedicated internet lines to handle uplink bandwidth from distributed sites. The pipeline uses multi cloud contracts, custom upload machines and on the order of 10,000 compute cores for continual multimodal processing. The research team reports compression of dozens of petabytes of data and data-loading techniques from frontier video foundation models, yielding a system capable of absorbing 6.85 years of real world manipulation experience per day of training.

How you pre-train GEN-θ matters as much as how big it is?

Generalist AI team runs large ablations over 8 pre-training datasets and 10 long horizon task sets. They find that different data mixtures, not just more data, produce models with different behaviors across 3 groups of tasks, dexterity, real world applications and generalization. Performance is measured using validation mean squared error on next actions and reverse Kullback Leibler divergence between the model policy and a Gaussian around ground truth actions.

Low MSE and low reverse KL models are better candidates for supervised fine-tuning. Models with higher MSE but low reverse KL are more multimodal in their action distributions and can be better starting points for reinforcement learning.

Key Takeaways

GEN-θ is an embodied foundation model trained on high fidelity raw physical interaction data, not simulation or internet video, and it uses Harmonic Reasoning to think and act simultaneously under real world physics.

Scaling experiments show an intelligence threshold around 7B parameters, where smaller models ossify under high data load and larger models keep improving with more pretraining.

GEN-θ exhibits clear scaling laws, where downstream post training performance follows a power law in the amount of pre-training data, which lets teams predict how much data and compute are needed for target error levels.

The system is trained on more than 270,000 hours of real world manipulation data, growing by about 10,000 hours per week, supported by custom multi cloud infrastructure that can absorb 6.85 years of experience per training day.

Large scale ablations over 8 pretraining datasets and 10 long horizon task sets show that data quality and mixture design, measured with validation MSE and reverse KL, are as important as scale, since different mixtures yield models better suited for supervised finetuning or reinforcement learning.

Editorial Comments

GEN-θ positions embodied foundation models as a serious attempt to bring scaling laws to robotics, using Harmonic Reasoning, large scale multimodal pre-training and explicit analysis of data mixtures. The research shows that 7B+ models, trained on 270,000 hours of real world manipulation data with 10,000 hours added weekly, can cross an intelligence threshold where more physical interaction data predictably improves downstream performance across dexterity, applications and generalization tasks.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Generalist AI Introduces GEN-θ: A New Class of Embodied Foundation Models Built for Multimodal Training Directly on High-Fidelity Raw Physical Interaction appeared first on MarkTechPost.

How to Build a Model-Native Agent That Learns Internal Planning, Memor …

In this tutorial, we explore how an agent can internalize planning, memory, and tool use within a single neural model rather than relying on external orchestration. We design a compact, model-native agent that learns to perform arithmetic reasoning tasks through reinforcement learning. By combining a stage-aware actor-critic network with a curriculum of increasingly complex environments, we enable the agent to discover how to use internalized “tools” and short-term memory to reach correct solutions end-to-end. We work step by step to observe how learning evolves from simple reasoning to multi-step compositional behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport math, random, torch, torch.nn as nn, torch.nn.functional as F
device = “cuda” if torch.cuda.is_available() else “cpu”; torch.manual_seed(0); random.seed(0)
V = 18; CTX = 10; MUL, ADD, SUB, ANS, STO, RCL, EOS = 11, 12, 13, 14, 15, 16, 17
tok2str = {**{i: str(i) for i in range(10)}, CTX:”[CTX]”, MUL:”[MUL]”, ADD:”[ADD]”, SUB:”[SUB]”, ANS:”[ANS]”, STO:”[STO]”, RCL:”[RCL]”, EOS:”[EOS]”}

class ToolEnv:
def __init__(self, max_steps=7):
self.max_steps = max_steps
def sample(self, stage):
a,b,c,d,e = [random.randint(0,9) for _ in range(5)]
if stage==0: ctx=[a,b,c]; target=a*b+c
elif stage==1: ctx=[a,b,c,d]; target=(a*b+c)-d
else: ctx=[a,b,c,d,e]; target=(a*b+c)-(d*e)
return ctx, target, (a,b,c,d,e)
def step_seq(self, actions, abc, stage):
a,b,c,d,e = abc; last=None; mem=None; steps=0; shaped=0.0
goal0=a*b; goal1=goal0+c; goal2=goal1-d; goal3=d*e; goal4=goal1-goal3
for act in actions:
steps+=1
if act==MUL: last=(a*b if last is None else last*(d if stage>0 else 1))
elif act==ADD and last is not None: last+=c
elif act==SUB and last is not None:
last -= (e if stage==2 and mem==”use_d” else (d if stage>0 else 0))
elif act==STO: mem=”use_d” if stage>=1 else “ok”
elif act==RCL and mem is not None:
last = (d*e) if (stage==2 and mem==”use_d”) else (last if last else 0)
elif act==ANS:
target=[goal0,goal1,goal2,goal4][stage] if stage==2 else [goal0,goal1,goal2][stage]
correct=(last==target)
if stage==0: shaped += 0.25*(last==goal0)+0.5*(last==goal1)
if stage==1: shaped += 0.25*(last==goal0)+0.5*(last==goal1)+0.75*(last==goal2)
if stage==2: shaped += 0.2*(last==goal0)+0.4*(last==goal1)+0.6*(last==goal4)+0.6*(last==goal3)
return (1.0 if correct else 0.0)+0.2*shaped, steps
if steps>=self.max_steps: break
return 0.0, steps

We begin by setting up the environment and defining the symbolic tools our agent can use. We create a small synthetic world where each action, such as multiplication, addition, or subtraction, acts as an internal tool. This environment enables us to simulate reasoning tasks in which the agent must plan sequences of tool use to arrive at the correct answer. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ActorCritic(nn.Module):
def __init__(self,V,d=96,nstage=3):
super().__init__()
self.emb=nn.Embedding(V,d); self.stage_emb=nn.Embedding(nstage,d)
self.rnn=nn.GRU(d,d,1,batch_first=True); self.pi=nn.Linear(d,V); self.v=nn.Linear(d,1)
def forward(self,ctx,stage,max_len=6,greedy=False):
B=ctx.shape[0]; ce=self.emb(ctx).mean(1)+self.stage_emb(stage).unsqueeze(1)
h=torch.tanh(ce.mean(1)).unsqueeze(0); inp=self.emb(torch.full((B,1),CTX,device=device))
acts,logps,ents,vals=[],[],[],[]
for _ in range(max_len):
out,h=self.rnn(inp,h); val=self.v(out[:,-1]); logits=self.pi(out[:,-1])
pi=F.log_softmax(logits,dim=-1).exp(); ent=-(pi*torch.log(pi+1e-9)).sum(1)
a=torch.argmax(logits,1) if greedy else torch.distributions.Categorical(pi).sample()
logp=F.log_softmax(logits,dim=-1).gather(1,a.unsqueeze(1)).squeeze(1)
inp=self.emb(a.unsqueeze(1))
acts.append(a); logps.append(logp); ents.append(ent); vals.append(val.squeeze(1))
return torch.stack(acts,1), torch.stack(logps,1), torch.stack(ents,1), torch.stack(vals,1)

We then design our model-native policy using an actor-critic structure built around a GRU. We embed both tokens and task stages, allowing the network to adapt its reasoning depth according to task complexity. This setup enables the agent to learn contextually when and how to use internal tools within a single unified model. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserenv=ToolEnv(); net=ActorCritic(V).to(device)
opt=torch.optim.Adam(net.parameters(),lr=3e-4)
def pad_batch(ctxs):
L=max(len(c)+1 for c in ctxs)
out=torch.full((len(ctxs),L),EOS,dtype=torch.long,device=device)
for i,c in enumerate(ctxs): out[i,:len(c)+1]=torch.tensor(c+[CTX],device=device)
return out
def run_batch(stage,batch=128,train=True,greedy=False):
ctxs=[]; metas=[]
for _ in range(batch):
c,t,abc=env.sample(stage); ctxs.append(c); metas.append((t,abc))
ctx=pad_batch(ctxs); stage_t=torch.full((batch,),stage,device=device,dtype=torch.long)
acts,logps,ents,vals=net(ctx,stage_t,max_len=6,greedy=greedy)
rewards=[]
for i in range(batch):
traj = acts[i].tolist()
abc = metas[i][1]
r,_ = env.step_seq(traj,abc,stage)
rewards.append(r)
R=torch.tensor(rewards,device=device).float()
adv=(R-vals.sum(1)).detach()
if not train: return R.mean().item(), 0.0
pg=-(logps.sum(1)*adv).mean(); vloss=F.mse_loss(vals.sum(1),R); ent=-ents.mean()
loss=pg+0.5*vloss+0.01*ent
opt.zero_grad(); loss.backward(); nn.utils.clip_grad_norm_(net.parameters(),1.0); opt.step()
return R.mean().item(), loss.item()

We implement the reinforcement learning training loop using an advantage actor-critic (A2C) update. We train the agent end-to-end across batches of synthetic problems, updating policy and value networks simultaneously. Here, we incorporate entropy regularization to promote exploration and prevent premature convergence. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Training…”)
stages=[0,0,0,1,1,2]
for ep in range(1,61):
stage=stages[min((ep-1)//10,len(stages)-1)]
acc,loss=run_batch(stage,batch=192,train=True)
if ep%5==0:
with torch.no_grad():
evals=[run_batch(s,train=False,greedy=True)[0] for s in [0,1,2]]
print(f”ep={ep:02d} stage={stage} acc={acc:.3f} | eval T0={evals[0]:.3f} ”
f”T1={evals[1]:.3f} T2={evals[2]:.3f} loss={loss:.3f}”)

We start the main training process using a curriculum strategy where tasks gradually increase in difficulty. As we train, we evaluate the agent on all stages to observe its ability to generalize from simpler to more complex reasoning steps. The printed metrics show how internal planning improves over time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef explain(stage):
c,t,abc=env.sample(stage)
ctx=pad_batch([c]); stage_t=torch.tensor([stage],device=device)
with torch.no_grad(): a,_,_,_=net(ctx,stage_t,greedy=True)
seq=[tok2str[x] for x in a[0].tolist()]
r,_=env.step_seq(a[0].tolist(),abc,stage)
return dict(stage=stage,ctx=c,target=t,actions=” “.join(seq),reward=round(float(r),2))
with torch.no_grad():
for s in [0,1,2]:
print(f”nStage {s} samples:”)
for _ in range(5): print(explain(s))
with torch.no_grad():
finals=[run_batch(s,train=False,greedy=True,batch=1000)[0] for s in [0,1,2]]
print(f”nFinal greedy accuracies → T0={finals[0]:.3f}, T1={finals[1]:.3f}, T2={finals[2]:.3f}”)

We finish by probing the trained agent and printing example reasoning trajectories. We visualize the sequence of tool tokens the model chooses and verify whether it reaches the correct result. Finally, we evaluate the overall performance, demonstrating that the model successfully integrates planning, memory, and reasoning into an internalized process.

In conclusion, we see that even a neural network can learn internalized planning and tool-use behaviors when trained with reinforcement signals. We successfully move beyond traditional pipeline-style architectures, where memory, planning, and execution are separate, toward a model-native agent that integrates these components as part of its learned dynamics. This approach represents a shift in agentic AI, demonstrating how end-to-end learning can produce emergent reasoning and self-organized decision-making without the need for handcrafted control loops.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Model-Native Agent That Learns Internal Planning, Memory, and Multi-Tool Reasoning Through End-to-End Reinforcement Learning appeared first on MarkTechPost.

OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Language …

How can we reliably test whether large language models actually understand Indian languages and culture in real world contexts? OpenAI has released IndQA, a benchmark that evaluates how well AI models understand and reason about questions that matter in Indian languages across cultural domains.

Why IndQA?

OpenAI states that about 80 percent of people worldwide do not speak English as their primary language. Yet most benchmarks that measure non English capabilities are still narrow and often rely on translation or multiple choice formats.

Benchmarks such as MMMLU and MGSM are now near saturation at the top end, where strong models cluster near similar scores. This makes it hard to see meaningful progress and does not test whether models understand local context, history and everyday life.

India is OpenAI’s starting point for new region focused benchmarks. India has about 1 billion people who do not use English as their primary language, 22 official languages with at least 7 spoken by more than 50 million people, and it is ChatGPT’s second largest market.

Dataset, Languages And Domains

IndQA evaluates knowledge and reasoning about Indian culture and everyday life in Indian languages. The benchmark spans 2,278 questions across 12 languages and 10 cultural domains, created with 261 domain experts from across India.

The cultural domains are Architecture and Design, Arts and Culture, Everyday Life, Food and Cuisine, History, Law and Ethics, Literature and Linguistics, Media and Entertainment, Religion and Spirituality, and Sports and Recreation. Items are written natively in Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi and Tamil. Hinglish is included to reflect common code switching in Indian conversations.

Each datapoint contains four components, a culturally grounded prompt in an Indian language, an English translation for auditability, rubric criteria for grading and an ideal answer that encodes expert expectations.

Rubric Based Evaluation Pipeline

IndQA uses a rubric based grading procedure instead of exact match accuracy. For each question, domain experts define multiple criteria that describe what a strong answer should include or avoid and assign a weight to each criterion.

A model based grader checks the candidate response against these criteria and marks which ones are satisfied. The final score is the sum of weights for satisfied criteria divided by the total possible score. This behaves like grading a short exam answer, it supports partial credit and captures nuance and cultural correctness, not only surface token overlap.

https://openai.com/index/introducing-indqa/

Construction Process And Adversarial Filtering

OpenAI describes a four step construction pipeline:

First, they partnered with organizations in India to recruit experts across 10 domains. These experts are native level speakers of the target language and English and have deep subject expertise. They wrote difficult, reasoning heavy prompts anchored in regional context, such as literature, food history, law or media.

Second, they applied adversarial filtering. Every draft question was evaluated with OpenAI’s strongest models at creation time, GPT-4o, OpenAI o3, GPT-4.5 and, partially after public launch, GPT-5. Only questions where a majority of these models failed to produce acceptable answers were kept. This preserves headroom so that future model improvements show up clearly on IndQA.

Third, experts provided detailed criteria for grading each question, similar to an exam rubric. These criteria are reused whenever another model is evaluated on IndQA.

Fourth, experts wrote ideal answers and English translations and then performed peer review and iterative revisions until they signed off on quality.

Measuring Progress On Indian Languages

OpenAI uses IndQA to evaluate recent frontier models and to chart progress over the last couple years on Indian languages. They report that model performance has improved significantly on IndQA while still leaving substantial room for improvement. Results are stratified by language and by domain and include comparisons of GPT-5 Thinking High with other frontier systems.

Key Takeaways

IndQA is a culturally grounded Indic benchmark: IndQA evaluates how well AI models understand and reason about questions that matter in Indian languages, across culturally specific domains, rather than only testing translation or multiple choice accuracy.

The dataset is expert built and reasonably large: The benchmark contains 2,278 questions across 12 languages and 10 cultural domains, developed in collaboration with 261 domain experts from across India, covering areas like architecture, everyday life, food, history and religion.

Evaluation is rubric based, not exact match: Each datapoint bundles a native language prompt, an English translation, a detailed grading rubric and an ideal answer, and model outputs are graded by a model based system that checks weighted expert defined criteria, which enables partial credit and nuanced cultural evaluation.

Questions are adversarially filtered against OpenAI’s strongest models: Draft questions were filtered by running GPT 4o, OpenAI o3, GPT 4.5 and partially GPT 5, and keeping only those items where most of these models failed, which preserves headroom for future models on IndQA.

Editorial Comments

IndQA is a timely step because it targets a real gap, most existing multilingual benchmarks over index on English content and translation style tasks while India has diverse high resource and low resource languages. IndQA brings expert curated, rubric based evaluation for questions that matter in Indian cultural contexts, and uses adversarial filtering against GPT 4o, OpenAI o3, GPT 4.5 and GPT 5 to preserve headroom for frontier models. This launch makes IndQA a practical north star for evaluating Indian language reasoning in modern AI systems.
The post OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Languages appeared first on MarkTechPost.

How Amazon Search increased ML training twofold using AWS Batch for Am …

In this post, we show you how Amazon Search optimized GPU instance utilization by leveraging AWS Batch for SageMaker Training jobs. This managed solution enabled us to orchestrate machine learning (ML) training workloads on GPU-accelerated instance families like P5, P4, and others. We will also provide a step-by-step walkthrough of the use case implementation.
Machine learning at Amazon Search
At Amazon Search, we use hundreds of GPU-accelerated instances to train and evaluate ML models that help our customers discover products they love. Scientists typically train more than one model at a time to find the optimal set of features, model architecture, and hyperparameter settings that optimize the model’s performance. We previously leveraged a first-in-first-out (FIFO) queue to coordinate model training and evaluation jobs. However, we needed to employ a more nuanced criteria to prioritize which jobs should run in what order. Production models needed to run with high priority, exploratory research as medium priority, and hyperparameter sweeps and batch inference as low priority. We also needed a system that could handle interruptions. Should a job fail, or a given instance type become saturated, we needed the job to run on other available compatible instance types while respecting the overall prioritization criteria. Finally, we wanted a managed solution so we could focus more on model development instead of managing infrastructure.
After evaluating multiple options, we chose AWS Batch for Amazon SageMaker Training jobs because it best met our requirements. This solution seamlessly integrated AWS Batch with Amazon SageMaker and allowed us to run jobs per our prioritization criteria. This allows applied scientists to submit multiple concurrent jobs without manual resource management. By leveraging AWS Batch features such as advanced prioritization through fair-share scheduling, we increased peak utilization of GPU-accelerated instances from 40% to over 80%.
Amazon Search: AWS Batch for SageMaker Training Job implementation
We leveraged three AWS technologies to set up our job queue. We used Service Environments to configure the SageMaker AI parameters that AWS Batch uses to submit and manage SageMaker Training jobs. We used Share Identifiers to prioritize our workloads. Finally, we used Amazon CloudWatch to monitor and the provision of alerting capability for critical events or deviations from expected behavior. Let’s dive deep into these constructs.
Service environments. We set up service environments to represent the total GPU capacity available for each instance family, such as P5s and P4s. Each service environment was configured with fixed limits based on our team’s reserved capacity in AWS Batch. Note that for teams using SageMaker Training Plans, these limits can be set to the number of reserved instances, making capacity planning more straightforward. By defining these boundaries, we established how the total GPU instance capacity within a service environment was distributed across different production jobs. Each production experiment was allocated a portion of this capacity through Share Identifiers.
Figure 1 provides a real-world example of how we used AWS Batch’s fair-share scheduling to divide 100 GPU instance between ShareIDs. We allocated 60 instances to ProdExp1, and 40 to ProdExp2. When ProdExp2 used only 25 GPU instances, the remaining 15 could be borrowed by ProdExp1, allowing it to scale up to 75 GPU instances. When ProdExp2 later needed its full 40 GPU instances, the scheduler preempted jobs from ProdExp1 to restore balance. This example used the P4 instance family, but the same approach could apply to any SageMaker-supported EC2 instance family. This ensured that production workloads have guaranteed access to their assigned capacity, while exploratory or ad-hoc experiments could still make use of any idle GPU instances. This design safeguarded critical workloads and improved overall instance utilization by ensuring that no reserved capacity went unused.

Figure 1: AWS Batch fair-share scheduling

Share Identifiers. We used Share Identifiers to allocate fractions of a service environment’s capacity to production experiments. Share Identifiers are string tags applied at job submission time. AWS Batch used these tags to track usage and enforce fair-share scheduling. For initiatives that required dedicated capacity, we defined preset Share Identifiers with quotas in AWS Batch. This reserved capacity for production tracks. These quotas acted as fairness targets rather than hard limits. Idle capacity could still be borrowed, but under contention, AWS Batch enforced fairness by preempting resources from overused identifiers and reassigned them to underused ones.
Within each Share Identifier, job priorities ranging from 0 to 99 determined execution order, but priority-based preemption only triggered when the ShareIdentifier reached its allocated capacity limit. Figure 2 illustrates how we setup and used our share identifiers. ProdExp1 had 60 p4d instances and ran jobs at various priorities. Job A had a priority of 80, Job B was set to 50, Job C was set to at 30, and Job D had a priority 10. When all 60 instances were occupied and a new high-priority job (priority 90) requiring 15 instances was submitted, the system preempted the lowest priority running job (Job D) to make room, while maintaining the total of 60 instances for that Share Identifier.

Figure 2: Priority scheduling within a Share ID

Amazon CloudWatch. We used Amazon CloudWatch to instrument our SageMaker training jobs. SageMaker automatically publishes metrics on job progress and resource utilization, while AWS Batch provides detailed information on job scheduling and execution. With AWS Batch, we queried the status of each job through the AWS Batch APIs. This made it possible to track jobs as they transitioned through states such as SUBMITTED, PENDING, RUNNABLE, STARTING, RUNNING, SUCCEEDED, and FAILED. We published these metrics and job states to CloudWatch and configured dashboards and alarms to alert anytime we encountered extended wait times, unexpected failures, or underutilized resources. This built-in integration provided both real-time visibility and historical trend analysis, which helped our team maintain operational efficiency across GPU clusters without building custom monitoring systems.
Operational impact on team performance
By adopting AWS Batch for SageMaker Training jobs, we enabled experiments to run without concerns about resource availability or contention. Researchers could submit jobs without waiting for manual scheduling, which increased the number of experiments that could be run in parallel. This led to shorter queue times, higher GPU utilization, and faster turnaround of training results, directly improving both research throughput and delivery timelines.
How to set Amazon Batch for SageMaker Training Jobs
To set up a similar environment, you can follow this tutorial, which shows you how to orchestrate multiple GPU large language model (LLM) fine-tuning jobs using multiple GPU-powered instances. The solution is also available on GitHub.
Prerequisites
To orchestrate multiple SageMaker Training jobs with AWS Batch, first you need to complete the following prerequisites:
Clone the GitHub repository with the assets for this deployment. This repository consists of notebooks that reference assets:

git clone https://github.com/aws/amazon-sagemaker-examples/
cd build_and_train_models/sm-training-queues-pytorch/

Create AWS Batch resources
To create the necessary resources to manage SageMaker Training job queues with AWS Batch, we provide utility functions in the example to automate the creation of the Service Environment, Scheduling Policy, and Job Queue.
The service environment represents the Amazon SageMaker AI capacity limits available to schedule, expressed by maximum number of instances. The scheduling policy indicates how resource computes are allocated in a job queue between users or workloads. The job queue is the scheduler interface that researchers interact with to submit jobs and interrogate job status. AWS Batch provides two different queues we can operate with:

FIFO queues – Queues in which no scheduling policies are required
Fair-share queues – Queues in which a scheduling policy Amazon Resource Name (ARN) is required to orchestrate the submitted jobs

We recommend creating dedicated service environments for each job queue in a 1:1 ratio. FIFO queues provide basic message delivery, while fair-share scheduling (FSS) queues provide more sophisticated scheduling, balancing utilization within a Share Identifier, share weights, and job priority. For customers who don’t need multiple shares but would like the ability to assign a priority on job submission, we recommend creating an FSS queue and using a single share within it for all submissions.To create the resources, execute the following commands:

cd smtj_batch_utils
python create_resources.py

You can navigate the AWS Batch Dashboard, shown in the following screenshot, to explore the created resources.

This automation script created two queues:

ml-c5-xlarge-queue – A FIFO queue with priority 2 used for CPU workloads
ml-g6-12xlarge-queue – A fair-share queue with priority 1 used for GPU workloads

The associated scheduling policy for the queue ml-g6-12xlarge-queue is with share attributes such as High priority (HIGHPRI), Medium priority (MIDPRI) and Low priority (LOWPRI) along with the queue weights. Users can submit jobs and assign them to one of three shares: HIGHPRI, MIDPRI, or LOWPRI and assign weights such as 1 for high priority and 3 for medium and 5 for low priority. Below is the screenshot showing the scheduling policy details:

For instructions on how to set up the service environment and a job queue, refer to the Getting started section in Introducing AWS Batch support for SageMaker Training Jobs blog.
Run LLM fine-tuning jobs on SageMaker AI
We run the notebook notebook.ipynb to start submitting SageMaker Training jobs with AWS Batch. The notebook contains the code to prepare the data used for the workload, upload on Amazon Simple Storage Service (Amazon S3), and define the hyperparameters required by the job to be executed.
To run the fine-tuning workload using SageMaker Training jobs, this example uses the ModelTrainer class. The ModelTrainer class is a newer and more intuitive approach to model training that significantly enhances user experience. It supports distributed training, build your own container (BYOC), and recipes.
For additional information about ModelTrainer, you can refer to Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer
To set up the fine-tuning workload, complete the following steps:

Select the instance type, the container image for the training job, and define the checkpoint path where the model will be stored:

import sagemaker

instance_type = “ml.g6.12xlarge”
instance_count = 1

image_uri = sagemaker.image_uris.retrieve(
    framework=”pytorch”,
    region=sagemaker_session.boto_session.region_name,
    version=”2.6″,
    instance_type=instance_type,
    image_scope=”training”
)

Create the ModelTrainer function to encapsulate the training setup. The ModelTrainer class simplifies the experience by encapsulating code and training setup. In this example:

SourceCode – The source code configuration. This is used to configure the source code for running the training job by using your local python scripts.
Compute – The compute configuration. This is used to specify the compute resources for the training job.

from sagemaker.modules.configs import Compute, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

role = sagemaker.get_execution_role()

# Define the script to be run
source_code = SourceCode(
    source_dir=”./scripts”,
    requirements=”requirements.txt”,
    entry_script=”train.py”,
)

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=0
)

# define Training Job Name
job_name = f”train-deepseek-distill-llama-8b-sft-batch”

# define OutputDataConfig path
output_path = f”s3://{bucket_name}/{job_name}”

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    distributed=Torchrun(),
    stopping_condition=StoppingCondition(max_runtime_in_seconds=7200),
    hyperparameters={
        “config”: “/opt/ml/input/data/config/args.yaml”
    },
    output_data_config=OutputDataConfig(s3_output_path=output_path),
    role=role,
)

Set up the input channels for ModelTrainer by creating InputData objects from the provided S3 bucket paths for the training and validation datasets:

from sagemaker.modules.configs import InputData

train_input = InputData(
    channel_name=”train”,
    data_source=train_dataset_s3_path,
)
val_input = InputData(
    channel_name=”val”,
    data_source=val_dataset_s3_path,
)
config_input = InputData(
    channel_name=”config”,
    data_source=train_config_s3_path,
)

TRAINING_INPUTS = [train_input, val_input, config_input]

Queue SageMaker Training jobs
This section and the following are intended to be used interactively so that you can explore how to use the Amazon SageMaker Python SDK to submit jobs to your Batch queues. Follow these steps:

Select the queue to use:

from sagemaker.aws_batch.queue import TrainingQueue
SMTJ_BATCH_QUEUE = “ml-g6-12xlarge-queue”

queue = TrainingQueue(SMTJ_BATCH_QUEUE)

In the next cell, submit two training jobs in the queue:

LOW PRIORITY
MEDIUM PRIORITY

Use the API submit to submit all the jobs:

job_name_1 = job_name + “-low-pri”
queued_job_1 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_1, priority=5, share_identifier=”LOWPRI”
)
job_name_2 = job_name + “-mid-pri”
queued_job_2 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_2, priority=3, share_identifier=”MIDPRI”
)

Display the status of running and in queue jobs
We can use the job queue list and job queue snapshot APIs to programmatically view a snapshot of the jobs that the queue will run next. For fair-share queues, this ordering is dynamic and occasionally needs to be refreshed because new jobs are submitted to the queue or as share usage changes over time.

from utils.queue_utils import print_queue_state
print_queue_state(queue)

The following screenshot shows the jobs submitted with low priority and medium priority in the Runnable State and in the queue.

You can also refer to the AWS Batch Dashboard, shown in the following screenshot, to analyze the status of the jobs.

As shown in the following screenshot, the first job executed with the SageMaker Training job is the MEDIUM PRIORITY one, by respecting the scheduling policy rules defined previously.

You can explore the running training job in the SageMaker AI console, as shown in the following screenshot.

Submit an additional job
You can now submit an additional SageMaker Training job with HIGH PRIORITY to the queue:

job_name_3 = job_name + “-high-pri”
queued_job_3 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_3, priority=1, share_identifier=”HIGHPRI”
)

You can explore the status from the dashboard, as shown in the following screenshot.

The HIGH PRIORITY job, despite being submitted later in the queue, will be executed before the other runnable jobs by respecting the scheduling policy rules, as shown in the following screenshot.

As the scheduling policy in the screenshot shows, the LOWPRI share has a higher weight factor (5) than the MIDPRI share (3). Since a lower weight signifies higher priority, a LOWPRI job will be executed after a MIDPRI job, even if they are submitted at the same time.

Clean up
To clean up your resources to avoid incurring future charges, follow these steps:

Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.
Delete AWS Batch resources by using the command python create_resources.py –clean from the GitHub example or by manually deleting them from the AWS Management Console.

Conclusion
In this post, we demonstrated how Amazon Search used AWS Batch for SageMaker Training Jobs to optimize GPU resource utilization and training job management. The solution transformed their training infrastructure by implementing sophisticated queue management and fair share scheduling, increasing peak GPU utilization from 40% to over 80%.We recommend that organizations facing similar ML training infrastructure challenges explore AWS Batch integration with SageMaker, which provides built-in queue management capabilities and priority-based scheduling. The solution eliminates manual resource coordination while providing workloads with appropriate prioritization through configurable scheduling policies.
To begin implementing AWS Batch with SageMaker training jobs, you can access our sample code and implementation guide in the amazon-sagemaker-examples repository on GitHub. The example demonstrates how to set up AWS Identity and Access Management (IAM) permissions, create AWS Batch resources, and orchestrate multiple GPU-powered training jobs using ModelTrainer class.

The authors would like to thank Charles Thompson and Kanwaljit Khurmi for their collaboration.
About the authors

Mona Mona
Mona is a generative AI Specialist Solutions Architect at Amazon focusing. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide.

Mayank Jha
Mayank is a Senior Machine Learning Engineer at Amazon Search working on the model training optimization. He is passionate about finding practical applications for complex problems at hand and aims to develop solutions that have a deep impact on how businesses and people thrive.

Bruno Pistone
Bruno is a Senior generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

James Park
James is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

How Can We Build Scalable and Reproducible Machine Learning Experiment …

In this tutorial, we explore Hydra, an advanced configuration management framework originally developed and open-sourced by Meta Research. We begin by defining structured configurations using Python dataclasses, which allows us to manage experiment parameters in a clean, modular, and reproducible manner. As we move through the tutorial, we compose configurations, apply runtime overrides, and simulate multirun experiments for hyperparameter sweeps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “hydra-core”])

import hydra
from hydra import compose, initialize_config_dir
from omegaconf import OmegaConf, DictConfig
from dataclasses import dataclass, field
from typing import List, Optional
import os
from pathlib import Path

We begin by installing Hydra and importing all the essential modules required for structured configurations, dynamic composition, and file handling. This setup ensures our environment is ready to execute the full tutorial seamlessly on Google Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class OptimizerConfig:
_target_: str = “torch.optim.SGD”
lr: float = 0.01

@dataclass
class AdamConfig(OptimizerConfig):
_target_: str = “torch.optim.Adam”
lr: float = 0.001
betas: tuple = (0.9, 0.999)
weight_decay: float = 0.0

@dataclass
class SGDConfig(OptimizerConfig):
_target_: str = “torch.optim.SGD”
lr: float = 0.01
momentum: float = 0.9
nesterov: bool = True

@dataclass
class ModelConfig:
name: str = “resnet”
num_layers: int = 50
hidden_dim: int = 512
dropout: float = 0.1

@dataclass
class DataConfig:
dataset: str = “cifar10”
batch_size: int = 32
num_workers: int = 4
augmentation: bool = True

@dataclass
class TrainingConfig:
model: ModelConfig = field(default_factory=ModelConfig)
data: DataConfig = field(default_factory=DataConfig)
optimizer: OptimizerConfig = field(default_factory=AdamConfig)
epochs: int = 100
seed: int = 42
device: str = “cuda”
experiment_name: str = “exp_001”

We define clean, type-safe configurations using Python dataclasses for the model, data, and optimizer settings. This structure allows us to manage complex experiment parameters in a modular and readable way while ensuring consistency across runs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef setup_config_dir():
config_dir = Path(“./hydra_configs”)
config_dir.mkdir(exist_ok=True)

main_config = “””
defaults:
– model: resnet
– data: cifar10
– optimizer: adam
– _self_

epochs: 100
seed: 42
device: cuda
experiment_name: exp_001
“””
(config_dir / “config.yaml”).write_text(main_config)

model_dir = config_dir / “model”
model_dir.mkdir(exist_ok=True)

(model_dir / “resnet.yaml”).write_text(“””
name: resnet
num_layers: 50
hidden_dim: 512
dropout: 0.1
“””)

(model_dir / “vit.yaml”).write_text(“””
name: vision_transformer
num_layers: 12
hidden_dim: 768
dropout: 0.1
patch_size: 16
“””)

data_dir = config_dir / “data”
data_dir.mkdir(exist_ok=True)

(data_dir / “cifar10.yaml”).write_text(“””
dataset: cifar10
batch_size: 32
num_workers: 4
augmentation: true
“””)

(data_dir / “imagenet.yaml”).write_text(“””
dataset: imagenet
batch_size: 128
num_workers: 8
augmentation: true
“””)

opt_dir = config_dir / “optimizer”
opt_dir.mkdir(exist_ok=True)

(opt_dir / “adam.yaml”).write_text(“””
_target_: torch.optim.Adam
lr: 0.001
betas: [0.9, 0.999]
weight_decay: 0.0
“””)

(opt_dir / “sgd.yaml”).write_text(“””
_target_: torch.optim.SGD
lr: 0.01
momentum: 0.9
nesterov: true
“””)

return str(config_dir.absolute())

We programmatically create a directory containing YAML configuration files for models, datasets, and optimizers. This approach enables us to demonstrate how Hydra automatically composes configurations from different files, thereby maintaining flexibility and clarity in experiments. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@hydra.main(version_base=None, config_path=”hydra_configs”, config_name=”config”)
def train(cfg: DictConfig) -> float:
print(“=” * 80)
print(“CONFIGURATION”)
print(“=” * 80)
print(OmegaConf.to_yaml(cfg))

print(“n” + “=” * 80)
print(“ACCESSING CONFIGURATION VALUES”)
print(“=” * 80)
print(f”Model: {cfg.model.name}”)
print(f”Dataset: {cfg.data.dataset}”)
print(f”Batch Size: {cfg.data.batch_size}”)
print(f”Optimizer LR: {cfg.optimizer.lr}”)
print(f”Epochs: {cfg.epochs}”)

best_acc = 0.0
for epoch in range(min(cfg.epochs, 3)):
acc = 0.5 + (epoch * 0.1) + (cfg.optimizer.lr * 10)
best_acc = max(best_acc, acc)
print(f”Epoch {epoch+1}/{cfg.epochs}: Accuracy = {acc:.4f}”)

return best_acc

We implement a training function that leverages Hydra’s configuration system to print, access, and use nested config values. By simulating a simple training loop, we showcase how Hydra cleanly integrates experiment control into real workflows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_basic_usage():
print(“n” + ” DEMO 1: Basic Configurationn”)
config_dir = setup_config_dir()
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(config_name=”config”)
print(OmegaConf.to_yaml(cfg))

def demo_config_override():
print(“n” + ” DEMO 2: Configuration Overridesn”)
config_dir = setup_config_dir()
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(
config_name=”config”,
overrides=[
“model=vit”,
“data=imagenet”,
“optimizer=sgd”,
“optimizer.lr=0.1”,
“epochs=50”
]
)
print(OmegaConf.to_yaml(cfg))

def demo_structured_config():
print(“n” + ” DEMO 3: Structured Config Validationn”)
from hydra.core.config_store import ConfigStore
cs = ConfigStore.instance()
cs.store(name=”training_config”, node=TrainingConfig)
with initialize_config_dir(version_base=None, config_dir=setup_config_dir()):
cfg = compose(config_name=”config”)
print(f”Config type: {type(cfg)}”)
print(f”Epochs (validated as int): {cfg.epochs}”)

def demo_multirun_simulation():
print(“n” + ” DEMO 4: Multirun Simulationn”)
config_dir = setup_config_dir()
experiments = [
[“model=resnet”, “optimizer=adam”, “optimizer.lr=0.001”],
[“model=resnet”, “optimizer=sgd”, “optimizer.lr=0.01”],
[“model=vit”, “optimizer=adam”, “optimizer.lr=0.0001”],
]
results = {}
for i, overrides in enumerate(experiments):
print(f”n— Experiment {i+1} —“)
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(config_name=”config”, overrides=overrides)
print(f”Model: {cfg.model.name}, Optimizer: {cfg.optimizer._target_}”)
print(f”Learning Rate: {cfg.optimizer.lr}”)
results[f”exp_{i+1}”] = cfg
return results

def demo_interpolation():
print(“n” + ” DEMO 5: Variable Interpolationn”)
cfg = OmegaConf.create({
“model”: {“name”: “resnet”, “layers”: 50},
“experiment”: “${model.name}_${model.layers}”,
“output_dir”: “/outputs/${experiment}”,
“checkpoint”: “${output_dir}/best.ckpt”
})
print(OmegaConf.to_yaml(cfg))
print(f”nResolved experiment name: {cfg.experiment}”)
print(f”Resolved checkpoint path: {cfg.checkpoint}”)

We demonstrate Hydra’s advanced capabilities, including config overrides, structured config validation, multi-run simulations, and variable interpolation. Each demo showcases how Hydra accelerates experimentation speed, streamlines manual setup, and fosters reproducibility in research. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
demo_basic_usage()
demo_config_override()
demo_structured_config()
demo_multirun_simulation()
demo_interpolation()
print(“n” + “=” * 80)
print(“Tutorial complete! Key takeaways:”)
print(“✓ Config composition with defaults”)
print(“✓ Runtime overrides via command line”)
print(“✓ Structured configs with type safety”)
print(“✓ Multirun for hyperparameter sweeps”)
print(“✓ Variable interpolation”)
print(“=” * 80)

We execute all demonstrations in sequence to observe Hydra in action, from loading configs to performing multiruns. By the end, we summarize key takeaways, reinforcing how Hydra enables scalable and elegant experiment management.

In conclusion, we grasp how Hydra, pioneered by Meta Research, simplifies and enhances experiment management through its powerful composition system. We explore structured configs, interpolation, and multirun capabilities that make large-scale machine learning workflows more flexible and maintainable. With this knowledge, you are now equipped to integrate Hydra into your own research or development pipelines, ensuring reproducibility, efficiency, and clarity in every experiment you run.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Can We Build Scalable and Reproducible Machine Learning Experiment Pipelines Using Meta Research Hydra? appeared first on MarkTechPost.

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2 …

Code-oriented large language models moved from autocomplete to software engineering systems. In 2025, leading models must fix real GitHub issues, refactor multi-repo backends, write tests, and run as agents over long context windows. The main question for teams is not “can it code” but which model fits which constraints.

Here are seven models (and systems around them) that cover most real coding workloads today:

OpenAI GPT-5 / GPT-5-Codex

Anthropic Claude 3.5 Sonnet / Claude 4.x Sonnet with Claude Code

Google Gemini 2.5 Pro

Meta Llama 3.1 405B Instruct

DeepSeek-V2.5-1210 (with DeepSeek-V3 as the successor)

Alibaba Qwen2.5-Coder-32B-Instruct

Mistral Codestral 25.01

The goal of this comparison is not to rank them on a single score. The goal is to show which system to pick for a given benchmark target, deployment model, governance requirement, and IDE or agent stack.

Evaluation dimensions

We compare on six stable dimensions:

Core coding quality: HumanEval, MBPP / MBPP EvalPlus, code generation and repair quality on standard Python tasks.

Repo and bug-fix performance: SWE-bench Verified (real GitHub issues), Aider Polyglot (whole-file edits), RepoBench, LiveCodeBench.

Context and long-context behavior: Documented context limits and practical behavior in long sessions.

Deployment model: Closed API, cloud service, containers, on-premises or fully self-hosted open weights.

Tooling and ecosystem: Native agents, IDE extensions, cloud integration, GitHub and CI/CD support.

Cost and scaling pattern: Token pricing for closed models, hardware footprint and inference pattern for open models.

Image source: marktechpost.com

1. OpenAI GPT-5 / GPT-5-Codex

OpenAI’s GPT-5 is the flagship reasoning and coding model and the default in ChatGPT. For real-world code, OpenAI reports:

SWE-bench Verified: 74.9%

Aider Polyglot: 88%

Both benchmarks simulate real engineering: SWE-bench Verified runs against upstream repos and tests; Aider Polyglot measures whole-file multi-language edits.

Context and variants

gpt-5 (chat) API: 128k token context.

gpt-5-pro / gpt-5-codex: up to 400k combined context in the model card, with typical production limits around ≈272k input + 128k output for reliability.

GPT-5 and GPT-5-Codex are available in ChatGPT (Plus / Pro / Team / Enterprise) and via the OpenAI API; they are closed-weight, cloud-hosted only.

Strengths

Highest published SWE-bench Verified and Aider Polyglot scores among widely available models.

Very strong at multi-step bug fixing with “thinking” (chain-of-thought) enabled.

Deep ecosystem: ChatGPT, Copilot, and many third-party IDE and agent platforms use GPT-5 backends.

Limits

No self-hosting; all traffic must go through OpenAI or partners.

Long-context calls are expensive if you stream full monorepos, so you need retrieval and diff-only patterns.

Use when you want maximum repo-level benchmark performance and are comfortable with a closed, cloud API.

2. Anthropic Claude 3.5 Sonnet / Claude 4.x + Claude Code

Claude 3.5 Sonnet was Anthropic’s main coding workhorse before the Claude 4 line. Anthropic highlights it as SOTA on HumanEval, and independent comparisons report:

HumanEval: ≈ 92%

MBPP EvalPlus: ≈ 91%

In 2025, Anthropic released Claude 4 Opus, Sonnet, and Sonnet 4.5, positioning Sonnet 4.5 as its best coding and agent model so far.

Claude Code stack

Claude Code is a repo-aware coding system:

Managed VM connected to your GitHub repo.

File browsing, editing, tests, and PR creation.

SDK for building custom agents that use Claude as a coding backend.

Strengths

Very strong HumanEval / MBPP, good empirical behavior on debugging and code review.

Production-grade coding agent environment with persistent VM and GitHub workflows.

Limits

Closed and cloud-hosted, similar to GPT-5 in governance terms.

Published SWE-bench Verified numbers for Claude 3.5 Sonnet are below GPT-5, though Claude 4.x is likely closer.

Use when you need explainable debugging, code review, and a managed repo-level agent and can accept a closed deployment.

3. Google Gemini 2.5 Pro

Gemini 2.5 Pro is Google DeepMind’s main coding and reasoning model for developers. It reports following performance/results:

LiveCodeBench v5: 70.4%

Aider Polyglot (whole-file editing): 74.0%

SWE-bench Verified: 63.8%

These results place Gemini 2.5 Pro above many earlier models and only behind Claude 3.7 and GPT-5 on SWE-bench Verified.

Context and platform

Long-context capability marketed up to 1M tokens across the Gemini family; 2.5 Pro is the stable tier used in Gemini Apps, Google AI Studio, and Vertex AI.

Tight integration with GCP services, BigQuery, Cloud Run, and Google Workspace.

Strengths

Good combination of LiveCodeBench, Aider, SWE-bench scores plus first-class GCP integration.

Strong choice for “data plus application code” when you want the same model for SQL, analytics helpers, and backend code.

Limits

Closed and tied to Google Cloud.

For pure SWE-bench Verified, GPT-5 and the newest Claude Sonnet 4.x are stronger.

Use when your workloads already run on GCP / Vertex AI and you want a long-context coding model inside that stack.

4. Meta Llama 3.1 405B Instruct

Meta’s Llama 3.1 family (8B, 70B, 405B) is open-weight. The 405B Instruct variant is the high-end option for coding and general reasoning. It reports following performance/results:

HumanEval (Python): 89.0

MBPP (base or EvalPlus): ≈ 88.6

These scores put Llama 3.1 405B among the strongest open models on classic code benchmarks.

The official model card states that Llama 3.1 models outperform many open and closed chat models on common benchmarks and are optimized for multilingual dialogue and reasoning.

Strengths

High HumanEval / MBPP scores with open weights and permissive licensing.

Strong general performance (MMLU, MMLU-Pro, etc.), so one model can serve both product features and coding agents.

Limits

405B parameters mean high serving cost and latency unless you have a large GPU cluster.

For strictly code benchmarks at a fixed compute budget, specialized models such as Qwen2.5-Coder-32B and Codestral 25.01 are more cost-efficient.

Use when you want a single open foundation model with strong coding and general reasoning, and you control your own GPU infrastructure.

5. DeepSeek-V2.5-1210 (and DeepSeek-V3)

DeepSeek-V2.5-1210 is an upgraded Mixture-of-Experts model that merges the chat and coder lines. The model card reports:

LiveCodeBench (08.01–12.01): improved from 29.2% to 34.38%

MATH-500: 74.8% → 82.8%

DeepSeek has since released DeepSeek-V3, a 671B-parameter MoE with 37B active per token, trained on 14.8T tokens. The performance is comparable to leading closed models on many reasoning and coding benchmarks, and public dashboards show V3 ahead of V2.5 on key tasks.

Strengths

Open MoE model with solid LiveCodeBench results and good math performance for its size.

Efficient active-parameter count vs total parameters.

Limits

V2.5 is no longer the flagship; DeepSeek-V3 is now the reference model.

Ecosystem is lighter than OpenAI / Google / Anthropic; teams must assemble their own IDE and agent integrations.

Use when you want a self-hosted MoE coder with open weights and are ready to move to DeepSeek-V3 as it matures.

6. Qwen2.5-Coder-32B-Instruct

Qwen2.5-Coder is Alibaba’s code-specific LLM family. The technical report and model card describe six sizes (0.5B to 32B) and continued pretraining on over 5.5T tokens of code-heavy data.

The official benchmarks for Qwen2.5-Coder-32B-Instruct list:

HumanEval: 92.7%

MBPP: 90.2%

LiveCodeBench: 31.4%

Aider Polyglot: 73.7%

Spider: 85.1%

CodeArena: 68.9%

Strengths

Very strong HumanEval / MBPP / Spider results for an open model; often competitive with closed models in pure code tasks.

Multiple parameter sizes make it adaptable to different hardware budgets.

Limits

Less suited for broad general reasoning than a generalist like Llama 3.1 405B or DeepSeek-V3.

Documentation and ecosystem are catching up in English-language tooling.

Use when you need a self-hosted, high-accuracy code model and can pair it with a general LLM for non-code tasks.

7. Mistral Codestral 25.01

Codestral 25.01 is Mistral’s updated code generation model. Mistral’s announcement and follow-up posts state that 25.01 uses a more efficient architecture and tokenizer and generates code roughly 2× faster than the base Codestral model.

Benchmark reports:

HumanEval: 86.6%

MBPP: 80.2%

Spider: 66.5%

RepoBench: 38.0%

LiveCodeBench: 37.9%

Codestral 25.01 supports over 80 programming languages and a 256k token context window, and is optimized for low-latency, high-frequency tasks such as completion and FIM.

Strengths

Very good RepoBench / LiveCodeBench scores for a mid-size open model.

Designed for fast interactive use in IDEs and SaaS, with open weights and a 256k context.

Limits

Absolute HumanEval / MBPP scores sit below Qwen2.5-Coder-32B, which is expected at this parameter class.

Use when you need a compact, fast open code model for completions and FIM at scale.

Head to head comparison

FeatureGPT-5 / GPT-5-CodexClaude 3.5 / 4.x + Claude CodeGemini 2.5 ProLlama 3.1 405B InstructDeepSeek-V2.5-1210 / V3Qwen2.5-Coder-32BCodestral 25.01Core taskHosted general model with strong coding and agentsHosted models plus repo-level coding VMHosted coding and reasoning model on GCPOpen generalist foundation with strong codingOpen MoE coder and chat modelOpen code-specialized modelOpen mid-size code modelContext128k (chat), up to 400k Pro / Codex200k-class (varies by tier)Long-context, million-class across Gemini lineUp to 128k in many deploymentsTens of k, MoE scaling32B with typical 32k–128k contexts depending on host256k contextCode benchmarks (examples)74.9 SWE-bench, 88 Aider≈92 HumanEval, ≈91 MBPP, 49 SWE-bench (3.5); 4.x stronger but less published70.4 LiveCodeBench, 74 Aider, 63.8 SWE-bench89 HumanEval, ≈88.6 MBPP34.38 LiveCodeBench; V3 stronger on mixed benchmarks92.7 HumanEval, 90.2 MBPP, 31.4 LiveCodeBench, 73.7 Aider86.6 HumanEval, 80.2 MBPP, 38 RepoBench, 37.9 LiveCodeBenchDeploymentClosed API, OpenAI / Copilot stackClosed API, Anthropic console, Claude CodeClosed API, Google AI Studio / Vertex AIOpen weights, self-hosted or cloudOpen weights, self-hosted; V3 via providersOpen weights, self-hosted or via providersOpen weights, available on multiple cloudsIntegration pathChatGPT, OpenAI API, CopilotClaude app, Claude Code, SDKsGemini Apps, Vertex AI, GCPHugging Face, vLLM, cloud marketplacesHugging Face, vLLM, custom stacksHugging Face, commercial APIs, local runnersAzure, GCP, custom inference, IDE pluginsBest fitMax SWE-bench / Aider performance in hosted settingRepo-level agents and debugging qualityGCP-centric engineering and data + codeSingle open foundation modelOpen MoE experiments and Chinese ecosystemSelf-hosted high-accuracy code assistantFast open model for IDE and product integration

What to use when?

You want the strongest hosted repo-level solver: Use GPT-5 / GPT-5-Codex. Claude Sonnet 4.x is the closest competitor, but GPT-5 has the clearest SWE-bench Verified and Aider numbers today.

You want a full coding agent over a VM and GitHub: Use Claude Sonnet + Claude Code for repo-aware workflows and long multi-step debugging sessions.

You are standardized on Google Cloud: Use Gemini 2.5 Pro as the default coding model inside Vertex AI and AI Studio.

You need a single open general foundation: Use Llama 3.1 405B Instruct when you want one open model for application logic, RAG, and code.

You want the strongest open code specialist: Use Qwen2.5-Coder-32B-Instruct, and add a smaller general LLM for non-code tasks if needed.

You want MoE-based open models: Use DeepSeek-V2.5-1210 now and plan for DeepSeek-V3 as you move to the latest upgrade.

You are building IDEs or SaaS products and need a fast open code model: Use Codestral 25.01 for FIM, completion, and mid-size repo work with 256k context.

Editorial comments

GPT-5, Claude Sonnet 4.x, and Gemini 2.5 Pro now define the upper bound of hosted coding performance, especially on SWE-bench Verified and Aider Polyglot. At the same time, open models such as Llama 3.1 405B, Qwen2.5-Coder-32B, DeepSeek-V2.5/V3, and Codestral 25.01 show that it is realistic to run high-quality coding systems on your own infrastructure, with full control over weights and data paths.

For most software engineering teams, the practical answer is a portfolio: one or two hosted frontier models for the hardest multi-service refactors, plus one or two open models for internal tools, regulated code bases, and latency-sensitive IDE integrations.

References

OpenAI – Introducing GPT-5 for developers (SWE-bench Verified, Aider Polyglot) (openai.com)

Vellum, Runbear and other benchmark summaries for GPT-5 coding performance (vellum.ai)

Anthropic – Claude 3.5 Sonnet and Claude 4 announcements (Anthropic)

Kitemetric and other third-party Claude 3.5 Sonnet coding benchmark reviews (Kite Metric)

Google – Gemini 2.5 Pro model page and Google / Datacamp benchmark posts (Google DeepMind)

Meta – Llama 3.1 405B model card and analyses of HumanEval / MBPP scores (Hugging Face)

DeepSeek – DeepSeek-V2.5-1210 model card and update notes; community coverage on V3 (Hugging Face)

Alibaba – Qwen2.5-Coder technical report and Hugging Face model card (arXiv)

Mistral – Codestral 25.01 announcement and benchmark summaries (Mistral AI)

The post Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025 appeared first on MarkTechPost.