Moonshot AI Researchers Introduce Seer: An Online Context Learning Sys …

How do you keep reinforcement learning for large reasoning models from stalling on a few very long, very slow rollouts while GPUs sit under used? a team of researchers from Moonshot AI and Tsinghua University introduce ‘Seer’, a new online context learning system that targets a specific systems bottleneck in reinforcement learning for large language models. In synchronous on policy setups, the rollout phase dominates the cost of each iteration. Seer restructures this phase and reports rollout throughput gains of 74 percent to 97 percent and tail latency reductions of 75 percent to 93 percent compared with a strong synchronous baseline called veRL.

https://arxiv.org/pdf/2511.14617

Why synchronous rollout is slow for reasoning models?

Modern reasoning RL workloads use long chain of thought style outputs. In the Seer experiments, the researchers apply GRPO to three different models, Moonlight, Qwen2 VL 72B and Kimi K2. These workloads run on 32 compute nodes with 8 H800 GPUs per node. The three tasks use 32, 128 and 256 GPUs respectively, with 400, 600 and 800 prompts per iteration and 8 or 16 responses per prompt.

Maximum generation length is large. Moonlight is configured for 65,536 tokens, Qwen2 VL 72B for 40,960 tokens and Kimi K2 for 98,304 tokens. A single long chain of thought request can grow from a few hundred megabytes of KVCache to tens of gigabytes as decoding progresses. This memory growth forces instances to reduce concurrency or to preempt requests, which triggers expensive re decoding.

The research team defines tail requests as the last 10 percent of requests to finish in a rollout. For Moonlight and Qwen2 VL 72B, this tail alone can consume up to 50 percent of the total rollout time in the baseline system. Rollout already dominates iteration time, so this tail effect directly slows RL.

https://arxiv.org/pdf/2511.14617

Seer architecture on top of Mooncake and vLLM

Seer keeps the RL algorithm identical to synchronous veRL. Each training iteration uses only data from the current rollout iteration, so the system preserves on policy behavior. The training phase uses Megatron for distributed optimization. The rollout phase uses an in house implementation of vLLM as the inference engine.

To support aggressive request scheduling, Seer relies on a Global KVCache Pool built on the Mooncake disaggregated KVCache architecture used in production for Kimi. Mooncake provides a two tier DRAM and SSD KV cache store shared across inference nodes, which allows Seer to migrate requests without recomputing prefills.

On top of this substrate, Seer introduces three key mechanisms:

Divided Rollout

Context Aware Scheduling

Adaptive Grouped Speculative Decoding

These are orchestrated by a Request Buffer, a Context Manager and an Inference Engine Pool connected to the Global KVCache Pool.

https://arxiv.org/pdf/2511.14617

Divided Rollout, fine grained scheduling and migration

Conventional synchronous rollout assigns whole GRPO groups to inference instances. A group is a set of requests that share one prompt. Once assigned, a group stays on the same instance until all responses finish. Due to large variance in output lengths, this leads to load imbalance and long running stragglers.

Seer breaks groups down in two steps. It first decomposes each group into individual requests. It then divides each request into multiple chunks based on generation length. When the scheduler dispatches a request from the Request Buffer, it sets a small max tokens value such as 8,000 tokens for that chunk. After each chunk, the request is re enqueued until it reaches an end of sequence token or its original max tokens limit.

Because KVCache is stored in the Global KVCache Pool, divided requests can move between instances at chunk boundaries without re running the prefill. The scheduler maintains a concurrency level that keeps memory utilization high while avoiding preemption. This reduces waste and smooths KVCache usage across the iteration.

Context Aware Scheduling using group length statistics

The research team observe that different requests in the same group tend to have correlated output lengths. Seer uses this structure as online context. For each prompt group, it designates one request as the speculative request. The scheduler keeps speculative requests in a high priority queue and serves them with a smallest first policy based on generated tokens so far. Short requests complete quickly and exit. Long requests remain and identify groups that are potential tail candidates.

The Context Manager maintains a length estimate for each group. It updates this estimate to the maximum generated length among completed requests in the group. If no request has finished, it uses the original max tokens as a conservative bound. Once speculative requests are in flight or done, Seer schedules remaining requests with an approximate longest first policy at group level. This design achieves throughput and tail behavior close to an oracle scheduler that knows all output lengths in advance.

https://arxiv.org/pdf/2511.14617

Adaptive Grouped Speculative Decoding

Seer adds Adaptive Grouped Speculative Decoding on top of the previous two components to accelerate decoding, especially for long requests in the tail. It introduces a Distributed Grouped Draft Server, or DGDS. DGDS maintains a Compressed Suffix Tree for each group and aggregates token sequences from all requests in that group. Instances asynchronously append generated tokens to DGDS, periodically fetch updated suffix trees and perform local speculative decoding based on the shared pattern statistics.

The system adjusts draft length and the number of paths according to model architecture, batch size and measured acceptance length. For dense and Mixture of Experts models, it pre-computes different speculation thresholds and uses them to bound draft depth for each batch. In late tail stages, concurrency is low, so Seer increases draft depth and enables multi path drafting to raise accepted tokens per step.

Ablation results show that divided rollout yields up to 35 percent throughput improvement over the baseline. Adding Context Aware Scheduling increases this to up to 47 percent over baseline. Enabling grouped speculative decoding raises the total speedup to 77 percent to 87 percent over the baseline in the evaluated iteration.

End to end impact on RL training

The research team evaluate Seer on three RL tasks built on Moonlight, Qwen2 VL 72B and Kimi K2. They run 10 rollout iterations per task and measure output tokens per second and completion time for each rollout. Seer improves rollout throughput by 74 percent to 97 percent across these workloads relative to veRL with the same RL algorithm and vLLM based inference engine.

Tail latency is reduced by 75 percent to 93 percent. For memory constrained tasks, the baseline system spends up to half of its time on the last 10 percent of requests. Seer removes most of this tail by combining divided rollout, Context Aware Scheduling and Adaptive Grouped Speculative Decoding on top of the Mooncake based Global KVCache Pool.

Key Takeaways

Rollout bottleneck: Seer targets the rollout phase of synchronous RL, which accounts for about 63% to 87% of iteration time and is dominated by long tail requests and KV cache fragmentation.

Three core mechanisms: Seer combines divided rollout, context aware scheduling and adaptive grouped speculative decoding to exploit output length and pattern similarity among GRPO responses that share a prompt.

Fine grained scheduling on a global KV cache: Requests are split into chunks and migrated across a Mooncake style Global KVCache Pool, which preserves synchronous on policy RL while keeping GPU memory utilization high and reducing preemptions.

Online context for tail latency reduction: Group level length statistics from speculative requests drive context aware scheduling that approximates an oracle longest first scheduler and sharply reduces the time spent on the last 10 percent of requests.

Measured end to end gains: On production grade RL workloads with Moonlight, Qwen2 VL 72B and Kimi K2, Seer improves rollout throughput by 74% to 97% and reduces long tail latency by 75% to 93% relative to a state of the art synchronous vLLM based baseline.

Editorial Comments

Seer is an important systems contribution because it optimizes the rollout phase in synchronous RL without changing the underlying GRPO algorithm, so it preserves on policy guarantees and reproducibility while fixing a real infrastructure bottleneck. The combination of divided rollout, context aware scheduling and adaptive grouped speculative decoding offers a practical template for other RL stacks that rely on long chain of thought reasoning models and large KVCache footprints. Overall, Seer shows that online context learning at the systems level is now as critical as model architecture for scaling reasoning RL efficiently.

Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Moonshot AI Researchers Introduce Seer: An Online Context Learning System for Fast Synchronous Reinforcement Learning RL Rollouts appeared first on MarkTechPost.

Google DeepMind Introduces Nano Banana Pro: the Gemini 3 Pro Image Mod …

Nano Banana Pro, also called Gemini 3 Pro Image, is Google DeepMind’s new image generation and editing model built on Gemini 3 Pro. It is positioned as a state of the art system for creating and editing images that must respect structure, world knowledge and text layout, not only style. Nano Banana Pro follows Nano Banana, which was based on Gemini 2.5 Flash Image and focused on fast, casual image editing such as restoring photos and generating figurines.

From Gemini 2.5 Flash Image to Gemini 3 Pro Image

The earlier Nano Banana model targeted quick creative edits for casual creators. It helped restore old photos and build stylized 3D mini figurines with a simple prompt. Nano Banana Pro keeps that editing flow but runs on top of Gemini 3 Pro, which brings stronger reasoning and real world knowledge into the image stack.

The model can turn prototypes, data tables and handwritten notes into diagrams and infographics that reflect the underlying information, rather than producing only decorative art.

Reasoning Guided, Search Grounded Visuals

A core design point for Nano Banana Pro is reasoning guided generation. Using Gemini 3 Pro, the model can consume text, structured content and references and then plan the image as an explanation of that content. Nano Banana Pro can also connect to Google Search, using the search index as a real time knowledge source.

Clear Text and Multilingual Layouts

Text inside images is a long standing failure mode for many diffusion based generators. Nano Banana Pro addresses this explicitly. Google states that it is the best model in the Gemini family for producing images with correctly rendered and legible text, for both short taglines and full paragraphs.

Gemini 3 Pro’s multilingual reasoning flows into the image model. Nano Banana Pro can render text in multiple languages and also translate text that already appears in products or posters. The documentation shows beverage cans where English text is translated into Korean while the visual design and layout stay unchanged.

Studio Level Control, Consistency and Upscaling

Nano Banana Pro exposes a set of controls aimed at design and production workflows rather than single shot art prompts. On the composition side, the model can use up to 14 input images and maintain the consistency and resemblance of up to 5 people in one workflow. This supports tasks such as combining reference photos into a single fashion editorial, transforming sketches into product shots or keeping the same cast across multiple scenes.

The studio control section of the model page lists several families of controls. Users can vary camera angle and shot type, including wide shot, panoramic and close up, while controlling depth of field and focus on specific subjects in the image. Color and lighting can be adjusted, for example changing day to night, replacing volumetric lighting with bokeh or applying a strong chiaroscuro effect without losing subject identity.

Nano Banana Pro supports explicit upscaling. The official Google blog states that it can generate crisp visuals at 1k, 2k or 4k resolution, and provides examples of progressive zoom in operations that keep detail and composition. Aspect ratio is also programmable. Prompts can convert between ratios such as 1:1, 4:3, 16:9 and cinematic formats while keeping the main character locked in place and adjusting only the background.

Key Takeaways

Nano Banana Pro is Gemini 3 Pro Image, an upgraded image generation and editing model that succeeds Nano Banana, which was based on Gemini 2.5 Flash Image, and is optimized for higher quality and control.

The model integrates Gemini 3 Pro reasoning and Google Search grounding so it can turn factual content, documents and real time data into infographics, recipes, process diagrams and other information dense visuals.

It provides strong text rendering and multilingual support, producing legible typography in images and enabling translation or localization of existing on image text while preserving layout and design.

Nano Banana Pro supports up to 14 input images and maintains resemblance for up to 5 people, with studio style controls for camera angle, depth of field, lighting, aspect ratios and upscaling to 1k, 2k and 4k resolutions.

The model is being deployed across Gemini app, AI Mode in Search, NotebookLM, Google Ads, Workspace apps, Gemini API, Google AI Studio, Vertex AI, Antigravity and Flow, with all outputs watermarked using SynthID plus tier specific visible watermarks.

Editorial Comments

Nano Banana Pro positions Gemini 3 Pro Image as a production oriented image system that links Gemini 3 Pro reasoning, Google Search grounding and structured controls for layout, text and upscaling. It directly addresses long standing issues in text rendering, multilingual localization and subject consistency, while keeping SynthID and visible watermarks as default provenance signals across tiers and surfaces. This launch moves Google’s image stack closer to an integrated, API first visual platform for developers and enterprises.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google DeepMind Introduces Nano Banana Pro: the Gemini 3 Pro Image Model for Text Accurate and Studio Grade Visuals appeared first on MarkTechPost.

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Usi …

In this tutorial, we implement a complete workflow for building, tracing, and evaluating an LLM pipeline using Opik. We structure the system step-by-step, beginning with a lightweight model, adding prompt-based planning, creating a dataset, and finally running automated evaluations. As we move through each snippet, we see how Opik helps us track every function span, visualize the pipeline’s behavior, and measure output quality with clear, reproducible metrics. By the end, we have a fully instrumented QA system that we can extend, compare, and monitor with ease. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q opik transformers accelerate torch

import torch
from transformers import pipeline
import textwrap

import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio

device = 0 if torch.cuda.is_available() else -1
print(“Using device:”, “cuda” if device == 0 else “cpu”)

opik.configure()
PROJECT_NAME = “opik-hf-tutorial”

We set up our environment by installing the required libraries and initializing Opik. We load the core modules, detect the device, and configure our project so that every trace flows into the correct workspace. We lay the foundation for the rest of the tutorial. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserllm = pipeline(
“text-generation”,
model=”distilgpt2″,
device=device,
)

def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
result = llm(
prompt,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.3,
pad_token_id=llm.tokenizer.eos_token_id,
)[0][“generated_text”]
return result[len(prompt):].strip()

We load a lightweight Hugging Face model and create a small helper function to generate text cleanly. We prepare the LLM to operate locally without external APIs. This gives us a reliable and reproducible generation layer for the rest of the pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserplan_prompt = Prompt(
name=”hf_plan_prompt”,
prompt=textwrap.dedent(“””
You are an assistant that creates a plan to answer a question
using ONLY the given context.

Context:
{{context}}

Question:
{{question}}

Return exactly 3 bullet points as a plan.
“””).strip(),
)

answer_prompt = Prompt(
name=”hf_answer_prompt”,
prompt=textwrap.dedent(“””
You answer based only on the given context.

Context:
{{context}}

Question:
{{question}}

Plan:
{{plan}}

Answer the question in 2–4 concise sentences.
“””).strip(),
)

We define two structured prompts using Opik’s Prompt class. We control the planning phase and answering phase through clear templates. This helps us maintain consistency and observe how structured prompting impacts model behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDOCS = {
“overview”: “””
Opik is an open-source platform for debugging, evaluating,
and monitoring LLM and RAG applications. It provides tracing,
datasets, experiments, and evaluation metrics.
“””,
“tracing”: “””
Tracing in Opik logs nested spans, LLM calls, token usage,
feedback scores, and metadata to inspect complex LLM pipelines.
“””,
“evaluation”: “””
Opik evaluations are defined by datasets, evaluation tasks,
scoring metrics, and experiments that aggregate scores,
helping detect regressions or issues.
“””,
}

@track(project_name=PROJECT_NAME, type=”tool”, name=”retrieve_context”)
def retrieve_context(question: str) -> str:
q = question.lower()
if “trace” in q or “span” in q:
return DOCS[“tracing”]
if “metric” in q or “dataset” in q or “evaluate” in q:
return DOCS[“evaluation”]
return DOCS[“overview”]

We construct a tiny document store and a retrieval function that Opik tracks as a tool. We let the pipeline select context based on the user’s question. This allows us to simulate a minimal RAG-style workflow without needing an actual vector database. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@track(project_name=PROJECT_NAME, type=”llm”, name=”plan_answer”)
def plan_answer(context: str, question: str) -> str:
rendered = plan_prompt.format(context=context, question=question)
return hf_generate(rendered, max_new_tokens=80)

@track(project_name=PROJECT_NAME, type=”llm”, name=”answer_from_plan”)
def answer_from_plan(context: str, question: str, plan: str) -> str:
rendered = answer_prompt.format(
context=context,
question=question,
plan=plan,
)
return hf_generate(rendered, max_new_tokens=120)

@track(project_name=PROJECT_NAME, type=”general”, name=”qa_pipeline”)
def qa_pipeline(question: str) -> str:
context = retrieve_context(question)
plan = plan_answer(context, question)
answer = answer_from_plan(context, question, plan)
return answer

print(“Sample answer:n”, qa_pipeline(“What does Opik help developers do?”))

We bring together planning, reasoning, and answering in a fully traced LLM pipeline. We capture each step with Opik’s decorators so we can analyze spans in the dashboard. By testing the pipeline, we confirm that all components integrate smoothly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclient = Opik()

dataset = client.get_or_create_dataset(
name=”HF_Opik_QA_Dataset”,
description=”Small QA dataset for HF + Opik tutorial”,
)

dataset.insert([
{
“question”: “What kind of platform is Opik?”,
“context”: DOCS[“overview”],
“reference”: “Opik is an open-source platform for debugging, evaluating and monitoring LLM and RAG applications.”,
},
{
“question”: “What does tracing in Opik log?”,
“context”: DOCS[“tracing”],
“reference”: “Tracing logs nested spans, LLM calls, token usage, feedback scores, and metadata.”,
},
{
“question”: “What are the components of an Opik evaluation?”,
“context”: DOCS[“evaluation”],
“reference”: “An Opik evaluation uses datasets, evaluation tasks, scoring metrics and experiments that aggregate scores.”,
},
])

We create and populate a dataset inside Opik that our evaluation will use. We insert multiple question–answer pairs that cover different aspects of Opik. This dataset will serve as the ground truth for our QA evaluation later. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserequals_metric = Equals()
lev_metric = LevenshteinRatio()

def evaluation_task(item: dict) -> dict:
output = qa_pipeline(item[“question”])
return {
“output”: output,
“reference”: item[“reference”],
}

We define the evaluation task and select two metrics—Equals and LevenshteinRatio—to measure model quality. We ensure the task produces outputs in the exact format required for scoring. This connects our pipeline to Opik’s evaluation engine. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserevaluation_result = evaluate(
dataset=dataset,
task=evaluation_task,
scoring_metrics=[equals_metric, lev_metric],
experiment_name=”HF_Opik_QA_Experiment”,
project_name=PROJECT_NAME,
task_threads=1,
)

print(“nExperiment URL:”, evaluation_result.experiment_url)

We run the evaluation experiment using Opik’s evaluate function. We keep the execution sequential for stability in Colab. Once complete, we receive a link to view the experiment details inside the Opik dashboard. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browseragg = evaluation_result.aggregate_evaluation_scores()

print(“nAggregated scores:”)
for metric_name, stats in agg.aggregated_scores.items():
print(metric_name, “=>”, stats)

We aggregate and print the evaluation scores to understand how well our pipeline performs. We inspect the metric results to see where outputs align with references and where improvements are needed. This closes the loop on our fully instrumented LLM workflow.

In conclusion, we set up a small but fully functional LLM evaluation ecosystem powered entirely by Opik and a local model. We observe how traces, prompts, datasets, and metrics come together to give us transparent visibility into the model’s reasoning process. As we finalize our evaluation and review the aggregated scores, we appreciate how Opik lets us iterate quickly, experiment systematically, and validate improvements in a structured and reliable way.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik for Transparent, Measurable, and Reproducible AI Workflows appeared first on MarkTechPost.

Perplexity AI Releases TransferEngine and pplx garden to Run Trillion …

How can teams run trillion parameter language models on existing mixed GPU clusters without costly new hardware or deep vendor lock in? Perplexity’s research team has released TransferEngine and the surrounding pplx garden toolkit as open source infrastructure for large language model systems. This provides a way to run models with up to 1 trillion parameters across mixed GPU clusters, without locking into a single cloud provider or buying new GB200 class hardware.

https://arxiv.org/pdf/2510.27656

The real bottleneck, network fabrics not FLOPs

Modern deployments of Mixture of Experts models such as DeepSeek V3 with 671 billion parameters and Kimi K2 with 1 trillion parameters no longer fit on a single 8 GPU server. They must span multiple nodes, so the main constraint becomes the network fabric between GPUs.

Here the hardware landscape is fragmented. NVIDIA ConnectX 7 typically uses Reliable Connection transport with in order delivery. AWS Elastic Fabric Adapter uses Scalable Reliable Datagram transport that is reliable but out of order, and a single GPU may need 4 network adapters at 100 Gbps, or 2 at 200 Gbps, to reach 400 Gbps.

Existing libraries such as DeepEP, NVSHMEM, MoonCake and NIXL tend to optimize for one vendor and degrade or lack support on the other side. Perplexity’s research team directly states in the research paper that there was no viable cross provider solution for LLM inference before this work.

TransferEngine, a portable RDMA layer for LLM systems

TransferEngine addresses this by targeting only the intersection of guarantees across Network Interface Controllers. It assumes that the underlying RDMA transport is reliable, but does not assume any ordering of messages. On top of this, it exposes one sided WriteImm operations and an ImmCounter primitive for completion notification.

The library provides a minimal API in Rust. It offers two sided Send and Recv for control messages, and three main one sided operations, submit_single_write, submit_paged_writes, and submit_scatter, plus a submit_barrier primitive for synchronization across a group of peers. A NetAddr structure identifies peers and an MrDesc structure describes registered memory regions. An alloc_uvm_watcher call creates a device side watcher for CPU GPU synchronization in advanced pipelines.

Internally, TransferEngine spawns one worker thread per GPU and builds a DomainGroup per GPU that coordinates between 1 and 4 RDMA Network Interface Controllers. A single ConnectX 7 provides 400 Gbps. On EFA, the DomainGroup aggregates 4 network adapters at 100 Gbps, or 2 at 200 Gbps, to reach the same bandwidth. The sharding logic knows about all Network Interface Controllers and can split a transfer across them.

Across hardware, the research team reports peak throughput of 400 Gbps on both NVIDIA ConnectX 7 and AWS EFA. This matches single platform solutions and confirms that the abstraction layer does not leave large performance on the table.

https://arxiv.org/pdf/2510.27656

pplx garden, the open source package

TransferEngine ships as part of the pplx garden repository on GitHub under an MIT license. The directory structure is straightforward. fabric-lib contains the RDMA TransferEngine library, p2p-all-to-all implements a Mixture of Experts all to all kernel, python-ext provides the Python extension module from the Rust core, and python/pplx_garden contains the Python package code.

The system requirements reflect a modern GPU cluster. Perplexity research team recommends Linux kernel 5.12 or newer for DMA BUF support, CUDA 12.8 or newer, libfabric, libibverbs, GDRCopy, and an RDMA fabric with GPUDirect RDMA enabled. Each GPU should have at least one dedicated RDMA Network Interface Controller.

Disaggregated prefill and decode

The first production use case is disaggregated inference. Prefill and decode run on separate clusters, so the system must stream KvCache from prefill GPUs to decode GPUs at high speed.

TransferEngine uses alloc_uvm_watcher to track progress in the model. During prefill, the model increments a watcher value after each layer’s attention output projection. When the worker observes a change, it issues paged writes for the KvCache pages of that layer, followed by a single write for the remaining context. This approach allows layer by layer streaming of cache pages without fixed world membership, and it avoids the strict ordering constraints of collectives.

https://arxiv.org/pdf/2510.27656

Fast weight transfer for reinforcement learning

The second system is asynchronous reinforcement learning fine tuning, where training and inference run on separate GPU pools. Traditional designs gather updated parameters to a single rank then broadcast them, which limits throughput to one Network Interface Controller.

Perplexity research team instead uses TransferEngine to perform point to point weight transfer. Each training GPU writes its parameter shard directly into the corresponding inference GPUs using one sided writes. A pipelined execution splits each tensor into stages, host to device copy when Fully Sharded Data Parallel offloads weights, reconstruction and optional quantization, RDMA transfer, and a barrier implemented through scatter and ImmCounter.

In production, this setup delivers weight updates for models such as Kimi K2 at 1 trillion parameters and DeepSeek V3 at 671 billion parameters in about 1.3 seconds from 256 training GPUs to 128 inference GPUs.

https://arxiv.org/pdf/2510.27656

Mixture of Experts routing across ConnectX and EFA

The third piece in pplx garden is a point to point Mixture of Experts dispatch and combine kernel. It uses NVLink for intra node traffic and RDMA for inter node traffic. Dispatch and combine are split into separate send and receive phases so that the decoder can micro batch and overlap communication with grouped general matrix multiply.

A host proxy thread polls GPU state and calls TransferEngine when send buffers are ready. Routes are exchanged first, then each rank computes contiguous receive offsets for each expert and writes tokens into private buffers that can be reused between dispatch and combine. This reduces memory footprint and keeps writes large enough to use the full link bandwidth.

On ConnectX 7, Perplexity research team reports state of the art decode latency that is competitive with DeepEP across expert counts. On AWS EFA, the same kernel delivers the first viable MoE decode latencies with higher but still practical values.

In multi node tests with DeepSeek V3 and Kimi K2 on AWS H200 instances, distributing the model across nodes reduces latency at medium batch sizes, which is the common regime for production serving.

Comparison Table

Key pointTransferEngine (pplx garden)DeepEPNVSHMEM (generic MoE use)MooncakePrimary rolePortable RDMA point to point for LLM systemsMoE all to all dispatch and combineGeneral GPU shared memory and collectivesDistributed KV cache for LLM inferenceHardware focusNVIDIA ConnectX 7 and AWS EFA, multi NIC per GPUNVIDIA ConnectX with GPU initiated RDMA IBGDANVIDIA GPUs on RDMA fabrics including EFARDMA NICs in KV centric serving stacksEFA statusFull support, peak 400 Gbps reportedNo support, requires IBGDA on ConnectXAPI works but MoE use shows severe degradation on EFAPaper reports no EFA support in its RDMA enginePortability for LLM systemsCross vendor, single API across ConnectX 7 and EFAVendor specific and ConnectX focusedNVIDIA centric, not viable for EFA MoE routingFocused on KV sharing, no cross provider support

Key Takeaways

TransferEngine gives a single RDMA point to point abstraction that works on both NVIDIA ConnectX 7 and AWS EFA, and manages multiple Network Interface Controllers per GPU transparently.

The library exposes one sided WriteImm with ImmCounter, and achieves peak 400 Gbps throughput on both NIC families, which lets it match single vendor stacks while remaining portable.

Perplexity team uses TransferEngine in three production systems, disaggregated prefill decode with KvCache streaming, reinforcement learning weight transfer that updates trillion parameter models in about 1.3 seconds, and Mixture of Experts dispatch combine for large models like Kimi K2.

On ConnectX 7, pplx garden’s MoE kernels provide state of the art decode latency and exceed DeepEP on the same hardware, while on EFA they deliver the first practical MoE latencies for trillion parameter workloads.

Because TransferEngine is open source in pplx garden under an MIT license, teams can run very large Mixture of Experts and dense models on heterogeneous H100 or H200 clusters across cloud providers, without rewriting for each vendor specific networking stack.

Editorial Comments

Perplexity’s release of TransferEngine and pplx garden is a practical contribution for LLM infra teams who are blocked by vendor specific networking stacks and expensive fabric upgrades. A portable RDMA abstraction that reaches peak 400 Gbps on both NVIDIA ConnectX 7 and AWS EFA, supports KvCache streaming, fast reinforcement learning weight transfer, and Mixture of Experts routing, directly addresses trillion parameter serving constraints for real systems.

Check out the Paper and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Perplexity AI Releases TransferEngine and pplx garden to Run Trillion Parameter LLMs on Existing GPU Clusters appeared first on MarkTechPost.

Streamline AI operations with the Multi-Provider Generative AI Gateway …

As organizations increasingly adopt AI capabilities across their applications, the need for centralized management, security, and cost control of AI model access is a required step in scaling AI solutions. The Generative AI Gateway on AWS guidance addresses these challenges by providing guidance for a unified gateway that supports multiple AI providers while offering comprehensive governance and monitoring capabilities.
The Generative AI Gateway is a reference architecture for enterprises looking to implement end-to-end generative AI solutions featuring multiple models, data-enriched responses, and agent capabilities in a self-hosted way. This guidance combines the broad model access of Amazon Bedrock, unified developer experience of Amazon SageMaker AI, and the robust management capabilities of LiteLLM, all while supporting customer access to models from external model providers in a more secure and reliable manner.
LiteLLM is an open source project that addresses common challenges faced by customers deploying generative AI workloads. LiteLLM simplifies multi-provider model access while standardizing production operational requirements including cost tracking, observability, prompt management, and more. In this post we’ll introduce how the Multi-Provider Generative AI Gateway reference architecture provides guidance for deploying LiteLLM into an AWS environment for production generative AI workload management and governance.
The challenge: Managing multi-provider AI infrastructure
Organizations building with generative AI face several complex challenges as they scale their AI initiatives:

Provider fragmentation: Teams often need access to different AI models from various providers—Amazon Bedrock, Amazon SageMaker AI, OpenAI, Anthropic, and others—each with different APIs, authentication methods, and billing models.
Decentralized governance model: Without a unified access point, organizations struggle to implement consistent security policies, usage monitoring, and cost controls across different AI services.
Operational complexity: Managing multiple access paradigms ranging from AWS Identity and Access Management roles to API keys, model-specific rate limits, and failover strategies across providers creates operational overhead and increases the risk of service disruptions.
Cost management: Understanding and controlling AI spending across multiple providers and teams becomes increasingly difficult, particularly as usage scales.
Security and compliance: Facilitating consistent security policies and audit trails across different AI providers presents significant challenges for enterprise governance.

Multi-Provider Generative AI Gateway reference architecture
This guidance addresses these common customer challenges by providing a centralized gateway that abstracts the complexity of multiple AI providers behind a single, managed interface.

Built on AWS services and using the open source LiteLLM project, organizations can use this solution to integrate with AI providers while maintaining centralized control, security, and observability.

Flexible deployment options on AWS
The Multi-Provider Generative AI Gateway supports multiple deployment patterns to meet diverse organizational needs:
Amazon ECS deployment For teams preferring containerized applications with managed infrastructure, the ECS deployment provides serverless container orchestration with automatic scaling and integrated load balancing.
Amazon EKS deployment Organizations with existing Kubernetes expertise can use the EKS deployment option, which provides full control over container orchestration while benefiting from a managed Kubernetes control plane. Customers can deploy a new cluster or leverage existing clusters for deployment.
The reference architecture provided for these deployment options is subject to additional security testing based on your organization’s specific security requirements. Conduct additional security testing and review as necessary before deploying anything into production.
Network architecture options
The Multi-Provider Generative AI Gateway supports multiple network architecture options:
Global Public-Facing Deployment For AI services with global user bases, combine the gateway with Amazon CloudFront (CloudFront) and Amazon Route 53. This configuration provides:

Enhanced security with AWS Shield DDoS protection
Simplified HTTPS management with the Amazon CloudFront default certificates
Global edge caching for improved latency
Intelligent traffic routing across regions

Regional direct access For single-Region deployments prioritizing low latency and cost optimization, direct access to the Application Load Balancer (ALB) removes the CloudFront layer while maintaining security through properly configured security groups and network ACLs.
Private internal access Organizations requiring complete isolation can deploy the gateway within a private VPC without internet exposure. This configuration makes sure that the AI model access remains within your secure network perimeter, with ALB security groups restricting traffic to authorized private subnet CIDRs only.
Comprehensive AI governance and management
The Multi-Provider Generative AI Gateway is built to enable robust AI governance standards from a straightforward administrative interface. In addition to policy-based configuration and access management, users can configure advanced capabilities like load-balancing and prompt caching.
Centralized administration interface
The Generative AI Gateway includes a web-based administrative interface in LiteLLM that supports comprehensive management of LLM usage across your organization.
Key capabilities include:
User and team management: Configure access controls at granular levels, from individual users to entire teams, with role-based permissions that align with your organizational structure.
API key management: Centrally manage and rotate API keys for the connected AI providers while maintaining audit trails of key usage and access patterns.
Budget controls and alerting: Set spending limits across providers, teams, and individual users with automated alerts when thresholds are approached or exceeded.
Comprehensive cost controls: Costs are influenced by AWS infrastructure and LLM providers. While it is the customer’s responsibility to configure this solution to meet their cost requirements, customers may review the existing cost settings for additional guidance.
Supports multiple model providers: Compatible with Boto3, OpenAI, and LangGraph SDK, allowing customers to use the best model for the workload regardless of the provider.
Support for Amazon Bedrock Guardrails: Customers can leverage guardrails created on Amazon Bedrock Guardrails for their generative AI workloads, regardless of the model provider.
Intelligent routing and resilience
Common considerations around model deployment include model and prompt resiliency. These factors are important to consider how failures are handled when responding to a prompt or accessing data stores.
Load balancing and failover: The gateway implements sophisticated routing logic that distributes requests across multiple model deployments and automatically fails over to backup providers when issues are detected.
Retry logic: Built-in retry mechanisms with exponential back-off facilitate reliable service delivery even when individual providers experience transient issues.
Prompt caching: Intelligent caching helps reduce costs by avoiding duplicate requests to expensive AI models while maintaining response accuracy.
Advanced policy management
Model deployment architecture can range from the simple to highly complex. The Multi-Provider Generative AI Gateway features the advanced policy management tools needed to maintain a strong governance posture.
Rate limiting: Configure sophisticated rate limiting policies that can vary by user, API key, model type, or time of day to facilitate fair resource allocation and help prevent abuse.
Model access controls: Restrict access to specific AI models based on user roles, making sure that sensitive or expensive models are only accessible to authorized personnel.
Custom routing rules: Implement business logic that routes requests to specific providers based on criteria such as request type, user location, or cost optimization requirements.
Monitoring and observability
As AI workloads grow to include more components, so to do observability needs. The Multi-Provider Generative AI Gateway architecture integrates with Amazon CloudWatch. This integration enables users to configure myriad monitoring and observability solutions, including open-source tools such as Langfuse.
Comprehensive logging and analytics
The gateway interactions are automatically logged to CloudWatch, providing detailed insights into:

Request patterns and usage trends across providers and teams
Performance metrics including latency, error rates, and throughput
Cost allocation and spending patterns by user, team, and model type
Security events and access patterns for compliance reporting

Built-in troubleshooting
The administrative interface provides real-time log viewing capabilities so administrators can quickly diagnose and resolve usage issues without needing to access CloudWatch directly.

Amazon SageMaker integration for expanded model access
Amazon SageMaker helps enhance the Multi-Provider Generative AI Gateway guidance by providing a comprehensive machine learning system that seamlessly integrates with the gateway’s architecture. By using the Amazon SageMaker managed infrastructure for model training, deployment, and hosting, organizations can develop custom foundation models or fine-tune existing ones that can be accessed through the gateway alongside models from other providers. This integration removes the need for separate infrastructure management while facilitating consistent governance across both custom and third-party models. SageMaker AI model hosting capabilities expands the gateway’s model access to include self-hosted models, as well as those available on Amazon Bedrock, OpenAI, and other providers.
Our open source contributions
This reference architecture builds upon our contributions to the LiteLLM open source project, enhancing its capabilities for enterprise deployment on AWS. Our enhancements include improved error handling, enhanced security features, and optimized performance for cloud-native deployments.
Getting started
The Multi-Provider Generative AI Gateway reference architecture is available today through our GitHub repository, complete with:

Infrastructure-as-Code: Amazon CloudFormation and AWS Cloud Development Kit (CDK) templates for automated deployment into an Amazon ECS cluster
Comprehensive documentation: Step-by-step deployment guides and configuration examples
Interactive workshop: Hands-on learning experience to explore the gateway capabilities
Detailed deployment guide: Deployment blog on AWS Builder Center

The code repository describes several flexible deployment options to get started.
Public gateway with global CloudFront distribution
Use CloudFront to provide a globally distributed, low-latency access point for your generative AI services. The CloudFront edge locations deliver content quickly to users around the world, while AWS Shield Standard helps protect against DDoS attacks. This is the recommended configuration for public-facing AI services with a global user base.
Custom domain with CloudFront
For a more branded experience, you can configure the gateway to use your own custom domain name, while still benefiting from the performance and security features of CloudFront. This option is ideal if you want to maintain consistency with your company’s online presence.
Direct access via public Application Load Balancer
Customers who prioritize low-latency over global distribution can opt for a direct-to-ALB deployment, without the CloudFront layer. This simplified architecture can offer cost savings, though it requires extra consideration for web application firewall protection.
Private VPC-only access
For a high level of security, you can deploy the gateway entirely within a private VPC, isolated from the public internet. This configuration is well-suited for processing sensitive data or deploying internal-facing generative AI services. Access is restricted to trusted networks like VPN, Direct Connect, VPC peering, or AWS Transit Gateway.
Learn more and deploy today
Ready to simplify your multi-provider AI infrastructure? Access the complete solution package to explore an interactive learning experience with step-by-step guidance describing each step of the deployment and management process.
Conclusion
The Multi-Provider Generative AI Gateway is a solution guidance intended to help customers get started working on generative AI solutions in a well-architected manner, while taking advantage of the AWS environment of services and complimentary open-source packages. Customers can work with models from Amazon Bedrock, Amazon SageMaker JumpStart, or third-party model providers. Operations and management of workloads is conducted via the LiteLLM management interface, and customers can choose to host on ECS or EKS based on their preference.
In addition, we have published a sample that integrates the gateway into an agentic customer service application. The agentic system is orchestrated using LangGraph and deployed on Amazon Bedrock AgentCore. LLM calls are routed through the gateway, providing the flexibility to test agents with different models–whether hosted on AWS or another provider.
This guidance is just one part of a mature generative AI foundation on AWS. For deeper reading on the components of a generative AI system on AWS, see Architect a mature generative AI foundation on AWS, which describes additional components of a generative AI system.

About the authors
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan works to support customers on their journey to integrating ML workflows efficiently, effectively, and sustainably.
Bobby Lindsey is a Machine Learning Specialist at Amazon Web Services. He’s been in technology for over a decade, spanning various technologies and multiple roles. He is currently focused on combining his background in software engineering, DevOps, and machine learning to help customers deliver machine learning workflows at scale. In his spare time, he enjoys reading, research, hiking, biking, and trail running.
Nick McCarthy is a Generative AI Specialist at AWS. He has worked with AWS clients across various industries including healthcare, finance, sports, telecoms and energy to accelerate their business outcomes through the use of AI/ML. Outside of work he loves to spend time traveling, trying new cuisines and reading about science and technology. Nick has a Bachelors degree in Astrophysics and a Masters degree in Machine Learning.
Chaitra Mathur is as a GenAI Specialist Solutions Architect at AWS. She works with customers across industries in building scalable generative AI platforms and operationalizing them. Throughout her career, she has shared her expertise at numerous conferences and has authored several blogs in the Machine Learning and Generative AI domains.
Sreedevi Velagala is a Solution Architect within the World-Wide Specialist Organization Technology Solutions team at Amazon Web Services, based in New Jersey. She has been focused on delivering tailored solutions and guidance aligned with the unique needs of diverse clientele across AI/ML, Compute, Storage, Networking and Analytics domains. She has been instrumental in helping customers learn how AWS can lower the compute costs for machine learning workloads using Graviton, Inferentia and Trainium. She leverages her deep technical knowledge and industry expertise to deliver tailored solutions that align with each client’s unique business needs and requirements.

Deploy geospatial agents with Foursquare Spatial H3 Hub and Amazon Sag …

Organizations have used geospatial machine learning (ML) for property risk assessment, disaster response, and infrastructure planning. These systems worked well but couldn’t scale beyond specialized use cases. Each question required multiple geospatial datasets, each with its own model and often its own workflow, limiting these capabilities to a handful of high-value use cases at the largest enterprises that could afford the investment. In this post, you’ll learn how to deploy geospatial AI agents that can answer complex spatial questions in minutes instead of months. By combining Foursquare Spatial H3 Hub’s analysis-ready geospatial data with reasoning models deployed on Amazon SageMaker AI, you can build agents that enable nontechnical domain experts to perform sophisticated spatial analysis through natural language queries—without requiring geographic information system (GIS) expertise or custom data engineering pipelines.
Geospatial intelligence adoption barriers
Two technical barriers have prevented these specialized geospatial systems from achieving broader adoption. First, geospatial data arrives in a bewildering array of formats—satellite imagery stored as GeoTIFF rasters, administrative boundaries stored as shapefile vectors, weather models stored as NetCDF grids, and property records in proprietary cadastral formats—each requiring different parsing libraries and custom data pipelines. Second, joining datasets across spatial granularities is nontrivial: property insurance data geocoded to individual addresses must combine with climate risk data at 1 km grid cells and census demographics aggregated to block groups, requiring organizations to spend months building custom processing pipelines before answering their first business question. In short, there is no universal join key to combine these datasets. This means organizations can’t experiment with geospatial intelligence without first building data engineering pipelines to normalize diverse formats, implement spatial processing for coordinate transformations and resolution resampling, and deploy specialized computing infrastructure.
Solving technical barriers alone wasn’t sufficient. Earlier systems still required 6–12 month implementations with specialized GIS teams. Five enterprise requirements remained unaddressed: making geospatial analysis accessible to nontechnical domain experts, showing how AI reaches conclusions, supporting flexible analysis, delivering interactive response times, and offering cost predictability at scale.
Three technologies converging to address adoption challenges
Addressing these technical and enterprise barriers requires a fundamentally different approach. This architecture combines three technologies to address those gaps:

Foursquare Spatial H3 Hub for analysis-ready data – This service transforms inaccessible raster and vector geospatial data into analysis-ready features, indexed to the H3 hierarchical grid system, in tabular format that data scientists can query using familiar tools such as Spark, Python, and DuckDB. Datasets containing latitude and longitude coordinates, city names, or zip codes can be easily enriched by joining on a common H3 cell, eliminating months of data preparation and specialized GIS expertise.
Reasoning models and agentic AI for adaptive workflows – Models such as DeepSeek-R1 and Llama 3 break down complex problems, reason through multistep workflows, and orchestrate actions across data sources. They dynamically determine which datasets to combine and plan analytical sequences that previously required GIS expertise—transforming static, preconfigured workflows into adaptive reasoning systems.
Amazon SageMaker AI for cost-effective generative AI inference – This Amazon SageMaker AI capability provides managed infrastructure for deploying open source models with optimized inference runtimes, auto scaling, and operational tooling. Teams can focus on building geospatial intelligence capabilities rather than managing underlying infrastructure.

Together, these technologies enable organizations to access analysis-ready geospatial data, deploy adaptive reasoning agents, and run production inference without building specialized infrastructure. In this post, we demonstrate a production geospatial agent that combines Foursquare Spatial H3 Hub with reasoning models deployed on Amazon SageMaker AI.
Analysis-ready geospatial data with Foursquare Spatial H3 Hub
Foursquare’s Spatial H3 Hub eliminates traditional geospatial adoption barriers through a proprietary H3 indexing engine. This engine has transformed dozens of disparate geospatial datasets into an Iceberg catalog ready for immediate analysis, replacing months of data engineering with instant access to analysis-ready geospatial features.
The H3 indexing engine addresses the root cause of geospatial complexity: the vast array of formats and coordinate systems that have historically limited access to geographic information. The engine converts spatial data, raster imagery, or vector datasets by indexing it into the H3 hierarchical spatial grid at global scale. H3 divides the entire Earth into nested hexagonal cells, creating a universal grid system where every location has a standardized identifier. The engine extracts data from raster images or diverse vector shapes such as census tract polygons and converts them into features attached to H3 cell IDs in tabular format, where the cell ID becomes a universal join key that abstracts away format complexity and coordinate systems. An insurance company’s property data, National Oceanic and Atmospheric Administration (NOAA) climate projections, census demographics, and infrastructure networks can all be combined because they share this common spatial index.

The engine also handles the methodological complexities that traditionally required GIS expertise. It can index data to H3 cells at any precision from resolution 0 (about 1,000 km hexagons covering continents) down to resolution 15 (about 1 meter hexagons covering individual buildings). You can choose the appropriate resolution for each use case—coarser resolutions for regional climate analysis, finer resolutions for property-level assessment. When boundaries don’t align perfectly—like a census tract overlapping multiple H3 hexagons—the engine intelligently handles partial overlaps through either fast centroid-based approximation or exact proportional allocation based on intersection areas. It also automatically aggregates or disaggregates data when combining datasets at different scales, eliminating the manual preprocessing that traditionally consumed months of GIS specialist time.
Built on this indexing foundation, Foursquare Spatial H3 Hub delivers an Iceberg catalog containing datasets spanning energy infrastructure, environmental conditions, and natural hazards all originally in diverse raster and vector formats, now pre-indexed to H3 cells at resolution 8 (with additional resolutions available on demand). You can query this data with familiar tools such as SQL, Python, Spark, Snowflake, and Databricks without proprietary GIS software. H3 cell identifiers become straightforward column values that join like any other attribute, so you can rapidly validate geospatial hypotheses by joining their proprietary data with Foursquare’s H3 catalog.
Reasoning models for spatial Intelligence
Reasoning models such as DeepSeek-R1 change how AI handles geospatial intelligence. Traditional geospatial systems operated as collections of static, purpose-built models, with separate models for flood risk, wildfire exposure, and earthquake vulnerability. Each model was trained on specific datasets and incapable of answering questions outside its narrow domain. When requirements shifted or new data emerged, organizations faced months of retraining. Reasoning models change this paradigm by decomposing complex problems, planning multistep workflows, and orchestrating actions across data sources dynamically. Rather than requiring pre-trained models for every question, these systems reason through novel scenarios by combining available data in ways never explicitly programmed. Asked “which neighborhoods face compounding climate and economic risks?”, a reasoning agent determines it needs flood exposure data, household income, property density, and neighborhood boundaries and then executes that analytical pipeline by calling appropriate tools and data sources. The agent understands spatial relationships conceptually: point data aggregates to polygons, grid cells map to administrative boundaries, proximity requires appropriate distance metrics. At each step, it reasons about what information comes next and adjusts when data reveals unexpected patterns, transforming geospatial analysis from pre-scripted queries into adaptive investigation.
Deploying agents on Amazon SageMaker AI
Analysis-ready geospatial data and reasoning-capable models solve critical parts of the puzzle, but production deployment creates new challenges. Geospatial agents need sustained inference capacity to process queries, execute reasoning chains, retrieve data, and generate visualizations. Organizations face a choice: build custom inference infrastructure with GPU clusters, load balancers, and auto scaling policies, or rely on commercial large language model (LLM) APIs where costs scale unpredictably with usage and data governance becomes complex.
Amazon SageMaker AI provides managed infrastructure for deploying and operating open source generative AI models in production. You can deploy models from Hugging Face or Amazon SageMaker AI JumpStart—including reasoning models such as DeepSeek-R1, Llama 3, or Qwen—to SageMaker AI real-time or asynchronous inference endpoints without managing underlying infrastructure. Amazon SageMaker AI Inference handles instance provisioning, supports optimized serving runtimes like vLLM and SGLang, and provides auto scaling based on traffic patterns.
Amazon SageMaker AI Inference capabilities address several operational challenges specific to agent architectures. Geospatial agents handling variable query loads throughout the day benefit from automatic scaling on GPU instances such as G5, P4d, and P5 based on request volume or custom metrics. Long-running spatial analyses that exceed typical API timeouts can route to asynchronous inference endpoints where SageMaker AI queues request, process them, and deliver results to Amazon Simple Storage Service (Amazon S3), enabling complex multi-dataset analyses without client-side timeout issues. For architectures employing multiple models, multi-container endpoints host different models on shared infrastructure with independent scaling policies and traffic routing. Built-in integration with Amazon CloudWatch for monitoring, AWS Identity and Access Management (IAM) for access control, and Amazon Virtual Private Cloud (Amazon VPC) for network isolation simplifies operational requirements.
Foursquare Spatial H3 Hub and Amazon SageMaker AI together reduce operational complexity. Data scientists can focus on building agent capabilities, defining which H3 Hub datasets to query for specific questions, refining prompting strategies for spatial reasoning, and optimizing tool-calling patterns rather than managing underlying infrastructure. Organizations can also experiment with different open source models. Such initiatives, which previously required separate teams for data engineering, model development, and platform operations, have now become accessible to smaller teams without specialized infrastructure expertise.
Designing the Foursquare Spatial Agent
The Foursquare Spatial Agent architecture combines reasoning models deployed on SageMaker AI with tool-calling capabilities that query Foursquare Spatial H3 Hub directly. The agent orchestrates the complete workflow from natural language question to visualization without manual intervention.
Agent workflow
When a user poses a natural language question about spatial relationships—such as “Which neighborhoods in Los Angeles face both high flood risk and economic vulnerability?”—the agent executes a multistep reasoning process. The reasoning model first analyzes the question and identifies required information: flood risk scores, economic indicators like income and employment, and neighborhood boundaries. It then determines which H3 Hub datasets contain relevant information by reasoning over dataset descriptions. With datasets selected, the model calls H3 Hub query tools, constructing SQL queries that join datasets on H3 cell IDs. After executing these queries, the model analyzes results to identify spatial patterns and statistical relationships. Finally, it generates Vega specifications for charts and Kepler.gl specifications for maps that visualize the findings.
This workflow uses the reasoning model’s ability to plan, adapt, and recover from errors. If initial queries return unexpected results, the model can refine its approach, select additional datasets, or adjust spatial operations—capabilities of that static, preprogrammed workflow.
Design decisions addressing enterprise requirements
Building a production geospatial agent required addressing the five enterprise requirements identified through deployment analysis. Three key design decisions illustrate how the architecture balances accessibility, transparency, and flexibility.
Insurance underwriters understand flood risk and property exposure but don’t write SQL or Python. The agent architecture makes geospatial analysis accessible by accepting natural language questions and translating them into appropriate H3 Hub queries. The reasoning model interprets domain-specific terminology like “vulnerable neighborhoods” or “high-risk areas” and maps these concepts to relevant datasets and analytical operations. This eliminates the bottleneck where domain experts must submit analysis requests to data teams, enabling self-service exploration.
Domain experts also need to understand how the agent arrived at conclusions, especially when analyses inform business decisions. The agent can log its reasoning process at each step: which datasets were considered and why, what spatial operations were planned, which queries were executed, and how results were interpreted. Every visualization includes metadata showing which H3 cells and source datasets contributed to the analysis. This transparency means users can validate the agent’s analytical approach and understand the data sources behind conclusions. If an insurance underwriter sees a high-risk assessment for a property, they can trace back through the reasoning chain to see it combined flood exposure data from Federal Emergency Management Agency (FEMA), wildfire risk from state forestry data, and property characteristics from local assessor records—building confidence in AI-generated insights. Implementation uses structured logging to capture reasoning steps, making the agent’s decision-making process inspectable and debuggable rather than a black box.
Pre-built dashboards serve known questions but fail when analysts need to explore variations. The agent architecture provides flexibility by using tool-calling to dynamically compose analyses. Rather than predefining workflows for every scenario, the reasoning model determines which H3 Hub datasets to query and how to combine them based on the specific question. This enables the agent to handle unforeseen analytical questions without requiring new engineering work for each variation. The agent uses function calling APIs supported by models such as Llama 3 and DeepSeek-R1 to interact with H3 Hub. The model receives tool descriptions specifying available datasets, query parameters, and return formats, then constructs appropriate tool calls during reasoning. SageMaker AI endpoints handle the inference, while custom application logic manages tool execution and result assembly.
SageMaker AI deployment architecture
The Foursquare Spatial Agent deploys on SageMaker AI real-time inference endpoints with configuration optimized for production geospatial workloads. The deployment uses G5 instances such as g5.2xlarge for development and g5.12xlarge for production, providing cost-effective GPU inference for models in the 7B–70B parameter range commonly used for agent reasoning. A target tracking scaling policy based on the InvocationsPerInstance metric maintains response times during variable load while minimizing costs during low-traffic periods. Spatial analyses involving large geographic extents or many datasets join route to asynchronous inference endpoints, allowing queries that can take 60 seconds or more to complete without exceeding typical API timeout limits while maintaining responsive behavior for more straightforward queries.
CloudWatch metrics track inference latency, error rates, and token throughput across the deployment. Custom metrics log reasoning chain depth, number of tool calls per query, and dataset access patterns, enabling continuous optimization of agent performance. This deployment architecture provides production-grade reliability while maintaining flexibility for experimentation with different models and prompting strategies.
Foursquare Spatial Agent in action
The following demonstrations show how organizations across insurance, banking, and urban planning can use this capability to answer complex spatial questions in minutes—collapsing timelines that previously stretched across quarters into interactive workflows accessible to domain experts without specialized technical skills. In insurance risk assessment, the agent predicts which areas in the Los Angeles region are likely to witness increased insurance rates by computing a composite risk score from flood risk, fire hazard severity, crime rates and the FEMA national risk index datasets at different spatial resolutions and formats, now queryable through common H3 cell IDs. An underwriter asks the question in natural language, and the agent handles dataset selection, spatial joins, risk aggregation, and map visualization without requiring GIS expertise.

For banking market analysis, the agent provides a 360-degree view of Los Angeles’s bank network planning. It combines demographic data including population, income, and age distribution with healthcare facility locations, crime statistics, and points of interest to identify under-served markets and expansion opportunities. This analysis informs data-driven decisions for branch placement, product targeting, and financial inclusion initiatives. Previously, assembling these datasets and performing spatial analysis required weeks of GIS specialist time. Now, the agent delivers results in minutes through conversational interaction.

For urban infrastructure planning, the agent helps the city of Chandler, Arizona, plan sustainable urban development over the next decade. It combines population growth projections, housing development patterns, median income trends, and infrastructure data including buildings, power lines, and cell towers—all indexed to H3 cells. Urban planners explore scenarios by asking questions like “which areas will experience population growth but lack adequate infrastructure?” The agent reasons through the analytical requirements, executes appropriate spatial queries, and generates visualizations showing infrastructure gaps that need investment.

The democratization of geospatial intelligence
Foursquare Spatial H3 Hub, reasoning models, and Amazon SageMaker AI together remove the barriers. Organizations can now access standardized geospatial data, deploy reasoning agents with tool-calling capabilities, and run production inference without building specialized infrastructure.
To deploy geospatial AI agents:

Access Foursquare Spatial H3 Hub for analysis-ready datasets.
Deploy reasoning models on Amazon SageMaker AI with SageMaker JumpStart or Hugging Face.
Build agent capabilities that connect models to H3 Hub datasets through tool-calling.

About the authors
Vikram Gundeti currently serves as the Chief Technology Officer (CTO) of Foursquare, where he leads the technical strategy, decision making, and research for the company’s Geospatial Platform. Before joining Foursquare, Vikram held the position of Principal Engineer at Amazon, where he made his mark as a founding engineer on the Amazon Alexa team.
Amit Modi is a Senior Manager of Product Management at Amazon SageMaker AI, where he focuses on ModelOps and Inference. His analysis of enterprise adoption patterns and design of the SageMaker deployment approach described in this post emerged from work with enterprise customers.
Aditya Badhwar is a Senior Solutions Architect at AWS based out of New York. He works with customers providing technical assistance and architectural guidance on various AWS services. Prior to AWS, Aditya worked for over 16 years in software engineering and architecture roles for various large-scale enterprises.

How Wipro PARI accelerates PLC code generation using Amazon Bedrock

This post is co-written with Rejin Surendran from Wipro Enterprises Limited and Bakrudeen K from ShellKode.
In manufacturing environments, industrial automation engineers face a significant challenge: how to rapidly convert complex process requirements into Programmable Logic Controller (PLC) ladder text code. This traditional, manual process typically requires 3-4 days per query, creating bottlenecks in production workflows. The complexity stems from multiple factors: engineers must meticulously translate high-level requirements into precise machine instructions while managing multiple states and transitions, facilitate compliance with the international PLC programming standard IEC 61131-3, handle complex variable declarations, maintain detailed documentation for industrial compliance, and conduct thorough testing of safety protocols and execution paths.
Wipro PARI is one of the largest global automation companies with over 1,300 employees and three facilities worldwide, with its headquarters in Pune, India. Wipro PARI has the vision to utilize its expertise and resources to bring the best solutions in automation and robotics to its customers.
In this post, we share how Wipro implemented advanced prompt engineering techniques, custom validation logic, and automated code rectification to streamline the development of industrial automation code at scale using Amazon Bedrock. We walk through the architecture along with the key use cases, explain core components and workflows, and share real-world results that show the transformative impact on manufacturing operations.
Why Wipro PARI chose Amazon Bedrock?
Wipro PARI partnered with AWS and ShellKode to develop an innovative solution that transforms this time-intensive PLC code generation process using AI. Using Amazon Bedrock and Anthropic’s Claude models, we have developed a system that:

Reduces PLC code generation time from 3–4 days to approximately 10 minutes per requirement
Improves code accuracy up to 85%
Automates validation against industry standards
Handles complex state management and transition logic automatically
Facilitates proper variable declarations and naming conventions
Maintains compliance documentation and audit trails
Provides a user-friendly interface for industrial engineers

Wipro PARI selected Amazon Bedrock as the foundation for this PLC code generation solution due to its unique combination of enterprise capabilities that align with industrial automation requirements. With the broad model choice available in Amazon Bedrock, the team can use Anthropic’s Claude 3.5 Sonnet for complex code generation while maintaining flexibility to switch models as newer, more capable versions become available without infrastructure changes. The fully managed service reduces the operational overhead of hosting and scaling machine learning (ML) infrastructure, helping Wipro PARI’s engineers focus on domain-specific automation logic rather than model deployment.
Critically for industrial applications, Amazon Bedrock makes sure that the customer data—including proprietary control logic and manufacturing specifications—remains within the AWS environment and is not used to train underlying foundation models (FMs), thereby maintaining strict data privacy and intellectual property protection. This security posture, combined with the AWS compliance certifications, provides the enterprise-grade governance required for manufacturing environments handling sensitive operational data.
Solution overview
In this section, we present the solution architecture and user workflow of the Wipro PLC Code Generator. The following diagram illustrates the end-to-end architecture.

Architecture components
The architecture consists of the following key components:

Frontend client layer – The frontend client layer consists of a React-based, responsive web application that makes it possible for industrial engineers to upload control logic spreadsheets, configure generation settings, and verify generated ladder code with full traceability.
Backend application services layer – The WIPRO PARI solution implements a React and FastAPI microservices architecture with over 30 specialized APIs deployed on load-balanced Amazon Elastic Compute Cloud (Amazon EC2) instances within a secure virtual private cloud (VPC) environment for industrial automation PLC code generation, with plans to migrate to Amazon Elastic Container Service (Amazon ECS) in future iterations. The VPC configuration includes public and private subnet isolation with bastion server access control for secure remote management of the industrial control system development service. The backend application services layer is organized into distinct components, including controllers for request handling, core services for business logic, authentication modules for user management, file processing engines for spreadsheet handling, and spreadsheet parsers for extracting control logic specifications from industrial automation documentation.
AI/ML processing layer – The solution includes a dedicated AI/ML processing layer that integrates with Amazon Bedrock and uses multiple Anthropic Claude models depending on task complexity and requirements. The large language model (LLM) integration services transform control logic requirements into intermediate structured pseudo queries, which are then converted into standardized PLC ladder text code through multi-iteration processing. The system handles complex industrial automation scenarios, including parallel execution paths, fork/defork logic, and Boolean expressions commonly found in manufacturing control systems.
Data and storage layer – The generated PLC code undergoes intelligent rectification to fix syntax and logical errors specific to ladder logic programming, followed by systematic validation against predefined industrial guidelines to facilitate code quality and safety compliance. Amazon Simple Storage Service (Amazon S3) buckets store generated code artifacts, templates, and version history for industrial project management. The system uses Amazon Relational Database Service (Amazon RDS) for PostgreSQL databases for persistent state management, project tracking, and maintaining relationships between control logic specifications and generated code.

User workflow
The code generation workflow consists of the following steps:

User input and authentication – An industrial engineer logs in to the React web application, authenticates through role-based access controls, and uploads Excel spreadsheets.
Data processing and transformation – The system processes the uploaded spreadsheets containing control logic specifications for PLC programming requirements through Excel parsers. It extracts the control logic data, validates input specifications against industrial standards, and transforms raw data into structured format suitable for AI processing.
AI-powered code generation – LLM integration services send structured requirements to Amazon Bedrock using Anthropic’s Claude 3.5 Sonnet, which generates intermediate pseudo queries, converts them into standardized PLC ladder text code, and handles complex industrial automation scenarios including parallel execution paths and Boolean expressions. A pseudo query is an intermediate structured representation that translates human-readable control logic requirements from Excel spreadsheets into a standardized format that can be processed by AI models to generate PLC code.

Example specification – When temperature > 80°C AND pressure < 5 bar, turn on cooling pump
Pseudo query – IF (TEMP_SENSOR > 80) AND (PRESSURE_SENSOR < 5) THEN SET COOLING_PUMP = TRUE

Validation and storage – The generated PLC code undergoes automated quality validation against IEC 61131-3 standards, intelligent rectification fixes syntax and logical errors, and validated code artifacts are stored in Amazon S3 with version control and traceability.
Engineer review – The industrial engineer reviews the generated ladder code through the web interface, verifies code quality and safety compliance, downloads validated PLC code for deployment, and maintains project history with a full audit trail for industrial compliance requirements.

The following GIF illustrates the complete user workflow from Excel upload to PLC code generation and download.

Security and compliance
User authentication and authorization are managed through Amazon Cognito, which validates user credentials and enforces role-based access controls to make sure only authorized personnel can access PLC code generation capabilities and sensitive industrial automation data. Security is implemented through AWS Identity and Access Management (IAM) based access controls managing engineer permissions and service-to-service authentication for industrial data protection. Amazon GuardDuty provides continuous threat detection, and AWS CloudTrail maintains comprehensive audit logging of the code generation activities for industrial compliance requirements.
In the following sections, we break down each functionality in detail. The modules used in the solution are integrated through a streamlined workflow to maximize automation and accuracy.
Data formatter
The solution begins with processing the pseudo query inputs, as shown in the following diagram. This crucial first step transforms various input formats into a standardized structure that can be effectively processed by the language model.

The workflow follows these steps:

Users upload the control logic available in a spreadsheet as inputs through the UI interface.
From the uploaded spreadsheet, the formatter intelligently extracts state definitions, transition numbers, associated actions, and forking/de-forking path relationships. This extracted information is useful in the downstream process to validate the PLC code.
The extracted information is stored in S3 buckets for persistence and future reference.
The data formatter constructs a comprehensive prompt containing the original spreadsheet data and specific processing instructions.
This prompt is sent to Anthropic’s Claude 3.5 Sonnet to convert the control logic into a structured pseudo query format. Lengthy descriptions are abbreviated to 20 characters to conform to PLC variable naming conventions.
The data formatter then passes control to the PLC code generator module.

The following code is a sample intermediate pseudo query (the output from the data formatter module). The pseudo query implements a safety monitoring system for industrial machinery that makes sure the machine only operates when the safety conditions are met. It monitors safety doors and emergency buttons, and includes proper reset procedures after a safety violation. Each state network contains the state numbers, the transition variables, and the actions to be performed for each transition.

State Number: 25
Description: Machine Safety Check
State Name: MchSafetyCheck
Action:
Transitions:
 – Condition: IF iSafetyDoorClosed & iEmergencyButtonReleased
   – Goto State Number: 28
 – Condition: IF !iSafetyDoorClosed | iEmergencyButtonPressed
   – Goto State Number: 26

State Number: 26
Description: Machine Safety Violation
State Name: MchSafetyViolation
Action:
  – SET oAlarmLight = TRUE
  – SET oMachineStop = TRUE
Transitions:
 – Condition: IF iAcknowledgeButton & iSafetyDoorClosed & iEmergencyButtonReleased
   – Goto State Number: 27

PLC code generator
To maximize the accuracy of ladder text generation, the solution employs sophisticated prompt engineering techniques and uses Anthropic’s Claude 3.5 Sonnet for code generation. The workflow steps for this part of the solution are shown in the following diagram.

Prompt creation
The prompt creation process consists of the following steps:

The intermediate pseudo query from the data formatter is passed to the PLC code generator module, which initiates the prompt creation process.
The prompt builder builds a detailed task prompt to generate the initial batch of PLC code and the subsequent batches as well. It includes:

PLC programming domain knowledge (state/transition variable naming conventions, network creation patterns for forking/de-forking, condition network structures) .
Few-shot examples demonstrating pseudo query to ladder text conversion.
Explicit instructions for handling state transitions, variable declarations, and complex Boolean expressions.

The prompt builder also creates a continuation prompt that instructs the FM to continue generating the PLC code from where it has left off in the previous iteration.

Few-shot sampling
We used a few-shot learning strategy to generate domain-specific outputs by providing relevant examples in the prompt context. Pseudo queries and related metadata including structural characteristics (state transitions, actions, control flow patterns) were indexed in a vector store. At inference, a hybrid retrieval strategy combines semantic similarity and lexical matching with the metadata to fetch the most relevant structurally aligned examples and their corresponding PLC code, which are then dynamically injected into the prompt. See the following code:

PLC_PROMPT = “””You are expert in writing code in PLC text ladder code …
##DYNAMIC EXAMPLES
{retrieved_examples}
##DOMAIN VARIABLES
{business_specific_variables}
##USER INPUT
{user_pseudo_code}
##FUNCTIONAL GUIDELINES
{custom_instructions}
“””

PLC code generation
The PLC code generation process consists of the following steps (as numbered in the preceding diagram):

The task prompt is passed to Anthropic’s Claude 3.5 Sonnet, which processes the prompt to generate the initial ladder text code containing up to 4,096 tokens (the maximum output tokens limit for the FM).
Because ladder text typically exceeds this limit, our solution implements an iterative generation approach with specialized continuation prompting. The system checks if generation is complete and requests additional continuation prompts as needed.
This continuation method maintains context between sequential generations, facilitating consistency throughout the entire code base.
The process continues iteratively until the PLC ladder code is fully generated. The completed code segments are then consolidated and passed to the code rectifier module for further processing.

The following code block shows a sample PLC code generated:

FUNCTION_BLOCK “Machine_Safety_Monitoring”
{ S7_Optimized_Access := ‘FALSE’ }
VERSION : 0.1
   VAR_INPUT
      iSafetyDoorClosed : Bool;
      iEmergencyButtonReleased : Bool;
      iEmergencyButtonPressed : Bool;
      iAutoRunning : Bool;
      iReset_fault : Bool;
   END_VAR

   VAR
      s25_MchSafetyCheck : Bool;
      s25_MchSafetyCheck_T1 : Bool;
      s25_MchSafetyCheck_T2 : Bool;
      SEQ01_ResetComplete : Bool;
      sStWtResetRel_T1 : Bool;
   END_VAR

NETWORK
TITLE = Transition for STATE Num:25 Machine Safety Check
      A #s25_MchSafetyCheck;
      AN #sStWtResetRel;
      A #sSst;
      A #iSafetyDoorClosed;
      A #iEmergencyButtonReleased;
      = #s25_MchSafetyCheck_T1;
      A #s25_MchSafetyCheck;
      AN #sStWtResetRel;
      A #sSst;
      AN #iSafetyDoorClosed;
      O #iEmergencyButtonPressed;
      = #s25_MchSafetyCheck_T2;
NETWORK
TITLE = STATE Num:25 Machine Safety Check
      A(;
      O #s25_MchSafetyCheck;
      O #sStWtResetRel_T1;
      );
      AN #sStWtResetRel;
      AN #s25_MchSafetyCheck_T1;
      AN #s25_MchSafetyCheck_T2;
      = %L1.0;
      A %L1.0;
      BLD 102;
      = #s25_MchSafetyCheck;
      A %L1.0;
      JNB Label_25;
      L 25;
      T #StateNo;
Label_25:      NOP 0;

Code rectifier
Because PLC ladder logic is inherently complex, LLMs might miss critical functionalities during initial code generation. The solution incorporates a sophisticated rectification system to address these gaps and facilitate high-quality output. The rectification uses a hybrid approach of custom logic containing business guidelines and an FM to perform the rectification task.The following diagram illustrates the workflow.

The rectifier module performs the following steps to help enhance code accuracy:

PLC code generated by the generator module is transferred to the rectifier module for enhancement.
The module facilitates proper handling of parallel execution paths, where sequences split into multiple branches and later re-converge, maintaining proper logic flow throughout the PLC program. This is done by invoking Anthropic’s Claude 3.7 Sonnet, which provides enhanced reasoning capabilities required for complex parallel execution path corrections, with a specialized prompt and the generated PLC code. Node/network mapping scripts are used to track state transitions and sequence tracking.
The module uses data extracted by the formatter (including transition variables’ source and destination states stored in Amazon S3) through the following phases:

Identification phase – Uses specialized Python algorithms to analyze the PLC code structure and cross-references transition variables against their declared source and destination states, flagging incorrect connections.
Remediation phase – Employs targeted Python routines to systematically remove incorrect connections while preserving the overall logic structure integrity.
Reconstruction phase – Implements custom Python logic to establish proper connections between states following correct sequential execution patterns.

The generated code might contain syntax errors, undeclared variables, or non-compliant naming. Using Anthropic’s Claude 3.5 Sonnet and custom logic, this process involves:

Identifying missing variables that are used within the code but not declared.
Adding missing variables to the declaration section.
Standardizing variable names to make sure the variables follow the Siemens S7-1517 PLC naming conventions.

The rectified PLC code and associated metadata are stored in Amazon S3.

Code evaluator
After rectification, the code undergoes a comprehensive validation process:

The validator module analyzes the rectified ladder text against the critical guidelines:

Unique state flags – Verifies that each state has a unique identifier with no duplicates.
Unique transition flags – Confirms the transition identifiers are unique throughout the code.
Proper connection verification – Validates that each transition connects to the correct destination state.
Input transition completeness – Makes sure every state has at least one input transition condition to trigger state changes.
Mutually exclusive conditions – Checks that transition variables within the same state are mutually exclusive to help prevent logic conflicts.

For each validation check, the system generates a detailed pass/fail result with specific information about the issues detected.
A comprehensive validation report is compiled, highlighting remaining issues that might require manual attention from engineers, with clear indicators of their location and nature in the code.
This multi-layered rectification and validation approach significantly helps improve the quality of the generated ladder text, reducing the need for manual intervention and accelerating the overall code development process.

UI and user interaction
The solution provides an intuitive UI that helps engineers interact with the system efficiently.The workflow for this part of the solution follows these steps:

Users access the web-based interface to upload control logic spreadsheets or structured text inputs.
The interface provides options to select different models and adjust parameters to optimize generation.
Advanced users can edit the prompts directly to customize the generation process.
The system displays the generated ladder text, pseudo query, and validation report, allowing engineers to quickly assess the output quality.

The entire process from upload to validated code typically completes in 3–7 minutes, depending on the complexity of the input query.The following GIF demonstrates the settings interface where users can configure model parameters including temperature, Top-P, Top-K values, select different models, and customize prompt settings for various projects.

Results and business impact
The solution improves upon Wipro PARI’s previous approach, demonstrating consistent performance across various test cases:

Average validation completion percentage across test cases was 85%
Processing time reduced from 3–4 days to approximately 10 minutes per query
Cost per query generation was approximately $0.40–$0.60
Perfect (100%) validation scores achieved on less complex queries such as “Conveyor controls”
Even complex queries with multiple state transitions achieved validation scores of 70–90%

This automation approach has transformed Wipro PARI’s PLC programming workflow, delivering measurable business impact including 5,000 work-hours saved across projects while minimizing manual coding errors. The solution helped their 200 engineers focus on high-value tasks like code design and application development while accelerating the code generation process. It also helped Wipro PARI win over key automotive clients and create a competitive advantage for complex automation projects. They plan to expand to other major PLC systems, including Rockwell Automation, Schneider Electric, and ABB in the future, helping Wipro PARI to scale their automotive industry expertise.
Conclusion
In this post, we explored how AWS collaborated with Wipro PARI to develop an AI-powered PLC Code Generator that transforms the time-intensive process of creating ladder text code from a given control logic. By using Amazon Bedrock with multiple Anthropic Claude models and a custom validation framework, the solution achieves an average accuracy of 85% while reducing code generation time from 3–4 days to approximately 10 minutes per query.
The Wipro PLC Code Generator represents a milestone in industrial automation programming, directly addressing the productivity challenges faced by Wipro PARI’s engineering consultants. The solution’s approach—combining prompt engineering, iterative code generation, automated rectification, and systematic validation—creates a robust framework that can be applied across various PLC programming scenarios.
Building on the current implementation, Wipro PARI is planning to expand the solution’s capabilities using additional Amazon Bedrock features. The team will implement Amazon Bedrock Guardrails to help enforce content filtering policies that help prevent generation of unsafe control logic and facilitate compliance with IEC 61131-3 standards at the model output level. The roadmap includes building multi-agent workflows using AWS Strands Agents, an open source SDK designed for autonomous AI agents, where specialized agents will handle distinct tasks: one agent for requirements analysis, another for code generation, and a third for automated documentation generation. To scale these agents in production, Wipro PARI will use Amazon Bedrock AgentCore, which provides serverless infrastructure for deploying and scaling agents with enterprise-grade security, session isolation, and built-in identity management. Amazon Bedrock AgentCore Memory will enable the system to maintain context across engineering sessions, allowing agents to remember previous interactions and build upon prior work, and an Amazon Bedrock AgentCore gateway will help securely connect agents to existing PLC validation tools and internal automation systems. Wipro PARI intends to build agents for automated testing, security scanning and automated document generation. In addition, Wipro PARI plans to expand this solution by incorporating additional validation rules, helping enhance the UI, and adding support for complex sequence types and integration with SIEMENS software for direct code deployment.
As industrial automation continues to evolve with increasing complexity, AI-assisted programming tools like the Wipro PLC Code Generator help accelerate development cycles and improve code quality. By reducing the manual burden of code generation and validation, engineers can focus on higher-value tasks such as system optimization and innovation, ultimately contributing to more efficient and reliable manufacturing operations across industries.
To learn more about the resources used in this solution, refer to the following additional resources:

Amazon Bedrock Documentation
Getting started with Amazon Bedrock
Claude by Anthropic in Amazon Bedrock
AWS Industrial Automation Solutions
AWS Blog: Generative AI for Industrial Applications

About the authors
Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 25+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Generative AI & Machine Learning with focus on moving Enterprise GenAI/ML applications to production, at scale.
Charu Dixit is a Solutions Architect at Amazon Web Services (AWS), helping GSI customers with cloud transformation strategies and solution design, focusing on containers, networking, and generative AI. With over 8 years of experience at AWS, she specializes in Amazon EKS and ELB, guiding customers through building and modernizing containerized applications at scale. Outside of work, Charu enjoys traveling, drawing and painting, and spending quality time with her family.
Debasish Mishra is a Senior Data Scientist at the AWS Generative AI Innovation Center, where he helps customers leverage AWS AI/ML services to solve complex business challenges through generative AI solutions. With experience spanning fintech, healthcare, sports, automotive, retail, manufacturing, he brings cross-industry expertise to diverse use cases. His specializations include code generation, AI agent frameworks, fine-tuning vision language models and robot foundation models, RAG systems, and multimodal applications. Debasish is passionate about enabling organizations to implement practical, impactful AI solutions.
Divakaran Ullampuzha Mana is the Head of Solution Architecture for Global Service Integrators (GSI) & IT/ITeS at AWS India. He leads solution architects who advise enterprise customers on cloud transformation strategies, with expertise in cloud computing, AI/ML, Generative AI, and digital transformation. Prior to AWS, he held executive leadership positions at Kyndryl and IBM, where he established and scaled cloud migration practices. He is an active thought leader, regularly speaking at industry events and mentoring technologists.
Rejin Surendran is the Global CIO at Wipro Enterprises Limited, where he leads digital transformation initiatives across the enterprise. With over 25 years of experience in technology leadership, he has driven large-scale transformation projects across commercial, supply chain, people, and finance functions. He holds a Master of Management from IIT Bombay and a B.Tech in Electrical & Electronics Engineering from NIT Warangal.
Bakrudeen K is an AWS Ambassador and leads the AI/ML practice at ShellKode, driving innovation in Generative and Agentic AI. He builds advanced AI solutions and Agentic Assistants that enable enterprises to scale intelligent systems responsibly. In 2025, he became the first-ever recipient of the AWS Ambassador Golden Jacket for Agentic AI, a global first within the AWS Ambassador Program.

Allen Institute for AI (AI2) Introduces Olmo 3: An Open Source 7B and …

Allen Institute for AI (AI2) is releasing Olmo 3 as a fully open model family that exposes the entire ‘model flow’, from raw data and code to intermediate checkpoints and deployment ready variants.

Olmo 3 is a dense transformer suite with 7B and 32B parameter models. The family includes Olmo 3-Base, Olmo 3-Think, Olmo 3-Instruct, and Olmo 3-RL Zero. Both 7B and 32B variants share a context length of 65,536 tokens and use the same staged training recipe.

https://allenai.org/blog/olmo3

Dolma 3 Data Suite

At the core of the training pipeline is Dolma 3, a new data collection designed for Olmo 3. Dolma 3 consists of Dolma 3 Mix, Dolma 3 Dolmino Mix, and Dolma 3 Longmino Mix. Dolma 3 Mix is a 5.9T token pre training dataset with web text, scientific PDFs, code repositories, and other natural data. The Dolmino and Longmino subsets are constructed from filtered, higher quality slices of this pool.

Dolma 3 Mix supports the main pre training stage for Olmo 3-Base. AI2 research team then applies Dolma 3 Dolmino Mix, a 100B token mid training set that emphasizes math, code, instruction following, reading comprehension, and thinking oriented tasks. Finally, Dolma 3 Longmino Mix adds 50B tokens for the 7B model and 100B tokens for the 32B model, with a strong focus on long documents and scientific PDFs processed with the olmOCR pipeline. This staged curriculum is what pushes the context limit to 65,536 tokens while maintaining stability and quality.

Large Scale Training on H100 Clusters

Olmo 3-Base 7B trains on Dolma 3 Mix using 1,024 H100 devices, reaching about 7,700 tokens per device per second. Later stages use 128 H100s for Dolmino mid training and 256 H100s for Longmino long context extension.

Base Model Performance Against Open Families

On standard capability benchmarks, Olmo 3-Base 32B is positioned as a leading fully open base model. AI2 research team reports that it is competitive with prominent open weight families such as Qwen 2.5 and Gemma 3 at similar sizes. Compared across a wide suite of tasks, Olmo 3-Base 32B ranks near or above these models while keeping the full data and training configuration open for inspection and reuse.

Reasoning Focused Olmo 3 Think

Olmo 3-Think 7B and Olmo 3-Think 32B sit on top of the base models as reasoning focused variants. They use a three stage post training recipe that includes supervised fine tuning, Direct Preference Optimization, and Reinforcement Learning with Verifiable Rewards within the OlmoRL framework. Olmo 3-Think 32B is described as the strongest fully open reasoning model and it narrows the gap to Qwen 3 32B thinking models while using about six times fewer training tokens.

https://allenai.org/blog/olmo3

Olmo 3 Instruct for Chat and Tool Use

Olmo 3-Instruct 7B is tuned for fast instruction following, multi turn chat, and tool use. It starts from Olmo 3-Base 7B and applies a separate Dolci Instruct data and training pipeline that covers supervised fine tuning, DPO, and RLVR for conversational and function calling workloads. AI2 research team reports that Olmo 3-Instruct matches or outperforms open weight competitors such as Qwen 2.5, Gemma 3, and Llama 3.1 and is competitive with Qwen 3 families at similar scales for several instruction and reasoning benchmarks.

RL Zero for Clean RL Research

Olmo 3-RL Zero 7B is designed for researchers who care about reinforcement learning on language models but need clean separation between pre training data and RL data. It is built as a fully open RL pathway on top of Olmo 3-Base and uses Dolci RL Zero datasets that are decontaminated with respect to Dolma 3.

Comparison Table

Model variantTraining or post training dataPrimary use caseReported position vs other open modelsOlmo 3 Base 7BDolma 3 Mix pre training, Dolma 3 Dolmino Mix mid training, Dolma 3 Longmino Mix long contextGeneral foundation model, long context reasoning, code, mathStrong fully open 7B base, designed as foundation for Think, Instruct, RL Zero, evaluated against leading open 7B scale basesOlmo 3 Base 32BSame Dolma 3 staged pipeline as 7B, with 100B Longmino tokens for long contextHigh end base for research, long context workloads, RL setupsDescribed as the best fully open 32B base, comparable to Qwen 2.5 32B and Gemma 3 27B and outperforming Marin, Apertus, LLM360Olmo 3 Think 7BOlmo 3 Base 7B, plus Dolci Think SFT, Dolci Think DPO, Dolci Think RL in OlmoRL frameworkReasoning focused 7B model with internal thinking tracesFully open reasoning model at efficient scale that enables chain of thought research and RL experiments on modest hardwareOlmo 3 Think 32BOlmo 3 Base 32B, plus the same Dolci Think SFT, DPO, RL pipelineFlagship reasoning model with long thinking tracesStated as the strongest fully open thinking model, competitive with Qwen 3 32B thinking models while training on about 6x fewer tokensOlmo 3 Instruct 7BOlmo 3 Base 7B, plus Dolci Instruct SFT, Dolci Instruct DPO, Dolci Instruct RL 7BInstruction following, chat, function calling, tool useReported to outperform Qwen 2.5, Gemma 3, Llama 3 and to narrow the gap to Qwen 3 families at similar scaleOlmo 3 RL Zero 7BOlmo 3 Base 7B, plus Dolci RLZero Math, Code, IF, Mix datasets, decontaminated from Dolma 3RLVR research on math, code, instruction following, mixed tasksIntroduced as a fully open RL pathway for benchmarking RLVR on top of a base model with fully open pre training data

Key Takeaways

End to end transparent pipeline: Olmo 3 exposes the full ‘model flow’ from Dolma 3 data construction, through staged pre training and post training, to released checkpoints, evaluation suites, and tooling, enabling fully reproducible LLM research and fine grained debugging.

Dense 7B and 32B models with 65K context: The family covers 7B and 32B dense transformers, all with a 65,536 token context window, trained via a three stage Dolma 3 curriculum, Dolma 3 Mix for main pre training, Dolma 3 Dolmino for mid training, and Dolma 3 Longmino for long context extension.

Strong open base and reasoning models: Olmo 3 Base 32B is positioned as a top fully open base model at its scale, competitive with Qwen 2.5 and Gemma 3, while Olmo 3 Think 32B is described as the strongest fully open thinking model and approaches Qwen 3 32B thinking models using about 6 times fewer training tokens.

Task tuned Instruct and RL Zero variants: Olmo 3 Instruct 7B targets instruction following, multi turn chat, and tool use using Dolci Instruct SFT, DPO, and RLVR data, and is reported to match or outperform Qwen 2.5, Gemma 3, and Llama 3.1 at similar scale. Olmo 3 RL Zero 7B provides a fully open RLVR pathway with Dolci RLZero datasets decontaminated from pre training data for math, code, instruction following, and general chat.

Editorial Comments

Olmo 3 is an unusual release because it operationalizes openness across the full stack, Dolma 3 data recipes, staged pre training, Dolci post training, RLVR in OlmoRL, and evaluation with OLMES and OlmoBaseEval. This reduces ambiguity around data quality, long context training, and reasoning oriented RL, and it creates a concrete baseline for extending Olmo 3 Base, Olmo 3 Think, Olmo 3 Instruct, and Olmo 3 RL Zero in controlled experiments. Overall, Olmo 3 sets a rigorous reference point for transparent, research grade LLM pipelines.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Allen Institute for AI (AI2) Introduces Olmo 3: An Open Source 7B and 32B LLM Family Built on the Dolma 3 and Dolci Stack appeared first on MarkTechPost.

How to Build a Fully Offline Multi-Tool Reasoning Agent with Dynamic P …

In this tutorial, we explore how to build a fully offline, multi-step reasoning agent that uses the Instructor library to generate structured outputs and reliably orchestrate complex tool calls. In this implementation, we design an agent capable of choosing the right tool, validating inputs, planning multi-stage workflows, and recovering from errors. We bring together Instructor, Transformers, and carefully crafted Pydantic schemas to create an intelligent, adaptive system that mirrors real-world agentic AI behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys

def install_dependencies():
import torch
packages = [
“instructor”,
“transformers>=4.35.0”,
“torch”,
“accelerate”,
“pydantic>=2.0.0”,
“numpy”,
“pandas”
]
if torch.cuda.is_available():
packages.append(“bitsandbytes”)
print(” GPU detected – installing quantization support”)
else:
print(” No GPU detected – will use CPU (slower but works)”)
for package in packages:
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, package])

try:
import instructor
except ImportError:
print(” Installing dependencies…”)
install_dependencies()
print(” Installation complete!”)

from typing import Literal, Optional, List, Union, Dict, Any
from pydantic import BaseModel, Field, validator
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import instructor
import json
from datetime import datetime
import re

We set up our environment by installing all required dependencies and importing the core libraries. As we lay the foundation for the system, we ensure that everything, from the Instructor to the Transformers, is ready for offline execution. This lets us start with a clean and reliable base for building the agent. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SQLQuery(BaseModel):
“””Complex SQL generation with validation”””
table: str
columns: List[str]
where_conditions: Optional[Dict[str, Any]] = None
joins: Optional[List[Dict[str, str]]] = None
aggregations: Optional[Dict[str, str]] = None
order_by: Optional[List[str]] = None

@validator(‘columns’)
def validate_columns(cls, v):
if not v:
raise ValueError(“Must specify at least one column”)
return v

class DataTransformation(BaseModel):
“””Schema for complex data pipeline operations”””
operation: Literal[“filter”, “aggregate”, “join”, “pivot”, “normalize”]
source_data: str = Field(description=”Reference to data source”)
parameters: Dict[str, Any]
output_format: Literal[“json”, “csv”, “dataframe”]

class APIRequest(BaseModel):
“””Multi-endpoint API orchestration”””
endpoints: List[Dict[str, str]] = Field(description=”List of endpoints to call”)
authentication: Dict[str, str]
request_order: Literal[“sequential”, “parallel”, “conditional”]
error_handling: Literal[“stop”, “continue”, “retry”]
max_retries: int = Field(default=3, ge=0, le=10)

class CodeGeneration(BaseModel):
“””Generate and validate code snippets”””
language: Literal[“python”, “javascript”, “sql”, “bash”]
purpose: str
code: str = Field(description=”The generated code”)
dependencies: List[str] = Field(default_factory=list)
test_cases: List[Dict[str, Any]] = Field(default_factory=list)

@validator(‘code’)
def validate_code_safety(cls, v, values):
dangerous = [‘eval(‘, ‘exec(‘, ‘__import__’, ‘os.system’]
if values.get(‘language’) == ‘python’:
if any(d in v for d in dangerous):
raise ValueError(“Code contains potentially dangerous operations”)
return v

class MultiToolPlan(BaseModel):
“””Plan for multi-step tool execution”””
goal: str
steps: List[Dict[str, Any]] = Field(description=”Ordered list of tool calls”)
dependencies: Dict[str, List[str]] = Field(description=”Step dependencies”)
fallback_strategy: Optional[str] = None
estimated_duration: float = Field(description=”Seconds”)

class ToolCall(BaseModel):
“””Enhanced tool selection with context”””
reasoning: str
confidence: float = Field(ge=0.0, le=1.0)
tool_name: Literal[“sql_engine”, “data_transformer”, “api_orchestrator”,
“code_generator”, “planner”, “none”]
tool_input: Optional[Union[SQLQuery, DataTransformation, APIRequest,
CodeGeneration, MultiToolPlan]] = None
requires_human_approval: bool = False

class ExecutionResult(BaseModel):
“””Rich result with metadata”””
success: bool
data: Any
execution_time: float
warnings: List[str] = Field(default_factory=list)
metadata: Dict[str, Any] = Field(default_factory=dict)

We define all the advanced Pydantic schemas that structure how our agent understands SQL queries, data pipelines, API calls, code generation, and multi-step plans. As we build these models, we give our agent strong validation, safety, and clarity in interpreting complex instructions. This becomes the backbone of our agent’s reasoning process. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef sql_engine_tool(params: SQLQuery) -> ExecutionResult:
import time
start = time.time()
mock_tables = {
“users”: [
{“id”: 1, “name”: “Alice”, “age”: 30, “country”: “USA”},
{“id”: 2, “name”: “Bob”, “age”: 25, “country”: “UK”},
{“id”: 3, “name”: “Charlie”, “age”: 35, “country”: “USA”},
],
“orders”: [
{“id”: 1, “user_id”: 1, “amount”: 100, “status”: “completed”},
{“id”: 2, “user_id”: 1, “amount”: 200, “status”: “pending”},
{“id”: 3, “user_id”: 2, “amount”: 150, “status”: “completed”},
]
}
data = mock_tables.get(params.table, [])
if params.where_conditions:
data = [row for row in data if all(
row.get(k) == v for k, v in params.where_conditions.items()
)]
data = [{col: row.get(col) for col in params.columns} for row in data]
warnings = []
if params.aggregations:
warnings.append(“Aggregation simplified in mock mode”)
return ExecutionResult(
success=True,
data=data,
execution_time=time.time() – start,
warnings=warnings,
metadata={“rows_affected”: len(data), “query_type”: “SELECT”}
)

def data_transformer_tool(params: DataTransformation) -> ExecutionResult:
import time
start = time.time()
operations = {
“filter”: lambda d, p: [x for x in d if x.get(p[‘field’]) == p[‘value’]],
“aggregate”: lambda d, p: {“count”: len(d), “operation”: p.get(‘function’, ‘count’)},
“normalize”: lambda d, p: [{k: v/p.get(‘factor’, 1) for k, v in x.items()} for x in d]
}
mock_data = [{“value”: i, “category”: “A” if i % 2 else “B”} for i in range(10)]
op_func = operations.get(params.operation)
if op_func:
result_data = op_func(mock_data, params.parameters)
else:
result_data = mock_data
return ExecutionResult(
success=True,
data=result_data,
execution_time=time.time() – start,
warnings=[],
metadata={“operation”: params.operation, “input_rows”: len(mock_data)}
)

def api_orchestrator_tool(params: APIRequest) -> ExecutionResult:
import time
start = time.time()
results = []
warnings = []
for i, endpoint in enumerate(params.endpoints):
if params.error_handling == “retry” and i == 1:
warnings.append(f”Endpoint {endpoint.get(‘url’)} failed, retrying…”)
results.append({
“endpoint”: endpoint.get(‘url’),
“status”: 200,
“data”: f”Mock response from {endpoint.get(‘url’)}”
})
return ExecutionResult(
success=True,
data=results,
execution_time=time.time() – start,
warnings=warnings,
metadata={“endpoints_called”: len(params.endpoints), “order”: params.request_order}
)

def code_generator_tool(params: CodeGeneration) -> ExecutionResult:
import time
start = time.time()
warnings = []
if len(params.code) > 1000:
warnings.append(“Generated code is quite long, consider refactoring”)
if not params.test_cases:
warnings.append(“No test cases provided for generated code”)
return ExecutionResult(
success=True,
data={“code”: params.code, “language”: params.language, “dependencies”: params.dependencies},
execution_time=time.time() – start,
warnings=warnings,
metadata={“lines_of_code”: len(params.code.split(‘n’))}
)

def planner_tool(params: MultiToolPlan) -> ExecutionResult:
import time
start = time.time()
warnings = []
if len(params.steps) > 10:
warnings.append(“Plan has many steps, consider breaking into sub-plans”)
for step_id, deps in params.dependencies.items():
if step_id in deps:
warnings.append(f”Circular dependency detected in step {step_id}”)
return ExecutionResult(
success=True,
data={“plan”: params.steps, “estimated_time”: params.estimated_duration},
execution_time=time.time() – start,
warnings=warnings,
metadata={“total_steps”: len(params.steps)}
)

TOOLS = {
“sql_engine”: sql_engine_tool,
“data_transformer”: data_transformer_tool,
“api_orchestrator”: api_orchestrator_tool,
“code_generator”: code_generator_tool,
“planner”: planner_tool
}

We implement the actual tools, SQL execution, data transformation, API orchestration, code validation, and planning. As we write these tool functions, we simulate realistic workflows with controlled outputs and error handling. This allows us to test the agent’s decision-making in an environment that mirrors real-world tasks. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedToolAgent:
“””Agent with complex reasoning, error recovery, and multi-step planning”””

def __init__(self, model_name: str = “HuggingFaceH4/zephyr-7b-beta”):
import torch
print(f” Loading model: {model_name}”)
model_kwargs = {“device_map”: “auto”}
if torch.cuda.is_available():
print(” GPU detected – using 8-bit quantization”)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model_kwargs[“quantization_config”] = quantization_config
else:
print(” CPU mode – using smaller model for better performance”)
model_name = “google/flan-t5-base”
model_kwargs[“torch_dtype”] = “auto”
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
**model_kwargs
)
self.pipe = pipeline(
“text-generation”, model=self.model, tokenizer=self.tokenizer,
max_new_tokens=768, temperature=0.7, do_sample=True
)
self.client = instructor.from_pipe(self.pipe)
self.execution_history = []
print(” Agent initialized!”)

def route_to_tool(self, user_query: str, context: Optional[str] = None) -> ToolCall:
tool_descriptions = “””
Advanced Tools:
– sql_engine: Execute complex SQL queries with joins, aggregations, filtering
– data_transformer: Multi-step data pipelines (filter→aggregate→normalize)
– api_orchestrator: Call multiple APIs with dependencies, retries, error handling
– code_generator: Generate safe, validated code with tests in multiple languages
– planner: Create multi-step execution plans with dependency management
– none: Answer directly using reasoning
“””
prompt = f”””{tool_descriptions}

User query: {user_query}
{f’Context from previous steps: {context}’ if context else ”}

Analyze the complexity and choose the appropriate tool. For multi-step tasks, use the planner.”””
return self.client(prompt, response_model=ToolCall)

def execute_with_recovery(self, tool_call: ToolCall, max_retries: int = 2) -> ExecutionResult:
for attempt in range(max_retries + 1):
try:
if tool_call.tool_name == “none”:
return ExecutionResult(
success=True, data=”Direct response”, execution_time=0.0,
warnings=[], metadata={}
)
tool_func = TOOLS.get(tool_call.tool_name)
if not tool_func:
return ExecutionResult(
success=False, data=None, execution_time=0.0,
warnings=[f”Tool {tool_call.tool_name} not found”], metadata={}
)
result = tool_func(tool_call.tool_input)
self.execution_history.append({
“tool”: tool_call.tool_name,
“success”: result.success,
“timestamp”: datetime.now().isoformat()
})
return result
except Exception as e:
if attempt < max_retries:
print(f” Attempt {attempt + 1} failed, retrying…”)
continue
return ExecutionResult(
success=False, data=None, execution_time=0.0,
warnings=[f”Failed after {max_retries + 1} attempts: {str(e)}”],
metadata={“error”: str(e)}
)

We construct the agent itself, loading the model, building the routing pipeline, and implementing recovery logic. As we define methods for tool selection and execution, we give the agent the ability to understand queries, choose strategies, and gracefully handle failures. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def run(self, user_query: str, verbose: bool = True) -> Dict[str, Any]:
if verbose:
print(f”n{‘=’*70}”)
print(f” Complex Query: {user_query}”)
print(f”{‘=’*70}”)
if verbose:
print(“n Step 1: Analyzing query complexity & routing…”)
tool_call = self.route_to_tool(user_query)
if verbose:
print(f” → Tool: {tool_call.tool_name}”)
print(f” → Confidence: {tool_call.confidence:.2%}”)
print(f” → Reasoning: {tool_call.reasoning}”)
if tool_call.requires_human_approval:
print(f” Requires human approval!”)
if verbose:
print(“n Step 2: Executing tool with error recovery…”)
result = self.execute_with_recovery(tool_call)
if verbose:
print(f” → Success: {result.success}”)
print(f” → Execution time: {result.execution_time:.3f}s”)
if result.warnings:
print(f” → Warnings: {‘, ‘.join(result.warnings)}”)
print(f” → Data preview: {str(result.data)[:200]}…”)
if verbose and result.metadata:
print(f”n Metadata:”)
for key, value in result.metadata.items():
print(f” • {key}: {value}”)
if verbose:
print(f”n{‘=’*70}n”)
return {
“query”: user_query,
“tool_used”: tool_call.tool_name,
“result”: result,
“history_length”: len(self.execution_history)
}

def main():
agent = AdvancedToolAgent()
hard_queries = [
“Generate a SQL query to find all users from USA who have completed orders worth more than $150, and join with their order details”,
“Create a data pipeline that filters records where category=’A’, then aggregates by count, and normalizes the results by a factor of 100”,
“I need to call 3 APIs sequentially: first authenticate at /auth, then fetch user data at /users/{id}, and finally update preferences at /preferences. If any step fails, retry up to 3 times”,
“Write a Python function that validates email addresses using regex, includes error handling, and has at least 2 test cases. Make sure it doesn’t use any dangerous operations”,
“Create a multi-step plan to: 1) Extract data from a database, 2) Transform it using pandas, 3) Generate a report, 4) Send via email. Show dependencies between steps”
]
print(“n” + ” HARD MODE: COMPLEX QUERIES “.center(70, “=”) + “n”)
for i, query in enumerate(hard_queries, 1):
print(f”n{‘#’*70}”)
print(f”# CHALLENGE {i}/{len(hard_queries)}”)
print(f”{‘#’*70}”)
try:
agent.run(query, verbose=True)
except Exception as e:
print(f” Critical error: {e}n”)
print(“n” + f” COMPLETED {len(agent.execution_history)} TOOL EXECUTIONS “.center(70, “=”) + “n”)
print(f” Success rate: {sum(1 for h in agent.execution_history if h[‘success’]) / len(agent.execution_history) * 100:.1f}%”)

if __name__ == “__main__”:
main()

We tie everything together with a run() method and a demo main() function that executes multiple hard-mode queries. As we watch the agent analyze, route, execute, and report results, we see the full power of the architecture in action. This final step lets us experience how the system performs under complex, realistic scenarios.

In conclusion, we have built a powerful agent capable of understanding intricate instructions, routing execution across multiple tools, and gracefully recovering from errors, all within a compact, offline system. As we test it on challenging queries, we watch it plan, reason, and execute with clarity and structure. We now appreciate how modular schemas, validated tool calls, and layered execution logic allow us to create agents that behave reliably in complex environments.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Fully Offline Multi-Tool Reasoning Agent with Dynamic Planning, Error Recovery, and Intelligent Function Routing appeared first on MarkTechPost.

Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Conce …

How do you reliably find, segment and track every instance of any concept across large image and video collections using simple prompts? Meta AI Team has just released Meta Segment Anything Model 3, or SAM 3, an open-sourced unified foundation model for promptable segmentation in images and videos that operates directly on visual concepts instead of only pixels. It detects, segments and tracks objects from both text prompts and visual prompts such as points, boxes and masks. Compared with SAM 2, SAM 3 can exhaustively find all instances of an open vocabulary concept, for example every ‘red baseball cap’ in a long video, using a single model.

From Visual Prompts to Promptable Concept Segmentation

Earlier SAM models focused on interactive segmentation. A user clicked or drew a box and the model produced a single mask. That workflow did not scale to tasks where a system must find all instances of a concept across large image or video collections. SAM 3 formalizes Promptable Concept Segmentation (PCS), which takes concept prompts and returns instance masks and stable identities for every matching object in images and videos.

Concept prompts combine short noun phrases with visual exemplars. The model supports detailed phrases such as ‘yellow school bus’ or ‘player in red’ and can also use exemplar crops as positive or negative examples. Text prompts describe the concept, while exemplar crops help disambiguate fine grained visual differences. SAM 3 can also be used as a vision tool inside multimodal large language models that generate longer referring expressions and then call SAM 3 with distilled concept prompts.

https://ai.meta.com/blog/segment-anything-model-3/?

Architecture, Presence Token and Tracking Design

The SAM 3 model has 848M parameters and consists of a detector and a tracker that share a single vision encoder. The detector is a DETR based architecture that is conditioned on three inputs, text prompts, geometric prompts and image exemplars. This separates the core image representation from the prompting interfaces and lets the same backbone serve many segmentation tasks.

A key change in SAM 3 is the presence token. This component predicts whether each candidate box or mask actually corresponds to the requested concept. It is especially important when the text prompts describe related entities, such as ‘a player in white’ and ‘a player in red’. The presence token reduces confusion between such prompts and improves open vocabulary precision. Recognition, meaning classifying a candidate as the concept, is decoupled from localization, meaning predicting the box and mask shape.

For video, SAM 3 reuses the transformer encoder decoder tracker from SAM 2, but connects it tightly to the new detector. The tracker propagates instance identities across frames and supports interactive refinement. The decoupled detector and tracker design minimizes task interference, scales cleanly with more data and concepts, and still exposes an interactive interface similar to earlier Segment Anything models for point based refinement.

https://ai.meta.com/research/publications/sam-3-segment-anything-with-concepts/

SA-Co Dataset and Benchmark Suite

To train and evaluate Promptable Concept Segmentation (PCS), Meta introduces the SA-Co family of datasets and benchmarks. The SA-Co benchmark contains 270K unique concepts, which is more than 50 times the number of concepts in previous open vocabulary segmentation benchmarks. Every image or video is paired with noun phrases and dense instance masks for all objects that match each phrase, including negative prompts where no objects should match.

The associated data engine has automatically annotated more than 4M unique concepts, which makes SA-Co the largest high quality open vocabulary segmentation corpus as mentioned by Meta. The engine combines large ontologies with automated checks and supports hard negative mining, for example phrases that are visually similar but semantically distinct. This scale is essential for learning a model that can respond robustly to diverse text prompts in real world scenes.

Image and Video Performance

On the SA-Co image benchmarks, SAM 3 reaches between 75 percent and 80 percent of human performance measured with the cgF1 metric. Competing systems such as OWLv2, DINO-X and Gemini 2.5 lag significantly behind. For example, on SA-Co Gold box detection, SAM 3 reports cgF1 of 55.7, while OWLv2 reaches 24.5, DINO-X reaches 22.5 and Gemini 2.5 reaches 14.4. This shows that a single unified model can outperform specialized detectors on open vocabulary segmentation.

In videos, SAM 3 is evaluated on SA-V, YT-Temporal 1B, SmartGlasses, LVVIS and BURST. On SA-V test it reaches 30.3 cgF1 and 58.0 pHOTA. On YT-Temporal 1B test it reaches 50.8 cgF1 and 69.9 pHOTA. On SmartGlasses test it reaches 36.4 cgF1 and 63.6 pHOTA, while on LVVIS and BURST it reaches 36.3 mAP and 44.5 HOTA respectively. These results confirm that a single architecture can handle both image PCS and long horizon video tracking.

https://ai.meta.com/research/publications/sam-3-segment-anything-with-concepts/

SAM 3 as a Data-Centric Benchmarking Opportunity for Annotation Platforms

For data-centric platforms like Encord, SAM 3 is a natural next step after their existing integrations of SAM and SAM 2 for auto-labeling and video tracking, which already let customers auto-annotate more than 90 percent of images with high mask accuracy using foundation models inside Encord’s QA driven workflows. Similar platforms such as CVAT, SuperAnnotate and Picsellia are standardizing on Segment Anything style models for zero shot labeling, model in the loop annotation and MLOps pipelines. SAM 3’s promptable concept segmentation and unified image video tracking create clear editorial and benchmarking opportunities here, for example, quantifying label cost reductions and quality gains when Encord like stacks move from SAM 2 to SAM 3 in dense video datasets or multimodal settings.

Key Takeaways

SAM 3 unifies image and video segmentation into a single 848M parameter foundation model that supports text prompts, exemplars, points and boxes for Promptable Concept Segmentation.

The SA-Co data engine and benchmark introduce about 270K evaluated concepts and over 4M automatically annotated concepts, making SAM 3’s training and evaluation stack one of the largest open vocabulary segmentation resources available.

SAM 3 substantially outperforms prior open vocabulary systems, reaching around 75 to 80 percent of human cgF1 on SA Co and more than doubling OWLv2 and DINO-X on key SA-Co Gold detection metrics.

The architecture decouples a DETR based detector from a SAM 2 style video tracker with a presence head, enabling stable instance tracking across long videos while keeping interactive SAM style refinement.

Editorial Comments

SAM 3 advances Segment Anything from Promptable Visual Segmentation to Promptable Concept Segmentation in a single 848M parameter model that unifies image and video. It leverages the SA-Co benchmark with about 270K evaluated concepts and over 4M automatically annotated concepts to approximate 75 to 80 percent of human performance on cgF1. The decoupled DETR based detector and SAM 2 style tracker with a presence head makes SAM 3 a practical vision foundation model for agents and products. Overall, SAM 3 is now a reference point for open vocabulary segmentation at production scale.

Check out the Paper, Repo and Model Weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept Segmentation in Images and Videos appeared first on MarkTechPost.

MSD explores applying generative Al to improve the deviation managemen …

This post is co-written with Hossein Salami and Jwalant Vyas from MSD. 
In the biopharmaceutical industry, deviations in the manufacturing process are rigorously addressed. Each deviation is thoroughly documented, and its various aspects and potential impacts are closely examined to help ensure drug product quality, patient safety, and compliance. For leading pharmaceutical companies, managing these deviations robustly and efficiently is crucial to maintaining high standards and minimizing disruptions.
Recently, the Digital Manufacturing Data Science team at Merck & Co., Inc., Rahway, NJ, USA (MSD) recognized an opportunity to streamline aspects of their deviation management process using emerging technologies including vector databases and generative AI, powered by AWS services such as Amazon Bedrock and Amazon OpenSearch. This innovative approach aims to use the organization’s past deviations as a vast, diverse, and reliable knowledge source. Such knowledge can potentially help reduce the time and resources required for—and increase the efficiency of—researching and addressing each new deviation by using learnings from similar cases across the manufacturing network, while maintaining the rigorous standards demanded by Good Manufacturing Practices (GMP) requirements.
Industry trends: AI in pharmaceutical manufacturing
The pharmaceutical industry has been increasingly turning to advanced technologies to enhance various aspects of their operations, from early drug discovery to manufacturing and quality control. The application of AI, particularly generative AI, in streamlining complex processes is a growing trend. Many companies are exploring how these technologies can be applied to areas that traditionally require significant human expertise and time investment, including the above-mentioned deviation management. This shift towards AI-assisted processes is not only about improving efficiency, but also about enhancing the quality and consistency of outcomes in critical areas.
Innovative solution: Generative AI for deviation management
To address some of the major challenges in deviation management, the Digital Manufacturing Data Science team at MSD devised an innovative solution using generative AI (see How can language models assist with pharmaceuticals manufacturing deviations and investigations?). The approach involves first, creating a comprehensive knowledge base from past deviation reports, which can be intelligently queried to provide various insights including helpful information for addressing new cases. In addition to the routine metadata, the knowledge base includes important unstructured data such as observations, analysis processes, and conclusions, typically recorded as natural language text. The solution is designed to facilitate the interaction of different users in manufacturing sites, with different personas and roles, with this knowledge sources. For example, users can quickly and accurately identify and access information about similar past incidents and use that information to hypothesize about the potential root causes and define resolutions for a current case. This is facilitated by a hybrid and domain-specific search mechanism implemented through Amazon OpenSearch Service. Subsequently, the information is processed by a large language model (LLM) and is presented to the user based on their persona and need. This functionality not only saves time but also uses the wealth of experience and knowledge from previous deviations.
Solution overview: Goals, risks, and opportunities
Deviation investigations have traditionally been a time-consuming, manual process that requires significant human effort and expertise. Investigation teams often spend extensive hours collecting, analyzing, and documenting information, sifting through historical records, and drawing conclusions—a workflow that is not only labor-intensive but also prone to potential human error and inconsistency. The solution aims to achieve several key goals:

Significantly reduce the time and effort required for investigation and closure of a deviation
Provide users with easy access to relevant knowledge, historical information, and data with high accuracy and flexibility based on user persona
Make sure that the information used to derive conclusions is traceable and verifiable

The team is also mindful of potential risks, such as over-reliance on AI-generated suggestions or the possibility of outdated information influencing current investigations. To mitigate these risks, the solution mostly limits the generative AI content creation to low-risk areas and incorporates human oversight and other guardrails. An automated data pipeline helps the knowledge base remain up-to-date with the most recent information and data. To protect proprietary and sensitive manufacturing information, the solution includes data encryption and access controls on different elements.
Additionally, the team sees opportunities for incorporating new elements in the architecture, particularly in the form of agents that can handle specific requests common to certain user personas such as high-level statistics and visualizations for site managers.
Technical architecture: RAG approach with AWS services
The solution architecture uses a Retrieval-Augmented Generation (RAG) approach to enhance the efficiency, relevance, and traceability of deviation investigations. This architecture integrates multiple AWS managed services to build a scalable, secure, and domain-aware AI-driven system.
At the core of the solution is a hybrid retrieval module (leveraging the hybrid search capabilities of Amazon OpenSearch Service) that combines both semantic (vector-based) and keyword (lexical) search for high-accuracy information retrieval. This module is built on Amazon OpenSearch Service, which functions as the vector store. OpenSearch indexes embeddings generated from past deviation reports and related documents, enriched with domain-specific metadata such as deviation type, resolution date, impacted product lines, and root cause classification. This is for both deep semantic search and efficient filtering based on structured fields.
To support structured data storage and management, the system uses Amazon Relational Database Service (Amazon RDS). RDS stores normalized tabular information associated with each deviation case, such as investigation timelines, responsible personnel, and other operational metadata. With RDS you can make complex queries across structured dimensions and supports reporting, compliance audits, and trend analysis.
A RAG pipeline orchestrates the flow between the retrieval module and a large language model (LLM) hosted in Amazon Bedrock. When a user issues a query, the system first retrieves relevant documents from OpenSearch and structured case data from RDS. These results are then passed as context to the LLM, which generates grounded, contextualized outputs such as:

Summarized investigation histories
Root cause patterns
Comparable past incidents
Suggested next steps or knowledge gaps

High-level architecture of the solution. Domain-specific deviation data are located on Amazon RDS and OpenSearch. Text vector embeddings along with relevant metadata are located on OpenSearch to support a variety of search functionalities.

Conclusion and next steps
This blog post has explored how MSD is harnessing the power of generative AI and databases to optimize and transform its manufacturing deviation management process. By creating an accurate and multifaceted knowledge base of past events, deviations, and findings, the company aims to significantly reduce the time and effort required for each new case while maintaining the highest standards of quality and compliance.
As next steps, the company plans to conduct a comprehensive review of use cases in the pharma quality domain and build a generative AI-driven enterprise scale product by integrating structured and unstructured sources using methods from this innovation. Some of the key capabilities coming from this innovation include data architecture, data modeling, including metadata curation, and generative AI-related components. Looking ahead, we plan to use the capabilities of Amazon Bedrock Knowledge Bases, which will provide more advanced semantic search and retrieval capabilities while maintaining seamless integration within the AWS environment. If successful, this approach could set a new standard for not only deviation management at MSD, but also pave the way for more efficient, integrated, and knowledge-driven manufacturing quality processes including complaints, audits, and so on.

About the authors
Hossein Salami is a Senior Data Scientist at the Digital Manufacturing organization at MSD. As a Chemical Engineering Ph.D. with a background of more than 9 years of laboratory and process R&D experience, he takes part in leveraging advanced technologies to build data science and AI/ML solutions that address core business problems and applications.
Jwalant (JD) Vyas is the Digital Product Line Lead for the Investigations Digital Product Portfolio at MSD, bringing 25+ years of biopharmaceutical experience across Quality Operations, QMS, Plant Operations, Manufacturing, Supply Chain, and Pharmaceutical Product Development. He leads the digitization of Quality Operations to improve efficiency, strengthen compliance, and enhance decision-making. With deep business domain and technology expertise, he bridges technical depth with strategic leadership.
Duverney Tavares is a Senior Solutions Architect at Amazon Web Services (AWS), specializing in guiding Life Sciences companies through their digital transformation journeys. With over two decades of experience in Data Warehousing, Big Data & Analytics, and Database Management, he uses his expertise to help organizations harness the power of data to drive business growth and innovation.

Accelerating genomics variant interpretation with AWS HealthOmics and …

Genomic research stands at a transformative crossroads where the exponential growth of sequencing data demands equally sophisticated analytical capabilities. According to the 1000 Genomes Project, a typical human genome differs from the reference at 4.1–5.0 million sites, with most variants being SNPs and short indels. These variants, when aggregated across individuals, contribute to differences in disease susceptibility captured through polygenic risk scores (PRS). Genomic analysis workflows struggle to translate such large-scale variant data into actionable insights. They remain fragmented, requiring researchers to manually orchestrate complex pipelines involving variant annotation, quality filtering, and integration with external databases such as ClinVar.

AWS HealthOmics workflows along with Amazon S3 tables and Amazon Bedrock AgentCore together provide a transformative solution to these challenges. HealthOmics workflows support the seamless integration of annotating Variant Call Format (VCF) files with insightful ontologies. Subsequently, the VEP-annotated VCF files need to be transformed into structured datasets stored in optimized S3 tables to improve query performance across large variant cohorts. The Strands Agents SDK running on Amazon Bedrock AgentCore provides a secure and scalable AI agent application so that researchers can interact with complex genomic datasets without specialized query expertise.
In this blog post, we show you how agentic workflows can accelerate the processing and interpretation of genomics pipelines at scale with a natural language interface. We demonstrate a comprehensive genomic variant interpreter agent that combines automated data processing with intelligent analysis to address the entire workflow from raw VCF file ingestion to conversational query interfaces. Most importantly, this solution removes the technical expertise barrier that has traditionally limited genomic analysis to specialized bioinformaticians. This enables clinical researchers to upload raw VCF files and immediately ask questions like ‘Which patients have pathogenic variants in BRCA1?’ or ‘Show me drug resistance variants in this cohort’. The code for this solution is available in the open-source toolkit repository of starter agents for life sciences on AWS.
Understanding variant annotation in genomic analysis
The foundation of genomic variant interpretation relies on comprehensive annotation pipelines that connect raw genetic variants to biological and clinical context. Variant Effect Predictor (VEP) and ClinVar represent two essential components in modern genomic analysis workflows, each providing complementary information that researchers must integrate to derive meaningful insights.

The comparative visualization illustrates the distinct yet complementary annotation capabilities of ClinVar and VEP for genomic variant interpretation. ClinVar annotations (left) focus primarily on clinical significance assessment, providing curated pathogenicity classifications (CLNSIG), evidence quality metrics (CLNREVSTAT), and disease associations (CLNDN) directly relevant to clinical decision-making. VEP annotations (right) deliver comprehensive functional information including consequence types (missense_variant, synonymous_variant, intron_variant), impact severity classifications (HIGH, MODERATE, LOW, MODIFIER), gene symbols, and transcript-specific effects with detailed positional information.
Current annotation workflow challenges
Variant annotation workflows typically follow a sequential process that includes:

Initial VCF processing: Raw variant call format (VCF) files from sequencing systems require preprocessing to normalize representation and filter low-quality calls.
VEP annotation: Running the Variant Effect Predictor tool requires substantial computational resources, especially for whole genome sequencing data with millions of variants per sample. VEP analysis can take 2-8 hours for a single genome depending on available compute resources and annotation depth.
ClinVar integration: Clinical annotations must be retrieved from ClinVar and matched to variants through a separate process, requiring database lookups and format conversions.
Multi-sample integration: Creating cohort-level analyses requires complex joining operations across samples, typically performed with specialized tools that generate large, flat files difficult to query efficiently.
Interpretation: Scientists must then use various tools to filter, sort, and analyze the annotated data—a process that often requires custom scripts and significant bioinformatics expertise. This technical bottleneck means that clinical researchers cannot independently explore their genomic data, creating delays of days or weeks between asking a biological question and receiving an answer.

Dataset complexity and scale
The scale of genomic variant analysis is exemplified by datasets like the 1000 Genomes Phase 3 Reanalysis with DRAGEN, which contains:

Over 2,500 individual samples from diverse populations
Approximately 85 million unique variants across all samples
Multiple annotation versions (DRAGEN 3.5, 3.7, 4.0, and 4.2) that must be reconciled
Complex structural variants alongside SNPs and indels

This complexity creates significant bottlenecks in traditional analysis pipelines that rely on flat file processing and manual integration steps.
Solution overview
Building genomic cohorts or computing PRS across multiple patients demands significant compute resources to generate joint variant call tables and comprehensive annotations using tools like the Variant Effect Predictor (VEP). Most critically, these workflows create a technical barrier where only bioinformaticians with SQL expertise and deep understanding of variant file formats can extract meaningful insights, leaving clinical researchers dependent on specialized technical teams for basic genomic queries.
The transformative advantage of our AI-powered approach lies in democratizing genomic analysis through natural language interaction. While traditional VEP pipelines require days of technical expertise to answer clinical questions like ‘Which patients have high-impact variants in drug resistance genes?’, with our solution researchers can ask these questions conversationally and receive answers in minutes. This represents a shift from technical dependency to self-service genomic insights so that clinical researchers, tumor boards, and genomics teams to directly explore their data without waiting for bioinformatics support.
Our solution demonstrates a generative AI-powered genomics variant interpreter agent that combines automated data processing with intelligent natural language analysis. The architecture addresses the entire genomic analysis workflow, from raw VCF file ingestion to conversational query interfaces.

The solution follows six key steps that transform raw genomic data into actionable insights:

Raw VCF processing: Raw VCF files from sequencing providers are uploaded to Amazon S3 storage and trigger AWS Lambda functions through S3 event notifications, which orchestrate AWS HealthOmics workflows.
VEP annotation: AWS HealthOmics workflows automatically process raw VCF files using the Variant Effect Predictor (VEP), enriching variants with functional predictions and clinical annotations in parallel before storing the annotated results back to S3.
Event coordination: Amazon EventBridge monitors workflow completion and triggers Lambda functions that update job status in Amazon DynamoDB and AWS Batch Fargate compute environment transforms VEP annotated VCF files and ClinVar annotations into Iceberg format as PyIceberg module
Data organization: PyIceberg loader interacts with the Amazon S3 Tables Iceberg Rest Endpoint. Amazon S3 Tables connects registers the table metadata in AWS Glue Data Catalog. Schema information (columns, data types, partitions) gets catalogued for annotated VCF and ClinVar annotations. It also establishes analytics connector for downstream analytics.
SQL-powered analysis: Amazon Athena provides SQL-based querying capabilities over the genomic data through columnar storage format, enabling large-scale analysis with ideal query responses across millions of variants.
Natural language interaction: The Strands orchestrator agent, powered by Amazon Bedrock LLMs on AgentCore Runtime, provides a natural language interface through five specialized tools that execute Athena queries:

query_variants_by_gene: Retrieves variants associated with specific genes
query_variants_by_chromosome: Facilitates chromosome-specific variant analysis
compare_sample_variants: Enables comparative genomics across patient samples
analyze_allele_frequencies: Provides population genetics insights
execute_dynamic_genomics_query: Supports flexible, ad-hoc analysis requests

The architecture includes comprehensive security controls through AWS IAM for fine-grained access management and Amazon CloudWatch for monitoring. The automated, event-driven pipeline supports scalable parallel processing of VCF files that automatically adapts to growing genomic datasets while maintaining consistent annotation quality and analytical capabilities.
Amazon S3 Tables with PyIceberg: Transforming VCF to a structured cohort
Amazon S3 Tables with PyIceberg transforms VEP-annotated VCF files into a structured cohort, queryable datasets optimized for AI-driven analysis. This creates the data foundation for natural language interfaces to efficiently interact with complex genomic data.
PyIceberg creates Apache Iceberg tables in S3 Tables format, provide the following benefits:

Optimal queries: The agent can perform complex genomic queries across millions of variants with minimal latency through optimized columnar storage, transforming analyses that previously required hours of SQL development and execution into instant conversational responses.
Rich annotation access: The VEP and ClinVar annotations become directly queryable through SQL via Amazon Athena, allowing the AI agent to extract specific genomic insights
Cohort-level analysis: The structured Iceberg format (PyIceberg) supports efficient comparisons across patient cohorts for population-level queries through natural language.

The separation of variant data from annotation data in S3 Tables creates an ideal foundation for AI-driven analytics because genomics variants S3 tables contain core positional information that agents can rapidly filter, and the annotations/clinical S3 tables house the rich functional and clinical context needed for interpretation. With this structure, the Strands agent can construct targeted queries that precisely answer user questions through the AWS Glue Data Catalog Connector.
This conversion from raw VCF files to structured tables is what makes it possible for researchers to query complex genomic datasets conversationally through the Strands orchestrator agent [KM1] on Amazon Bedrock AgentCore.
Intelligent genomic analysis with Strands Agents and AgentCore Runtime
The conversational interface represents the core innovation of our genomics AI solution, built using the Strands Agents SDK and deployed on Amazon Bedrock AgentCore Runtime. This sophisticated AI agent understands complex genomic concepts and translates natural language queries into appropriate analytical operations against the structured genomic datasets.
AgentCore Runtime is a secure, serverless runtime purpose-built for deploying and scaling dynamic AI agents and tools. This solution offers several key advantages for genomic analysis:

Model and framework flexibility: AgentCore services are composable and work with open source or custom framework and models, both in and outside of Amazon Bedrock
Multi-hour agentic workloads: Supports long-running workloads up to 8 hours and payloads up to 100MB
Security: Dedicated microVMs for each user session with complete isolation
Enterprise-grade integration: Built-in authentication via AgentCore Identity with AWS IAM
Observability: Comprehensive tracing of agent reasoning and tool invocations
Private resource access: Connectivity to databases and APIs within Amazon Virtual Private Cloud
Faster time-to-market: Accelerated deployment and development cycles for AI agent solutions

For detailed information on Amazon Bedrock AgentCore capabilities, refer to the Amazon Bedrock AgentCore documentation.
Strands Agents provide a robust foundation for building domain-specific AI agents with specialized capabilities through a model-driven approach that orchestrates genomic analysis tools using an agentic loop concept. This iterative reasoning framework enables agents to dynamically select and execute appropriate tools based on analysis requirements. Our genomic variant interpreter implements five key tools that leverage the structured data created by Amazon S3 Tables:

Variant querying: Translates gene-based questions into precise Athena SQL queries that retrieve associated variants.
Chromosome analysis: Enables region-specific genomic interrogation through natural language.
Sample comparison: Facilitates cross-patient genomic analysis without requiring SQL joins.
Population frequency analysis: Contextualizes findings against reference datasets like 1000 Genomes.
Dynamic query generation: Converts complex natural language requests into optimized SQL.

Natural language queries
The agent demonstrates remarkable capability in handling diverse query types. In the traditional model clinical researchers must wait for bioinformatics teams to write custom scripts and run complex analyses. Instead of spending days crafting SQL queries and wrestling with VCF file formats, researchers can now explore their genomic data as naturally as having a conversation with a genomics expert.
Cohort-level analysis
User: “Summarize as a table the total number of variants and pathogenicity per patient in this cohort?”
For this query, the agent:

Uses the execute_dynamic_genomics_query tool.
Analyzes variant data across the cohort of samples.
Generates a comprehensive cohort summary with patient counts and variant statistics.
Presents findings in a structured and tabular format summary.

Cohort-level frequency analysis
User: “Provide me the allelic frequencies of shared pathogenic or likely pathogenic variants in this cohort and 1000 genomes?”
The agent translates this into queries that:

Retrieve the list of pathogenic variants for the patient by running the execute_dynamic_genomics_query and analyze_allele_frequencies tool.
Filter for clinically relevant pathogenic variants.
Extract disease level information from ClinVar and allele frequencies from VEP.
Present results with relevant context.

Comorbidity risk association
User: ” Which are those patients have variant in ADRA2A gene at chr10:111079820 and, does these patients have any additional high impact variants linked with statin or insulin resistance? ”
For this query, the agent:

Searches for additional risk variants in drug resistance pathways for a specific disease context.
Connect with clinical significance at individual patient level for comorbidity.
Provide clinical implications of joint clinical and drug resistance pathways.

This natural language interface minimizes the need for researchers to master complex SQL syntax or understand the underlying data structures, democratizing access to genomic insights across clinical and research teams regardless of their technical background.
Advanced analytic processing
In addition to queries, the genomics variant interpreter agent demonstrates advanced analytical capabilities that extend beyond basic variant identification. Researchers can explore complex questions that traditionally required days of analysis.
Clinical decision support
User: ” Perform a thorough analysis on patient NA21144 and provide me the risk stratification for this patient”
For this query, the agent:

Analyzes variants in disease pathways genes, pharmacogenomics, and provides evidence-based recommendations.
Performs risk stratification by combining variant impact predictions with clinical significance classifications.
Identifies variants of uncertain significance.
Flags high-impact variants in clinically relevant genes.

Pharmacogenomics guided-dosing strategy
Researchers can leverage the agent for sophisticated pharmacogenomics pathway analyses across large cohorts through queries like:
User: ” Which major drug-related pathways are significantly enriched with genetic variants in this patient cohort? Provide me the most impactful pharmacogenomic pathways and associated patient IDs ”
This allows exploration of variant frequency distributions, consequence type patterns, and gene-level variant burdens across different populations—all through conversational interfaces without complex SQL or bioinformatics pipelines.

Benefits and limitation
The solution helps to solve the current challenges:

Challenges
Solutions

Initial VCF processing – Low-quality calls
The agent automatically prechecks quality calls of variants before making variant interpretation decisions

VEP annotation at scale
The solution automates VCF annotation at scale of 20 in batches uses right compute resource to achieve the appropriate performance.

ClinVar integration
The agent assess the query context and joint-query will be built dynamically based on the user interest.

Multi-sample integration
Amazon S3 Tables integration in Iceberg format makes the cohort of VCF files to query with ideal performance.

Genomics interpretation
The agent understands the context and user interest to make the informed decisions carefully reason out based on the appropriate evidences from the annotations and inhouse.

The solution has the following limitations:

Lambda Runtime constraints: The current implementation uses AWS Lambda for VCF/GVCF processing, which has a maximum execution time of 15 minutes. This constraint may be insufficient for loading large VCF files or especially large GVCF files into Iceberg S3 Tables, as these operations can take substantially longer than the Lambda timeout limit. For production workloads with large genomic datasets, consider using AWS HealthOmics workflows, AWS Batch, ECS tasks, or EC2 instances with longer execution times to handle the data loading process.
Schema optimization trade-offs: The schema implementation uses sample and chromosome partitioning, which is optimized for patient-level analysis. However, cohort-level analysis typically requires different partitioning strategies and schema designs to achieve optimal performance at scale. Making both patient-level and cohort-level analytics performant within a single schema becomes increasingly challenging as cohort sizes grow beyond hundreds of samples. For large-scale cohort studies (thousands to tens of thousands of samples), consider implementing separate schemas or materialized views optimized for specific analytical patterns, or explore denormalized structures that better support population-level queries.

Future technological evolution
The solution’s modular architecture establishes a foundation for continued innovation in AI-powered genomic analysis. Future versions could integrate additional annotation databases, external APIs, and support multi-modal analysis combining genomic data with clinical records and imaging. Domain-specific fine-tuning on genomic data could further improve interpretation accuracy, while integration with electronic health records would provide point-of-care genomic insights.
A particularly promising direction is multi-agent collaboration in pharmaceutical R&D, where this genomics variant interpreter agent could work alongside specialized agents for drug profiling, target identification, literature evidence, and hypothesis generation. This collaborative agent framework can dramatically accelerate drug discovery pipelines by connecting variant-level insights directly to therapeutic development, streamlining the translation from genetic findings to clinical applications.
Conclusion
This next-generation genomics agentic AI solution represents a fundamental transformation in how researchers and clinicians interact with genomic data. By seamlessly integrating AWS HealthOmics for automated variant annotation and data transformation with Amazon Bedrock AgentCore for intelligent interpretation, we’ve created a comprehensive solution that addresses the entire genomic analysis workflow.
The combination of automated VEP annotation workflows, S3 Tables for transforming VCF data into queryable Iceberg tables, and Strands Agents on Amazon Bedrock AgentCore for natural language interaction creates a system that minimizes traditional barriers between variant annotation, data processing, and clinical interpretation. By automating complex technical processes and providing intuitive interaction methods, researchers can now focus on biological questions rather than technical implementation details.
As genomic data continues to grow exponentially and clinical applications become increasingly sophisticated, systems like this will become essential infrastructure for advancing precision medicine and accelerating scientific discovery. The solution demonstrated with the 1000 Genomes Phase 3 Reanalysis dataset shows how even large-scale genomic cohorts can be analyzed through simple conversational interfaces, democratizing access to advanced genomic insights.
The code for this solution is available on the Life sciences agents toolkit, and we encourage you to explore and build upon this template. For examples to get started with Amazon Bedrock AgentCore, check out the Amazon Bedrock AgentCore repository.

About the authors
Edwin Sandanaraj is a genomics solutions architect at AWS. With a PhD in neuro-oncology and more than 20 years of experience in healthcare genomics data management and analysis, he brings a wealth of knowledge to accelerate precision genomics efforts in Asia-Pacific and Japan. He has a passionate interest in clinical genomics and multi-omics to accelerate precision care using cloud-based solutions.
Hasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Hasan helps design, deploy and scale Generative AI and Machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.
Charlie Lee is genomics industry lead for Asia-Pacific and Japan at AWS and has a PhD in computer science with a focus on bioinformatics. An industry leader with more than two decades of experience in bioinformatics, genomics, and molecular diagnostics, he is passionate about accelerating research and improving healthcare through genomics with cutting-edge sequencing technologies and cloud computing.

How Rufus scales conversational shopping experiences to millions of Am …

Our team at Amazon builds Rufus, an AI-powered shopping assistant which delivers intelligent, conversational experiences to delight our customers.

More than 250 million customers have used Rufus this year. Monthly users are up 140% YoY and interactions are up 210% YoY. Additionally, customers that use Rufus during a shopping journey are 60% more likely to complete a purchase. To make this possible, our team carefully evaluates every decision, aiming to focus on what matters most: building the best agentic shopping assistant experience. By focusing on customer-driven features, Rufus is now smarter, faster, and more useful.
In this post, we’ll share how our adoption of Amazon Bedrock accelerated the evolution of Rufus.
Building a customer-driven architecture
Defining clear use cases are fundamental to shaping both requirements and implementation, and building an AI-powered shopping assistant is no exception. For a shopping assistant like Rufus our use cases align with the kinds of questions customers ask, and we aim to exceed their expectations with every answer. For example, a customer may want to know something factual about the shoes they’re considering and ask, “are these shoes waterproof?” Another customer may want to ask Rufus for recommendations and ask, “give me a few good options for shoes suitable for marathon running.” These examples represent just a fraction of the diverse question types we designed Rufus to support by working backwards from customer use cases.
After we defined our customer use cases, we design Rufus with the entire stack in mind to work seamlessly for customers. From initial release to subsequent iterations, we collect metrics to see how well Rufus is doing with the aim to keep getting better. This means not only measuring how accurately questions are answered using tools like LLM-as-a-judge, but also analyzing factors such as latency, repeat customer engagement, and number of conversation turns per interaction, to gain deeper insights into customer engagement.
Expanding beyond our in-house LLM
We first launched Rufus by building our own in-house large language model (LLM). The decision to build a custom LLM was driven by the need to use a model that was specialized on shopping domain questions. At first, we considered off-the-shelf models but most of these did not do well in our shopping evaluations (evals). Other models came with the cost of being larger and therefore were slower and more costly. We didn’t need a model that did well across many domains, we needed a model that did well in the shopping domain, while maintaining high accuracy, low latency, and cost performance. By building our custom LLM and deploying it using AWS silicon, we were able to go into production worldwide supporting large scale events such as Prime Day when we used 80,000 AWS Inferentia and Trainium chips.
After the initial success of Rufus, we aimed to expand into use cases requiring advanced reasoning, larger context windows, and multi-step reasoning. However, training an LLM presents a significant challenge: iterations can take weeks or months to complete. With newer more capable models being released at an accelerated pace, we aimed to improve Rufus as quickly as possible and began to evaluate and adopt state-of-the-art models rapidly. To launch these new features and build a truly remarkable shopping assistant Amazon Bedrock was the natural solution.
Accelerating Rufus with Amazon Bedrock
Amazon Bedrock is a comprehensive, secure, and flexible platform for building generative AI applications and agents. Amazon Bedrock connects you to leading foundation models (FMs), services to deploy and operate agents, and tools for fine-tuning, safeguarding, and optimizing models along with knowledge bases to connect applications to your latest data so that you have everything you need to quickly move from experimentation to real-world deployment. Amazon Bedrock gives you access to hundreds of FMs from leading AI companies along with evaluation tools to pick the best model based on your unique performance and cost needs.
Amazon Bedrock provides us great value by:

Managing hosting of leading foundation models (FMs) from different providers and making them available through model agnostic interfaces such as the converse API. By providing access to frontier models we can evaluate and integrate them quickly with minimal changes to our existing systems. This increased our velocity. We can use the best model for the task while balancing characteristics like cost, latency, and accuracy.
Addressing significant operational overhead from the Rufus team such as managing model hosting infrastructure, handling scaling challenges, or maintaining model serving pipelines around the world where Amazon operates. Bedrock handles the heavy lifting, allowing customers to concentrate on building innovative solutions for their unique needs.
Providing global availability for consistent deployment supporting multiple geographic regions. By using Amazon Bedrock we launched in new marketplaces quickly with minimal effort.

Models hosted by Amazon Bedrock also helps Rufus support a wide range of experiences across modalities, including text and images. Even within a particular modality like text-to-text, use cases can vary in complexity, traffic, and latency requirements. Some scenarios such as “planning a camping trip,” “gift recommendations for my mom,” or style advice requires deeper reasoning, multi-turn dialogue, and access to tools like web search to provide contextually rich, personalized answers. Straightforward product inquiries, such as, “what is the wattage on this drill?” can be handled efficiently by smaller, faster models.
Our strategy combines multiple models to power Rufus including Amazon Nova, and Anthropic’s Claude Sonnet, and our custom model, so we can deliver the most reliable, fast, and intuitive customer experience possible.
Integrating Amazon Bedrock with Rufus
With Amazon Bedrock, we can evaluate and select the optimal model for each query type, balancing answer quality, latency, and engagement. The benefits of using Amazon Bedrock increased our development velocity by over 6x. Using multiple models gives us the ability to break down a conversation into granular pieces. By doing so, we’re able to answer questions more effectively and we’ve seen meaningful benefits. After we know what models we plan to use, we also take a hybrid approach in providing the model proper context to perform its task effectively. In some cases, we may already have the context that Rufus needs to answer a question. For example, if we know a customer is asking a question about their previous orders, we can provide their order history to the initial inference request of the model. This optimizes the number of inference calls we need to make and also provides more determinism to help avoid downstream errors. In other cases, we can defer the decision to the model and when it believes it needs more information it can use a tool to retrieve additional context.
We found that it’s very important to ground the model with the proper information. One of the ways we do this is by using Amazon Nova Web Grounding because it can interact with web browsers to retrieve and cite authoritative internet sources, resulting in significantly reduced answer defects and improved accuracy and customer trust. In addition to optimizing model accuracy, we’ve also worked with Amazon Bedrock features to decrease latency whenever possible. By using prompt caching and parallel tool calling we decreased latency even more. These optimizations, from model response to service latency, means customers that use Rufus are 60% more likely to complete a purchase.
Agentic functionality through tool integration
More importantly, the Amazon Bedrock architecture supports agentic capabilities that makes Rufus more useful for shoppers through tool use. Using models on Bedrock, Rufus can dynamically call services as tools to provide personalized, real-time, accurate information or take actions on behalf of the user. When a customer asks Rufus about product availability, pricing, or specifications, Rufus goes far beyond its built-in knowledge. It retrieves relevant information such as your order history and uses integrated tools at inference time to query live databases, check the latest product catalog, and access real-time data. To be more personal Rufus now has account memory, understanding customers based on their individual shopping activity. Rufus can use information you may have shared previously such as hobbies you enjoy, or a previous mention of a pet, to provide a much more personalized and effective experience.

When building these agentic capabilities, it might be necessary build a service for your agent to interact with to be more effective. For example, Rufus has a Price history feature on the product detail page that lets customers instantly view historical pricing to see if they’re getting a great deal. Shoppers can ask Rufus directly for price history while browsing (for example, For example, “Has this item been on sale in the past thirty days?”) or set an agentic price alert to be notified when a product reaches a target price (“Buy these headphones when they’re 30% off”). With the auto-buy feature, Rufus can complete purchases on your behalf within 30 minutes of when the desired price is met and finalize the order using your default payment and shipping details. Auto-buy requests remain active for six months, and customers currently using this feature are saving an average of 20% per purchase. The agent itself can create a persistent record in the price alert and auto-buy service, but the system then uses traditional software to manage the record and act on it accordingly. This tight integration of models, tools, and services transforms Rufus into a truly dynamic personalized shopping agent.

Beyond price tracking, Rufus supports natural, conversational reordering. Customers can simply say, “Reorder everything we used to make pumpkin pie last week,” or “Order the hiking boots and poles I browsed yesterday.” Rufus connects the dots between past activity and current intent and can suggest alternatives if items are unavailable. Rufus uses agentic AI capabilities to automatically add products to the cart for quick review and checkout. In these scenarios, Rufus can determine when to gather information to provide a better answer or to perform an action that’s directed by the customer. These are just two examples of the many agentic features we’ve launched.
The result: AI-powered shopping at Amazon scale
By using Amazon Bedrock, Rufus demonstrates how organizations can build sophisticated AI applications that scale to serve millions of users. The combination of flexible model selection, managed infrastructure, and agentic capabilities enables Amazon to deliver a shopping assistant that’s both intelligent and practical while maintaining tight controls on accuracy, latency, and cost. If you are considering your own AI initiatives, Rufus showcases Bedrock’s potential to simplify the journey from AI experimentation to production deployment, allowing you to focus on customer value rather than infrastructure complexity. We encourage you to try Bedrock and observe the same benefits we have and focusing on your agentic solutions and their core capabilities.

About the authors
James Park is a ML Specialist Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.
Shrikar Katti is a Principal TPM at Amazon. His current focus is on driving end-to-end delivery, strategy, and cross-org alignment for a large-scale AI products that transforms the Amazon shopping experience, while ensuring safety, scalability, and operational excellence. In his spare time, he enjoys playing chess, and exploring the latest advancements in AI.
Gaurang Sinkar is a Principal Engineer at Amazon. His recent focus is on scaling, performance engineering and optimizing generative ai solutions. Beyond work, he enjoys spending time with family, traveling, occasional hiking and playing cricket.
Sean Foo is an engineer at Amazon. His recent focus is building low latency customer experiences and maintaining a highly available systems at Amazon scale. In his spare time, he enjoys playing video and board games with friends and wandering around.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Somu Perianayagam is an Engineer at AWS specializing in distributed systems for Amazon DynamoDB and Amazon Bedrock. He builds large-scale, resilient architectures that help customers achieve consistent performance across regions, simplify their data paths, and operate reliably at massive scale.

An Implementation of a Comprehensive Empirical Framework for Benchmark …

In this tutorial, we dive deep into how we systematically benchmark agentic components by evaluating multiple reasoning strategies across diverse tasks. We explore how different architectures, such as Direct, Chain-of-Thought, ReAct, and Reflexion, behave when faced with problems of increasing difficulty, and we quantify their accuracy, efficiency, latency, and tool-usage patterns. By conducting controlled empirical studies, we gain a clearer understanding of why certain agentic strategies succeed, where they fail, and how they trade off speed for depth of reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Callable, Tuple
from dataclasses import dataclass
from enum import Enum
import time
from collections import defaultdict

class ReasoningStrategy(Enum):
DIRECT = “direct”
CHAIN_OF_THOUGHT = “chain_of_thought”
REACT = “react”
REFLEXION = “reflexion”

@dataclass
class AgentResponse:
answer: str
steps: int
time_taken: float
tool_calls: int
confidence: float

class BaseAgent:
def __init__(self, strategy: ReasoningStrategy):
self.strategy = strategy
self.tool_count = 0

def solve(self, problem: str) -> AgentResponse:
start_time = time.time()
if self.strategy == ReasoningStrategy.DIRECT:
answer, steps, tools = self._direct_solve(problem)
elif self.strategy == ReasoningStrategy.CHAIN_OF_THOUGHT:
answer, steps, tools = self._cot_solve(problem)
elif self.strategy == ReasoningStrategy.REACT:
answer, steps, tools = self._react_solve(problem)
else:
answer, steps, tools = self._reflexion_solve(problem)
time_taken = time.time() – start_time
confidence = self._calculate_confidence(problem, answer)
return AgentResponse(answer, steps, time_taken, tools, confidence)

We set up the foundation of our benchmarking framework by importing essential libraries and defining the core agent architectures. We establish different reasoning strategies and construct the BaseAgent class, giving ourselves a flexible structure to simulate diverse agentic behaviors. Through this setup, we establish a unified interface that all agents follow during evaluation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def _direct_solve(self, problem: str) -> Tuple[str, int, int]:
answer = self._compute_answer(problem)
return answer, 1, 0

def _cot_solve(self, problem: str) -> Tuple[str, int, int]:
steps = 3 + len(problem.split()) // 5
for i in range(steps):
_ = self._reason_step(problem, i)
answer = self._compute_answer(problem)
return answer, steps, 0

def _react_solve(self, problem: str) -> Tuple[str, int, int]:
steps = 4
tool_calls = 2
for i in range(steps):
_ = self._reason_step(problem, i)
if i % 2 == 0:
self._use_tool(problem)
answer = self._compute_answer(problem)
return answer, steps, tool_calls

def _reflexion_solve(self, problem: str) -> Tuple[str, int, int]:
steps = 6
tool_calls = 1
initial_answer = self._compute_answer(problem)
reflection = self._reflect(problem, initial_answer)
answer = self._refine(problem, initial_answer, reflection)
return answer, steps, tool_calls

def _reason_step(self, problem: str, step: int) -> str:
return f”Analyzing aspect {step+1}”

def _use_tool(self, problem: str):
self.tool_count += 1
time.sleep(0.001)

def _compute_answer(self, problem: str) -> str:
return f”Solution_{hash(problem) % 100}”

def _reflect(self, problem: str, answer: str) -> str:
return “Reflection on approach”

def _refine(self, problem: str, answer: str, reflection: str) -> str:
return f”Refined_{answer}”

def _calculate_confidence(self, problem: str, answer: str) -> float:
base_confidence = 0.7
strategy_bonus = {
ReasoningStrategy.DIRECT: 0.0,
ReasoningStrategy.CHAIN_OF_THOUGHT: 0.1,
ReasoningStrategy.REACT: 0.15,
ReasoningStrategy.REFLEXION: 0.2
}
return min(1.0, base_confidence + strategy_bonus[self.strategy] + np.random.uniform(-0.1, 0.1))

We implement how each reasoning strategy behaves internally, including direct answering, chain-of-thought reasoning, ReAct-style interleaving, and Reflexion-based refinement. We simulate reasoning steps, tool usage, and confidence estimation to capture realistic agent behavior patterns. Here, we shape the dynamic personality of each agentic strategy we benchmark. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass BenchmarkTask:
def __init__(self, name: str, difficulty: float, ground_truth: str):
self.name = name
self.difficulty = difficulty
self.ground_truth = ground_truth

def evaluate(self, response: AgentResponse) -> Dict[str, float]:
accuracy = response.confidence * (1 – self.difficulty * 0.3)
return {
‘accuracy’: accuracy,
‘efficiency’: 1.0 / (response.steps + 1),
‘latency’: response.time_taken,
‘tool_efficiency’: 1.0 / (response.tool_calls + 1)
}

class BenchmarkSuite:
def __init__(self):
self.tasks = self._create_tasks()

def _create_tasks(self) -> List[BenchmarkTask]:
tasks = []
task_types = [
(“Math_Problem”, 0.3),
(“Logic_Puzzle”, 0.5),
(“Code_Debug”, 0.6),
(“Complex_Reasoning”, 0.8),
(“Multi_Step_Planning”, 0.7)
]
for i, (task_type, difficulty) in enumerate(task_types):
for j in range(3):
task = BenchmarkTask(
name=f”{task_type}_{j+1}”,
difficulty=difficulty + np.random.uniform(-0.1, 0.1),
ground_truth=f”GT_{i}_{j}”
)
tasks.append(task)
return tasks

def run_benchmark(self, agents: List[BaseAgent]) -> pd.DataFrame:
results = []
for agent in agents:
for task in self.tasks:
response = agent.solve(task.name)
metrics = task.evaluate(response)
results.append({
‘strategy’: agent.strategy.value,
‘task’: task.name,
‘difficulty’: task.difficulty,
‘accuracy’: metrics[‘accuracy’],
‘efficiency’: metrics[‘efficiency’],
‘latency’: metrics[‘latency’],
‘tool_efficiency’: metrics[‘tool_efficiency’],
‘steps’: response.steps,
‘tool_calls’: response.tool_calls
})
return pd.DataFrame(results)

We build the complete benchmark suite that generates tasks, executes them across multiple agents, and collects standardized results. We design varied task types and difficulty levels to observe how each reasoning strategy adapts under pressure. This snippet allows us to create a reproducible and systematic evaluation pipeline. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef analyze_results(df: pd.DataFrame):
agg_metrics = df.groupby(‘strategy’).agg({
‘accuracy’: [‘mean’, ‘std’],
‘efficiency’: [‘mean’, ‘std’],
‘latency’: [‘mean’, ‘std’],
‘steps’: ‘mean’,
‘tool_calls’: ‘mean’
}).round(3)
print(agg_metrics)

diff_bins = pd.cut(df[‘difficulty’], bins=3, labels=[‘Easy’, ‘Medium’, ‘Hard’])
diff_analysis = df.groupby([‘strategy’, diff_bins])[‘accuracy’].mean().unstack()
print(diff_analysis.round(3))

tradeoff = df.groupby(‘strategy’).agg({
‘accuracy’: ‘mean’,
‘steps’: ‘mean’,
‘latency’: ‘mean’
})
tradeoff[‘score’] = (tradeoff[‘accuracy’] / (tradeoff[‘steps’] * tradeoff[‘latency’])).round(3)
print(tradeoff.round(3))

def visualize_results(df: pd.DataFrame):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
sns.barplot(data=df, x=’strategy’, y=’accuracy’, ax=axes[0, 0], errorbar=’sd’)
axes[0, 0].set_title(‘Accuracy by Strategy’)
axes[0, 0].tick_params(axis=’x’, rotation=45)

for strategy in df[‘strategy’].unique():
strategy_df = df[df[‘strategy’] == strategy]
axes[0, 1].scatter(strategy_df[‘steps’], strategy_df[‘accuracy’], label=strategy, alpha=0.6, s=50)
axes[0, 1].set_title(‘Steps vs Accuracy’)
axes[0, 1].legend()

difficulty_bins = pd.cut(df[‘difficulty’], bins=3, labels=[‘Easy’, ‘Medium’, ‘Hard’])
df_plot = df.copy()
df_plot[‘difficulty_bin’] = difficulty_bins
sns.boxplot(data=df_plot, x=’difficulty_bin’, y=’accuracy’, hue=’strategy’, ax=axes[1, 0])
axes[1, 0].set_title(‘Performance vs Difficulty’)

scores = df.groupby(‘strategy’).apply(
lambda x: x[‘accuracy’].mean() / (x[‘steps’].mean() * x[‘latency’].mean())
).sort_values()
axes[1, 1].barh(range(len(scores)), scores.values)
axes[1, 1].set_yticks(range(len(scores)))
axes[1, 1].set_yticklabels(scores.index)
axes[1, 1].set_title(‘Overall Efficiency Score’)

plt.tight_layout()
plt.show()

We perform detailed analysis and visualization to understand how strategies differ across metrics like accuracy, efficiency, and latency. We aggregate results, compare performance across difficulty levels, and visualize trade-offs to uncover deeper insights. This step empowers us to interpret the outcomes rather than just compute them. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
agents = [
BaseAgent(ReasoningStrategy.DIRECT),
BaseAgent(ReasoningStrategy.CHAIN_OF_THOUGHT),
BaseAgent(ReasoningStrategy.REACT),
BaseAgent(ReasoningStrategy.REFLEXION)
]

suite = BenchmarkSuite()
results_df = suite.run_benchmark(agents)

analyze_results(results_df)
visualize_results(results_df)

print(“1. Advanced strategies achieve higher accuracy but require more steps”)
print(“2. Chain-of-thought balances accuracy and efficiency”)
print(“3. Direct is fastest but less reliable on hard tasks”)
print(“4. All strategies degrade on harder tasks but advanced ones degrade slowly”)

We bring everything together by running the benchmark suite on all agents and printing the key findings. We execute the analysis pipeline, visualize comparative results, and interpret how strategies behave under identical conditions. This snippet completes the loop, allowing us to observe empirical patterns and derive meaningful conclusions.

In conclusion, we observe how different agentic reasoning paradigms perform when subjected to identical benchmark conditions, and we gain practical insight into how these strategies scale with increasing complexity. As we analyze patterns in accuracy, step count, latency, and tool efficiency, we recognize how advanced strategies succeed through deeper reasoning while incurring computational overhead. We now stand equipped with a structured empirical framework that helps us compare, debug, and optimize agentic behaviors, allowing us to build more capable, data-driven agentic systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post An Implementation of a Comprehensive Empirical Framework for Benchmarking Reasoning Strategies in Modern Agentic AI Systems appeared first on MarkTechPost.

Google Antigravity Makes the IDE a Control Plane for Agentic Coding

Google has introduced Antigravity as an agentic development platform that sits on top of Gemini 3. It is not only an autocomplete layer, it is an IDE where agents plan, execute, and explain complex software tasks across editor, terminal, and browser surfaces. Antigravity was launched on November 18, 2025, alongside Gemini 3 as part of Google’s push toward agent centric developer tools.

What Antigravity Actually is?

Antigravity is described by Google as a new agentic development platform with a familiar AI powered IDE at its core. The goal is to evolve the IDE toward an agent first future, with browser control and asynchronous interaction patterns that let agents autonomously plan and execute end to end software tasks.

In practice, Antigravity looks and behaves like a modern AI editor but treats agents as first class workers. Agents can break tasks, coordinate with other agents, edit files, run commands, and drive a browser. The developer operates at a task level, while the system manages the low level tool interactions.

Under the hood, Antigravity is an Electron application based on Visual Studio Code. It requires a Google account sign in and ships as a free public preview for macOS, Linux, and Windows.

Models, Pricing, And Runtime Environment

Antigravity exposes multiple foundation models inside the same agent framework. In the current preview, agents can use Gemini 3, Anthropic Claude Sonnet 4.5, and OpenAI GPT OSS models. This gives developers model optionality inside one IDE instead of binding them to a single vendor.

For individual users, Antigravity is available at no charge. Google describes the Gemini 3 Pro usage as subject to generous rate limits that refresh every 5 hours, and notes that only a small fraction of power users are expected to hit them.

Editor View And Manager View

Antigravity introduces 2 main work modes that match different neural models. Documentation and coverage consistently describe these as Editor view and Manager view.

Editor view is the default. It looks like a standard IDE with an agent in the side panel. The agent can read and edit files, suggest changes inline, and use the terminal and browser when needed.

Manager view lifts the abstraction from single files to multiple agents and workspaces. This is the place where you coordinate several agent runs rather than editing code line by line.

Artifacts, Not Raw Tool Logs

A key design element in Antigravity is the Artifact system. Instead of exposing only raw tool call logs, agents produce human readable artifacts that summarize what they are doing and why.

Artifacts are structured objects that can include task lists, implementation plans, walkthrough documents, screenshots, and browser recordings. They represent work at a task level rather than at an API call level and are designed to be easier for developers to verify than dense traces of model actions.

Google positions this as a response to a trust problem in current agent frameworks. Many tools either show every internal step, which overwhelms users, or hide everything and only show the final code diff. Antigravity tries to sit in the middle by surfacing task level artifacts plus enough verification signals so that a developer can audit what the agent did.

Four Design Tenets And Feedback Channels

Antigravity is explicitly built around 4 tenets, trust, autonomy, feedback, and self improvement.

Trust is handled through artifacts and verification steps. Autonomy comes from giving agents access to multiple surfaces, editor, terminal, and browser, so they can run more complex workflows without constant prompts. Feedback is enabled through comments on artifacts, and self improvement is tied to agents learning from past work and reusing successful procedures.

Antigravity allows developers to comment directly on specific artifacts, including text and screenshots. Agents can incorporate this feedback into their ongoing work without discarding the current run. This lets you correct a partial misunderstanding without restarting the whole task.

The platform also exposes a knowledge feature where agents can retain snippets of code or sequences of steps from earlier tasks. Over time, this becomes a reusable internal playbook that agents can query, rather than rediscovering the same strategies for each new project.

Key Takeaways

Antigravity is an agent first development platform that turns the IDE into a control plane where agents operate across editor, terminal and browser surfaces, instead of a narrow inline assistant.

The system is a Visual Studio Code fork that runs as a free public preview on Windows, macOS and Linux, with generous Gemini 3 Pro rate limits and optional use of Claude Sonnet 4.5 and GPT OSS.

Antigravity exposes 2 main modes, Editor view for hands on coding with an agent sidebar and Manager view as a mission control interface to orchestrate multiple agents and workspaces asynchronously.

Agents emit Artifacts, task lists, implementation plans, screenshots, browser recordings and more, which act as verifiable evidence of work instead of raw tool logs and enable asynchronous review workflows.

Feedback and self improvement are built in, developers can attach Google Docs style comments to artifacts across surfaces, and agents incorporate this feedback and learn from a development knowledge base without restarting tasks.

Editorial Comments

Google Antigravity is a pragmatic step toward agentic development. It anchors Gemini 3 Pro inside a real IDE workflow, exposes Editor view and Manager view for supervising agents, and enforces task level visibility through Artifacts. The four tenets, trust, autonomy, feedback, self improvement, are grounded in verifiable outputs and persistent knowledge rather than opaque traces. Overall, Antigravity treats the IDE as a governed environment for autonomous agents, not a chat window with code actions.

Check out the FULL TECHNICAL DETAILS here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Google Antigravity Makes the IDE a Control Plane for Agentic Coding appeared first on MarkTechPost.