How to Build a Model-Native Agent That Learns Internal Planning, Memor …

In this tutorial, we explore how an agent can internalize planning, memory, and tool use within a single neural model rather than relying on external orchestration. We design a compact, model-native agent that learns to perform arithmetic reasoning tasks through reinforcement learning. By combining a stage-aware actor-critic network with a curriculum of increasingly complex environments, we enable the agent to discover how to use internalized “tools” and short-term memory to reach correct solutions end-to-end. We work step by step to observe how learning evolves from simple reasoning to multi-step compositional behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport math, random, torch, torch.nn as nn, torch.nn.functional as F
device = “cuda” if torch.cuda.is_available() else “cpu”; torch.manual_seed(0); random.seed(0)
V = 18; CTX = 10; MUL, ADD, SUB, ANS, STO, RCL, EOS = 11, 12, 13, 14, 15, 16, 17
tok2str = {**{i: str(i) for i in range(10)}, CTX:”[CTX]”, MUL:”[MUL]”, ADD:”[ADD]”, SUB:”[SUB]”, ANS:”[ANS]”, STO:”[STO]”, RCL:”[RCL]”, EOS:”[EOS]”}

class ToolEnv:
def __init__(self, max_steps=7):
self.max_steps = max_steps
def sample(self, stage):
a,b,c,d,e = [random.randint(0,9) for _ in range(5)]
if stage==0: ctx=[a,b,c]; target=a*b+c
elif stage==1: ctx=[a,b,c,d]; target=(a*b+c)-d
else: ctx=[a,b,c,d,e]; target=(a*b+c)-(d*e)
return ctx, target, (a,b,c,d,e)
def step_seq(self, actions, abc, stage):
a,b,c,d,e = abc; last=None; mem=None; steps=0; shaped=0.0
goal0=a*b; goal1=goal0+c; goal2=goal1-d; goal3=d*e; goal4=goal1-goal3
for act in actions:
steps+=1
if act==MUL: last=(a*b if last is None else last*(d if stage>0 else 1))
elif act==ADD and last is not None: last+=c
elif act==SUB and last is not None:
last -= (e if stage==2 and mem==”use_d” else (d if stage>0 else 0))
elif act==STO: mem=”use_d” if stage>=1 else “ok”
elif act==RCL and mem is not None:
last = (d*e) if (stage==2 and mem==”use_d”) else (last if last else 0)
elif act==ANS:
target=[goal0,goal1,goal2,goal4][stage] if stage==2 else [goal0,goal1,goal2][stage]
correct=(last==target)
if stage==0: shaped += 0.25*(last==goal0)+0.5*(last==goal1)
if stage==1: shaped += 0.25*(last==goal0)+0.5*(last==goal1)+0.75*(last==goal2)
if stage==2: shaped += 0.2*(last==goal0)+0.4*(last==goal1)+0.6*(last==goal4)+0.6*(last==goal3)
return (1.0 if correct else 0.0)+0.2*shaped, steps
if steps>=self.max_steps: break
return 0.0, steps

We begin by setting up the environment and defining the symbolic tools our agent can use. We create a small synthetic world where each action, such as multiplication, addition, or subtraction, acts as an internal tool. This environment enables us to simulate reasoning tasks in which the agent must plan sequences of tool use to arrive at the correct answer. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ActorCritic(nn.Module):
def __init__(self,V,d=96,nstage=3):
super().__init__()
self.emb=nn.Embedding(V,d); self.stage_emb=nn.Embedding(nstage,d)
self.rnn=nn.GRU(d,d,1,batch_first=True); self.pi=nn.Linear(d,V); self.v=nn.Linear(d,1)
def forward(self,ctx,stage,max_len=6,greedy=False):
B=ctx.shape[0]; ce=self.emb(ctx).mean(1)+self.stage_emb(stage).unsqueeze(1)
h=torch.tanh(ce.mean(1)).unsqueeze(0); inp=self.emb(torch.full((B,1),CTX,device=device))
acts,logps,ents,vals=[],[],[],[]
for _ in range(max_len):
out,h=self.rnn(inp,h); val=self.v(out[:,-1]); logits=self.pi(out[:,-1])
pi=F.log_softmax(logits,dim=-1).exp(); ent=-(pi*torch.log(pi+1e-9)).sum(1)
a=torch.argmax(logits,1) if greedy else torch.distributions.Categorical(pi).sample()
logp=F.log_softmax(logits,dim=-1).gather(1,a.unsqueeze(1)).squeeze(1)
inp=self.emb(a.unsqueeze(1))
acts.append(a); logps.append(logp); ents.append(ent); vals.append(val.squeeze(1))
return torch.stack(acts,1), torch.stack(logps,1), torch.stack(ents,1), torch.stack(vals,1)

We then design our model-native policy using an actor-critic structure built around a GRU. We embed both tokens and task stages, allowing the network to adapt its reasoning depth according to task complexity. This setup enables the agent to learn contextually when and how to use internal tools within a single unified model. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserenv=ToolEnv(); net=ActorCritic(V).to(device)
opt=torch.optim.Adam(net.parameters(),lr=3e-4)
def pad_batch(ctxs):
L=max(len(c)+1 for c in ctxs)
out=torch.full((len(ctxs),L),EOS,dtype=torch.long,device=device)
for i,c in enumerate(ctxs): out[i,:len(c)+1]=torch.tensor(c+[CTX],device=device)
return out
def run_batch(stage,batch=128,train=True,greedy=False):
ctxs=[]; metas=[]
for _ in range(batch):
c,t,abc=env.sample(stage); ctxs.append(c); metas.append((t,abc))
ctx=pad_batch(ctxs); stage_t=torch.full((batch,),stage,device=device,dtype=torch.long)
acts,logps,ents,vals=net(ctx,stage_t,max_len=6,greedy=greedy)
rewards=[]
for i in range(batch):
traj = acts[i].tolist()
abc = metas[i][1]
r,_ = env.step_seq(traj,abc,stage)
rewards.append(r)
R=torch.tensor(rewards,device=device).float()
adv=(R-vals.sum(1)).detach()
if not train: return R.mean().item(), 0.0
pg=-(logps.sum(1)*adv).mean(); vloss=F.mse_loss(vals.sum(1),R); ent=-ents.mean()
loss=pg+0.5*vloss+0.01*ent
opt.zero_grad(); loss.backward(); nn.utils.clip_grad_norm_(net.parameters(),1.0); opt.step()
return R.mean().item(), loss.item()

We implement the reinforcement learning training loop using an advantage actor-critic (A2C) update. We train the agent end-to-end across batches of synthetic problems, updating policy and value networks simultaneously. Here, we incorporate entropy regularization to promote exploration and prevent premature convergence. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Training…”)
stages=[0,0,0,1,1,2]
for ep in range(1,61):
stage=stages[min((ep-1)//10,len(stages)-1)]
acc,loss=run_batch(stage,batch=192,train=True)
if ep%5==0:
with torch.no_grad():
evals=[run_batch(s,train=False,greedy=True)[0] for s in [0,1,2]]
print(f”ep={ep:02d} stage={stage} acc={acc:.3f} | eval T0={evals[0]:.3f} ”
f”T1={evals[1]:.3f} T2={evals[2]:.3f} loss={loss:.3f}”)

We start the main training process using a curriculum strategy where tasks gradually increase in difficulty. As we train, we evaluate the agent on all stages to observe its ability to generalize from simpler to more complex reasoning steps. The printed metrics show how internal planning improves over time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef explain(stage):
c,t,abc=env.sample(stage)
ctx=pad_batch([c]); stage_t=torch.tensor([stage],device=device)
with torch.no_grad(): a,_,_,_=net(ctx,stage_t,greedy=True)
seq=[tok2str[x] for x in a[0].tolist()]
r,_=env.step_seq(a[0].tolist(),abc,stage)
return dict(stage=stage,ctx=c,target=t,actions=” “.join(seq),reward=round(float(r),2))
with torch.no_grad():
for s in [0,1,2]:
print(f”nStage {s} samples:”)
for _ in range(5): print(explain(s))
with torch.no_grad():
finals=[run_batch(s,train=False,greedy=True,batch=1000)[0] for s in [0,1,2]]
print(f”nFinal greedy accuracies → T0={finals[0]:.3f}, T1={finals[1]:.3f}, T2={finals[2]:.3f}”)

We finish by probing the trained agent and printing example reasoning trajectories. We visualize the sequence of tool tokens the model chooses and verify whether it reaches the correct result. Finally, we evaluate the overall performance, demonstrating that the model successfully integrates planning, memory, and reasoning into an internalized process.

In conclusion, we see that even a neural network can learn internalized planning and tool-use behaviors when trained with reinforcement signals. We successfully move beyond traditional pipeline-style architectures, where memory, planning, and execution are separate, toward a model-native agent that integrates these components as part of its learned dynamics. This approach represents a shift in agentic AI, demonstrating how end-to-end learning can produce emergent reasoning and self-organized decision-making without the need for handcrafted control loops.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Model-Native Agent That Learns Internal Planning, Memory, and Multi-Tool Reasoning Through End-to-End Reinforcement Learning appeared first on MarkTechPost.

OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Language …

How can we reliably test whether large language models actually understand Indian languages and culture in real world contexts? OpenAI has released IndQA, a benchmark that evaluates how well AI models understand and reason about questions that matter in Indian languages across cultural domains.

Why IndQA?

OpenAI states that about 80 percent of people worldwide do not speak English as their primary language. Yet most benchmarks that measure non English capabilities are still narrow and often rely on translation or multiple choice formats.

Benchmarks such as MMMLU and MGSM are now near saturation at the top end, where strong models cluster near similar scores. This makes it hard to see meaningful progress and does not test whether models understand local context, history and everyday life.

India is OpenAI’s starting point for new region focused benchmarks. India has about 1 billion people who do not use English as their primary language, 22 official languages with at least 7 spoken by more than 50 million people, and it is ChatGPT’s second largest market.

Dataset, Languages And Domains

IndQA evaluates knowledge and reasoning about Indian culture and everyday life in Indian languages. The benchmark spans 2,278 questions across 12 languages and 10 cultural domains, created with 261 domain experts from across India.

The cultural domains are Architecture and Design, Arts and Culture, Everyday Life, Food and Cuisine, History, Law and Ethics, Literature and Linguistics, Media and Entertainment, Religion and Spirituality, and Sports and Recreation. Items are written natively in Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi and Tamil. Hinglish is included to reflect common code switching in Indian conversations.

Each datapoint contains four components, a culturally grounded prompt in an Indian language, an English translation for auditability, rubric criteria for grading and an ideal answer that encodes expert expectations.

Rubric Based Evaluation Pipeline

IndQA uses a rubric based grading procedure instead of exact match accuracy. For each question, domain experts define multiple criteria that describe what a strong answer should include or avoid and assign a weight to each criterion.

A model based grader checks the candidate response against these criteria and marks which ones are satisfied. The final score is the sum of weights for satisfied criteria divided by the total possible score. This behaves like grading a short exam answer, it supports partial credit and captures nuance and cultural correctness, not only surface token overlap.

https://openai.com/index/introducing-indqa/

Construction Process And Adversarial Filtering

OpenAI describes a four step construction pipeline:

First, they partnered with organizations in India to recruit experts across 10 domains. These experts are native level speakers of the target language and English and have deep subject expertise. They wrote difficult, reasoning heavy prompts anchored in regional context, such as literature, food history, law or media.

Second, they applied adversarial filtering. Every draft question was evaluated with OpenAI’s strongest models at creation time, GPT-4o, OpenAI o3, GPT-4.5 and, partially after public launch, GPT-5. Only questions where a majority of these models failed to produce acceptable answers were kept. This preserves headroom so that future model improvements show up clearly on IndQA.

Third, experts provided detailed criteria for grading each question, similar to an exam rubric. These criteria are reused whenever another model is evaluated on IndQA.

Fourth, experts wrote ideal answers and English translations and then performed peer review and iterative revisions until they signed off on quality.

Measuring Progress On Indian Languages

OpenAI uses IndQA to evaluate recent frontier models and to chart progress over the last couple years on Indian languages. They report that model performance has improved significantly on IndQA while still leaving substantial room for improvement. Results are stratified by language and by domain and include comparisons of GPT-5 Thinking High with other frontier systems.

Key Takeaways

IndQA is a culturally grounded Indic benchmark: IndQA evaluates how well AI models understand and reason about questions that matter in Indian languages, across culturally specific domains, rather than only testing translation or multiple choice accuracy.

The dataset is expert built and reasonably large: The benchmark contains 2,278 questions across 12 languages and 10 cultural domains, developed in collaboration with 261 domain experts from across India, covering areas like architecture, everyday life, food, history and religion.

Evaluation is rubric based, not exact match: Each datapoint bundles a native language prompt, an English translation, a detailed grading rubric and an ideal answer, and model outputs are graded by a model based system that checks weighted expert defined criteria, which enables partial credit and nuanced cultural evaluation.

Questions are adversarially filtered against OpenAI’s strongest models: Draft questions were filtered by running GPT 4o, OpenAI o3, GPT 4.5 and partially GPT 5, and keeping only those items where most of these models failed, which preserves headroom for future models on IndQA.

Editorial Comments

IndQA is a timely step because it targets a real gap, most existing multilingual benchmarks over index on English content and translation style tasks while India has diverse high resource and low resource languages. IndQA brings expert curated, rubric based evaluation for questions that matter in Indian cultural contexts, and uses adversarial filtering against GPT 4o, OpenAI o3, GPT 4.5 and GPT 5 to preserve headroom for frontier models. This launch makes IndQA a practical north star for evaluating Indian language reasoning in modern AI systems.
The post OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Languages appeared first on MarkTechPost.

How Amazon Search increased ML training twofold using AWS Batch for Am …

In this post, we show you how Amazon Search optimized GPU instance utilization by leveraging AWS Batch for SageMaker Training jobs. This managed solution enabled us to orchestrate machine learning (ML) training workloads on GPU-accelerated instance families like P5, P4, and others. We will also provide a step-by-step walkthrough of the use case implementation.
Machine learning at Amazon Search
At Amazon Search, we use hundreds of GPU-accelerated instances to train and evaluate ML models that help our customers discover products they love. Scientists typically train more than one model at a time to find the optimal set of features, model architecture, and hyperparameter settings that optimize the model’s performance. We previously leveraged a first-in-first-out (FIFO) queue to coordinate model training and evaluation jobs. However, we needed to employ a more nuanced criteria to prioritize which jobs should run in what order. Production models needed to run with high priority, exploratory research as medium priority, and hyperparameter sweeps and batch inference as low priority. We also needed a system that could handle interruptions. Should a job fail, or a given instance type become saturated, we needed the job to run on other available compatible instance types while respecting the overall prioritization criteria. Finally, we wanted a managed solution so we could focus more on model development instead of managing infrastructure.
After evaluating multiple options, we chose AWS Batch for Amazon SageMaker Training jobs because it best met our requirements. This solution seamlessly integrated AWS Batch with Amazon SageMaker and allowed us to run jobs per our prioritization criteria. This allows applied scientists to submit multiple concurrent jobs without manual resource management. By leveraging AWS Batch features such as advanced prioritization through fair-share scheduling, we increased peak utilization of GPU-accelerated instances from 40% to over 80%.
Amazon Search: AWS Batch for SageMaker Training Job implementation
We leveraged three AWS technologies to set up our job queue. We used Service Environments to configure the SageMaker AI parameters that AWS Batch uses to submit and manage SageMaker Training jobs. We used Share Identifiers to prioritize our workloads. Finally, we used Amazon CloudWatch to monitor and the provision of alerting capability for critical events or deviations from expected behavior. Let’s dive deep into these constructs.
Service environments. We set up service environments to represent the total GPU capacity available for each instance family, such as P5s and P4s. Each service environment was configured with fixed limits based on our team’s reserved capacity in AWS Batch. Note that for teams using SageMaker Training Plans, these limits can be set to the number of reserved instances, making capacity planning more straightforward. By defining these boundaries, we established how the total GPU instance capacity within a service environment was distributed across different production jobs. Each production experiment was allocated a portion of this capacity through Share Identifiers.
Figure 1 provides a real-world example of how we used AWS Batch’s fair-share scheduling to divide 100 GPU instance between ShareIDs. We allocated 60 instances to ProdExp1, and 40 to ProdExp2. When ProdExp2 used only 25 GPU instances, the remaining 15 could be borrowed by ProdExp1, allowing it to scale up to 75 GPU instances. When ProdExp2 later needed its full 40 GPU instances, the scheduler preempted jobs from ProdExp1 to restore balance. This example used the P4 instance family, but the same approach could apply to any SageMaker-supported EC2 instance family. This ensured that production workloads have guaranteed access to their assigned capacity, while exploratory or ad-hoc experiments could still make use of any idle GPU instances. This design safeguarded critical workloads and improved overall instance utilization by ensuring that no reserved capacity went unused.

Figure 1: AWS Batch fair-share scheduling

Share Identifiers. We used Share Identifiers to allocate fractions of a service environment’s capacity to production experiments. Share Identifiers are string tags applied at job submission time. AWS Batch used these tags to track usage and enforce fair-share scheduling. For initiatives that required dedicated capacity, we defined preset Share Identifiers with quotas in AWS Batch. This reserved capacity for production tracks. These quotas acted as fairness targets rather than hard limits. Idle capacity could still be borrowed, but under contention, AWS Batch enforced fairness by preempting resources from overused identifiers and reassigned them to underused ones.
Within each Share Identifier, job priorities ranging from 0 to 99 determined execution order, but priority-based preemption only triggered when the ShareIdentifier reached its allocated capacity limit. Figure 2 illustrates how we setup and used our share identifiers. ProdExp1 had 60 p4d instances and ran jobs at various priorities. Job A had a priority of 80, Job B was set to 50, Job C was set to at 30, and Job D had a priority 10. When all 60 instances were occupied and a new high-priority job (priority 90) requiring 15 instances was submitted, the system preempted the lowest priority running job (Job D) to make room, while maintaining the total of 60 instances for that Share Identifier.

Figure 2: Priority scheduling within a Share ID

Amazon CloudWatch. We used Amazon CloudWatch to instrument our SageMaker training jobs. SageMaker automatically publishes metrics on job progress and resource utilization, while AWS Batch provides detailed information on job scheduling and execution. With AWS Batch, we queried the status of each job through the AWS Batch APIs. This made it possible to track jobs as they transitioned through states such as SUBMITTED, PENDING, RUNNABLE, STARTING, RUNNING, SUCCEEDED, and FAILED. We published these metrics and job states to CloudWatch and configured dashboards and alarms to alert anytime we encountered extended wait times, unexpected failures, or underutilized resources. This built-in integration provided both real-time visibility and historical trend analysis, which helped our team maintain operational efficiency across GPU clusters without building custom monitoring systems.
Operational impact on team performance
By adopting AWS Batch for SageMaker Training jobs, we enabled experiments to run without concerns about resource availability or contention. Researchers could submit jobs without waiting for manual scheduling, which increased the number of experiments that could be run in parallel. This led to shorter queue times, higher GPU utilization, and faster turnaround of training results, directly improving both research throughput and delivery timelines.
How to set Amazon Batch for SageMaker Training Jobs
To set up a similar environment, you can follow this tutorial, which shows you how to orchestrate multiple GPU large language model (LLM) fine-tuning jobs using multiple GPU-powered instances. The solution is also available on GitHub.
Prerequisites
To orchestrate multiple SageMaker Training jobs with AWS Batch, first you need to complete the following prerequisites:
Clone the GitHub repository with the assets for this deployment. This repository consists of notebooks that reference assets:

git clone https://github.com/aws/amazon-sagemaker-examples/
cd build_and_train_models/sm-training-queues-pytorch/

Create AWS Batch resources
To create the necessary resources to manage SageMaker Training job queues with AWS Batch, we provide utility functions in the example to automate the creation of the Service Environment, Scheduling Policy, and Job Queue.
The service environment represents the Amazon SageMaker AI capacity limits available to schedule, expressed by maximum number of instances. The scheduling policy indicates how resource computes are allocated in a job queue between users or workloads. The job queue is the scheduler interface that researchers interact with to submit jobs and interrogate job status. AWS Batch provides two different queues we can operate with:

FIFO queues – Queues in which no scheduling policies are required
Fair-share queues – Queues in which a scheduling policy Amazon Resource Name (ARN) is required to orchestrate the submitted jobs

We recommend creating dedicated service environments for each job queue in a 1:1 ratio. FIFO queues provide basic message delivery, while fair-share scheduling (FSS) queues provide more sophisticated scheduling, balancing utilization within a Share Identifier, share weights, and job priority. For customers who don’t need multiple shares but would like the ability to assign a priority on job submission, we recommend creating an FSS queue and using a single share within it for all submissions.To create the resources, execute the following commands:

cd smtj_batch_utils
python create_resources.py

You can navigate the AWS Batch Dashboard, shown in the following screenshot, to explore the created resources.

This automation script created two queues:

ml-c5-xlarge-queue – A FIFO queue with priority 2 used for CPU workloads
ml-g6-12xlarge-queue – A fair-share queue with priority 1 used for GPU workloads

The associated scheduling policy for the queue ml-g6-12xlarge-queue is with share attributes such as High priority (HIGHPRI), Medium priority (MIDPRI) and Low priority (LOWPRI) along with the queue weights. Users can submit jobs and assign them to one of three shares: HIGHPRI, MIDPRI, or LOWPRI and assign weights such as 1 for high priority and 3 for medium and 5 for low priority. Below is the screenshot showing the scheduling policy details:

For instructions on how to set up the service environment and a job queue, refer to the Getting started section in Introducing AWS Batch support for SageMaker Training Jobs blog.
Run LLM fine-tuning jobs on SageMaker AI
We run the notebook notebook.ipynb to start submitting SageMaker Training jobs with AWS Batch. The notebook contains the code to prepare the data used for the workload, upload on Amazon Simple Storage Service (Amazon S3), and define the hyperparameters required by the job to be executed.
To run the fine-tuning workload using SageMaker Training jobs, this example uses the ModelTrainer class. The ModelTrainer class is a newer and more intuitive approach to model training that significantly enhances user experience. It supports distributed training, build your own container (BYOC), and recipes.
For additional information about ModelTrainer, you can refer to Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer
To set up the fine-tuning workload, complete the following steps:

Select the instance type, the container image for the training job, and define the checkpoint path where the model will be stored:

import sagemaker

instance_type = “ml.g6.12xlarge”
instance_count = 1

image_uri = sagemaker.image_uris.retrieve(
    framework=”pytorch”,
    region=sagemaker_session.boto_session.region_name,
    version=”2.6″,
    instance_type=instance_type,
    image_scope=”training”
)

Create the ModelTrainer function to encapsulate the training setup. The ModelTrainer class simplifies the experience by encapsulating code and training setup. In this example:

SourceCode – The source code configuration. This is used to configure the source code for running the training job by using your local python scripts.
Compute – The compute configuration. This is used to specify the compute resources for the training job.

from sagemaker.modules.configs import Compute, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

role = sagemaker.get_execution_role()

# Define the script to be run
source_code = SourceCode(
    source_dir=”./scripts”,
    requirements=”requirements.txt”,
    entry_script=”train.py”,
)

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=0
)

# define Training Job Name
job_name = f”train-deepseek-distill-llama-8b-sft-batch”

# define OutputDataConfig path
output_path = f”s3://{bucket_name}/{job_name}”

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    distributed=Torchrun(),
    stopping_condition=StoppingCondition(max_runtime_in_seconds=7200),
    hyperparameters={
        “config”: “/opt/ml/input/data/config/args.yaml”
    },
    output_data_config=OutputDataConfig(s3_output_path=output_path),
    role=role,
)

Set up the input channels for ModelTrainer by creating InputData objects from the provided S3 bucket paths for the training and validation datasets:

from sagemaker.modules.configs import InputData

train_input = InputData(
    channel_name=”train”,
    data_source=train_dataset_s3_path,
)
val_input = InputData(
    channel_name=”val”,
    data_source=val_dataset_s3_path,
)
config_input = InputData(
    channel_name=”config”,
    data_source=train_config_s3_path,
)

TRAINING_INPUTS = [train_input, val_input, config_input]

Queue SageMaker Training jobs
This section and the following are intended to be used interactively so that you can explore how to use the Amazon SageMaker Python SDK to submit jobs to your Batch queues. Follow these steps:

Select the queue to use:

from sagemaker.aws_batch.queue import TrainingQueue
SMTJ_BATCH_QUEUE = “ml-g6-12xlarge-queue”

queue = TrainingQueue(SMTJ_BATCH_QUEUE)

In the next cell, submit two training jobs in the queue:

LOW PRIORITY
MEDIUM PRIORITY

Use the API submit to submit all the jobs:

job_name_1 = job_name + “-low-pri”
queued_job_1 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_1, priority=5, share_identifier=”LOWPRI”
)
job_name_2 = job_name + “-mid-pri”
queued_job_2 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_2, priority=3, share_identifier=”MIDPRI”
)

Display the status of running and in queue jobs
We can use the job queue list and job queue snapshot APIs to programmatically view a snapshot of the jobs that the queue will run next. For fair-share queues, this ordering is dynamic and occasionally needs to be refreshed because new jobs are submitted to the queue or as share usage changes over time.

from utils.queue_utils import print_queue_state
print_queue_state(queue)

The following screenshot shows the jobs submitted with low priority and medium priority in the Runnable State and in the queue.

You can also refer to the AWS Batch Dashboard, shown in the following screenshot, to analyze the status of the jobs.

As shown in the following screenshot, the first job executed with the SageMaker Training job is the MEDIUM PRIORITY one, by respecting the scheduling policy rules defined previously.

You can explore the running training job in the SageMaker AI console, as shown in the following screenshot.

Submit an additional job
You can now submit an additional SageMaker Training job with HIGH PRIORITY to the queue:

job_name_3 = job_name + “-high-pri”
queued_job_3 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_3, priority=1, share_identifier=”HIGHPRI”
)

You can explore the status from the dashboard, as shown in the following screenshot.

The HIGH PRIORITY job, despite being submitted later in the queue, will be executed before the other runnable jobs by respecting the scheduling policy rules, as shown in the following screenshot.

As the scheduling policy in the screenshot shows, the LOWPRI share has a higher weight factor (5) than the MIDPRI share (3). Since a lower weight signifies higher priority, a LOWPRI job will be executed after a MIDPRI job, even if they are submitted at the same time.

Clean up
To clean up your resources to avoid incurring future charges, follow these steps:

Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.
Delete AWS Batch resources by using the command python create_resources.py –clean from the GitHub example or by manually deleting them from the AWS Management Console.

Conclusion
In this post, we demonstrated how Amazon Search used AWS Batch for SageMaker Training Jobs to optimize GPU resource utilization and training job management. The solution transformed their training infrastructure by implementing sophisticated queue management and fair share scheduling, increasing peak GPU utilization from 40% to over 80%.We recommend that organizations facing similar ML training infrastructure challenges explore AWS Batch integration with SageMaker, which provides built-in queue management capabilities and priority-based scheduling. The solution eliminates manual resource coordination while providing workloads with appropriate prioritization through configurable scheduling policies.
To begin implementing AWS Batch with SageMaker training jobs, you can access our sample code and implementation guide in the amazon-sagemaker-examples repository on GitHub. The example demonstrates how to set up AWS Identity and Access Management (IAM) permissions, create AWS Batch resources, and orchestrate multiple GPU-powered training jobs using ModelTrainer class.

The authors would like to thank Charles Thompson and Kanwaljit Khurmi for their collaboration.
About the authors

Mona Mona
Mona is a generative AI Specialist Solutions Architect at Amazon focusing. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide.

Mayank Jha
Mayank is a Senior Machine Learning Engineer at Amazon Search working on the model training optimization. He is passionate about finding practical applications for complex problems at hand and aims to develop solutions that have a deep impact on how businesses and people thrive.

Bruno Pistone
Bruno is a Senior generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

James Park
James is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

How Can We Build Scalable and Reproducible Machine Learning Experiment …

In this tutorial, we explore Hydra, an advanced configuration management framework originally developed and open-sourced by Meta Research. We begin by defining structured configurations using Python dataclasses, which allows us to manage experiment parameters in a clean, modular, and reproducible manner. As we move through the tutorial, we compose configurations, apply runtime overrides, and simulate multirun experiments for hyperparameter sweeps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “hydra-core”])

import hydra
from hydra import compose, initialize_config_dir
from omegaconf import OmegaConf, DictConfig
from dataclasses import dataclass, field
from typing import List, Optional
import os
from pathlib import Path

We begin by installing Hydra and importing all the essential modules required for structured configurations, dynamic composition, and file handling. This setup ensures our environment is ready to execute the full tutorial seamlessly on Google Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class OptimizerConfig:
_target_: str = “torch.optim.SGD”
lr: float = 0.01

@dataclass
class AdamConfig(OptimizerConfig):
_target_: str = “torch.optim.Adam”
lr: float = 0.001
betas: tuple = (0.9, 0.999)
weight_decay: float = 0.0

@dataclass
class SGDConfig(OptimizerConfig):
_target_: str = “torch.optim.SGD”
lr: float = 0.01
momentum: float = 0.9
nesterov: bool = True

@dataclass
class ModelConfig:
name: str = “resnet”
num_layers: int = 50
hidden_dim: int = 512
dropout: float = 0.1

@dataclass
class DataConfig:
dataset: str = “cifar10”
batch_size: int = 32
num_workers: int = 4
augmentation: bool = True

@dataclass
class TrainingConfig:
model: ModelConfig = field(default_factory=ModelConfig)
data: DataConfig = field(default_factory=DataConfig)
optimizer: OptimizerConfig = field(default_factory=AdamConfig)
epochs: int = 100
seed: int = 42
device: str = “cuda”
experiment_name: str = “exp_001”

We define clean, type-safe configurations using Python dataclasses for the model, data, and optimizer settings. This structure allows us to manage complex experiment parameters in a modular and readable way while ensuring consistency across runs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef setup_config_dir():
config_dir = Path(“./hydra_configs”)
config_dir.mkdir(exist_ok=True)

main_config = “””
defaults:
– model: resnet
– data: cifar10
– optimizer: adam
– _self_

epochs: 100
seed: 42
device: cuda
experiment_name: exp_001
“””
(config_dir / “config.yaml”).write_text(main_config)

model_dir = config_dir / “model”
model_dir.mkdir(exist_ok=True)

(model_dir / “resnet.yaml”).write_text(“””
name: resnet
num_layers: 50
hidden_dim: 512
dropout: 0.1
“””)

(model_dir / “vit.yaml”).write_text(“””
name: vision_transformer
num_layers: 12
hidden_dim: 768
dropout: 0.1
patch_size: 16
“””)

data_dir = config_dir / “data”
data_dir.mkdir(exist_ok=True)

(data_dir / “cifar10.yaml”).write_text(“””
dataset: cifar10
batch_size: 32
num_workers: 4
augmentation: true
“””)

(data_dir / “imagenet.yaml”).write_text(“””
dataset: imagenet
batch_size: 128
num_workers: 8
augmentation: true
“””)

opt_dir = config_dir / “optimizer”
opt_dir.mkdir(exist_ok=True)

(opt_dir / “adam.yaml”).write_text(“””
_target_: torch.optim.Adam
lr: 0.001
betas: [0.9, 0.999]
weight_decay: 0.0
“””)

(opt_dir / “sgd.yaml”).write_text(“””
_target_: torch.optim.SGD
lr: 0.01
momentum: 0.9
nesterov: true
“””)

return str(config_dir.absolute())

We programmatically create a directory containing YAML configuration files for models, datasets, and optimizers. This approach enables us to demonstrate how Hydra automatically composes configurations from different files, thereby maintaining flexibility and clarity in experiments. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@hydra.main(version_base=None, config_path=”hydra_configs”, config_name=”config”)
def train(cfg: DictConfig) -> float:
print(“=” * 80)
print(“CONFIGURATION”)
print(“=” * 80)
print(OmegaConf.to_yaml(cfg))

print(“n” + “=” * 80)
print(“ACCESSING CONFIGURATION VALUES”)
print(“=” * 80)
print(f”Model: {cfg.model.name}”)
print(f”Dataset: {cfg.data.dataset}”)
print(f”Batch Size: {cfg.data.batch_size}”)
print(f”Optimizer LR: {cfg.optimizer.lr}”)
print(f”Epochs: {cfg.epochs}”)

best_acc = 0.0
for epoch in range(min(cfg.epochs, 3)):
acc = 0.5 + (epoch * 0.1) + (cfg.optimizer.lr * 10)
best_acc = max(best_acc, acc)
print(f”Epoch {epoch+1}/{cfg.epochs}: Accuracy = {acc:.4f}”)

return best_acc

We implement a training function that leverages Hydra’s configuration system to print, access, and use nested config values. By simulating a simple training loop, we showcase how Hydra cleanly integrates experiment control into real workflows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_basic_usage():
print(“n” + ” DEMO 1: Basic Configurationn”)
config_dir = setup_config_dir()
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(config_name=”config”)
print(OmegaConf.to_yaml(cfg))

def demo_config_override():
print(“n” + ” DEMO 2: Configuration Overridesn”)
config_dir = setup_config_dir()
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(
config_name=”config”,
overrides=[
“model=vit”,
“data=imagenet”,
“optimizer=sgd”,
“optimizer.lr=0.1”,
“epochs=50”
]
)
print(OmegaConf.to_yaml(cfg))

def demo_structured_config():
print(“n” + ” DEMO 3: Structured Config Validationn”)
from hydra.core.config_store import ConfigStore
cs = ConfigStore.instance()
cs.store(name=”training_config”, node=TrainingConfig)
with initialize_config_dir(version_base=None, config_dir=setup_config_dir()):
cfg = compose(config_name=”config”)
print(f”Config type: {type(cfg)}”)
print(f”Epochs (validated as int): {cfg.epochs}”)

def demo_multirun_simulation():
print(“n” + ” DEMO 4: Multirun Simulationn”)
config_dir = setup_config_dir()
experiments = [
[“model=resnet”, “optimizer=adam”, “optimizer.lr=0.001”],
[“model=resnet”, “optimizer=sgd”, “optimizer.lr=0.01”],
[“model=vit”, “optimizer=adam”, “optimizer.lr=0.0001”],
]
results = {}
for i, overrides in enumerate(experiments):
print(f”n— Experiment {i+1} —“)
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(config_name=”config”, overrides=overrides)
print(f”Model: {cfg.model.name}, Optimizer: {cfg.optimizer._target_}”)
print(f”Learning Rate: {cfg.optimizer.lr}”)
results[f”exp_{i+1}”] = cfg
return results

def demo_interpolation():
print(“n” + ” DEMO 5: Variable Interpolationn”)
cfg = OmegaConf.create({
“model”: {“name”: “resnet”, “layers”: 50},
“experiment”: “${model.name}_${model.layers}”,
“output_dir”: “/outputs/${experiment}”,
“checkpoint”: “${output_dir}/best.ckpt”
})
print(OmegaConf.to_yaml(cfg))
print(f”nResolved experiment name: {cfg.experiment}”)
print(f”Resolved checkpoint path: {cfg.checkpoint}”)

We demonstrate Hydra’s advanced capabilities, including config overrides, structured config validation, multi-run simulations, and variable interpolation. Each demo showcases how Hydra accelerates experimentation speed, streamlines manual setup, and fosters reproducibility in research. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
demo_basic_usage()
demo_config_override()
demo_structured_config()
demo_multirun_simulation()
demo_interpolation()
print(“n” + “=” * 80)
print(“Tutorial complete! Key takeaways:”)
print(“✓ Config composition with defaults”)
print(“✓ Runtime overrides via command line”)
print(“✓ Structured configs with type safety”)
print(“✓ Multirun for hyperparameter sweeps”)
print(“✓ Variable interpolation”)
print(“=” * 80)

We execute all demonstrations in sequence to observe Hydra in action, from loading configs to performing multiruns. By the end, we summarize key takeaways, reinforcing how Hydra enables scalable and elegant experiment management.

In conclusion, we grasp how Hydra, pioneered by Meta Research, simplifies and enhances experiment management through its powerful composition system. We explore structured configs, interpolation, and multirun capabilities that make large-scale machine learning workflows more flexible and maintainable. With this knowledge, you are now equipped to integrate Hydra into your own research or development pipelines, ensuring reproducibility, efficiency, and clarity in every experiment you run.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Can We Build Scalable and Reproducible Machine Learning Experiment Pipelines Using Meta Research Hydra? appeared first on MarkTechPost.

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2 …

Code-oriented large language models moved from autocomplete to software engineering systems. In 2025, leading models must fix real GitHub issues, refactor multi-repo backends, write tests, and run as agents over long context windows. The main question for teams is not “can it code” but which model fits which constraints.

Here are seven models (and systems around them) that cover most real coding workloads today:

OpenAI GPT-5 / GPT-5-Codex

Anthropic Claude 3.5 Sonnet / Claude 4.x Sonnet with Claude Code

Google Gemini 2.5 Pro

Meta Llama 3.1 405B Instruct

DeepSeek-V2.5-1210 (with DeepSeek-V3 as the successor)

Alibaba Qwen2.5-Coder-32B-Instruct

Mistral Codestral 25.01

The goal of this comparison is not to rank them on a single score. The goal is to show which system to pick for a given benchmark target, deployment model, governance requirement, and IDE or agent stack.

Evaluation dimensions

We compare on six stable dimensions:

Core coding quality: HumanEval, MBPP / MBPP EvalPlus, code generation and repair quality on standard Python tasks.

Repo and bug-fix performance: SWE-bench Verified (real GitHub issues), Aider Polyglot (whole-file edits), RepoBench, LiveCodeBench.

Context and long-context behavior: Documented context limits and practical behavior in long sessions.

Deployment model: Closed API, cloud service, containers, on-premises or fully self-hosted open weights.

Tooling and ecosystem: Native agents, IDE extensions, cloud integration, GitHub and CI/CD support.

Cost and scaling pattern: Token pricing for closed models, hardware footprint and inference pattern for open models.

Image source: marktechpost.com

1. OpenAI GPT-5 / GPT-5-Codex

OpenAI’s GPT-5 is the flagship reasoning and coding model and the default in ChatGPT. For real-world code, OpenAI reports:

SWE-bench Verified: 74.9%

Aider Polyglot: 88%

Both benchmarks simulate real engineering: SWE-bench Verified runs against upstream repos and tests; Aider Polyglot measures whole-file multi-language edits.

Context and variants

gpt-5 (chat) API: 128k token context.

gpt-5-pro / gpt-5-codex: up to 400k combined context in the model card, with typical production limits around ≈272k input + 128k output for reliability.

GPT-5 and GPT-5-Codex are available in ChatGPT (Plus / Pro / Team / Enterprise) and via the OpenAI API; they are closed-weight, cloud-hosted only.

Strengths

Highest published SWE-bench Verified and Aider Polyglot scores among widely available models.

Very strong at multi-step bug fixing with “thinking” (chain-of-thought) enabled.

Deep ecosystem: ChatGPT, Copilot, and many third-party IDE and agent platforms use GPT-5 backends.

Limits

No self-hosting; all traffic must go through OpenAI or partners.

Long-context calls are expensive if you stream full monorepos, so you need retrieval and diff-only patterns.

Use when you want maximum repo-level benchmark performance and are comfortable with a closed, cloud API.

2. Anthropic Claude 3.5 Sonnet / Claude 4.x + Claude Code

Claude 3.5 Sonnet was Anthropic’s main coding workhorse before the Claude 4 line. Anthropic highlights it as SOTA on HumanEval, and independent comparisons report:

HumanEval: ≈ 92%

MBPP EvalPlus: ≈ 91%

In 2025, Anthropic released Claude 4 Opus, Sonnet, and Sonnet 4.5, positioning Sonnet 4.5 as its best coding and agent model so far.

Claude Code stack

Claude Code is a repo-aware coding system:

Managed VM connected to your GitHub repo.

File browsing, editing, tests, and PR creation.

SDK for building custom agents that use Claude as a coding backend.

Strengths

Very strong HumanEval / MBPP, good empirical behavior on debugging and code review.

Production-grade coding agent environment with persistent VM and GitHub workflows.

Limits

Closed and cloud-hosted, similar to GPT-5 in governance terms.

Published SWE-bench Verified numbers for Claude 3.5 Sonnet are below GPT-5, though Claude 4.x is likely closer.

Use when you need explainable debugging, code review, and a managed repo-level agent and can accept a closed deployment.

3. Google Gemini 2.5 Pro

Gemini 2.5 Pro is Google DeepMind’s main coding and reasoning model for developers. It reports following performance/results:

LiveCodeBench v5: 70.4%

Aider Polyglot (whole-file editing): 74.0%

SWE-bench Verified: 63.8%

These results place Gemini 2.5 Pro above many earlier models and only behind Claude 3.7 and GPT-5 on SWE-bench Verified.

Context and platform

Long-context capability marketed up to 1M tokens across the Gemini family; 2.5 Pro is the stable tier used in Gemini Apps, Google AI Studio, and Vertex AI.

Tight integration with GCP services, BigQuery, Cloud Run, and Google Workspace.

Strengths

Good combination of LiveCodeBench, Aider, SWE-bench scores plus first-class GCP integration.

Strong choice for “data plus application code” when you want the same model for SQL, analytics helpers, and backend code.

Limits

Closed and tied to Google Cloud.

For pure SWE-bench Verified, GPT-5 and the newest Claude Sonnet 4.x are stronger.

Use when your workloads already run on GCP / Vertex AI and you want a long-context coding model inside that stack.

4. Meta Llama 3.1 405B Instruct

Meta’s Llama 3.1 family (8B, 70B, 405B) is open-weight. The 405B Instruct variant is the high-end option for coding and general reasoning. It reports following performance/results:

HumanEval (Python): 89.0

MBPP (base or EvalPlus): ≈ 88.6

These scores put Llama 3.1 405B among the strongest open models on classic code benchmarks.

The official model card states that Llama 3.1 models outperform many open and closed chat models on common benchmarks and are optimized for multilingual dialogue and reasoning.

Strengths

High HumanEval / MBPP scores with open weights and permissive licensing.

Strong general performance (MMLU, MMLU-Pro, etc.), so one model can serve both product features and coding agents.

Limits

405B parameters mean high serving cost and latency unless you have a large GPU cluster.

For strictly code benchmarks at a fixed compute budget, specialized models such as Qwen2.5-Coder-32B and Codestral 25.01 are more cost-efficient.

Use when you want a single open foundation model with strong coding and general reasoning, and you control your own GPU infrastructure.

5. DeepSeek-V2.5-1210 (and DeepSeek-V3)

DeepSeek-V2.5-1210 is an upgraded Mixture-of-Experts model that merges the chat and coder lines. The model card reports:

LiveCodeBench (08.01–12.01): improved from 29.2% to 34.38%

MATH-500: 74.8% → 82.8%

DeepSeek has since released DeepSeek-V3, a 671B-parameter MoE with 37B active per token, trained on 14.8T tokens. The performance is comparable to leading closed models on many reasoning and coding benchmarks, and public dashboards show V3 ahead of V2.5 on key tasks.

Strengths

Open MoE model with solid LiveCodeBench results and good math performance for its size.

Efficient active-parameter count vs total parameters.

Limits

V2.5 is no longer the flagship; DeepSeek-V3 is now the reference model.

Ecosystem is lighter than OpenAI / Google / Anthropic; teams must assemble their own IDE and agent integrations.

Use when you want a self-hosted MoE coder with open weights and are ready to move to DeepSeek-V3 as it matures.

6. Qwen2.5-Coder-32B-Instruct

Qwen2.5-Coder is Alibaba’s code-specific LLM family. The technical report and model card describe six sizes (0.5B to 32B) and continued pretraining on over 5.5T tokens of code-heavy data.

The official benchmarks for Qwen2.5-Coder-32B-Instruct list:

HumanEval: 92.7%

MBPP: 90.2%

LiveCodeBench: 31.4%

Aider Polyglot: 73.7%

Spider: 85.1%

CodeArena: 68.9%

Strengths

Very strong HumanEval / MBPP / Spider results for an open model; often competitive with closed models in pure code tasks.

Multiple parameter sizes make it adaptable to different hardware budgets.

Limits

Less suited for broad general reasoning than a generalist like Llama 3.1 405B or DeepSeek-V3.

Documentation and ecosystem are catching up in English-language tooling.

Use when you need a self-hosted, high-accuracy code model and can pair it with a general LLM for non-code tasks.

7. Mistral Codestral 25.01

Codestral 25.01 is Mistral’s updated code generation model. Mistral’s announcement and follow-up posts state that 25.01 uses a more efficient architecture and tokenizer and generates code roughly 2× faster than the base Codestral model.

Benchmark reports:

HumanEval: 86.6%

MBPP: 80.2%

Spider: 66.5%

RepoBench: 38.0%

LiveCodeBench: 37.9%

Codestral 25.01 supports over 80 programming languages and a 256k token context window, and is optimized for low-latency, high-frequency tasks such as completion and FIM.

Strengths

Very good RepoBench / LiveCodeBench scores for a mid-size open model.

Designed for fast interactive use in IDEs and SaaS, with open weights and a 256k context.

Limits

Absolute HumanEval / MBPP scores sit below Qwen2.5-Coder-32B, which is expected at this parameter class.

Use when you need a compact, fast open code model for completions and FIM at scale.

Head to head comparison

FeatureGPT-5 / GPT-5-CodexClaude 3.5 / 4.x + Claude CodeGemini 2.5 ProLlama 3.1 405B InstructDeepSeek-V2.5-1210 / V3Qwen2.5-Coder-32BCodestral 25.01Core taskHosted general model with strong coding and agentsHosted models plus repo-level coding VMHosted coding and reasoning model on GCPOpen generalist foundation with strong codingOpen MoE coder and chat modelOpen code-specialized modelOpen mid-size code modelContext128k (chat), up to 400k Pro / Codex200k-class (varies by tier)Long-context, million-class across Gemini lineUp to 128k in many deploymentsTens of k, MoE scaling32B with typical 32k–128k contexts depending on host256k contextCode benchmarks (examples)74.9 SWE-bench, 88 Aider≈92 HumanEval, ≈91 MBPP, 49 SWE-bench (3.5); 4.x stronger but less published70.4 LiveCodeBench, 74 Aider, 63.8 SWE-bench89 HumanEval, ≈88.6 MBPP34.38 LiveCodeBench; V3 stronger on mixed benchmarks92.7 HumanEval, 90.2 MBPP, 31.4 LiveCodeBench, 73.7 Aider86.6 HumanEval, 80.2 MBPP, 38 RepoBench, 37.9 LiveCodeBenchDeploymentClosed API, OpenAI / Copilot stackClosed API, Anthropic console, Claude CodeClosed API, Google AI Studio / Vertex AIOpen weights, self-hosted or cloudOpen weights, self-hosted; V3 via providersOpen weights, self-hosted or via providersOpen weights, available on multiple cloudsIntegration pathChatGPT, OpenAI API, CopilotClaude app, Claude Code, SDKsGemini Apps, Vertex AI, GCPHugging Face, vLLM, cloud marketplacesHugging Face, vLLM, custom stacksHugging Face, commercial APIs, local runnersAzure, GCP, custom inference, IDE pluginsBest fitMax SWE-bench / Aider performance in hosted settingRepo-level agents and debugging qualityGCP-centric engineering and data + codeSingle open foundation modelOpen MoE experiments and Chinese ecosystemSelf-hosted high-accuracy code assistantFast open model for IDE and product integration

What to use when?

You want the strongest hosted repo-level solver: Use GPT-5 / GPT-5-Codex. Claude Sonnet 4.x is the closest competitor, but GPT-5 has the clearest SWE-bench Verified and Aider numbers today.

You want a full coding agent over a VM and GitHub: Use Claude Sonnet + Claude Code for repo-aware workflows and long multi-step debugging sessions.

You are standardized on Google Cloud: Use Gemini 2.5 Pro as the default coding model inside Vertex AI and AI Studio.

You need a single open general foundation: Use Llama 3.1 405B Instruct when you want one open model for application logic, RAG, and code.

You want the strongest open code specialist: Use Qwen2.5-Coder-32B-Instruct, and add a smaller general LLM for non-code tasks if needed.

You want MoE-based open models: Use DeepSeek-V2.5-1210 now and plan for DeepSeek-V3 as you move to the latest upgrade.

You are building IDEs or SaaS products and need a fast open code model: Use Codestral 25.01 for FIM, completion, and mid-size repo work with 256k context.

Editorial comments

GPT-5, Claude Sonnet 4.x, and Gemini 2.5 Pro now define the upper bound of hosted coding performance, especially on SWE-bench Verified and Aider Polyglot. At the same time, open models such as Llama 3.1 405B, Qwen2.5-Coder-32B, DeepSeek-V2.5/V3, and Codestral 25.01 show that it is realistic to run high-quality coding systems on your own infrastructure, with full control over weights and data paths.

For most software engineering teams, the practical answer is a portfolio: one or two hosted frontier models for the hardest multi-service refactors, plus one or two open models for internal tools, regulated code bases, and latency-sensitive IDE integrations.

References

OpenAI – Introducing GPT-5 for developers (SWE-bench Verified, Aider Polyglot) (openai.com)

Vellum, Runbear and other benchmark summaries for GPT-5 coding performance (vellum.ai)

Anthropic – Claude 3.5 Sonnet and Claude 4 announcements (Anthropic)

Kitemetric and other third-party Claude 3.5 Sonnet coding benchmark reviews (Kite Metric)

Google – Gemini 2.5 Pro model page and Google / Datacamp benchmark posts (Google DeepMind)

Meta – Llama 3.1 405B model card and analyses of HumanEval / MBPP scores (Hugging Face)

DeepSeek – DeepSeek-V2.5-1210 model card and update notes; community coverage on V3 (Hugging Face)

Alibaba – Qwen2.5-Coder technical report and Hugging Face model card (arXiv)

Mistral – Codestral 25.01 announcement and benchmark summaries (Mistral AI)

The post Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025 appeared first on MarkTechPost.

Cache-to-Cache(C2C): Direct Semantic Communication Between Large Langu …

Can large language models collaborate without sending a single token of text? a team of researchers from Tsinghua University, Infinigence AI, The Chinese University of Hong Kong, Shanghai AI Laboratory, and Shanghai Jiao Tong University say yes. Cache-to-Cache (C2C) is a new communication paradigm where large language models exchange information through their KV-Cache rather than through generated text.

https://arxiv.org/pdf/2510.03215

Text communication is the bottleneck in multi LLM systems

Current multi LLM systems mostly use text to communicate. One model writes an explanation, another model reads it as context.

This design has three practical costs:

Internal activations are compressed into short natural language messages. Much of the semantic signal in the KV-Cache never crosses the interface.

Natural language is ambiguous. Even with structured protocols, a coder model may encode structural signals, such as the role of an HTML <p> tag, that do not survive a vague textual description.

Every communication step requires token by token decoding, which dominates latency in long analytical exchanges.

The C2C work asks a direct question, can we treat KV-Cache as the communication channel instead.

Oracle experiments, can KV-Cache carry communication

The research team first run two oracle style experiments to test whether KV-Cache is a useful medium.

Cache enrichment oracle

They compare three setups on multiple choice benchmarks:

Direct, prefill on the question only.

Few shot, prefill on exemplars plus question, longer cache.

Oracle, prefill on exemplars plus question, then discard the exemplar segment and keep only the question aligned slice of the cache, so cache length is the same as Direct.

https://arxiv.org/pdf/2510.03215

Oracle improves accuracy from 58.42 percent to 62.34 percent at the same cache length, while Few shot reaches 63.39 percent. This demonstrates that enriching the question KV-Cache itself, even without more tokens, improves performance. Layer wise analysis shows that enriching only selected layers is better than enriching all layers, which later motivates a gating mechanism.

Cache transformation oracle

Next, they test whether KV-Cache from one model can be transformed into the space of another model. A three layer MLP is trained to map KV-Cache from Qwen3 4B to Qwen3 0.6B. t SNE plots show that the transformed cache lies inside the target cache manifold, but only in a sub region.

https://arxiv.org/pdf/2510.03215

C2C, direct semantic communication through KV-Cache

Based on these oracles, the research team defines Cache-to-Cache communication between a Sharer and a Receiver model.

During prefill, both models read the same input and produce layer wise KV-Cache. For each Receiver layer, C2C selects a mapped Sharer layer and applies a C2C Fuser to produce a fused cache. During decoding, the Receiver predicts tokens conditioned on this fused cache instead of its original cache.

The C2C Fuser follows a residual integration principle and has three modules:

Projection module concatenates Sharer and Receiver KV-Cache vectors, applies a projection layer, then a feature fusion layer.

Dynamic weighting module modulates heads based on the input so that some attention heads rely more on Sharer information.

Learnable gate adds a per layer gate that decides whether to inject Sharer context into that layer. The gate uses a Gumbel sigmoid during training and becomes binary at inference.

Sharer and Receiver can come from different families and sizes, so C2C also defines:

Token alignment by decoding Receiver tokens to strings and re encoding them with the Sharer tokenizer, then choosing Sharer tokens with maximal string coverage.

Layer alignment using a terminal strategy that pairs top layers first and walks backward until the shallower model is fully covered.

https://arxiv.org/pdf/2510.03215

During training, both LLMs are frozen. Only the C2C module is trained, using a next token prediction loss on Receiver outputs. The main C2C fusers are trained on the first 500k samples of the OpenHermes2.5 dataset, and evaluated on OpenBookQA, ARC Challenge, MMLU Redux and C Eval.

Accuracy and latency, C2C versus text communication

Across many Sharer Receiver combinations built from Qwen2.5, Qwen3, Llama3.2 and Gemma3, C2C consistently improves Receiver accuracy and reduces latency. For results:

C2C achieves about 8.5 to 10.5 percent higher average accuracy than individual models.

C2C outperforms text communication by about 3.0 to 5.0 percent on average.

C2C delivers around 2x average speedup in latency compared with text based collaboration, and in some configurations the speedup is larger.

A concrete example uses Qwen3 0.6B as Receiver and Qwen2.5 0.5B as Sharer. On MMLU Redux, the Receiver alone reaches 35.53 percent, text to text reaches 41.03 percent, and C2C reaches 42.92 percent. Average time per query for text to text is 1.52 units, while C2C stays close to the single model at 0.40. Similar patterns appear on OpenBookQA, ARC Challenge and C Eval.

On LongBenchV1, with the same pair, C2C outperforms text communication across all sequence length buckets. For sequences of 0 to 4k tokens, text communication reaches 29.47 while C2C reaches 36.64. Gains remain for 4k to 8k and for longer contexts.

https://arxiv.org/pdf/2510.03215

Key Takeaways

Cache-to-Cache communication lets a Sharer model send information to a Receiver model directly via KV-Cache, so collaboration does not need intermediate text messages, which removes the token bottleneck and reduces semantic loss in multi model systems.

Two oracle studies show that enriching only the question aligned slice of the cache improves accuracy at constant sequence length, and that KV-Cache from a larger model can be mapped into a smaller model’s cache space through a learned projector, confirming cache as a viable communication medium.

C2C Fuser architecture combines Sharer and Receiver caches with a projection module, dynamic head weighting and a learnable per layer gate, and integrates everything in a residual way, which allows the Receiver to selectively absorb Sharer semantics without destabilizing its own representation.

Consistent accuracy and latency gains are observed across Qwen2.5, Qwen3, Llama3.2 and Gemma3 model pairs, with about 8.5 to 10.5 percent average accuracy improvement over a single model, 3 to 5 percent gains over text to text communication, and around 2x faster responses because unnecessary decoding is removed.

Editorial Comments

Cache-to-Cache reframes multi LLM communication as a direct semantic transfer problem, not a prompt engineering problem. By projecting and fusing KV-Cache between Sharer and Receiver with a neural fuser and learnable gating, C2C uses the deep specialized semantics of both models while avoiding explicit intermediate text generation, which is an information bottleneck and a latency cost. With 8.5 to 10.5 percent higher accuracy and about 2x lower latency than text communication, C2C is a strong systems level step toward KV native collaboration between models.

Check out the Paper and Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cache-to-Cache(C2C): Direct Semantic Communication Between Large Language Models via KV-Cache Fusion appeared first on MarkTechPost.

Iterate faster with Amazon Bedrock AgentCore Runtime direct code deplo …

Amazon Bedrock AgentCore is an agentic platform for building, deploying, and operating effective agents securely at scale. Amazon Bedrock AgentCore Runtime is a fully managed service of Bedrock AgentCore, which provides low latency serverless environments to deploy agents and tools. It provides session isolation, supports multiple agent frameworks including popular open-source frameworks, and handles multimodal workloads and long-running agents.
AgentCore Runtime supports container based deployments where the container definition is provided in a Dockerfile, and the agent is built as a container image. Customers who have container build and deploy pipelines benefit from this method, where agent deployment can be integrated into existing pipelines. 
Today, AgentCore Runtime has launched a second method to deploy agents – direct code deployment (for Python). Agent code and its dependencies can be packaged as a zip archive, alleviating the need for Docker definition and ECR dependencies. This makes it straightforward for developers to prototype and iterate faster. This method is a good fit for customers who prefer not to worry about Docker expertise and container infrastructure when deploying agents.
In this post, we’ll demonstrate how to use direct code deployment (for Python).
Introducing AgentCore Runtime direct code deployment
With the container deployment method, developers create a Dockerfile, build ARM-compatible containers, manage ECR repositories, and upload containers for code changes. This works well where container DevOps pipelines have already been established to automate deployments. 
However, customers looking for fully managed deployments can benefit from direct code deployment, which can significantly improve developer time and productivity. Direct code deployment provides a secure and scalable path forward for rapid prototyping agent capabilities to deploying production workloads at scale.
We’ll discuss the strengths of each deployment option to help you choose the right approach for your use case. 

With direct code deployment, developers create a zip archive of code and dependencies, upload to Amazon S3, and configure the bucket in the agent configuration. When using the AgentCore starter toolkit, the toolkit handles dependency detection, packaging, and upload which provides a much-simplified developer experience. Direct code deployment is also supported using the API.
Let’s compare the deployment steps at a high level between the two methods:
Container-based deployment
The container-based deployment method involves the following steps:

Create a Dockerfile
Build ARM-compatible container
Create ECR repository
Upload to ECR
Deploy to AgentCore Runtime

Direct code deployment
The direct code deployment method involves the following steps:

Package your code and dependencies into a zip archive
Upload it to S3
Configure the bucket in agent configuration
Deploy to AgentCore Runtime

How to use direct code deployment
Let’s illustrate how direct code deployment works with an agent created with Strands Agents SDK and using the AgentCore starter-toolkit to deploy the agent.
Prerequisites
Before you begin, make sure you have the following:

Any of the versions of Python 3.10 to 3.13
Your preferred package manager installed. For example, we use uv package manager.
AWS account for creating and deploying agents
Amazon Bedrock model access to Anthropic Claude Sonnet 4.0

Step 1: Initialize your project
Set up a new Python project using the uv package manager, then navigate into the project directory:

uv init <project> –python 3.13
cd <project>

Step 2: Add the dependencies for the project
Install the required Bedrock AgentCore libraries and development tools for your project. In this example, dependencies are added using .toml file, alternatively they can be specified in requirements.txt file:

uv add bedrock-agentcore strands-agents strands-agents-tools
uv add –dev bedrock-agentcore-starter-toolkit
source .venv/bin/activate

Step 3: Create an agent.py file
Create the main agent implementation file that defines your AI agent’s behavior:

from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent, tool
from strands_tools import calculator
from strands.models import BedrockModel
import logging

app = BedrockAgentCoreApp(debug=True)

# Logging setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create a custom tool
@tool
def weather():
“”” Get weather “””
return “sunny”

model_id = “us.anthropic.claude-sonnet-4-20250514-v1:0″
model = BedrockModel(
model_id=model_id,
)

agent = Agent(
model=model,
tools=[calculator, weather],
system_prompt=”You’re a helpful assistant. You can do simple math calculation, and tell the weather.”
)

@app.entrypoint
def invoke(payload):
“””Your AI agent function”””
user_input = payload.get(“prompt”, “Hello! How can I help you today?”)
logger.info(“n User input: %s”, user_input)
response = agent(user_input)
logger.info(“n Agent result: %s “, response.message)
return response.message[‘content’][0][‘text’]

if __name__ == “__main__”:
app.run()

Step 4: Deploy to AgentCore Runtime
Configure and deploy your agent to the AgentCore Runtime environment:

agentcore configure –entrypoint agent.py –name <agent-name>

This will launch an interactive session where you configure the S3 bucket to upload the zip deployment package to and choose a deployment configuration type (as shown in the following configuration). To opt for direct code deployment, choose option 1 – Code Zip.
Deployment Configuration
Select deployment type:

Code Zip (recommended) – Simple, serverless, no Docker required
Container – For custom runtimes or complex dependencies

agentcore launch

This command creates a zip deployment package, uploads it to the specified S3 bucket, and launches the agent in the AgentCore Runtime environment, making it ready to receive and process requests.
To test the solution, let’s prompt the agent to see how the weather is:

agentcore invoke ‘{“prompt”:”How is the weather today?”}’

The first deployment takes approximately 30 seconds to complete, but subsequent updates to the agent benefit from the streamlined direct code deployment process and should take less than half the time, supporting faster iteration cycles during development.
When to choose direct code instead of container-based deployment
Let’s look at some of the dimensions and see how the direct code and container-based deployment options are different. This will help you choose the option that’s right for you:

Deployment process: Direct code deploys agents as zip files with no Docker, ECR, or CodeBuild required. Container-based deployment uses Docker and ECR with full Dockerfile control.
Deployment time: Although there is not much difference during first deployment of an agent, subsequent updates to the agent are significantly faster with direct code deployment (from an average of 30 seconds for containers to about 10 seconds for direct code deployment).
Artifact storage: Direct code stores ZIP packages in an S3 bucket. Container-based deployment stores Docker images in Amazon ECR. Direct code deployment incurs storage costs at standard S3 storage rates (starting February 27th  2026) as artifacts are stored in the service account. Container-based deployment incurs Amazon ECR charges in your account.
Customization: Direct code deployment supports custom dependencies through ZIP-based packaging, while container based depends on a Dockerfile.
Package size: Direct code deployment limits the package size to 250MB whereas container-based packages can be up to 2GB in size.
Language Support: Direct code currently supports Python 3.10, 3.11, 3.12, and 3.13. Container-based deployment supports many languages and runtimes.

Our general guidance is:
Container-based deployment is the right choice when your package exceeds 250MB, you have existing container CI/CD pipelines, or you need highly specialized dependencies and custom packaging requirements. Choose containers if you require multi-language support, custom system dependencies or direct control over artifact storage and versioning in your account.
Direct code deployment is the right choice when your package is under 250MB, you use Python 3.10-3.13 with common frameworks like LangGraph, Strands, or CrewAI, and you need rapid prototyping with fast iteration cycles. Choose direct code if your build process is straightforward without complex dependencies, and you want to remove the Docker/ECR/CodeBuild setup.
A hybrid approach works well for many teams, use direct code for rapid prototyping and experimentation where fast iteration and simple setup accelerate development, then graduate to containers for production when package size, multi-language requirements, or specialized build processes demand it.
Conclusion
Amazon Bedrock AgentCore direct code deployment makes iterative agent development cycles even faster, while still benefiting from enterprise security and scale of deployments. Developers can now rapidly prototype and iterate by deploying their code directly, without having to create a container. To get started with Amazon Bedrock AgentCore direct code deployment, visit the AWS documentation.

About the authors
Chaitra Mathur is as a GenAI Specialist Solutions Architect at AWS. She works with customers across industries in building scalable generative AI platforms and operationalizing them. Throughout her career, she has shared her expertise at numerous conferences and has authored several blogs in the Machine Learning and Generative AI domains.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.

How to Build Supervised AI Models When You Don’t Have Annotated Data

One of the biggest challenges in real-world machine learning is that supervised models require labeled data—yet in many practical scenarios, the data you start with is almost always unlabeled. Manually annotating thousands of samples isn’t just slow; it’s expensive, tedious, and often impractical.

This is where active learning becomes a game-changer.

Active learning is a subset of machine learning in which the algorithm is not a passive consumer of data—it becomes an active participant. Instead of labeling the entire dataset upfront, the model intelligently selects which data points it wants labeled next. It interactively queries a human or oracle for labels on the most informative samples, allowing it to learn faster using far fewer annotations. Check out the FULL CODES here.

Here’s how the workflow typically looks:

Begin by labeling a small seed portion of the dataset to train an initial, weak model.

Use this model to generate predictions and confidence scores on the unlabeled data.

Compute a confidence metric (e.g., probability gap) for each prediction.

Select only the lowest-confidence samples—the ones the model is most unsure about.

Manually label these uncertain samples and add them to the training set.

Retrain the model and repeat the cycle of predict → rank confidence → label → retrain.

After several iterations, the model can achieve near–fully supervised performance while requiring far fewer manually labeled samples.

In this article, we’ll walk through how to apply this strategy step-by-step and show how active learning can help you build high-quality supervised models with minimal labeling effort. Check out the FULL CODES here.

Installing & Importing the libraries

Copy CodeCopiedUse a different Browserpip install numpy pandas scikit-learn matplotlib

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

For this tutorial, we will be using the make_classification dataset from the sklearn library

Copy CodeCopiedUse a different BrowserSEED = 42 # For reproducibility
N_SAMPLES = 1000 # Total number of data points
INITIAL_LABELED_PERCENTAGE = 0.10 # Your constraint: Start with 10% labeled data
NUM_QUERIES = 20 # Number of times we ask the “human” to label a confusing sample

NUM_QUERIES = 20 represents the annotation budget in an active learning setup. In a real-world workflow, this would mean the model selects the 20 most confusing samples and sends them to human annotators to label—each annotation costing time and money. In our simulation, we replicate this process automatically: during each iteration, the model selects one uncertain sample, the code instantly retrieves its true label (acting as the human oracle), and the model is retrained with this new information. 

Thus, setting NUM_QUERIES = 20 means we’re simulating the benefit of labeling only 20 strategically chosen samples and observing how much the model improves with that limited but valuable human effort.

Data Generation and Splitting Strategy for Active Learning

This block handles data generation and the initial split that powers the entire Active Learning experiment. It first uses make_classification to create 1,000 synthetic samples for a two-class problem. The dataset is then split into a 10% held-out test set for final evaluation and a 90% pool for training. From this pool, only 10% is kept as the small initial labeled set—matching the constraint of starting with very limited annotations—while the remaining 90% becomes the unlabeled pool. This setup creates the realistic low-label scenario Active Learning is designed for, with a large pool of unlabeled samples ready for strategic querying. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserX, y = make_classification(
n_samples=N_SAMPLES, n_features=10, n_informative=5, n_redundant=0,
n_classes=2, n_clusters_per_class=1, flip_y=0.1, random_state=SEED
)

# 1. Split into 90% Pool (samples to be queried) and 10% Test (final evaluation)
X_pool, X_test, y_pool, y_test = train_test_split(
X, y, test_size=0.10, random_state=SEED, stratify=y
)

# 2. Split the 90% Pool into Initial Labeled (10% of the pool) and Unlabeled (90% of the pool)
X_labeled_current, X_unlabeled_full, y_labeled_current, y_unlabeled_full = train_test_split(
X_pool, y_pool, test_size=1.0 – INITIAL_LABELED_PERCENTAGE,
random_state=SEED, stratify=y_pool
)

# A set to track indices in the unlabeled pool for efficient querying and removal
unlabeled_indices_set = set(range(X_unlabeled_full.shape[0]))

print(f”Initial Labeled Samples (STARTING N): {len(y_labeled_current)}”)
print(f”Unlabeled Pool Samples: {len(unlabeled_indices_set)}”)

Initial Training and Baseline Evaluation

This block trains the initial Logistic Regression model using only the small labeled seed set and evaluates its accuracy on the held-out test set. The labeled sample count and baseline accuracy are then stored as the first points in the performance history, establishing a starting benchmark before Active Learning begins. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserlabeled_size_history = []
accuracy_history = []

# Train the baseline model on the small initial labeled set
baseline_model = LogisticRegression(random_state=SEED, max_iter=2000)
baseline_model.fit(X_labeled_current, y_labeled_current)

# Evaluate performance on the held-out test set
y_pred_init = baseline_model.predict(X_test)
accuracy_init = accuracy_score(y_test, y_pred_init)

# Record the baseline point (x=90, y=0.8800)
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy_init)

print(f”INITIAL BASELINE (N={labeled_size_history[0]}): Test Accuracy: {accuracy_history[0]:.4f}”)

Active Learning Loop

This block contains the heart of the Active Learning process, where the model iteratively selects the most uncertain sample, receives its true label, retrains, and evaluates performance. In each iteration, the current model predicts probabilities for all unlabeled samples, identifies the one with the highest uncertainty (least confidence), and “queries” its true label—simulating a human annotator. The newly labeled data point is added to the training set, a fresh model is retrained, and accuracy is recorded. Repeating this cycle for 20 queries demonstrates how targeted labeling quickly improves model performance with minimal annotation effort. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsercurrent_model = baseline_model # Start the loop with the baseline model

print(f”nStarting Active Learning Loop ({NUM_QUERIES} Queries)…”)

# ———————————————–
# The Active Learning Loop (Query, Annotate, Retrain, Evaluate)
# Purpose: Run 20 iterations to demonstrate strategic labeling gains.
# ———————————————–
for i in range(NUM_QUERIES):
if not unlabeled_indices_set:
print(“Unlabeled pool is empty. Stopping.”)
break

# — A. QUERY STRATEGY: Find the Least Confident Sample —
# 1. Get probability predictions from the CURRENT model for all unlabeled samples
probabilities = current_model.predict_proba(X_unlabeled_full)
max_probabilities = np.max(probabilities, axis=1)

# 2. Calculate Uncertainty Score (1 – Max Confidence)
uncertainty_scores = 1 – max_probabilities

# 3. Identify the index of the sample with the MAXIMUM uncertainty score
current_indices_list = list(unlabeled_indices_set)
current_uncertainty = uncertainty_scores[current_indices_list]
most_uncertain_idx_in_subset = np.argmax(current_uncertainty)
query_index_full = current_indices_list[most_uncertain_idx_in_subset]
query_uncertainty_score = uncertainty_scores[query_index_full]

# — B. HUMAN ANNOTATION SIMULATION —
# This is the single critical step where the human annotator intervenes.
# We look up the true label (y_unlabeled_full) for the sample the model asked for.
X_query = X_unlabeled_full[query_index_full].reshape(1, -1)
y_query = np.array([y_unlabeled_full[query_index_full]])

# Update the Labeled Set: Add the new annotated sample (N becomes N+1)
X_labeled_current = np.vstack([X_labeled_current, X_query])
y_labeled_current = np.hstack([y_labeled_current, y_query])
# Remove the sample from the unlabeled pool
unlabeled_indices_set.remove(query_index_full)

# — C. RETRAIN and EVALUATE —
# Train the NEW model on the larger, improved labeled set
current_model = LogisticRegression(random_state=SEED, max_iter=2000)
current_model.fit(X_labeled_current, y_labeled_current)

# Evaluate the new model on the held-out test set
y_pred = current_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Record results for plotting
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy)

# Output status
print(f”nQUERY {i+1}: Labeled Samples: {len(y_labeled_current)}”)
print(f” > Test Accuracy: {accuracy:.4f}”)
print(f” > Uncertainty Score: {query_uncertainty_score:.4f}”)

final_accuracy = accuracy_history[-1]

Final Result

The experiment successfully validated the efficiency of Active Learning. By focusing annotation efforts on only 20 strategically selected samples (increasing the labeled set from 90 to 110), the model’s performance on the unseen Test Set improved from 0.8800 (88%) to 0.9100 (91%). 

This 3 percentage point increase in accuracy was achieved with a minimal increase in annotation effort—roughly a 22% increase in the size of the training data resulted in a measurable and meaningful performance boost. 

In essence, the Active Learner acts as an intelligent curator, ensuring that every dollar or minute spent on human labeling provides the maximum possible benefit, proving that smart labeling is far more valuable than random or bulk labeling. Check out the FULL CODES here.

Plotting the results

Copy CodeCopiedUse a different Browserplt.figure(figsize=(10, 6))
plt.plot(labeled_size_history, accuracy_history, marker=’o’, linestyle=’-‘, color=’#00796b’, label=’Active Learning (Least Confidence)’)
plt.axhline(y=final_accuracy, color=’red’, linestyle=’–‘, alpha=0.5, label=’Final Accuracy’)
plt.title(‘Active Learning: Accuracy vs. Number of Labeled Samples’)
plt.xlabel(‘Number of Labeled Samples’)
plt.ylabel(‘Test Set Accuracy’)
plt.grid(True, linestyle=’–‘, alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build Supervised AI Models When You Don’t Have Annotated Data appeared first on MarkTechPost.

Anyscale and NovaSky Team Releases SkyRL tx v0.1.0: Bringing Tinker Co …

How can AI teams run Tinker style reinforcement learning on large language models using their own infrastructure with a single unified engine? Anyscale and NovaSky (UC Berkeley) Team releases SkyRL tx v0.1.0 that gives developers a way to run a Tinker compatible training and inference engine directly on their own hardware, while keeping the same minimal API that Tinker exposes in the managed service.

The research team describes SkyRL tx as a unified training and inference engine that implements the Tinker API and allows people to run a Tinker like service on their own infrastructure. This v0.1.0 version is the first of its series that supports reinforcement learning end to end, and it also makes sampling significantly faster.

Tinker API in brief

Tinker from Thinking Machines is a training API built around four core functions. forward_backward performs a forward pass and a backward pass and accumulates gradients. optim_step updates model weights based on those gradients. sample generates tokens for interaction, evaluation or RL actions. save_state writes checkpoints for resuming training.

Instead of a full task specific fine tuning abstraction, Tinker exposes these low level primitives so that users can implement their own supervised or reinforcement learning loops in regular Python code, while the service handles GPU scheduling and distributed execution.

SkyRL tx targets this exact API and implements an open backend that users can deploy locally. It keeps the Tinker programming model, while removing the need to rely only on the hosted environment.

Where SkyRL tx fits inside SkyRL

SkyRL is a full stack reinforcement learning library for large language models that includes skyrl-agent for long horizon agents, skyrl-train for training, and skyrl-gym for tool use environments such as math, coding, search and SQL.

Within this stack, skyrl-tx is marked as an experimental cross platform library that exposes a local Tinker like REST API for model post training. SkyRL tx therefore becomes the system layer that connects RL logic, environments and training code to concrete GPU resources through the Tinker interface.

Architecture, inference engine that also trains

The SkyRL tx architecture is described as an inference engine that also supports backward passes. It has four main components:

REST API server that processes incoming requests from different users.

Database that tracks metadata about models, checkpoints, requests and futures, and also acts as a job queue. The current implementation uses SQLite behind an interface that also supports other SQL databases such as Postgres.

Engine that schedules and batches requests across users. Each engine instance serves a single base model and can attach many LoRA adapters.

Worker that executes forward and backward passes and holds model definitions and optimizer states. Multiple workers would be enabling more advanced multi node sharding in upcoming versions

What v0.1.0 adds?

The v0.1.0 release focuses on reinforcement learning support and performance improvements. The official release highlights several concrete changes:

Sampling is now much faster, since it is jitted and properly batched and sharded in the engine.

Different sampling parameters per request, per request seeds and stop tokens are now supported, which is useful when many experiments share a base model.

After several fixes, the RL loop now runs properly through the engine.

Gradient checkpointing support and micro batching for sampling are implemented.

Postgres is now supported as a database backend, next to SQLite.

Running RL end to end on 8 H100 GPUs

The official release contains a specific code recipe for running reinforcement learning end to end on a cluster with 8 H100 GPUs.

First, users clone the SkyRL repository and in the skyrl-tx folder start the engine with:

Copy CodeCopiedUse a different Browseruv run –extra gpu –extra tinker -m tx.tinker.api
–base-model Qwen/Qwen3-4B
–max-lora-adapters 3
–max-lora-rank 1
–tensor-parallel-size 8
–train-micro-batch-size 8 > out.log

Then they clone the Tinker Cookbook from the Thinking Machines team and in the tinker_cookbook/recipes folder run:

Copy CodeCopiedUse a different Browserexport TINKER_API_KEY=dummy
export WANDB_API_KEY=<your key>
uv run –with wandb –with tinker rl_loop.py
base_url=http://localhost:8000
model_name=”Qwen/Qwen3-4B”
lora_rank=1
max_length=1024
save_every=100

This produces a reward curve that confirms the RL loop runs correctly through the local SkyRL tx backend.

Key Takeaways

SkyRL tx v0.1.0 implements a local, Tinker compatible engine that unifies training and inference for LLM post training.

The system exposes Tinker primitives, forward_backward, optim_step, sample and save_state over REST, while handling batching, LoRA adapters and device placement internally.

Architecture is split into API server, SQL database, scheduling engine and workers that execute forward and backward passes for a single base model with multiple LoRA adapters.

v0.1.0 adds end to end reinforcement learning support, faster jitted and sharded sampling, per request sampling parameters, gradient checkpointing, micro batching and Postgres support.

Editorial Comments

SkyRL tx v0.1.0 is a practical step for dev teams that want Tinker style reinforcement learning on their own clusters with a consistent Tinker API surface. The design that treats the system as an inference engine that also runs backward passes is clean and reduces stack divergence. Support for LoRA, gradient checkpointing, micro batching and Postgres is a concrete systems upgrade. Overall, this release turns Tinker compatibility into an actionable local RL backend for LLM

Check out the Repo and Official Release. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anyscale and NovaSky Team Releases SkyRL tx v0.1.0: Bringing Tinker Compatible Reinforcement Learning RL Engine To Local GPU Clusters appeared first on MarkTechPost.

How to Design a Persistent Memory and Personalized Agentic AI System w …

In this tutorial, we explore how to build an intelligent agent that remembers, learns, and adapts to us over time. We implement a Persistent Memory & Personalisation system using simple, rule-based logic to simulate how modern Agentic AI frameworks store and recall contextual information. As we progress, we see how the agent’s responses evolve with experience, how memory decay helps prevent overload, and how personalisation improves performance. We aim to understand, step by step, how persistence transforms a static chatbot into a context-aware, evolving digital companion. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport math, time, random
from typing import List

class MemoryItem:
def __init__(self, kind:str, content:str, score:float=1.0):
self.kind = kind
self.content = content
self.score = score
self.t = time.time()

class MemoryStore:
def __init__(self, decay_half_life=1800):
self.items: List[MemoryItem] = []
self.decay_half_life = decay_half_life

def _decay_factor(self, item:MemoryItem):
dt = time.time() – item.t
return 0.5 ** (dt / self.decay_half_life)

We established the foundation for our agent’s long-term memory. We define the MemoryItem class to hold each piece of information and build a MemoryStore with an exponential decay mechanism. We begin laying the foundation for storing and aging information just like a human’s memory. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def add(self, kind:str, content:str, score:float=1.0):
self.items.append(MemoryItem(kind, content, score))

def search(self, query:str, topk=3):
scored = []
for it in self.items:
decay = self._decay_factor(it)
sim = len(set(query.lower().split()) & set(it.content.lower().split()))
final = (it.score * decay) + sim
scored.append((final, it))
scored.sort(key=lambda x: x[0], reverse=True)
return [it for _, it in scored[:topk] if _ > 0]

def cleanup(self, min_score=0.1):
new = []
for it in self.items:
if it.score * self._decay_factor(it) > min_score:
new.append(it)
self.items = new

We expand the memory system by adding methods to insert, search, and clean old memories. We implement a simple similarity function and a decay-based cleanup routine, enabling the agent to remember relevant facts while automatically forgetting weak or outdated ones. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass Agent:
def __init__(self, memory:MemoryStore, name=”PersonalAgent”):
self.memory = memory
self.name = name

def _llm_sim(self, prompt:str, context:List[str]):
base = “OK. ”
if any(“prefers short” in c for c in context):
base = “”
reply = base + f”I considered {len(context)} past notes. ”
if “summarize” in prompt.lower():
return reply + “Summary: ” + ” | “.join(context[:2])
if “recommend” in prompt.lower():
if any(“cybersecurity” in c for c in context):
return reply + “Recommended: write more cybersecurity articles.”
if any(“rag” in c for c in context):
return reply + “Recommended: build an agentic RAG demo next.”
return reply + “Recommended: continue with your last topic.”
return reply + “Here’s my response to: ” + prompt

def perceive(self, user_input:str):
ui = user_input.lower()
if “i like” in ui or “i prefer” in ui:
self.memory.add(“preference”, user_input, 1.5)
if “topic:” in ui:
self.memory.add(“topic”, user_input, 1.2)
if “project” in ui:
self.memory.add(“project”, user_input, 1.0)
def act(self, user_input:str):
mems = self.memory.search(user_input, topk=4)
ctx = [m.content for m in mems]
answer = self._llm_sim(user_input, ctx)
self.memory.add(“dialog”, f”user said: {user_input}”, 0.6)
self.memory.cleanup()
return answer, ctx

We design an intelligent agent that utilizes memory to inform its responses. We create a mock language model simulator that adapts replies based on stored preferences and topics. At the same time, the perception function enables the agent to dynamically capture new insights about the user. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef evaluate_personalisation(agent:Agent):
agent.memory.add(“preference”, “User likes cybersecurity articles”, 1.6)
q = “Recommend what to write next”
ans_personal, _ = agent.act(q)
empty_mem = MemoryStore()
cold_agent = Agent(empty_mem)
ans_cold, _ = cold_agent.act(q)
gain = len(ans_personal) – len(ans_cold)
return ans_personal, ans_cold, gain

Now we give our agent the ability to act and evaluate itself. We allow it to recall memories to shape contextual answers and add a small evaluation loop to compare personalised responses versus a memory-less baseline, quantifying how much the memory helps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsermem = MemoryStore(decay_half_life=60)
agent = Agent(mem)

print(“=== Demo: teaching the agent about yourself ===”)
inputs = [
“I prefer short answers.”,
“I like writing about RAG and agentic AI.”,
“Topic: cybersecurity, phishing, APTs.”,
“My current project is to build an agentic RAG Q&A system.”
]
for inp in inputs:
agent.perceive(inp)

print(“n=== Now ask the agent something ===”)
user_q = “Recommend what to write next in my blog”
ans, ctx = agent.act(user_q)
print(“USER:”, user_q)
print(“AGENT:”, ans)
print(“USED MEMORY:”, ctx)

print(“n=== Evaluate personalisation benefit ===”)
p, c, g = evaluate_personalisation(agent)
print(“With memory :”, p)
print(“Cold start :”, c)
print(“Personalisation gain (chars):”, g)

print(“n=== Current memory snapshot ===”)
for it in agent.memory.items:
print(f”- {it.kind} | {it.content[:60]}… | score~{round(it.score,2)}”)

Finally, we run the full demo to see our agent in action. We feed it user inputs, observe how it recommends personalised actions, and check its memory snapshot. We witness the emergence of adaptive behaviour, proof that persistent memory transforms a static script into a learning companion.

In conclusion, we demonstrate how adding memory and personalisation makes our agent more human-like, capable of remembering preferences, adapting plans, and forgetting outdated details naturally. We observe that even simple mechanisms such as decay and retrieval significantly improve the agent’s relevance and response quality. By the end, we realize that persistent memory is the foundation of next-generation Agentic AI, one that learns continuously, tailors experiences intelligently, and maintains context dynamically in a fully local, offline setup.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Persistent Memory and Personalized Agentic AI System with Decay and Self-Evaluation? appeared first on MarkTechPost.

How Switchboard, MD automates real-time call transcription in clinical …

In high-volume healthcare contact centers, every patient conversation carries both clinical and operational significance, making accurate real-time transcription necessary for automated workflows. Accurate, instant transcription enables intelligent automation without sacrificing clarity or care, so that teams can automate electronic medical record (EMR) record matching, streamline workflows, and eliminate manual data entry. By removing routine process steps, staff can stay fully focused on patient conversations, improving both the experience and the outcome. As healthcare systems seek to balance efficiency with empathy, real-time transcription has become a capability for delivering responsive, high-quality care at scale.
Switchboard, MD is a physician-led AI and data science company with a mission to prioritize the human connection in medicine. Its service improves patient engagement and outcomes, while reducing inefficiency and burnout. By designing and deploying clinically relevant solutions, Switchboard, MD helps providers and operators collaborate more effectively to deliver great experiences for both patients and staff. One of its key solutions is streamlining the contact center using AI voice automation, real-time medical record matching, and suggested next steps, which has led to significant reductions in queue times and call abandonment rates.
With more than 20,000 calls handled each month, Switchboard, MD supports healthcare providers in delivering timely, personalized communication at scale. Its AI platform is already helping reduce call queue times, improve patient engagement, and streamline contact center operations for clinics and health systems. Customers using Switchboard have seen outcomes such as:

75% reduction in queue times
59% reduction in call abandonment rate

Despite these early successes, Switchboard faced a critical challenge: their existing transcription approach couldn’t scale economically while maintaining the accuracy required for clinical workflows. Cost and word error rate (WER) weren’t just operational metrics—they were critical enablers for scaling automation and expanding Switchboard’s impact across more patient interactions.
In this post, we examine the specific challenges Switchboard, MD faced with scaling transcription accuracy and cost-effectiveness in clinical environments, their evaluation process for selecting the right transcription solution, and the technical architecture they implemented using Amazon Connect and Amazon Kinesis Video Streams. This post details the impressive results achieved and demonstrates how they were able to use this foundation to automate EMR matching and give healthcare staff more time to focus on patient care. Finally, we’ll look at the broader implications for healthcare AI automation and how other organizations can implement similar solutions using Amazon Bedrock.
Choosing an accurate, scalable, and cost-effective transcription model for contact center automation
Switchboard, MD needed a transcription solution that delivered high accuracy at a sustainable cost. In clinical settings, transcription accuracy is critical because errors can compromise EMR record matching, affect recommended treatment plans, and disrupt automated workflows. At the same time, scaling support for thousands of calls each week meant that inference costs couldn’t be ignored.
Switchboard initially explored multiple paths, including evaluating open source models such as Open AI’s Whisper model hosted locally. But these options presented tradeoffs—either in performance, cost, or integration complexity.
After testing, the team determined that Amazon Nova Sonic provided the right combination of transcription quality and efficiency needed to support their healthcare use case. The model performed reliably across live caller audio, even in noisy or variable conditions. It delivered:

80–90% lower transcription costs
A word error rate of 4% on Switchboard’s proprietary evaluation dataset
Low-latency output that aligned with their need for real-time processing

Equally important, Nova Sonic integrated smoothly into Switchboard’s existing architecture, minimizing engineering lift and accelerating deployment. With this foundation, the team reduced manual transcription steps and scaled accurate, real-time automation across thousands of patient interactions.

“Our vision is to restore the human connection in medicine by removing administrative barriers that get in the way of meaningful interaction. Nova Sonic gave us the speed and accuracy we needed to transcribe calls in real time—so our customers can focus on what truly matters: the patient conversation. By reducing our transcription costs by 80–90%, it’s also made real-time automation sustainable at scale.” – Dr. Blake Anderson, Founder, CEO, and CTO, Switchboard, MD

Architecture and implementation
Switchboard’s architecture uses Amazon Connect to capture live audio from both patients and representatives. Switchboard processes audio streams through Amazon Kinesis Video Streams , which handles the real-time media conversion before routing the data to containerized AWS Lambda functions. Switchboard’s Lambda functions establish bidirectional streaming connections with Amazon Nova Sonic using BedrockRuntimeClient’s InvokeModelWithBidirectionalStream API.  This novel architecture creates separate transcription streams for each conversation participant, which Switchboard recombines to create the complete transcription record. The entire processing pipeline runs in a serverless environment, providing scalable operation designed to handle thousands of concurrent calls while using Nova Sonic’s real-time speech-to-text capabilities for immediate transcription processing.
Nova Sonic integration: Real-time speech processing
Harnessing Amazon Nova Sonic’s advanced audio streaming and processing, Switchboard developed and built the capability of separating and recombining speakers’ streams and transcripts. This makes Amazon Nova Sonic particularly effective for Switchboard’s healthcare applications, where accurate transcription and speaker identification are crucial.
Amazon Nova Sonic offers configurable settings that can be optimized for different healthcare use cases, with the flexibility to prioritize either transcription or speech generation based on specific needs. A key cost-optimization feature is the ability to adjust speech output tokens – organizations can set lower token values when primarily focused on transcription, resulting in significant cost savings while maintaining high accuracy. This versatility and cost flexibility makes Amazon Nova Sonic a valuable tool for healthcare organizations like Switchboard looking to implement voice-enabled solutions.
Why serverless: Strategic advantages for healthcare innovation
Switchboard’s choice of a serverless architecture using Amazon Connect, Amazon Kinesis Video Streams, and containerized Lambda functions represents a strategic decision that maximizes operational efficiency while minimizing infrastructure overhead. The serverless approach eliminates the need to provision, manage, and monitor underlying infrastructure, so that Switchboard’s engineering team can focus on developing clinical automation features rather than server management. This architecture provides built-in fault tolerance and high availability for critical healthcare communications without requiring extensive configuration from Switchboard’s team.
Switchboard’s event-driven architecture, shown in the following figure, enables the system to scale from handling dozens to thousands of concurrent calls, with AWS automatically managing capacity provisioning behind the scenes. The pay-as-you-go billing model helps Switchboard pay only for compute resources used during call processing, optimizing costs while eliminating the risk of over-provisioning servers that would sit idle during low-volume periods.

Conclusion
Switchboard, MD’s implementation of Amazon Nova Sonic demonstrates how the right transcription technology can transform healthcare operations. By achieving 80–90% cost reductions while maintaining clinical-grade accuracy, they’ve created a sustainable foundation for scaling AI-powered patient interactions across the healthcare industry.
By building on Amazon Bedrock, Switchboard now has the flexibility to expand automation across more use cases and provider networks. Their success exemplifies how healthcare innovators can combine accuracy, speed, and efficiency to transform how care teams connect with patients—one conversation at a time.
Get started with Amazon Nova on the Amazon Bedrock console. Learn more about Amazon Nova models at the Amazon Nova product page.

About the authors
Tanner Jones is a Technical Account Manager in AWS Enterprise Support, where he helps customers navigate and optimize their production applications on AWS. He specializes in helping customers develop applications that incorporate AI agents, with a particular focus on building safe multi-agent systems.
Anuj Jauhari is a Sr. Product Marketing Manager at AWS, where he helps customers innovate and drive business impact with generative AI solutions built on Amazon Nova models.
Jonathan Woods is a Solutions Architect at AWS based in Nashville currently working with SMB customers. He has a passion for communicating AWS technology to businesses in a relevant way making it easy for customers to innovate. Outside of work, he tries keeping up with his three kids.
Nauman Zulfiqar is a senior account manager based in New York working with SMB clients. He loves building and maintaining strong customer relationships, understanding their business challenges and serving as the customer’s primary business advocate within AWS.

How to Create AI-ready APIs?

Postman recently released a comprehensive checklist and developer guide for building AI-ready APIs, highlighting a simple truth: even the most powerful AI models are only as good as the data they receive—and that data comes through your APIs. If your endpoints are inconsistent, unclear, or unreliable, models waste time fixing bad inputs instead of producing insight. Postman’s playbook distills years of best practices into practical steps that help teams make their APIs predictable, machine-readable, and dependable for AI workloads.

This article summarizes the key ideas from that playbook. As we move into a world where Agents—not humans—will make purchases, compare options, and interact with services, APIs must evolve. Unlike developers, Agents can’t compensate for messy docs or ambiguous behavior. They rely on standardized patterns and automatically generated, machine-consumable documentation that stays in sync with your schema. The goal is simple: create APIs that humans and AI agents can understand instantly, so your systems can scale smarter and unlock their full potential.

Machine consumable metadata

Humans can infer missing details from vague API docs, but AI agents can’t—they rely entirely on explicit, machine-readable metadata. Instead of saying “this endpoint returns user preferences,” an AI-ready API must define everything: request type, parameter schema, response structure, and object definitions. Clear metadata like the example above removes ambiguity, ensures agents don’t guess, and makes APIs fully understandable to machines.

Rich Error Semantics

Developers can interpret vague errors like “Something went wrong,” but AI agents can’t—they need precise, structured guidance. AI-ready APIs must clearly spell out what failed, why it failed, and how to fix it. Rich error metadata with fields like code, message, expected, and received removes guesswork and enables agents to self-correct instead of getting stuck.

Introspection Capabilities

For APIs to be AI-ready, they must move beyond human-centric, vague documentation. Unlike developers who can infer missing details using context and RESTful conventions, AI agents rely entirely on structured data for planning and execution. This means APIs must provide complete introspection through a full schema, explicitly defining all endpoints, parameters, data schemas, and error codes. Without this clarity, AI systems are forced to guess, which inevitably leads to broken workflows and unreliable, hallucinated behavior.

Consistent Naming Patterns

AI systems rely on consistent patterns, so predictable naming conventions make your API far easier for them to understand and navigate. When endpoints and fields follow clear, uniform structures—like proper REST methods and consistent casing—AI can infer relationships and behaviors without guesswork. This reduces ambiguity and enables more accurate automation, reasoning, and integration across your entire API.

Predictable behaviour

AI agents need strict consistency—same inputs should always produce the same structure, format, and fields. Humans can troubleshoot inconsistent responses using intuition, but AI can’t assume or investigate; it only learns from the patterns you provide. If naming, nesting, or errors vary across endpoints, the agent becomes unreliable or breaks entirely. To be AI-ready, your API must enforce predictable responses, uniform naming, consistent error handling, and zero hidden edge cases. In short: inconsistent inputs lead to inconsistent agent behavior.

Proper documentation

Humans can look things up when docs are unclear, but AI agents can’t—they only know what your API explicitly tells them. Without clear, complete documentation, an agent can’t discover endpoints, understand parameters, predict responses, or recover from errors. Good documentation isn’t optional for AI-ready APIs—it’s the only way agents can learn and reliably interact with your system.

Reliable and fast

AI agents act as orchestrators, making rapid and often parallel API calls—so your API’s speed and reliability directly impact their performance. Humans can wait out slow responses or retry manually, but agents will time out, fail, or break entire workflows. In fast, automated environments, an AI system is only as strong as the APIs it relies on. If your API can’t keep up, neither can your AI.

Discoverability

Humans can track down missing APIs through wikis, chats, code, or intuition—but AI agents can’t. If an API isn’t clearly published with structured, searchable metadata, it simply doesn’t exist to them. AI systems depend on standardized, discoverable specs and examples to understand how to use an API. Making your API visible, accessible, and well-indexed—through platforms like the Postman API Network—ensures both developers and agents can reliably find and integrate it.
The post How to Create AI-ready APIs? appeared first on MarkTechPost.

LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Para …

How do you design a single model that can listen, see, read and respond in real time across text, image, video and audio without losing the efficiency? Meituan’s LongCat team has released LongCat Flash Omni, an open source omni modal model with 560 billion parameters and about 27 billion active per token, built on the shortcut connected Mixture of Experts design that LongCat Flash introduced. The model extends the text backbone to vision, video and audio, and it keeps a 128K context so it can run long conversations and document level understanding in one stack.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

Architecture and Modal Attachments

LongCat Flash Omni keeps the language model unchanged, then adds perception modules. A LongCat ViT encoder processes both images and video frames so there is no separate video tower. An audio encoder together with the LongCat Audio Codec turns speech into discrete tokens, then the decoder can output speech from the same LLM stream, which enables real time audio visual interaction.

Streaming and Feature Interleaving

The research team describes chunk wise audio visual feature interleaving, where audio features, video features and timestamps are packed into 1 second segments. Video is sampled at 2 frames per second by default, then the rate is adjusted according to video length, the report does not tie the sampling rule to user or model speaking phases, so the correct description is duration conditioned sampling. This keeps latency low and still provides spatial context for GUI, OCR and video QA tasks.

Curriculum from Text to Omni

Training follows a staged curriculum. The research team first trains the LongCat Flash text backbone, which activates 18.6B to 31.3B parameters per token, average 27B, then applies text speech continued pretraining, then multimodal continued pretraining with image and video, then context extension to 128K, then audio encoder alignment.

Systems Design, Modality Decoupled Parallelism

Because the encoders and the LLM have different compute patterns, Meituan uses modality decoupled parallelism. Vision and audio encoders run with hybrid sharding and activation recomputation, the LLM runs with pipeline, context and expert parallelism, and a ModalityBridge aligns embeddings and gradients. The research team reports that multimodal supervised fine tuning keeps more than 90 percent of the throughput of text only training, which is the main systems result in this release.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

Benchmarks and Positioning

LongCat Flash Omni reaches 61.4 on OmniBench, this is higher than Qwen 3 Omni Instruct at 58.5 and Qwen 2.5 Omni at 55.0, but lower than Gemini 2.5 Pro at 66.8. On VideoMME it scores 78.2, which is close to GPT 4o and Gemini 2.5 Flash, and on VoiceBench it reaches 88.7, slightly higher than GPT 4o Audio in the same table.

Key Takeaways

LongCat Flash Omni is an open source omni modal model built on Meituan’s 560B MoE backbone, it activates about 27B parameters per token through shortcut connected MoE with zero computation experts, so it keeps large capacity but inference friendly compute.

The model attaches unified vision video encoding and a streaming audio path to the existing LongCat Flash LLM, using 2 fps default video sampling with duration conditioned adjustment, and packs audio visual features into 1 second chunks for synchronized decoding, which is what enables real time any to any interaction.

LongCat Flash Omni scores 61.4 on OmniBench, above Qwen 3 Omni Instruct at 58.5, but below Gemini 2.5 Pro at 66.8.

Meituan uses modality decoupled parallelism, vision and audio encoders run with hybrid sharding, the LLM runs with pipeline, context and expert parallelism, and report more than 90 percent of text only throughput for multimodal SFT, which is the main systems contribution of the release.

Editorial Comments

This release shows that Meituan is trying to make omni modal interaction practical, not experimental. It keeps the 560B Shortcut connected Mixture of Experts with 27B activated, so the language backbone stays compatible with earlier LongCat releases. It adds streaming audio visual perception with 2 fps default video sampling and duration conditioned adjustment, so latency remains low without losing spatial grounding. It reports over 90 percent text only throughput in multimodal supervised fine tuning through modality decoupled parallelism.

Check out the Paper, Model Weights and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction appeared first on MarkTechPost.

Comparing the Top 6 OCR (Optical Character Recognition) Models/Systems …

Optical character recognition has moved from plain text extraction to document intelligence. Modern systems must read scanned and digital PDFs in one pass, preserve layout, detect tables, extract key value pairs, and work with more than one language. Many teams now also want OCR that can feed RAG and agent pipelines directly. In 2025, 6 systems cover most real workloads:

Google Cloud Document AI, Enterprise Document OCR

Amazon Textract

Microsoft Azure AI Document Intelligence

ABBYY FineReader Engine and FlexiCapture

PaddleOCR 3.0

DeepSeek OCR, Contexts Optical Compression

The goal of this comparison is not to rank them on a single metric, because they target different constraints. The goal is to show which system to use for a given document volume, deployment model, language set, and downstream AI stack.

Image source: Marktechpost.com

Evaluation dimensions

We compare on 6 stable dimensions:

Core OCR quality on scanned, photographed and digital PDFs.

Layout and structure tables, key value pairs, selection marks, reading order.

Language and handwriting coverage.

Deployment model fully managed, container, on premises, self hosted.

Integration with LLM, RAG and IDP tools.

Cost at scale.

1. Google Cloud Document AI, Enterprise Document OCR

Google’s Enterprise Document OCR takes PDFs and images, whether scanned or digital, and returns text with layout, tables, key value pairs and selection marks. It also exposes handwriting recognition in 50 languages and can detect math and font style. This matters for financial statements, educational forms and archives. Output is structured JSON that can be sent to Vertex AI or any RAG system.

Strengths

High quality OCR on business documents.

Strong layout graph and table detection.

One pipeline for digital and scanned PDFs, which keeps ingestion simple.

Enterprise grade, with IAM and data residency.

Limits

It is a metered Google Cloud service.

Custom document types still require configuration.

Use when your data is already on Google Cloud or when you must preserve layout for a later LLM stage.

2. Amazon Textract

Textract provides two API lanes, synchronous for small documents and asynchronous for large multipage PDFs. It extracts text, tables, forms, signatures and returns them as blocks with relationships. AnalyzeDocument in 2025 can also answer queries over the page which simplifies invoice or claim extraction. The integration with S3, Lambda and Step Functions makes it easy to turn Textract into an ingestion pipeline.

Strengths

Reliable table and key value extraction for receipts, invoices and insurance forms.

Clear sync and batch processing model.

Tight AWS integration, good for serverless and IDP on S3.

Limits

Image quality has a visible effect, so camera uploads may need preprocessing.

Customization is more limited than Azure custom models.

Locked to AWS.

Use when the workload is already in AWS and you need structured JSON out of the box.

3. Microsoft Azure AI Document Intelligence

Azure’s service, renamed from Form Recognizer, combines OCR, generic layout, prebuilt models and custom neural or template models. The 2025 release added layout and read containers, so enterprises can run the same model on premises. The layout model extracts text, tables, selection marks and document structure and is designed for further processing by LLMs.

Strengths

Best in class custom document models for line of business forms.

Containers for hybrid and air gapped deployments.

Prebuilt models for invoices, receipts and identity documents.

Clean JSON output.

Limits

Accuracy on some non English documents can still be slightly behind ABBYY.

Pricing and throughput must be planned because it is still a cloud first product.

Use when you need to teach the system your own templates or when you are a Microsoft shop that wants the same model in Azure and on premises.

4. ABBYY FineReader Engine and FlexiCapture

ABBYY stays relevant in 2025 because of 3 things, accuracy on printed documents, very wide language coverage, and deep control over preprocessing and zoning. The current Engine and FlexiCapture products support 190 and more languages, export structured data, and can be embedded in Windows, Linux and VM workloads. ABBYY is also strong in regulated sectors where data cannot leave the premises.

Strengths

Very high recognition quality on scanned contracts, passports, old documents.

Largest language set in this comparison.

FlexiCapture can be tuned to messy recurring documents.

Mature SDKs.

Limits

License cost is higher than open source.

Deep learning based scene text is not the focus.

Scaling to hundreds of nodes needs engineering.

Use when you must run on premises, must process many languages, or must pass compliance audits.

5. PaddleOCR 3.0

PaddleOCR 3.0 is an Apache licensed open source toolkit that aims to bridge images and PDFs to LLM ready structured data. It ships with PP OCRv5 for multilingual recognition, PP StructureV3 for document parsing and table reconstruction, and PP ChatOCRv4 for key information extraction. It supports 100 plus languages, runs on CPU and GPU, and has mobile and edge variants.

Strengths

Free and open, no per page cost.

Fast on GPU, usable on edge.

Covers detection, recognition and structure in one project.

Active community.

Limits

You must deploy, monitor and update it.

For European or financial layouts you often need postprocessing or fine tuning.

Security and durability are your responsibility.

Use when you want full control, or you want to build a self hosted document intelligence service for LLM RAG.

6. DeepSeek OCR, Contexts Optical Compression

DeepSeek OCR was released in October 2025. It is not a classical OCR. It is an LLM centric vision language model that compresses long text and documents into high resolution images, then decodes them. The public model card and blog report around 97 percent decoding accuracy at 10 times compression and around 60 percent at 20 times compression. It is MIT licensed, built around a 3B decoder, and already supported in vLLM and Hugging Face. This makes it interesting for teams that want to reduce token cost before calling an LLM.

Strengths

Self hosted, GPU ready.

Excellent for long context and mixed text plus tables because compression happens before decoding.

Open license.

Fits modern agentic stacks.

Limits

There is no standard public benchmark yet that puts it against Google or AWS, so enterprises must run their own tests.

Requires a GPU with enough VRAM.

Accuracy depends on chosen compression ratio.

Use when you want OCR that is optimized for LLM pipelines rather than for archive digitization.

Head to head comparison

FeatureGoogle Cloud Document AI (Enterprise Document OCR)Amazon TextractAzure AI Document IntelligenceABBYY FineReader Engine / FlexiCapturePaddleOCR 3.0DeepSeek OCRCore taskOCR for scanned and digital PDFs, returns text, layout, tables, KVP, selection marks OCR for text, tables, forms, IDs, invoices, receipts, with sync and async APIs OCR plus prebuilt and custom models, layout, containers for on premises High accuracy OCR and document capture for large, multilingual, on premises workloads Open source OCR and document parsing, PP OCRv5, PP StructureV3, PP ChatOCRv4 LLM centric OCR that compresses document images and decodes them for long context AI Text and layoutBlocks, paragraphs, lines, words, symbols, tables, key value pairs, selection marks Text, relationships, tables, forms, query responses, lending analysis Text, tables, KVP, selection marks, figure extraction, structured JSON, v4 layout model Zoning, tables, form fields, classification through FlexiCapture StructureV3 rebuilds tables and document hierarchy, KIE modules available Reconstructs content after optical compression, good for long pages, needs local evaluation HandwritingPrinted and handwriting for 50 languages Handwriting in forms and free text Handwriting supported in read and layout models Printed very strong, handwriting available via capture templates Supported, may need domain tuning Depends on image and compression ratio, not yet benchmarked vs cloud Languages200+ OCR languages, 50 handwriting languages Main business languages, invoices, IDs, receipts Major business languages, expanding in v4.x 190–201 languages depending on edition, widest in this table 100+ languages in v3.0 stack Multilingual via VLM decoder, coverage good but not exhaustively published, test per project DeploymentFully managed Google CloudFully managed AWS, synchronous and asynchronous jobs Managed Azure service plus read and layout containers (2025) for on premises On premises, VM, customer cloud, SDK centric Self hosted, CPU, GPU, edge, mobile Self hosted, GPU, vLLM ready, license to verify Integration pathExports structured JSON to Vertex AI, BigQuery, RAG pipelines Native to S3, Lambda, Step Functions, AWS IDP Azure AI Studio, Logic Apps, AKS, custom models, containers BPM, RPA, ECM, IDP platforms Python pipelines, open RAG stacks, custom document services LLM and agent stacks that want to reduce tokens first, vLLM and HF supported Cost modelPay per 1,000 pages, volume discounts Pay per page or document, AWS billing Consumption based, container licensing for local runs Commercial license, per server or per volume Free, infra onlyFree repo, GPU cost, license to confirmBest fitMixed scanned and digital PDFs on Google Cloud, layout preservedAWS ingestion of invoices, receipts, loan packages at scaleMicrosoft shops that need custom models and hybridRegulated, multilingual, on premises processingSelf hosted document intelligence for LLM and RAGLong document LLM pipelines that need optical compression

What to use when

Cloud IDP on invoices, receipts, medical forms: Amazon Textract or Azure Document Intelligence.

Mixed scanned and digital PDFs for banks and telcos on Google Cloud: Google Document AI Enterprise Document OCR.

Government archive or publisher with 150 plus languages and no cloud: ABBYY FineReader Engine and FlexiCapture.

Startup or media company building its own RAG over PDFs: PaddleOCR 3.0.

LLM platform that wants to shrink context before inference: DeepSeek OCR.

Editorial Comments

Google Document AI, Amazon Textract, and Azure AI Document Intelligence all deliver layout aware OCR with tables, key value pairs, and selection marks as structured JSON outputs, while ABBYY FineReader Engine 12 R7 and FlexiCapture export structured data in XML and the new JSON format and support 190 to 201 languages for on premises processing. PaddleOCR 3.0 provides Apache licensed PP OCRv5, PP StructureV3, and PP ChatOCRv4 for self hosted document parsing. DeepSeek OCR reports 97% decoding precision below 10x compression and about 60% at 20x, so enterprises must run local benchmarks before rollout in production workloads. Overall, OCR in 2025 is document intelligence first, recognition second.

References:

Google Cloud Document AI – Enterprise Document OCRhttps://docs.cloud.google.com/document-ai/docs/enterprise-document-ocr (Google Cloud Documentation)

Google Cloud – Document AI product pagehttps://cloud.google.com/document-ai (Google Cloud)

Amazon Textract – product pagehttps://aws.amazon.com/textract/ (Amazon Web Services, Inc.)

Amazon Textract – analyzing documents (tables, forms, queries, signatures)https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html (AWS Documentation)

Microsoft Azure AI Document Intelligence – docshttps://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/ (Microsoft Learn)

Microsoft Azure AI Document Intelligence – product pagehttps://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence (Microsoft Azure)

ABBYY FineReader Engine 12 R7 – release posthttps://www.abbyy.com/blog/finereader-engine-12-r7-release/ (ABBYY)

ABBYY FlexiCapture – product pagehttps://www.abbyy.com/flexicapture/ (ABBYY)

PaddleOCR – official GitHub repohttps://github.com/PaddlePaddle/PaddleOCR (GitHub)

DeepSeek OCR – official launch blog (Contexts Optical Compression)https://deepseek.ai/blog/deepseek-ocr-context-compression (deepseek.ai)

DeepSeek OCR – GitHub repositoryhttps://github.com/deepseek-ai/DeepSeek-OCR (GitHub)

DeepSeek OCR – coverage on compression ratioshttps://venturebeat.com/ai/deepseek-drops-open-source-model-that-compresses-text-10x-through-images (venturebeat.com)

The post Comparing the Top 6 OCR (Optical Character Recognition) Models/Systems in 2025 appeared first on MarkTechPost.

A Coding Implementation of a Comprehensive Enterprise AI Benchmarking …

In this tutorial, we develop a comprehensive benchmarking framework to evaluate various types of agentic AI systems on real-world enterprise software tasks. We design a suite of diverse challenges, from data transformation and API integration to workflow automation and performance optimization, and assess how various agents, including rule-based, LLM-powered, and hybrid ones, perform across these domains. By running structured benchmarks and visualizing key performance metrics, such as accuracy, execution time, and success rate, we gain a deeper understanding of each agent’s strengths and trade-offs in enterprise environments. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserimport json
import time
import random
from typing import Dict, List, Any, Callable
from dataclasses import dataclass, asdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

@dataclass
class Task:
id: str
name: str
description: str
category: str
complexity: int
expected_output: Any

@dataclass
class BenchmarkResult:
task_id: str
agent_name: str
success: bool
execution_time: float
accuracy: float
error_message: str = “”

class EnterpriseTaskSuite:
def __init__(self):
self.tasks = self._create_tasks()

def _create_tasks(self) -> List[Task]:
return [
Task(“data_transform”, “CSV Data Transformation”,
“Transform customer data by aggregating sales”, “data_processing”, 3,
{“total_sales”: 15000, “avg_order”: 750}),
Task(“api_integration”, “REST API Integration”,
“Parse API response and extract key metrics”, “integration”, 2,
{“status”: “success”, “active_users”: 1250}),
Task(“workflow_automation”, “Multi-Step Workflow”,
“Execute data validation -> processing -> reporting”, “automation”, 4,
{“validated”: True, “processed”: 100, “report_generated”: True}),
Task(“error_handling”, “Error Recovery”,
“Handle malformed data gracefully”, “reliability”, 3,
{“errors_caught”: 5, “recovery_success”: True}),
Task(“optimization”, “Query Optimization”,
“Optimize database query performance”, “performance”, 5,
{“execution_time_ms”: 45, “rows_scanned”: 1000}),
Task(“data_validation”, “Schema Validation”,
“Validate data against business rules”, “validation”, 2,
{“valid_records”: 95, “invalid_records”: 5}),
Task(“reporting”, “Executive Dashboard”,
“Generate KPI summary report”, “analytics”, 3,
{“revenue”: 125000, “growth”: 0.15, “customer_count”: 450}),
Task(“integration_test”, “System Integration”,
“Test end-to-end integration flow”, “testing”, 4,
{“all_systems_connected”: True, “latency_ms”: 120}),
]

def get_task(self, task_id: str) -> Task:
return next((t for t in self.tasks if t.id == task_id), None)

We define the core data structures for our benchmarking system. We create the Task and BenchmarkResult data classes and initialize the EnterpriseTaskSuite, which holds multiple enterprise-relevant tasks such as data transformation, reporting, and integration. We laid the foundation for consistently evaluating different types of agents across these tasks. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass BaseAgent:
def __init__(self, name: str):
self.name = name

def execute(self, task: Task) -> Dict[str, Any]:
raise NotImplementedError

class RuleBasedAgent(BaseAgent):
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.1, 0.3))
if task.category == “data_processing”:
return {“total_sales”: 15000 + random.randint(-500, 500),
“avg_order”: 750 + random.randint(-50, 50)}
elif task.category == “integration”:
return {“status”: “success”, “active_users”: 1250}
elif task.category == “automation”:
return {“validated”: True, “processed”: 98, “report_generated”: True}
else:
return task.expected_output

We introduce the base agent structure and implement the RuleBasedAgent, which mimics traditional automation logic using predefined rules. We simulate how such agents execute tasks deterministically while maintaining speed and reliability, giving us a baseline for comparison with more advanced agents. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass LLMAgent(BaseAgent):
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.2, 0.5))
accuracy_boost = 0.95 if task.complexity >= 4 else 0.90
result = {}
for key, value in task.expected_output.items():
if isinstance(value, (int, float)):
variation = value * (1 – accuracy_boost)
result[key] = value + random.uniform(-variation, variation)
else:
result[key] = value
return result

class HybridAgent(BaseAgent):
def execute(self, task: Task) -> Dict[str, Any]:
time.sleep(random.uniform(0.15, 0.35))
if task.complexity <= 2:
return task.expected_output
else:
result = {}
for key, value in task.expected_output.items():
if isinstance(value, (int, float)):
variation = value * 0.03
result[key] = value + random.uniform(-variation, variation)
else:
result[key] = value
return result

We develop two intelligent agent types, the LLMAgent, representing reasoning-based AI systems, and the HybridAgent, which combines rule-based precision with LLM adaptability. We design these agents to show how learning-based methods improve task accuracy, especially for complex enterprise workflows. Check out the Full Codes here.

Copy CodeCopiedUse a different Browserclass BenchmarkEngine:
def __init__(self, task_suite: EnterpriseTaskSuite):
self.task_suite = task_suite
self.results: List[BenchmarkResult] = []

def run_benchmark(self, agent: BaseAgent, iterations: int = 3):
print(f”n{‘=’*60}”)
print(f”Benchmarking Agent: {agent.name}”)
print(f”{‘=’*60}”)
for task in self.task_suite.tasks:
print(f”nTask: {task.name} (Complexity: {task.complexity}/5)”)
for i in range(iterations):
result = self._execute_task(agent, task, i+1)
self.results.append(result)
status = “✓ PASS” if result.success else “✗ FAIL”
print(f” Run {i+1}: {status} | Time: {result.execution_time:.3f}s | Accuracy: {result.accuracy:.2%}”)

Here, we build the core of our benchmarking engine, which manages agent evaluation across the defined task suite. We implement methods to run each agent multiple times per task, log results, and measure key parameters like execution time and accuracy. This creates a systematic and repeatable benchmarking loop. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser def _execute_task(self, agent: BaseAgent, task: Task, run_num: int) -> BenchmarkResult:
start_time = time.time()
try:
output = agent.execute(task)
execution_time = time.time() – start_time
accuracy = self._calculate_accuracy(output, task.expected_output)
success = accuracy >= 0.85
return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=success,
execution_time=execution_time, accuracy=accuracy)
except Exception as e:
execution_time = time.time() – start_time
return BenchmarkResult(task_id=task.id, agent_name=agent.name, success=False,
execution_time=execution_time, accuracy=0.0, error_message=str(e))

def _calculate_accuracy(self, output: Dict, expected: Dict) -> float:
if not output:
return 0.0
scores = []
for key, expected_val in expected.items():
if key not in output:
scores.append(0.0)
continue
actual_val = output[key]
if isinstance(expected_val, bool):
scores.append(1.0 if actual_val == expected_val else 0.0)
elif isinstance(expected_val, (int, float)):
diff = abs(actual_val – expected_val)
tolerance = abs(expected_val * 0.1)
score = max(0, 1 – (diff / (tolerance + 1e-9)))
scores.append(score)
else:
scores.append(1.0 if actual_val == expected_val else 0.0)
return np.mean(scores) if scores else 0.0

We define the task execution logic and the accuracy computation. We measure each agent’s performance by comparing their outputs against expected results using a scoring mechanism. This step ensures our benchmarking process is quantitative and fair, providing insights into how closely agents align with business expectations. Check out the Full Codes here.

Copy CodeCopiedUse a different Browser def generate_report(self):
df = pd.DataFrame([asdict(r) for r in self.results])
print(f”n{‘=’*60}”)
print(“BENCHMARK REPORT”)
print(f”{‘=’*60}n”)
for agent_name in df[‘agent_name’].unique():
agent_df = df[df[‘agent_name’] == agent_name]
print(f”{agent_name}:”)
print(f” Success Rate: {agent_df[‘success’].mean():.1%}”)
print(f” Avg Execution Time: {agent_df[‘execution_time’].mean():.3f}s”)
print(f” Avg Accuracy: {agent_df[‘accuracy’].mean():.2%}n”)
return df

def visualize_results(self, df: pd.DataFrame):
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle(‘Enterprise Agent Benchmarking Results’, fontsize=16, fontweight=’bold’)
success_rate = df.groupby(‘agent_name’)[‘success’].mean()
axes[0, 0].bar(success_rate.index, success_rate.values, color=[‘#3498db’, ‘#e74c3c’, ‘#2ecc71’])
axes[0, 0].set_title(‘Success Rate by Agent’, fontweight=’bold’)
axes[0, 0].set_ylabel(‘Success Rate’)
axes[0, 0].set_ylim(0, 1.1)
for i, v in enumerate(success_rate.values):
axes[0, 0].text(i, v + 0.02, f'{v:.1%}’, ha=’center’, fontweight=’bold’)
time_data = df.groupby(‘agent_name’)[‘execution_time’].mean()
axes[0, 1].bar(time_data.index, time_data.values, color=[‘#3498db’, ‘#e74c3c’, ‘#2ecc71’])
axes[0, 1].set_title(‘Average Execution Time’, fontweight=’bold’)
axes[0, 1].set_ylabel(‘Time (seconds)’)
for i, v in enumerate(time_data.values):
axes[0, 1].text(i, v + 0.01, f'{v:.3f}s’, ha=’center’, fontweight=’bold’)
df.boxplot(column=’accuracy’, by=’agent_name’, ax=axes[1, 0])
axes[1, 0].set_title(‘Accuracy Distribution’, fontweight=’bold’)
axes[1, 0].set_xlabel(‘Agent’)
axes[1, 0].set_ylabel(‘Accuracy’)
plt.sca(axes[1, 0])
plt.xticks(rotation=15)
task_complexity = {t.id: t.complexity for t in self.task_suite.tasks}
df[‘complexity’] = df[‘task_id’].map(task_complexity)
complexity_perf = df.groupby([‘agent_name’, ‘complexity’])[‘accuracy’].mean().unstack()
complexity_perf.plot(kind=’line’, ax=axes[1, 1], marker=’o’, linewidth=2)
axes[1, 1].set_title(‘Accuracy by Task Complexity’, fontweight=’bold’)
axes[1, 1].set_xlabel(‘Task Complexity’)
axes[1, 1].set_ylabel(‘Accuracy’)
axes[1, 1].legend(title=’Agent’, loc=’best’)
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

if __name__ == “__main__”:
print(“Enterprise Software Benchmarking for Agentic Agents”)
print(“=”*60)
task_suite = EnterpriseTaskSuite()
benchmark = BenchmarkEngine(task_suite)
agents = [RuleBasedAgent(“Rule-Based Agent”), LLMAgent(“LLM Agent”), HybridAgent(“Hybrid Agent”)]
for agent in agents:
benchmark.run_benchmark(agent, iterations=3)
results_df = benchmark.generate_report()
benchmark.visualize_results(results_df)
results_df.to_csv(‘agent_benchmark_results.csv’, index=False)
print(“nResults exported to: agent_benchmark_results.csv”)

We generate detailed reports and create visual analytics for performance comparison. We analyze metrics such as success rate, execution time, and accuracy across agents and task complexities. Finally, we export the results to CSV file, completing a full enterprise-grade evaluation workflow.

In conclusion, we implemented a robust, extensible benchmarking system that enables us to measure and compare the efficiency, adaptability, and accuracy of multiple agentic AI approaches. We observed how different architectures excel at different levels of task complexity and how visual analytics highlight performance trends. This process enables us to evaluate existing agents and provides a strong foundation for next-generation enterprise AI agents, optimized for reliability and intelligence.

Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks appeared first on MarkTechPost.