Transform your MCP architecture: Unite MCP servers through AgentCore G …

As AI agents are adopted at scale, developer teams can create dozens to hundreds of specialized Model Context Protocol (MCP) servers, tailored for specific agent use case and domain, organization functions or teams. Organizations also need to integrate their own existing MCP servers or open source MCP servers for their AI workflows. There is a need for a way to efficiently combine these existing MCP servers–whether custom-built, publicly available, or open source–into a unified interface that AI agents can readily consume and teams can seamlessly share across the organization.
Earlier this year, we introduced Amazon Bedrock AgentCore Gateway, a fully managed service that serves as a centralized MCP tool server, providing a unified interface where agents can discover, access, and invoke tools. Today, we’re extending support for existing MCP servers as a new target type in AgentCore Gateway. With this capability, you can group multiple task-specific MCP servers aligned to agent goals behind a single, manageable MCP gateway interface. This reduces the operational complexity of maintaining separate gateways, while providing the same centralized tool and authentication management that existed for REST APIs and AWS Lambda functions.
Without a centralized approach, customers face significant challenges: discovering and sharing tools across organizations becomes fragmented, managing authentication across multiple MCP servers grows increasingly complex, and maintaining separate gateway instances for each server quickly becomes unmanageable. Amazon Bedrock AgentCore Gateway helps solves these challenges by treating existing MCP servers as native targets, giving customers a single point of control for routing, authentication, and tool management—making it as simple to integrate MCP servers as it is to add other targets to the gateway.
Breaking down MCP silos: Why enterprise teams need a unified Gateway
Let’s explore this through a real-world example of an e-commerce ordering system, where different teams maintain specialized MCP servers for their specific domains. Consider an enterprise e-commerce system where different teams have developed specialized MCP servers:

The Shopping Cart team maintains an MCP server with cart management tools
The Product Catalog team runs their MCP server for product browsing and search
The Promotions team operates an MCP server handling promotional logic

Previously, an ordering agent would need to interact with each of these MCP servers separately, managing multiple connections and authentication contexts. With the new MCP server target support in AgentCore Gateway, these specialized servers can now be unified under a single gateway while maintaining their team-specific ownership and access controls. The power of this approach lies in its organizational flexibility. Teams can group their MCP servers based on multiple logical criteria:

Business unit alignment: Organize the MCP servers by business unit
Product feature boundaries: Each product team owns their MCP server with domain-specific tools allowing them to maintain clear ownership while providing a unified interface for their agents
Security and access control: Different MCP servers require different authentication mechanisms. The gateway handles the authentication complexity, making it simple for authorized agents to access the tools they need

The following diagram illustrates how an ordering agent interacts with multiple MCP servers through AgentCore Gateway. The agent connects to the gateway and discovers the available tools. Each team maintains control over their domain-specific tools while contributing to a cohesive agent experience. The gateway handles tool naming collisions, authentication, and provides unified semantic search across the tools.

The AgentCore Gateway serves as an integration hub in modern agentic architectures, offering a unified interface for connecting diverse agent implementations with a wide array of tool providers. The architecture, as illustrated in the diagram, demonstrates how the gateway bridges the gap between agent and tool implementation approaches, now enhanced with the ability to directly integrate MCP server targets.
AgentCore Gateway integration architecture
In AgentCore Gateway, a target defines the APIs, Lambda functions, or other MCP servers that a gateway will provide as tools to an agent. Targets can be Lambda functions, OpenAPI specifications, Smithy models, MCP servers, or other tool definitions.
The target integration side of the architecture showcases the gateway’s versatility in tool integration. With the new MCP server target support, the gateway can directly incorporate tools from public MCP servers, treating them as first-class citizens alongside other target types. This capability extends to federation scenarios where one AgentCore Gateway instance can serve as a target for another, for hierarchical tool organization across organizational boundaries. The gateway can seamlessly integrate with AgentCore Runtime instances that expose agents as tools, private MCP servers maintained by customers, traditional AWS Lambda functions, and both Smithy and AWS service APIs.
Beyond target diversity, the gateway’s authentication architecture provides additional operational benefits. The gateway decouples its inbound authentication from target systems, letting agents access tools that use multiple identity providers through a single interface. This centralized approach simplifies development, deployment, and maintenance of AI agents. Now, the same approach can be used for MCP server targets, where the gateway manages the complexity of interfacing with the server using the configured identity provider for the target.
With this authentication foundation you get sophisticated tool management capabilities through a unified architecture. When an agent requests tool discovery, the gateway provides a consistent view across the integrated targets, with tools from MCP servers appearing alongside Lambda functions and traditional APIs. The semantic search capability operates uniformly across the tool types, so agents can discover relevant tools regardless of their implementation. During tool invocation, the gateway handles the necessary protocol translations, authentication flows, and data transformations, presenting a clean, consistent interface to agents while managing the complexity of different target systems behind the scenes.
The addition of MCP server target support represents a significant evolution in the gateway’s capabilities. Organizations can now directly integrate MCP-native tools while maintaining their investments in traditional APIs and Lambda functions. This flexibility allows for gradual migration strategies where teams can adopt MCP-native implementations at their own pace while facilitating continuous operation of existing integrations. The gateway’s synchronization mechanisms make sure that tool definitions remain current across the different target types, while its authentication and authorization systems provide consistent security controls regardless of the underlying tool implementation.
The gateway combines MCP servers, traditional APIs, and serverless functions into a coherent tool environment. This capability, along with enterprise-grade security and performance, makes it a beneficial infrastructure for agentic computing.

Solution Walkthrough
In this post, we’ll guide you through the steps to set up an MCP server target in AgentCore Gateway, which is as simple as adding a new MCP server type target to a new or existing MCP Gateway. Adding an MCP server to an AgentCore Gateway will allow you to centralize your tool management, security authentication, and operational best practices with managing MCP servers at scale.

Get started with adding MCP Server into AgentCore Gateway
To get started, you will create an AgentCore Gateway and add your MCP Server as a target.
Prerequisites
Verify you have the following prerequisites:

AWS account with Amazon Bedrock AgentCore access. For more information review Permissions for AgentCore Runtime documentation.
Python 3.12 or later
Basic understanding of OAuth 2.0

You can create gateways and add targets through multiple interfaces:

AWS SDK for Python (Boto3)
AWS Management Console
AWS Command Line Interface (AWS CLI)
AgentCore starter toolkit for fast and straightforward setup

The following practical examples and code snippets demonstrate how to set up and use Amazon Bedrock AgentCore Gateway. For an interactive walkthrough, you can use these Jupyter Notebook samples on GitHub.
Create a gateway
To create a gateway, you can use the AgentCore starter toolkit to create a default authorization configuration with Amazon Cognito for JWT-based inbound authorization. You can also use another OAuth 2.0-compliant authentication provider instead of Cognito.

import time
import boto3

gateway_client = boto3.client(“bedrock-agentcore-control”)

# Create an authorization configuration, that specifies what client is authorized to access this Gateway
auth_config = {
“customJWTAuthorizer”: {
“allowedClients”: [‘<cognito_client_id>’], # Client MUST match with the ClientId configured in Cognito.
“discoveryUrl”: ‘<cognito_oauth_discovery_url>’,
}
}

# Call the create_gateway API
# This operation is asynchronous so may take time for Gateway creation
# This Gateway will leverage a CUSTOM_JWT authorizer, the Cognito User Pool we reference in auth_config
def deploy_gateway(poll_interval=5):
create_response = gateway_client.create_gateway(
name=”DemoGateway”,
roleArn=”<IAM Role>”, # The IAM Role must have permissions to create/list/get/delete Gateway
protocolType=”MCP”,
authorizerType=”CUSTOM_JWT”,
authorizerConfiguration=auth_config,
description=”AgentCore Gateway with MCP Server Target”,
)
gatewayID = create_response[“gatewayId”]
gatewayURL = create_response[“gatewayUrl”]

# Wait for deployment
while True:
status_response = gateway_client.get_gateway(gatewayIdentifier=gatewayID)
status = status_response[“status”]
if status == “READY”:
print(“✅ AgentCore Gateway is READY!”)
break
elif status in [“FAILED”]:
print(f”❌ Deployment failed: {status}”)
return None
print(f”Status: {status} – waiting…”)
time.sleep(poll_interval)

if __name__ == “__main__”:
deploy_gateway()

# Values with < > needs to be replaced with real values

 Create a sample MCP Server
As an example, let’s create a sample MCP server with three simple tools that return static responses. The server uses FastMCP with stateless_http=True which is required for AgentCore Runtime compatibility.

from mcp.server.fastmcp import FastMCP

mcp = FastMCP(host=”0.0.0.0″, stateless_http=True)

@mcp.tool()
def getOrder() -> int:
“””Get an order”””
return 123

@mcp.tool()
def updateOrder(orderId: int) -> int:
“””Update existing order”””
return 456

@mcp.tool()
def cancelOrder(orderId: int) -> int:
“””cancel existing order”””
return 789

if __name__ == “__main__”:
mcp.run(transport=”streamable-http”)

Configure AgentCore Runtime deployment
Next, we will use the starter toolkit to configure the AgentCore Runtime deployment. The toolkit can create the Amazon ECR repository on launch and generate a Dockerfile for deployment on AgentCore Runtime. You can use your own existing MCP server, we’re using the following only as an example. In a real-world environment, the inbound authorization for your MCP server will likely differ from the gateway configuration. Refer to this GitHub code example to create an Amazon Cognito user pool for Runtime authorization.

from bedrock_agentcore_starter_toolkit import Runtime
from boto3.session import Session

boto_session = Session()
region = boto_session.region_name
print(f”Using AWS region: {region}”)

required_files = [‘mcp_server.py’, ‘requirements.txt’]
for file in required_files:
    if not os.path.exists(file):
        raise FileNotFoundError(f”Required file {file} not found”)
print(“All required files found ✓”)

agentcore_runtime = Runtime()

auth_config = {
    “customJWTAuthorizer”: {
        “allowedClients”: [
            ‘<runtime_cognito_client_id>’ # Client MUST match with the ClientId configured in Cognito, and can be separate from the Gateway Cognito provider.
        ],
        “discoveryUrl”: ‘<cognito_oauth_discovery_url>’,
    }
}

print(“Configuring AgentCore Runtime…”)
response = agentcore_runtime.configure(
    entrypoint=”mcp_server.py”,
    auto_create_execution_role=True,
    auto_create_ecr=True,
    requirements_file=”requirements.txt”,
    region=region,
    authorizer_configuration=auth_config,
    protocol=”MCP”,
    agent_name=”mcp_server_agentcore”
)
print(“Configuration completed ✓”)

# Values with < > needs to be replaced with real values

Launch MCP server to AgentCore Runtime
Now that we have the Dockerfile, let’s launch the MCP server to AgentCore Runtime:

print(“Launching MCP server to AgentCore Runtime…”)
print(“This may take several minutes…”)
launch_result = agentcore_runtime.launch()
agent_arn = launch_result.agent_arn
agent_id = launch_result.agent_id
print(“Launch completed ✓”)

encoded_arn = agent_arn.replace(‘:’, ‘%3A’).replace(‘/’, ‘%2F’)
mcp_url = f”https://bedrock-agentcore.{region}.amazonaws.com/runtimes/{encoded_arn}/invocations?qualifier=DEFAULT”

print(f”Agent ARN: {launch_result.agent_arn}”)
print(f”Agent ID: {launch_result.agent_id}”)

Create MCP server as target for AgentCore Gateway
Create an AgentCore Identity Resource Credential Provider for the AgentCore Gateway to use as outbound auth to the MCP server agent in AgentCore Runtime:

identity_client = boto3.client(‘bedrock-agentcore-control’, region_name=region)

cognito_provider = identity_client.create_oauth2_credential_provider(
    name=”gateway-mcp-server-identity”,
    credentialProviderVendor=”CustomOauth2″,
    oauth2ProviderConfigInput={
        ‘customOauth2ProviderConfig’: {
            ‘oauthDiscovery’: {
                ‘discoveryUrl’: ‘<cognito_oauth_discovery_url>’,
            },
            ‘clientId’: ‘<runtime_cognito_client_id>’, # Client MUST match with the ClientId configured in Cognito for the Runtime authorizer
            ‘clientSecret’: ‘<cognito_client_secret>’
        }
    }
)
cognito_provider_arn = cognito_provider[‘credentialProviderArn’]
print(cognito_provider_arn)

# Values with < > needs to be replaced with real values

Create a gateway target pointing to the MCP server:

gateway_client = boto3.client(“bedrock-agentcore-control”, region_name=region)
create_gateway_target_response = gateway_client.create_gateway_target(
name=”mcp-server-target”,
gatewayIdentifier=gatewayID,
targetConfiguration={“mcp”: {“mcpServer”: {“endpoint”: mcp_url}}},
credentialProviderConfigurations=[
{
“credentialProviderType”: “OAUTH”,
“credentialProvider”: {
“oauthCredentialProvider”: {
“providerArn”: cognito_provider_arn,
“scopes”: [“<cognito_oauth_scopes>”],
}
},
},
],
) # Asynchronously create gateway target
gatewayTargetID = create_gateway_target_response[“targetId”]

# Values with < > needs to be replaced with real values

After creating a gateway target, implement a polling mechanism to check for the gateway target status using the get_gateway_target API call:

import time

def poll_for_status(interval=5):
# Poll for READY status
while True:
gateway_target_response = gateway_client.get_gateway_target(gatewayIdentifier=gatewayID, targetId=gatewayTargetID)
status = gateway_target_response[“status”]
if status == ‘READY’:
break
elif status in [‘FAILED’, ‘UPDATE_UNSUCCESSFUL’, ‘SYNCHRONIZE_UNSUCCESSFUL’]:
raise Exception(f”Gateway target failed with status: {status}”)
time.sleep(interval)

poll_for_status()

Test Gateway with Strands Agents framework
Let’s test the Gateway with the Strands Agents integration to list the tools from MCP server. You can also use other MCP-compatible agents built with different agentic frameworks.

from strands import Agent
from mcp.client.streamable_http import streamablehttp_client
from strands.tools.mcp.mcp_client import MCPClient

def create_streamable_http_transport():
return streamablehttp_client(gatewayURL,headers={“Authorization”: f”Bearer {token}”})

client = MCPClient(create_streamable_http_transport)

with client:
# Call the listTools
tools = client.list_tools_sync()
# Create an Agent with the model and tools
agent = Agent(model=yourmodel,tools=tools) ## you can replace with any model you like
# Invoke the agent with the sample prompt. This will only invoke MCP listTools and retrieve the list of tools the LLM has access to. The below does not actually call any tool.
agent(“Hi , can you list all tools available to you”)
# Invoke the agent with sample prompt, invoke the tool and display the response
agent(“Get the Order id”)

Refreshing tool definitions of your MCP servers in AgentCore Gateway
The SynchronizeGatewayTargets API is a new asynchronous operation that enables on-demand synchronization of tools from MCP server targets. MCP servers host tools which agents can discover and invoke. With time, these tools might need to be updated, or new tools may be introduced in an existing MCP server target. You can connect with external MCP servers through the SynchronizeGatewayTargets API that performs protocol handshakes and indexes available tools. This API provides customers with explicit control over when to refresh their tool definitions, particularly useful after making changes to their MCP server’s tool configurations.
When a target is configured with OAuth authentication, the API first interacts with the AgentCore Identity service to retrieve the necessary credentials from the specified credential provider. These credentials are validated for freshness and availability before communication with the MCP server begins. If the credential retrieval fails or returns expired tokens, the synchronization operation fails immediately with appropriate error details, transitioning the target to a FAILED state. For targets configured without authentication, the API proceeds directly to tool synchronization.
The tool processing workflow begins with an initialize call to the MCP server to establish a session. Following successful initialization, the API makes paginated calls to the MCP server’s tools/list capability, processing tools in batches of 100 to optimize performance and resource utilization. Each batch of tools undergoes normalization where the API adds target-specific prefixes to help prevent naming collisions with tools from other targets. During processing, tool definitions are normalized to facilitate consistency across different target types, while preserving the essential metadata from the original MCP server definitions.

The synchronization flow begins when:

An Ops Admin initiates the SynchronizeGatewayTargets API, triggering AgentCore Gateway to refresh the configured MCP target.
The gateway obtains an OAuth token from AgentCore Identity for secure access to the MCP target.
The gateway then initializes a secure session with the MCP server to retrieve version capabilities.
Finally, the gateway makes paginated calls to the MCP server tools/list endpoint to retrieve the tool definitions, making sure the gateway maintains a current and accurate list of tools.

The SynchronizeGatewayTargets API addresses a critical challenge in managing MCP targets within AgentCore Gateway: maintaining an accurate representation of available tools while optimizing system performance and resource utilization. Here’s why this explicit synchronization approach is valuable:
Schema consistency management: Without explicit synchronization, AgentCore Gateway would need to either make real-time calls to MCP servers during ListTools operations (impacting latency and reliability) or risk serving stale tool definitions. The SynchronizeGatewayTargets API provides a controlled mechanism where customers can refresh their tool schemas at strategic times, such as after deploying new tools or updating existing ones in their MCP server. This approach makes sure that tool definitions in the gateway accurately reflect the target MCP server’s capabilities without compromising performance.

Performance impact trade-offs: The API implements optimistic locking during synchronization to help prevent concurrent modifications that could lead to inconsistent states. While this means multiple synchronization requests might need to retry if there’s contention, this trade-off is acceptable because:

Tool schema changes are typically infrequent operational events rather than regular runtime occurrences
The performance cost of synchronization is incurred only when explicitly requested, not during regular tool invocations
The cached tool definitions facilitate consistent high performance for ListTools operations between synchronizations

Invoke the synchronize gateway API
Use the following example to invoke the synchronize gateway operation:

import requests
import json

def search_tools(gateway_url, access_token, query):
headers = {
“Content-Type”: “application/json”,
“Authorization”: f”Bearer {access_token}”
}

payload = {
“jsonrpc”: “2.0”,
“id”: “search-tools-request”,
“method”: “tools/call”,
“params”: {
“name”: “x_amz_bedrock_agentcore_search”,
“arguments”: {
“query”: query
}
}
}

response = requests.post(gateway_url, headers=headers, json=payload, timeout=5)
response.raise_for_status()
return response.json()

# Example usage
token_response = utils.get_token(user_pool_id, client_id, client_secret, scopeString, REGION)
access_token = token_response[‘access_token’]
results = search_tools(gatewayURL, access_token, “order operations”)
print(json.dumps(results, indent=2))

Implicit synchronization of tools schema
During CreateGatewayTarget and UpdateGatewayTarget operations, AgentCore Gateway performs an implicit synchronization that differs from the explicit SynchronizeGatewayTargets API. This implicit synchronization makes sure that MCP targets are created or updated with valid, current tool definitions, aligning with the assurance from AgentCore Gateway that targets in READY state are immediately usable. While this might make create/update operations take longer than with other target types, it helps prevent the complexity and potential issues of having targets without validated tool definitions.

The implicit synchronization flow begins when:

An Ops Admin creates or updates the MCP target using CreateGatewayTarget or UpdateGatewayTarget operations.
AgentCore Gateway configures the new or updated MCP target.
The gateway asynchronously triggers the synchronization process to update the tool definitions.
The gateway obtains an OAuth token from AgentCore Identity for secure access.
The gateway then initializes a secure session with the MCP server to retrieve version capabilities.
Finally, the gateway makes paginated calls to the MCP server’s tools/list endpoint to retrieve the tool definitions, making sure the gateway maintains a current and accurate list of tools.

ListTools behavior for MCP targets
The ListTools operation in AgentCore Gateway provides access to tool definitions previously synchronized from MCP targets, following a cache-first approach that prioritizes performance and reliability. Unlike traditional OpenAPI or Lambda targets where tool definitions are statically defined, MCP target tools are discovered and cached through synchronization operations. When a client calls ListTools, the gateway retrieves tool definitions from its persistent storage rather than making real-time calls to the MCP server. These definitions were previously populated either through implicit synchronization during target creation/update or through explicit SynchronizeGatewayTargets API calls. The operation returns a paginated list of normalized tool definitions.

InvokeTool (tools/call) Behavior for MCP Targets
The InvokeTool operation for MCP targets handles the actual execution of tools discovered through ListTools, managing real-time communication with the target MCP server. Unlike the cache-based ListTools operation, tools/call requires active communication with the MCP server, introducing specific authentication, session management, and error handling requirements. When a tools/call request arrives, AgentCore Gateway first validates the tool exists in its synchronized definitions. For MCP targets, AgentCore Gateway performs an initial initialize call to establish a session with the MCP server. If the target is configured with OAuth credentials, AgentCore Gateway retrieves fresh credentials from AgentCore Identity before making the initialize call. This makes sure that even if ListTools returned cached tools with expired credentials, the actual invocation uses valid authentication.

The inbound authorization flow begins when:

The MCP client initializes a request with MCP protocol version to AgentCore Gateway.
The client then sends the tools/call request to the gateway.
The gateway obtains an OAuth token from AgentCore Identity for secure access.
The gateway initializes a secure session with the MCP server to invoke and handle the actual execution of the tool.

Search tool behavior for MCP targets
The search capability in AgentCore Gateway enables semantic discovery of tools across the different target types, including MCP targets. For MCP targets, the search functionality operates on normalized tool definitions that were captured and indexed during synchronization operations, providing efficient semantic search without real-time MCP server communication.
When tool definitions are synchronized from an MCP target, AgentCore Gateway automatically generates embeddings for each tool’s name, description, and parameter descriptions. These embeddings are stored alongside the normalized tool definitions, enabling semantic search that understands the intent and context of search queries. Unlike traditional keyword matching, this allows agents to discover relevant tools even when exact terminology doesn’t match.

Search for MCP server tools through the gateway
Use the following example to search for tools through the gateway.

import requests
import json

def search_tools(gateway_url, access_token, query):
    headers = {
        “Content-Type”: “application/json”,
        “Authorization”: f”Bearer {access_token}”
    }

    payload = {
        “jsonrpc”: “2.0”,
        “id”: “search-tools-request”,
        “method”: “tools/call”,
        “params”: {
            “name”: “x_amz_bedrock_agentcore_search”,
            “arguments”: {
                “query”: query
            }
        }
    }

    response = requests.post(gateway_url, headers=headers, json=payload, timeout=5)
response.raise_for_status()
    return response.json()

# Example usage
token_response = utils.get_token(user_pool_id, client_id, client_secret, scopeString, REGION)
access_token = token_response[‘access_token’]
results = search_tools(gatewayURL, access_token, “math operations”)
print(json.dumps(results, indent=2))

Conclusion
Today’s announcement of MCP server support as a target type in Amazon Bedrock AgentCore Gateway is an advancement in enterprise AI agent development. This new capability addresses critical challenges in scaling MCP server implementations while maintaining security and operational efficiency. By integrating existing MCP servers alongside REST APIs and Lambda functions, AgentCore Gateway provides a more unified, secure, and manageable solution for tool integration at scale. Organizations can now manage their tools through a single, centralized interface while benefiting from unified authentication, simplified tool discovery and reduced maintenance overhead.
For more detailed information and advanced configurations, refer to the code samples on GitHub, the Amazon Bedrock AgentCore Gateway Developer Guide and Amazon AgentCore Gateway pricing.

About the authors
Frank Dallezotte is a Senior Solutions Architect at AWS and is passionate about working with independent software vendors to design and build scalable applications on AWS. He has experience creating software, implementing build pipelines, and deploying these solutions in the cloud.
Ganesh Thiyagarajan is a Senior Solutions Architect at Amazon Web Services (AWS) with over 20 years of experience in software architecture, IT consulting, and solution delivery. He helps ISVs transform and modernize their applications on AWS. He is also part of the AI/ML Technical field community, helping customers build and scale Gen AI solutions.
Dhawal Patel is a Principal Generative AI Tech lead at Amazon Web Services (AWS). He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to Agentic AI, Deep learning, distributed computing.

How to Build a Model-Native Agent That Learns Internal Planning, Memor …

In this tutorial, we explore how an agent can internalize planning, memory, and tool use within a single neural model rather than relying on external orchestration. We design a compact, model-native agent that learns to perform arithmetic reasoning tasks through reinforcement learning. By combining a stage-aware actor-critic network with a curriculum of increasingly complex environments, we enable the agent to discover how to use internalized “tools” and short-term memory to reach correct solutions end-to-end. We work step by step to observe how learning evolves from simple reasoning to multi-step compositional behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport math, random, torch, torch.nn as nn, torch.nn.functional as F
device = “cuda” if torch.cuda.is_available() else “cpu”; torch.manual_seed(0); random.seed(0)
V = 18; CTX = 10; MUL, ADD, SUB, ANS, STO, RCL, EOS = 11, 12, 13, 14, 15, 16, 17
tok2str = {**{i: str(i) for i in range(10)}, CTX:”[CTX]”, MUL:”[MUL]”, ADD:”[ADD]”, SUB:”[SUB]”, ANS:”[ANS]”, STO:”[STO]”, RCL:”[RCL]”, EOS:”[EOS]”}

class ToolEnv:
def __init__(self, max_steps=7):
self.max_steps = max_steps
def sample(self, stage):
a,b,c,d,e = [random.randint(0,9) for _ in range(5)]
if stage==0: ctx=[a,b,c]; target=a*b+c
elif stage==1: ctx=[a,b,c,d]; target=(a*b+c)-d
else: ctx=[a,b,c,d,e]; target=(a*b+c)-(d*e)
return ctx, target, (a,b,c,d,e)
def step_seq(self, actions, abc, stage):
a,b,c,d,e = abc; last=None; mem=None; steps=0; shaped=0.0
goal0=a*b; goal1=goal0+c; goal2=goal1-d; goal3=d*e; goal4=goal1-goal3
for act in actions:
steps+=1
if act==MUL: last=(a*b if last is None else last*(d if stage>0 else 1))
elif act==ADD and last is not None: last+=c
elif act==SUB and last is not None:
last -= (e if stage==2 and mem==”use_d” else (d if stage>0 else 0))
elif act==STO: mem=”use_d” if stage>=1 else “ok”
elif act==RCL and mem is not None:
last = (d*e) if (stage==2 and mem==”use_d”) else (last if last else 0)
elif act==ANS:
target=[goal0,goal1,goal2,goal4][stage] if stage==2 else [goal0,goal1,goal2][stage]
correct=(last==target)
if stage==0: shaped += 0.25*(last==goal0)+0.5*(last==goal1)
if stage==1: shaped += 0.25*(last==goal0)+0.5*(last==goal1)+0.75*(last==goal2)
if stage==2: shaped += 0.2*(last==goal0)+0.4*(last==goal1)+0.6*(last==goal4)+0.6*(last==goal3)
return (1.0 if correct else 0.0)+0.2*shaped, steps
if steps>=self.max_steps: break
return 0.0, steps

We begin by setting up the environment and defining the symbolic tools our agent can use. We create a small synthetic world where each action, such as multiplication, addition, or subtraction, acts as an internal tool. This environment enables us to simulate reasoning tasks in which the agent must plan sequences of tool use to arrive at the correct answer. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ActorCritic(nn.Module):
def __init__(self,V,d=96,nstage=3):
super().__init__()
self.emb=nn.Embedding(V,d); self.stage_emb=nn.Embedding(nstage,d)
self.rnn=nn.GRU(d,d,1,batch_first=True); self.pi=nn.Linear(d,V); self.v=nn.Linear(d,1)
def forward(self,ctx,stage,max_len=6,greedy=False):
B=ctx.shape[0]; ce=self.emb(ctx).mean(1)+self.stage_emb(stage).unsqueeze(1)
h=torch.tanh(ce.mean(1)).unsqueeze(0); inp=self.emb(torch.full((B,1),CTX,device=device))
acts,logps,ents,vals=[],[],[],[]
for _ in range(max_len):
out,h=self.rnn(inp,h); val=self.v(out[:,-1]); logits=self.pi(out[:,-1])
pi=F.log_softmax(logits,dim=-1).exp(); ent=-(pi*torch.log(pi+1e-9)).sum(1)
a=torch.argmax(logits,1) if greedy else torch.distributions.Categorical(pi).sample()
logp=F.log_softmax(logits,dim=-1).gather(1,a.unsqueeze(1)).squeeze(1)
inp=self.emb(a.unsqueeze(1))
acts.append(a); logps.append(logp); ents.append(ent); vals.append(val.squeeze(1))
return torch.stack(acts,1), torch.stack(logps,1), torch.stack(ents,1), torch.stack(vals,1)

We then design our model-native policy using an actor-critic structure built around a GRU. We embed both tokens and task stages, allowing the network to adapt its reasoning depth according to task complexity. This setup enables the agent to learn contextually when and how to use internal tools within a single unified model. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserenv=ToolEnv(); net=ActorCritic(V).to(device)
opt=torch.optim.Adam(net.parameters(),lr=3e-4)
def pad_batch(ctxs):
L=max(len(c)+1 for c in ctxs)
out=torch.full((len(ctxs),L),EOS,dtype=torch.long,device=device)
for i,c in enumerate(ctxs): out[i,:len(c)+1]=torch.tensor(c+[CTX],device=device)
return out
def run_batch(stage,batch=128,train=True,greedy=False):
ctxs=[]; metas=[]
for _ in range(batch):
c,t,abc=env.sample(stage); ctxs.append(c); metas.append((t,abc))
ctx=pad_batch(ctxs); stage_t=torch.full((batch,),stage,device=device,dtype=torch.long)
acts,logps,ents,vals=net(ctx,stage_t,max_len=6,greedy=greedy)
rewards=[]
for i in range(batch):
traj = acts[i].tolist()
abc = metas[i][1]
r,_ = env.step_seq(traj,abc,stage)
rewards.append(r)
R=torch.tensor(rewards,device=device).float()
adv=(R-vals.sum(1)).detach()
if not train: return R.mean().item(), 0.0
pg=-(logps.sum(1)*adv).mean(); vloss=F.mse_loss(vals.sum(1),R); ent=-ents.mean()
loss=pg+0.5*vloss+0.01*ent
opt.zero_grad(); loss.backward(); nn.utils.clip_grad_norm_(net.parameters(),1.0); opt.step()
return R.mean().item(), loss.item()

We implement the reinforcement learning training loop using an advantage actor-critic (A2C) update. We train the agent end-to-end across batches of synthetic problems, updating policy and value networks simultaneously. Here, we incorporate entropy regularization to promote exploration and prevent premature convergence. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(“Training…”)
stages=[0,0,0,1,1,2]
for ep in range(1,61):
stage=stages[min((ep-1)//10,len(stages)-1)]
acc,loss=run_batch(stage,batch=192,train=True)
if ep%5==0:
with torch.no_grad():
evals=[run_batch(s,train=False,greedy=True)[0] for s in [0,1,2]]
print(f”ep={ep:02d} stage={stage} acc={acc:.3f} | eval T0={evals[0]:.3f} ”
f”T1={evals[1]:.3f} T2={evals[2]:.3f} loss={loss:.3f}”)

We start the main training process using a curriculum strategy where tasks gradually increase in difficulty. As we train, we evaluate the agent on all stages to observe its ability to generalize from simpler to more complex reasoning steps. The printed metrics show how internal planning improves over time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef explain(stage):
c,t,abc=env.sample(stage)
ctx=pad_batch([c]); stage_t=torch.tensor([stage],device=device)
with torch.no_grad(): a,_,_,_=net(ctx,stage_t,greedy=True)
seq=[tok2str[x] for x in a[0].tolist()]
r,_=env.step_seq(a[0].tolist(),abc,stage)
return dict(stage=stage,ctx=c,target=t,actions=” “.join(seq),reward=round(float(r),2))
with torch.no_grad():
for s in [0,1,2]:
print(f”nStage {s} samples:”)
for _ in range(5): print(explain(s))
with torch.no_grad():
finals=[run_batch(s,train=False,greedy=True,batch=1000)[0] for s in [0,1,2]]
print(f”nFinal greedy accuracies → T0={finals[0]:.3f}, T1={finals[1]:.3f}, T2={finals[2]:.3f}”)

We finish by probing the trained agent and printing example reasoning trajectories. We visualize the sequence of tool tokens the model chooses and verify whether it reaches the correct result. Finally, we evaluate the overall performance, demonstrating that the model successfully integrates planning, memory, and reasoning into an internalized process.

In conclusion, we see that even a neural network can learn internalized planning and tool-use behaviors when trained with reinforcement signals. We successfully move beyond traditional pipeline-style architectures, where memory, planning, and execution are separate, toward a model-native agent that integrates these components as part of its learned dynamics. This approach represents a shift in agentic AI, demonstrating how end-to-end learning can produce emergent reasoning and self-organized decision-making without the need for handcrafted control loops.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Model-Native Agent That Learns Internal Planning, Memory, and Multi-Tool Reasoning Through End-to-End Reinforcement Learning appeared first on MarkTechPost.

Generalist AI Introduces GEN-θ: A New Class of Embodied Foundation Mo …

How do you build a single model that can learn physical skills from chaotic real world robot data without relying on simulation? Generalist AI has unveiled GEN-θ, a family of embodied foundation models trained directly on high fidelity raw physical interaction data instead of internet video or simulation. The system is built to establish scaling laws for robotics in the same way that large language models did for text, but now grounded in continuous sensorimotor streams from real robots operating in homes, warehouses and workplaces.

Harmonic Reasoning, thinking and acting in real time

GEN-θ is introduced as an embodied foundation model architecture that builds on the strengths of vision and language models, and extends them with native support for human level reflexes and physical commonsense. The core feature is Harmonic Reasoning, where the model is trained to think and act at the same time over asynchronous, continuous time streams of sensing and acting tokens.

This design targets a robotics specific constraint. Language models can simply spend more time thinking before replying, but robots must act while physics continues to evolve. Harmonic Reasoning creates a harmonic interplay between sensing and acting streams so that GEN-θ can scale to very large model sizes without depending on  System1-System2 architectures or heavy inference time guidance controllers.

GEN-θ is explicitly cross embodiment. The same architecture runs on different robots and has been tested on 6DoF, 7DoF and 16+DoF semi humanoid systems, which lets a single pre-training run serve heterogeneous fleets.

Surpassing the intelligence threshold in robotics

The Generalist AI team reports a phase transition in capability as GEN-θ scales in a high data regime. Their scaling research experiment also show that the models must be large enough to absorb vast amounts of physical interaction data.

Their behaviors are as follows:

1B models struggle to absorb complex and diverse sensorimotor data during pretraining and their weights stop absorbing new information, which the research team describe as ossification.

6B models start to benefit from pretraining and show strong multi task capabilities.

7B+ models internalize large scale robotic pretraining so that a few thousand post training steps on downstream tasks are sufficient for transfer.

https://generalistai.com/blog/nov-04-2025-GEN-0

The above image plots next action validation prediction error on a completely withheld long horizon downstream task across model sizes and pre-training compute. 1B models plateau early while 6B and 7B models continue to improve as pretraining increases. The research team connect this phase transition to Moravec’s Paradox, arguing that physical commonsense and dexterity appear to require higher compute thresholds than abstract language reasoning, and that GEN-θ is operating beyond that activation point.

Generalist AI team states that GEN-θ has been scaled to 10B+ model sizes, and that larger variants adapt to new tasks with increasingly less post training.

Scaling laws for robotics

Another focus of this research is scaling laws that relate pre-training data and compute to downstream post training performance. The research team samples checkpoints from GEN-θ training runs on different subsets of the pre-training dataset, then post trains those checkpoints on multi task, language conditioned data. This supervised fine tuning stage spans 16 task sets, covering dexterity tasks such as building Lego, industry workflows such as fast food packing, and generalization tasks that include anything style instructions.

Across various tasks, more pre-training improves validation loss and next action prediction error during post training. At sufficient model scale, the relationship between pre-training dataset size and downstream validation error is well described by a power law of the form.

L(D)=(Dc​/D)αD​

where (D) is the number of action trajectories in pre-training and (L(D)) is validation error on a downstream task. This formula lets robotics teams estimate how much pre-training data is needed to reach a target next action prediction error, or how much downstream labeled data can be traded for additional pre-training.

Data engine and infrastructure at robotics scale

GEN-θ is trained on an in house dataset of 270,000 hours of real world manipulation trajectories collected in thousands of homes, warehouses and workplaces worldwide. The data operation currently adds more than 10,000 new hours per week. Generalist AI team claims that GEN-θ is trained on orders of magnitude more real world manipulation data than prior large robotics datasets as of today.

To sustain this regime, the research team has built custom hardware, data-loaders and network infrastructure, including dedicated internet lines to handle uplink bandwidth from distributed sites. The pipeline uses multi cloud contracts, custom upload machines and on the order of 10,000 compute cores for continual multimodal processing. The research team reports compression of dozens of petabytes of data and data-loading techniques from frontier video foundation models, yielding a system capable of absorbing 6.85 years of real world manipulation experience per day of training.

How you pre-train GEN-θ matters as much as how big it is?

Generalist AI team runs large ablations over 8 pre-training datasets and 10 long horizon task sets. They find that different data mixtures, not just more data, produce models with different behaviors across 3 groups of tasks, dexterity, real world applications and generalization. Performance is measured using validation mean squared error on next actions and reverse Kullback Leibler divergence between the model policy and a Gaussian around ground truth actions.

Low MSE and low reverse KL models are better candidates for supervised fine-tuning. Models with higher MSE but low reverse KL are more multimodal in their action distributions and can be better starting points for reinforcement learning.

Key Takeaways

GEN-θ is an embodied foundation model trained on high fidelity raw physical interaction data, not simulation or internet video, and it uses Harmonic Reasoning to think and act simultaneously under real world physics.

Scaling experiments show an intelligence threshold around 7B parameters, where smaller models ossify under high data load and larger models keep improving with more pretraining.

GEN-θ exhibits clear scaling laws, where downstream post training performance follows a power law in the amount of pre-training data, which lets teams predict how much data and compute are needed for target error levels.

The system is trained on more than 270,000 hours of real world manipulation data, growing by about 10,000 hours per week, supported by custom multi cloud infrastructure that can absorb 6.85 years of experience per training day.

Large scale ablations over 8 pretraining datasets and 10 long horizon task sets show that data quality and mixture design, measured with validation MSE and reverse KL, are as important as scale, since different mixtures yield models better suited for supervised finetuning or reinforcement learning.

Editorial Comments

GEN-θ positions embodied foundation models as a serious attempt to bring scaling laws to robotics, using Harmonic Reasoning, large scale multimodal pre-training and explicit analysis of data mixtures. The research shows that 7B+ models, trained on 270,000 hours of real world manipulation data with 10,000 hours added weekly, can cross an intelligence threshold where more physical interaction data predictably improves downstream performance across dexterity, applications and generalization tasks.

Check out the Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Generalist AI Introduces GEN-θ: A New Class of Embodied Foundation Models Built for Multimodal Training Directly on High-Fidelity Raw Physical Interaction appeared first on MarkTechPost.

OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Language …

How can we reliably test whether large language models actually understand Indian languages and culture in real world contexts? OpenAI has released IndQA, a benchmark that evaluates how well AI models understand and reason about questions that matter in Indian languages across cultural domains.

Why IndQA?

OpenAI states that about 80 percent of people worldwide do not speak English as their primary language. Yet most benchmarks that measure non English capabilities are still narrow and often rely on translation or multiple choice formats.

Benchmarks such as MMMLU and MGSM are now near saturation at the top end, where strong models cluster near similar scores. This makes it hard to see meaningful progress and does not test whether models understand local context, history and everyday life.

India is OpenAI’s starting point for new region focused benchmarks. India has about 1 billion people who do not use English as their primary language, 22 official languages with at least 7 spoken by more than 50 million people, and it is ChatGPT’s second largest market.

Dataset, Languages And Domains

IndQA evaluates knowledge and reasoning about Indian culture and everyday life in Indian languages. The benchmark spans 2,278 questions across 12 languages and 10 cultural domains, created with 261 domain experts from across India.

The cultural domains are Architecture and Design, Arts and Culture, Everyday Life, Food and Cuisine, History, Law and Ethics, Literature and Linguistics, Media and Entertainment, Religion and Spirituality, and Sports and Recreation. Items are written natively in Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi and Tamil. Hinglish is included to reflect common code switching in Indian conversations.

Each datapoint contains four components, a culturally grounded prompt in an Indian language, an English translation for auditability, rubric criteria for grading and an ideal answer that encodes expert expectations.

Rubric Based Evaluation Pipeline

IndQA uses a rubric based grading procedure instead of exact match accuracy. For each question, domain experts define multiple criteria that describe what a strong answer should include or avoid and assign a weight to each criterion.

A model based grader checks the candidate response against these criteria and marks which ones are satisfied. The final score is the sum of weights for satisfied criteria divided by the total possible score. This behaves like grading a short exam answer, it supports partial credit and captures nuance and cultural correctness, not only surface token overlap.

https://openai.com/index/introducing-indqa/

Construction Process And Adversarial Filtering

OpenAI describes a four step construction pipeline:

First, they partnered with organizations in India to recruit experts across 10 domains. These experts are native level speakers of the target language and English and have deep subject expertise. They wrote difficult, reasoning heavy prompts anchored in regional context, such as literature, food history, law or media.

Second, they applied adversarial filtering. Every draft question was evaluated with OpenAI’s strongest models at creation time, GPT-4o, OpenAI o3, GPT-4.5 and, partially after public launch, GPT-5. Only questions where a majority of these models failed to produce acceptable answers were kept. This preserves headroom so that future model improvements show up clearly on IndQA.

Third, experts provided detailed criteria for grading each question, similar to an exam rubric. These criteria are reused whenever another model is evaluated on IndQA.

Fourth, experts wrote ideal answers and English translations and then performed peer review and iterative revisions until they signed off on quality.

Measuring Progress On Indian Languages

OpenAI uses IndQA to evaluate recent frontier models and to chart progress over the last couple years on Indian languages. They report that model performance has improved significantly on IndQA while still leaving substantial room for improvement. Results are stratified by language and by domain and include comparisons of GPT-5 Thinking High with other frontier systems.

Key Takeaways

IndQA is a culturally grounded Indic benchmark: IndQA evaluates how well AI models understand and reason about questions that matter in Indian languages, across culturally specific domains, rather than only testing translation or multiple choice accuracy.

The dataset is expert built and reasonably large: The benchmark contains 2,278 questions across 12 languages and 10 cultural domains, developed in collaboration with 261 domain experts from across India, covering areas like architecture, everyday life, food, history and religion.

Evaluation is rubric based, not exact match: Each datapoint bundles a native language prompt, an English translation, a detailed grading rubric and an ideal answer, and model outputs are graded by a model based system that checks weighted expert defined criteria, which enables partial credit and nuanced cultural evaluation.

Questions are adversarially filtered against OpenAI’s strongest models: Draft questions were filtered by running GPT 4o, OpenAI o3, GPT 4.5 and partially GPT 5, and keeping only those items where most of these models failed, which preserves headroom for future models on IndQA.

Editorial Comments

IndQA is a timely step because it targets a real gap, most existing multilingual benchmarks over index on English content and translation style tasks while India has diverse high resource and low resource languages. IndQA brings expert curated, rubric based evaluation for questions that matter in Indian cultural contexts, and uses adversarial filtering against GPT 4o, OpenAI o3, GPT 4.5 and GPT 5 to preserve headroom for frontier models. This launch makes IndQA a practical north star for evaluating Indian language reasoning in modern AI systems.
The post OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Languages appeared first on MarkTechPost.

How Amazon Search increased ML training twofold using AWS Batch for Am …

In this post, we show you how Amazon Search optimized GPU instance utilization by leveraging AWS Batch for SageMaker Training jobs. This managed solution enabled us to orchestrate machine learning (ML) training workloads on GPU-accelerated instance families like P5, P4, and others. We will also provide a step-by-step walkthrough of the use case implementation.
Machine learning at Amazon Search
At Amazon Search, we use hundreds of GPU-accelerated instances to train and evaluate ML models that help our customers discover products they love. Scientists typically train more than one model at a time to find the optimal set of features, model architecture, and hyperparameter settings that optimize the model’s performance. We previously leveraged a first-in-first-out (FIFO) queue to coordinate model training and evaluation jobs. However, we needed to employ a more nuanced criteria to prioritize which jobs should run in what order. Production models needed to run with high priority, exploratory research as medium priority, and hyperparameter sweeps and batch inference as low priority. We also needed a system that could handle interruptions. Should a job fail, or a given instance type become saturated, we needed the job to run on other available compatible instance types while respecting the overall prioritization criteria. Finally, we wanted a managed solution so we could focus more on model development instead of managing infrastructure.
After evaluating multiple options, we chose AWS Batch for Amazon SageMaker Training jobs because it best met our requirements. This solution seamlessly integrated AWS Batch with Amazon SageMaker and allowed us to run jobs per our prioritization criteria. This allows applied scientists to submit multiple concurrent jobs without manual resource management. By leveraging AWS Batch features such as advanced prioritization through fair-share scheduling, we increased peak utilization of GPU-accelerated instances from 40% to over 80%.
Amazon Search: AWS Batch for SageMaker Training Job implementation
We leveraged three AWS technologies to set up our job queue. We used Service Environments to configure the SageMaker AI parameters that AWS Batch uses to submit and manage SageMaker Training jobs. We used Share Identifiers to prioritize our workloads. Finally, we used Amazon CloudWatch to monitor and the provision of alerting capability for critical events or deviations from expected behavior. Let’s dive deep into these constructs.
Service environments. We set up service environments to represent the total GPU capacity available for each instance family, such as P5s and P4s. Each service environment was configured with fixed limits based on our team’s reserved capacity in AWS Batch. Note that for teams using SageMaker Training Plans, these limits can be set to the number of reserved instances, making capacity planning more straightforward. By defining these boundaries, we established how the total GPU instance capacity within a service environment was distributed across different production jobs. Each production experiment was allocated a portion of this capacity through Share Identifiers.
Figure 1 provides a real-world example of how we used AWS Batch’s fair-share scheduling to divide 100 GPU instance between ShareIDs. We allocated 60 instances to ProdExp1, and 40 to ProdExp2. When ProdExp2 used only 25 GPU instances, the remaining 15 could be borrowed by ProdExp1, allowing it to scale up to 75 GPU instances. When ProdExp2 later needed its full 40 GPU instances, the scheduler preempted jobs from ProdExp1 to restore balance. This example used the P4 instance family, but the same approach could apply to any SageMaker-supported EC2 instance family. This ensured that production workloads have guaranteed access to their assigned capacity, while exploratory or ad-hoc experiments could still make use of any idle GPU instances. This design safeguarded critical workloads and improved overall instance utilization by ensuring that no reserved capacity went unused.

Figure 1: AWS Batch fair-share scheduling

Share Identifiers. We used Share Identifiers to allocate fractions of a service environment’s capacity to production experiments. Share Identifiers are string tags applied at job submission time. AWS Batch used these tags to track usage and enforce fair-share scheduling. For initiatives that required dedicated capacity, we defined preset Share Identifiers with quotas in AWS Batch. This reserved capacity for production tracks. These quotas acted as fairness targets rather than hard limits. Idle capacity could still be borrowed, but under contention, AWS Batch enforced fairness by preempting resources from overused identifiers and reassigned them to underused ones.
Within each Share Identifier, job priorities ranging from 0 to 99 determined execution order, but priority-based preemption only triggered when the ShareIdentifier reached its allocated capacity limit. Figure 2 illustrates how we setup and used our share identifiers. ProdExp1 had 60 p4d instances and ran jobs at various priorities. Job A had a priority of 80, Job B was set to 50, Job C was set to at 30, and Job D had a priority 10. When all 60 instances were occupied and a new high-priority job (priority 90) requiring 15 instances was submitted, the system preempted the lowest priority running job (Job D) to make room, while maintaining the total of 60 instances for that Share Identifier.

Figure 2: Priority scheduling within a Share ID

Amazon CloudWatch. We used Amazon CloudWatch to instrument our SageMaker training jobs. SageMaker automatically publishes metrics on job progress and resource utilization, while AWS Batch provides detailed information on job scheduling and execution. With AWS Batch, we queried the status of each job through the AWS Batch APIs. This made it possible to track jobs as they transitioned through states such as SUBMITTED, PENDING, RUNNABLE, STARTING, RUNNING, SUCCEEDED, and FAILED. We published these metrics and job states to CloudWatch and configured dashboards and alarms to alert anytime we encountered extended wait times, unexpected failures, or underutilized resources. This built-in integration provided both real-time visibility and historical trend analysis, which helped our team maintain operational efficiency across GPU clusters without building custom monitoring systems.
Operational impact on team performance
By adopting AWS Batch for SageMaker Training jobs, we enabled experiments to run without concerns about resource availability or contention. Researchers could submit jobs without waiting for manual scheduling, which increased the number of experiments that could be run in parallel. This led to shorter queue times, higher GPU utilization, and faster turnaround of training results, directly improving both research throughput and delivery timelines.
How to set Amazon Batch for SageMaker Training Jobs
To set up a similar environment, you can follow this tutorial, which shows you how to orchestrate multiple GPU large language model (LLM) fine-tuning jobs using multiple GPU-powered instances. The solution is also available on GitHub.
Prerequisites
To orchestrate multiple SageMaker Training jobs with AWS Batch, first you need to complete the following prerequisites:
Clone the GitHub repository with the assets for this deployment. This repository consists of notebooks that reference assets:

git clone https://github.com/aws/amazon-sagemaker-examples/
cd build_and_train_models/sm-training-queues-pytorch/

Create AWS Batch resources
To create the necessary resources to manage SageMaker Training job queues with AWS Batch, we provide utility functions in the example to automate the creation of the Service Environment, Scheduling Policy, and Job Queue.
The service environment represents the Amazon SageMaker AI capacity limits available to schedule, expressed by maximum number of instances. The scheduling policy indicates how resource computes are allocated in a job queue between users or workloads. The job queue is the scheduler interface that researchers interact with to submit jobs and interrogate job status. AWS Batch provides two different queues we can operate with:

FIFO queues – Queues in which no scheduling policies are required
Fair-share queues – Queues in which a scheduling policy Amazon Resource Name (ARN) is required to orchestrate the submitted jobs

We recommend creating dedicated service environments for each job queue in a 1:1 ratio. FIFO queues provide basic message delivery, while fair-share scheduling (FSS) queues provide more sophisticated scheduling, balancing utilization within a Share Identifier, share weights, and job priority. For customers who don’t need multiple shares but would like the ability to assign a priority on job submission, we recommend creating an FSS queue and using a single share within it for all submissions.To create the resources, execute the following commands:

cd smtj_batch_utils
python create_resources.py

You can navigate the AWS Batch Dashboard, shown in the following screenshot, to explore the created resources.

This automation script created two queues:

ml-c5-xlarge-queue – A FIFO queue with priority 2 used for CPU workloads
ml-g6-12xlarge-queue – A fair-share queue with priority 1 used for GPU workloads

The associated scheduling policy for the queue ml-g6-12xlarge-queue is with share attributes such as High priority (HIGHPRI), Medium priority (MIDPRI) and Low priority (LOWPRI) along with the queue weights. Users can submit jobs and assign them to one of three shares: HIGHPRI, MIDPRI, or LOWPRI and assign weights such as 1 for high priority and 3 for medium and 5 for low priority. Below is the screenshot showing the scheduling policy details:

For instructions on how to set up the service environment and a job queue, refer to the Getting started section in Introducing AWS Batch support for SageMaker Training Jobs blog.
Run LLM fine-tuning jobs on SageMaker AI
We run the notebook notebook.ipynb to start submitting SageMaker Training jobs with AWS Batch. The notebook contains the code to prepare the data used for the workload, upload on Amazon Simple Storage Service (Amazon S3), and define the hyperparameters required by the job to be executed.
To run the fine-tuning workload using SageMaker Training jobs, this example uses the ModelTrainer class. The ModelTrainer class is a newer and more intuitive approach to model training that significantly enhances user experience. It supports distributed training, build your own container (BYOC), and recipes.
For additional information about ModelTrainer, you can refer to Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer
To set up the fine-tuning workload, complete the following steps:

Select the instance type, the container image for the training job, and define the checkpoint path where the model will be stored:

import sagemaker

instance_type = “ml.g6.12xlarge”
instance_count = 1

image_uri = sagemaker.image_uris.retrieve(
    framework=”pytorch”,
    region=sagemaker_session.boto_session.region_name,
    version=”2.6″,
    instance_type=instance_type,
    image_scope=”training”
)

Create the ModelTrainer function to encapsulate the training setup. The ModelTrainer class simplifies the experience by encapsulating code and training setup. In this example:

SourceCode – The source code configuration. This is used to configure the source code for running the training job by using your local python scripts.
Compute – The compute configuration. This is used to specify the compute resources for the training job.

from sagemaker.modules.configs import Compute, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

role = sagemaker.get_execution_role()

# Define the script to be run
source_code = SourceCode(
    source_dir=”./scripts”,
    requirements=”requirements.txt”,
    entry_script=”train.py”,
)

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=0
)

# define Training Job Name
job_name = f”train-deepseek-distill-llama-8b-sft-batch”

# define OutputDataConfig path
output_path = f”s3://{bucket_name}/{job_name}”

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    distributed=Torchrun(),
    stopping_condition=StoppingCondition(max_runtime_in_seconds=7200),
    hyperparameters={
        “config”: “/opt/ml/input/data/config/args.yaml”
    },
    output_data_config=OutputDataConfig(s3_output_path=output_path),
    role=role,
)

Set up the input channels for ModelTrainer by creating InputData objects from the provided S3 bucket paths for the training and validation datasets:

from sagemaker.modules.configs import InputData

train_input = InputData(
    channel_name=”train”,
    data_source=train_dataset_s3_path,
)
val_input = InputData(
    channel_name=”val”,
    data_source=val_dataset_s3_path,
)
config_input = InputData(
    channel_name=”config”,
    data_source=train_config_s3_path,
)

TRAINING_INPUTS = [train_input, val_input, config_input]

Queue SageMaker Training jobs
This section and the following are intended to be used interactively so that you can explore how to use the Amazon SageMaker Python SDK to submit jobs to your Batch queues. Follow these steps:

Select the queue to use:

from sagemaker.aws_batch.queue import TrainingQueue
SMTJ_BATCH_QUEUE = “ml-g6-12xlarge-queue”

queue = TrainingQueue(SMTJ_BATCH_QUEUE)

In the next cell, submit two training jobs in the queue:

LOW PRIORITY
MEDIUM PRIORITY

Use the API submit to submit all the jobs:

job_name_1 = job_name + “-low-pri”
queued_job_1 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_1, priority=5, share_identifier=”LOWPRI”
)
job_name_2 = job_name + “-mid-pri”
queued_job_2 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_2, priority=3, share_identifier=”MIDPRI”
)

Display the status of running and in queue jobs
We can use the job queue list and job queue snapshot APIs to programmatically view a snapshot of the jobs that the queue will run next. For fair-share queues, this ordering is dynamic and occasionally needs to be refreshed because new jobs are submitted to the queue or as share usage changes over time.

from utils.queue_utils import print_queue_state
print_queue_state(queue)

The following screenshot shows the jobs submitted with low priority and medium priority in the Runnable State and in the queue.

You can also refer to the AWS Batch Dashboard, shown in the following screenshot, to analyze the status of the jobs.

As shown in the following screenshot, the first job executed with the SageMaker Training job is the MEDIUM PRIORITY one, by respecting the scheduling policy rules defined previously.

You can explore the running training job in the SageMaker AI console, as shown in the following screenshot.

Submit an additional job
You can now submit an additional SageMaker Training job with HIGH PRIORITY to the queue:

job_name_3 = job_name + “-high-pri”
queued_job_3 = queue.submit(
    model_trainer, TRAINING_INPUTS, job_name_3, priority=1, share_identifier=”HIGHPRI”
)

You can explore the status from the dashboard, as shown in the following screenshot.

The HIGH PRIORITY job, despite being submitted later in the queue, will be executed before the other runnable jobs by respecting the scheduling policy rules, as shown in the following screenshot.

As the scheduling policy in the screenshot shows, the LOWPRI share has a higher weight factor (5) than the MIDPRI share (3). Since a lower weight signifies higher priority, a LOWPRI job will be executed after a MIDPRI job, even if they are submitted at the same time.

Clean up
To clean up your resources to avoid incurring future charges, follow these steps:

Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.
Delete AWS Batch resources by using the command python create_resources.py –clean from the GitHub example or by manually deleting them from the AWS Management Console.

Conclusion
In this post, we demonstrated how Amazon Search used AWS Batch for SageMaker Training Jobs to optimize GPU resource utilization and training job management. The solution transformed their training infrastructure by implementing sophisticated queue management and fair share scheduling, increasing peak GPU utilization from 40% to over 80%.We recommend that organizations facing similar ML training infrastructure challenges explore AWS Batch integration with SageMaker, which provides built-in queue management capabilities and priority-based scheduling. The solution eliminates manual resource coordination while providing workloads with appropriate prioritization through configurable scheduling policies.
To begin implementing AWS Batch with SageMaker training jobs, you can access our sample code and implementation guide in the amazon-sagemaker-examples repository on GitHub. The example demonstrates how to set up AWS Identity and Access Management (IAM) permissions, create AWS Batch resources, and orchestrate multiple GPU-powered training jobs using ModelTrainer class.

The authors would like to thank Charles Thompson and Kanwaljit Khurmi for their collaboration.
About the authors

Mona Mona
Mona is a generative AI Specialist Solutions Architect at Amazon focusing. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide.

Mayank Jha
Mayank is a Senior Machine Learning Engineer at Amazon Search working on the model training optimization. He is passionate about finding practical applications for complex problems at hand and aims to develop solutions that have a deep impact on how businesses and people thrive.

Bruno Pistone
Bruno is a Senior generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

James Park
James is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

How Can We Build Scalable and Reproducible Machine Learning Experiment …

In this tutorial, we explore Hydra, an advanced configuration management framework originally developed and open-sourced by Meta Research. We begin by defining structured configurations using Python dataclasses, which allows us to manage experiment parameters in a clean, modular, and reproducible manner. As we move through the tutorial, we compose configurations, apply runtime overrides, and simulate multirun experiments for hyperparameter sweeps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport subprocess
import sys
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “hydra-core”])

import hydra
from hydra import compose, initialize_config_dir
from omegaconf import OmegaConf, DictConfig
from dataclasses import dataclass, field
from typing import List, Optional
import os
from pathlib import Path

We begin by installing Hydra and importing all the essential modules required for structured configurations, dynamic composition, and file handling. This setup ensures our environment is ready to execute the full tutorial seamlessly on Google Colab. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@dataclass
class OptimizerConfig:
_target_: str = “torch.optim.SGD”
lr: float = 0.01

@dataclass
class AdamConfig(OptimizerConfig):
_target_: str = “torch.optim.Adam”
lr: float = 0.001
betas: tuple = (0.9, 0.999)
weight_decay: float = 0.0

@dataclass
class SGDConfig(OptimizerConfig):
_target_: str = “torch.optim.SGD”
lr: float = 0.01
momentum: float = 0.9
nesterov: bool = True

@dataclass
class ModelConfig:
name: str = “resnet”
num_layers: int = 50
hidden_dim: int = 512
dropout: float = 0.1

@dataclass
class DataConfig:
dataset: str = “cifar10”
batch_size: int = 32
num_workers: int = 4
augmentation: bool = True

@dataclass
class TrainingConfig:
model: ModelConfig = field(default_factory=ModelConfig)
data: DataConfig = field(default_factory=DataConfig)
optimizer: OptimizerConfig = field(default_factory=AdamConfig)
epochs: int = 100
seed: int = 42
device: str = “cuda”
experiment_name: str = “exp_001”

We define clean, type-safe configurations using Python dataclasses for the model, data, and optimizer settings. This structure allows us to manage complex experiment parameters in a modular and readable way while ensuring consistency across runs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef setup_config_dir():
config_dir = Path(“./hydra_configs”)
config_dir.mkdir(exist_ok=True)

main_config = “””
defaults:
– model: resnet
– data: cifar10
– optimizer: adam
– _self_

epochs: 100
seed: 42
device: cuda
experiment_name: exp_001
“””
(config_dir / “config.yaml”).write_text(main_config)

model_dir = config_dir / “model”
model_dir.mkdir(exist_ok=True)

(model_dir / “resnet.yaml”).write_text(“””
name: resnet
num_layers: 50
hidden_dim: 512
dropout: 0.1
“””)

(model_dir / “vit.yaml”).write_text(“””
name: vision_transformer
num_layers: 12
hidden_dim: 768
dropout: 0.1
patch_size: 16
“””)

data_dir = config_dir / “data”
data_dir.mkdir(exist_ok=True)

(data_dir / “cifar10.yaml”).write_text(“””
dataset: cifar10
batch_size: 32
num_workers: 4
augmentation: true
“””)

(data_dir / “imagenet.yaml”).write_text(“””
dataset: imagenet
batch_size: 128
num_workers: 8
augmentation: true
“””)

opt_dir = config_dir / “optimizer”
opt_dir.mkdir(exist_ok=True)

(opt_dir / “adam.yaml”).write_text(“””
_target_: torch.optim.Adam
lr: 0.001
betas: [0.9, 0.999]
weight_decay: 0.0
“””)

(opt_dir / “sgd.yaml”).write_text(“””
_target_: torch.optim.SGD
lr: 0.01
momentum: 0.9
nesterov: true
“””)

return str(config_dir.absolute())

We programmatically create a directory containing YAML configuration files for models, datasets, and optimizers. This approach enables us to demonstrate how Hydra automatically composes configurations from different files, thereby maintaining flexibility and clarity in experiments. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser@hydra.main(version_base=None, config_path=”hydra_configs”, config_name=”config”)
def train(cfg: DictConfig) -> float:
print(“=” * 80)
print(“CONFIGURATION”)
print(“=” * 80)
print(OmegaConf.to_yaml(cfg))

print(“n” + “=” * 80)
print(“ACCESSING CONFIGURATION VALUES”)
print(“=” * 80)
print(f”Model: {cfg.model.name}”)
print(f”Dataset: {cfg.data.dataset}”)
print(f”Batch Size: {cfg.data.batch_size}”)
print(f”Optimizer LR: {cfg.optimizer.lr}”)
print(f”Epochs: {cfg.epochs}”)

best_acc = 0.0
for epoch in range(min(cfg.epochs, 3)):
acc = 0.5 + (epoch * 0.1) + (cfg.optimizer.lr * 10)
best_acc = max(best_acc, acc)
print(f”Epoch {epoch+1}/{cfg.epochs}: Accuracy = {acc:.4f}”)

return best_acc

We implement a training function that leverages Hydra’s configuration system to print, access, and use nested config values. By simulating a simple training loop, we showcase how Hydra cleanly integrates experiment control into real workflows. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_basic_usage():
print(“n” + ” DEMO 1: Basic Configurationn”)
config_dir = setup_config_dir()
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(config_name=”config”)
print(OmegaConf.to_yaml(cfg))

def demo_config_override():
print(“n” + ” DEMO 2: Configuration Overridesn”)
config_dir = setup_config_dir()
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(
config_name=”config”,
overrides=[
“model=vit”,
“data=imagenet”,
“optimizer=sgd”,
“optimizer.lr=0.1”,
“epochs=50”
]
)
print(OmegaConf.to_yaml(cfg))

def demo_structured_config():
print(“n” + ” DEMO 3: Structured Config Validationn”)
from hydra.core.config_store import ConfigStore
cs = ConfigStore.instance()
cs.store(name=”training_config”, node=TrainingConfig)
with initialize_config_dir(version_base=None, config_dir=setup_config_dir()):
cfg = compose(config_name=”config”)
print(f”Config type: {type(cfg)}”)
print(f”Epochs (validated as int): {cfg.epochs}”)

def demo_multirun_simulation():
print(“n” + ” DEMO 4: Multirun Simulationn”)
config_dir = setup_config_dir()
experiments = [
[“model=resnet”, “optimizer=adam”, “optimizer.lr=0.001”],
[“model=resnet”, “optimizer=sgd”, “optimizer.lr=0.01”],
[“model=vit”, “optimizer=adam”, “optimizer.lr=0.0001”],
]
results = {}
for i, overrides in enumerate(experiments):
print(f”n— Experiment {i+1} —“)
with initialize_config_dir(version_base=None, config_dir=config_dir):
cfg = compose(config_name=”config”, overrides=overrides)
print(f”Model: {cfg.model.name}, Optimizer: {cfg.optimizer._target_}”)
print(f”Learning Rate: {cfg.optimizer.lr}”)
results[f”exp_{i+1}”] = cfg
return results

def demo_interpolation():
print(“n” + ” DEMO 5: Variable Interpolationn”)
cfg = OmegaConf.create({
“model”: {“name”: “resnet”, “layers”: 50},
“experiment”: “${model.name}_${model.layers}”,
“output_dir”: “/outputs/${experiment}”,
“checkpoint”: “${output_dir}/best.ckpt”
})
print(OmegaConf.to_yaml(cfg))
print(f”nResolved experiment name: {cfg.experiment}”)
print(f”Resolved checkpoint path: {cfg.checkpoint}”)

We demonstrate Hydra’s advanced capabilities, including config overrides, structured config validation, multi-run simulations, and variable interpolation. Each demo showcases how Hydra accelerates experimentation speed, streamlines manual setup, and fosters reproducibility in research. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
demo_basic_usage()
demo_config_override()
demo_structured_config()
demo_multirun_simulation()
demo_interpolation()
print(“n” + “=” * 80)
print(“Tutorial complete! Key takeaways:”)
print(“✓ Config composition with defaults”)
print(“✓ Runtime overrides via command line”)
print(“✓ Structured configs with type safety”)
print(“✓ Multirun for hyperparameter sweeps”)
print(“✓ Variable interpolation”)
print(“=” * 80)

We execute all demonstrations in sequence to observe Hydra in action, from loading configs to performing multiruns. By the end, we summarize key takeaways, reinforcing how Hydra enables scalable and elegant experiment management.

In conclusion, we grasp how Hydra, pioneered by Meta Research, simplifies and enhances experiment management through its powerful composition system. We explore structured configs, interpolation, and multirun capabilities that make large-scale machine learning workflows more flexible and maintainable. With this knowledge, you are now equipped to integrate Hydra into your own research or development pipelines, ensuring reproducibility, efficiency, and clarity in every experiment you run.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How Can We Build Scalable and Reproducible Machine Learning Experiment Pipelines Using Meta Research Hydra? appeared first on MarkTechPost.

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2 …

Code-oriented large language models moved from autocomplete to software engineering systems. In 2025, leading models must fix real GitHub issues, refactor multi-repo backends, write tests, and run as agents over long context windows. The main question for teams is not “can it code” but which model fits which constraints.

Here are seven models (and systems around them) that cover most real coding workloads today:

OpenAI GPT-5 / GPT-5-Codex

Anthropic Claude 3.5 Sonnet / Claude 4.x Sonnet with Claude Code

Google Gemini 2.5 Pro

Meta Llama 3.1 405B Instruct

DeepSeek-V2.5-1210 (with DeepSeek-V3 as the successor)

Alibaba Qwen2.5-Coder-32B-Instruct

Mistral Codestral 25.01

The goal of this comparison is not to rank them on a single score. The goal is to show which system to pick for a given benchmark target, deployment model, governance requirement, and IDE or agent stack.

Evaluation dimensions

We compare on six stable dimensions:

Core coding quality: HumanEval, MBPP / MBPP EvalPlus, code generation and repair quality on standard Python tasks.

Repo and bug-fix performance: SWE-bench Verified (real GitHub issues), Aider Polyglot (whole-file edits), RepoBench, LiveCodeBench.

Context and long-context behavior: Documented context limits and practical behavior in long sessions.

Deployment model: Closed API, cloud service, containers, on-premises or fully self-hosted open weights.

Tooling and ecosystem: Native agents, IDE extensions, cloud integration, GitHub and CI/CD support.

Cost and scaling pattern: Token pricing for closed models, hardware footprint and inference pattern for open models.

Image source: marktechpost.com

1. OpenAI GPT-5 / GPT-5-Codex

OpenAI’s GPT-5 is the flagship reasoning and coding model and the default in ChatGPT. For real-world code, OpenAI reports:

SWE-bench Verified: 74.9%

Aider Polyglot: 88%

Both benchmarks simulate real engineering: SWE-bench Verified runs against upstream repos and tests; Aider Polyglot measures whole-file multi-language edits.

Context and variants

gpt-5 (chat) API: 128k token context.

gpt-5-pro / gpt-5-codex: up to 400k combined context in the model card, with typical production limits around ≈272k input + 128k output for reliability.

GPT-5 and GPT-5-Codex are available in ChatGPT (Plus / Pro / Team / Enterprise) and via the OpenAI API; they are closed-weight, cloud-hosted only.

Strengths

Highest published SWE-bench Verified and Aider Polyglot scores among widely available models.

Very strong at multi-step bug fixing with “thinking” (chain-of-thought) enabled.

Deep ecosystem: ChatGPT, Copilot, and many third-party IDE and agent platforms use GPT-5 backends.

Limits

No self-hosting; all traffic must go through OpenAI or partners.

Long-context calls are expensive if you stream full monorepos, so you need retrieval and diff-only patterns.

Use when you want maximum repo-level benchmark performance and are comfortable with a closed, cloud API.

2. Anthropic Claude 3.5 Sonnet / Claude 4.x + Claude Code

Claude 3.5 Sonnet was Anthropic’s main coding workhorse before the Claude 4 line. Anthropic highlights it as SOTA on HumanEval, and independent comparisons report:

HumanEval: ≈ 92%

MBPP EvalPlus: ≈ 91%

In 2025, Anthropic released Claude 4 Opus, Sonnet, and Sonnet 4.5, positioning Sonnet 4.5 as its best coding and agent model so far.

Claude Code stack

Claude Code is a repo-aware coding system:

Managed VM connected to your GitHub repo.

File browsing, editing, tests, and PR creation.

SDK for building custom agents that use Claude as a coding backend.

Strengths

Very strong HumanEval / MBPP, good empirical behavior on debugging and code review.

Production-grade coding agent environment with persistent VM and GitHub workflows.

Limits

Closed and cloud-hosted, similar to GPT-5 in governance terms.

Published SWE-bench Verified numbers for Claude 3.5 Sonnet are below GPT-5, though Claude 4.x is likely closer.

Use when you need explainable debugging, code review, and a managed repo-level agent and can accept a closed deployment.

3. Google Gemini 2.5 Pro

Gemini 2.5 Pro is Google DeepMind’s main coding and reasoning model for developers. It reports following performance/results:

LiveCodeBench v5: 70.4%

Aider Polyglot (whole-file editing): 74.0%

SWE-bench Verified: 63.8%

These results place Gemini 2.5 Pro above many earlier models and only behind Claude 3.7 and GPT-5 on SWE-bench Verified.

Context and platform

Long-context capability marketed up to 1M tokens across the Gemini family; 2.5 Pro is the stable tier used in Gemini Apps, Google AI Studio, and Vertex AI.

Tight integration with GCP services, BigQuery, Cloud Run, and Google Workspace.

Strengths

Good combination of LiveCodeBench, Aider, SWE-bench scores plus first-class GCP integration.

Strong choice for “data plus application code” when you want the same model for SQL, analytics helpers, and backend code.

Limits

Closed and tied to Google Cloud.

For pure SWE-bench Verified, GPT-5 and the newest Claude Sonnet 4.x are stronger.

Use when your workloads already run on GCP / Vertex AI and you want a long-context coding model inside that stack.

4. Meta Llama 3.1 405B Instruct

Meta’s Llama 3.1 family (8B, 70B, 405B) is open-weight. The 405B Instruct variant is the high-end option for coding and general reasoning. It reports following performance/results:

HumanEval (Python): 89.0

MBPP (base or EvalPlus): ≈ 88.6

These scores put Llama 3.1 405B among the strongest open models on classic code benchmarks.

The official model card states that Llama 3.1 models outperform many open and closed chat models on common benchmarks and are optimized for multilingual dialogue and reasoning.

Strengths

High HumanEval / MBPP scores with open weights and permissive licensing.

Strong general performance (MMLU, MMLU-Pro, etc.), so one model can serve both product features and coding agents.

Limits

405B parameters mean high serving cost and latency unless you have a large GPU cluster.

For strictly code benchmarks at a fixed compute budget, specialized models such as Qwen2.5-Coder-32B and Codestral 25.01 are more cost-efficient.

Use when you want a single open foundation model with strong coding and general reasoning, and you control your own GPU infrastructure.

5. DeepSeek-V2.5-1210 (and DeepSeek-V3)

DeepSeek-V2.5-1210 is an upgraded Mixture-of-Experts model that merges the chat and coder lines. The model card reports:

LiveCodeBench (08.01–12.01): improved from 29.2% to 34.38%

MATH-500: 74.8% → 82.8%

DeepSeek has since released DeepSeek-V3, a 671B-parameter MoE with 37B active per token, trained on 14.8T tokens. The performance is comparable to leading closed models on many reasoning and coding benchmarks, and public dashboards show V3 ahead of V2.5 on key tasks.

Strengths

Open MoE model with solid LiveCodeBench results and good math performance for its size.

Efficient active-parameter count vs total parameters.

Limits

V2.5 is no longer the flagship; DeepSeek-V3 is now the reference model.

Ecosystem is lighter than OpenAI / Google / Anthropic; teams must assemble their own IDE and agent integrations.

Use when you want a self-hosted MoE coder with open weights and are ready to move to DeepSeek-V3 as it matures.

6. Qwen2.5-Coder-32B-Instruct

Qwen2.5-Coder is Alibaba’s code-specific LLM family. The technical report and model card describe six sizes (0.5B to 32B) and continued pretraining on over 5.5T tokens of code-heavy data.

The official benchmarks for Qwen2.5-Coder-32B-Instruct list:

HumanEval: 92.7%

MBPP: 90.2%

LiveCodeBench: 31.4%

Aider Polyglot: 73.7%

Spider: 85.1%

CodeArena: 68.9%

Strengths

Very strong HumanEval / MBPP / Spider results for an open model; often competitive with closed models in pure code tasks.

Multiple parameter sizes make it adaptable to different hardware budgets.

Limits

Less suited for broad general reasoning than a generalist like Llama 3.1 405B or DeepSeek-V3.

Documentation and ecosystem are catching up in English-language tooling.

Use when you need a self-hosted, high-accuracy code model and can pair it with a general LLM for non-code tasks.

7. Mistral Codestral 25.01

Codestral 25.01 is Mistral’s updated code generation model. Mistral’s announcement and follow-up posts state that 25.01 uses a more efficient architecture and tokenizer and generates code roughly 2× faster than the base Codestral model.

Benchmark reports:

HumanEval: 86.6%

MBPP: 80.2%

Spider: 66.5%

RepoBench: 38.0%

LiveCodeBench: 37.9%

Codestral 25.01 supports over 80 programming languages and a 256k token context window, and is optimized for low-latency, high-frequency tasks such as completion and FIM.

Strengths

Very good RepoBench / LiveCodeBench scores for a mid-size open model.

Designed for fast interactive use in IDEs and SaaS, with open weights and a 256k context.

Limits

Absolute HumanEval / MBPP scores sit below Qwen2.5-Coder-32B, which is expected at this parameter class.

Use when you need a compact, fast open code model for completions and FIM at scale.

Head to head comparison

FeatureGPT-5 / GPT-5-CodexClaude 3.5 / 4.x + Claude CodeGemini 2.5 ProLlama 3.1 405B InstructDeepSeek-V2.5-1210 / V3Qwen2.5-Coder-32BCodestral 25.01Core taskHosted general model with strong coding and agentsHosted models plus repo-level coding VMHosted coding and reasoning model on GCPOpen generalist foundation with strong codingOpen MoE coder and chat modelOpen code-specialized modelOpen mid-size code modelContext128k (chat), up to 400k Pro / Codex200k-class (varies by tier)Long-context, million-class across Gemini lineUp to 128k in many deploymentsTens of k, MoE scaling32B with typical 32k–128k contexts depending on host256k contextCode benchmarks (examples)74.9 SWE-bench, 88 Aider≈92 HumanEval, ≈91 MBPP, 49 SWE-bench (3.5); 4.x stronger but less published70.4 LiveCodeBench, 74 Aider, 63.8 SWE-bench89 HumanEval, ≈88.6 MBPP34.38 LiveCodeBench; V3 stronger on mixed benchmarks92.7 HumanEval, 90.2 MBPP, 31.4 LiveCodeBench, 73.7 Aider86.6 HumanEval, 80.2 MBPP, 38 RepoBench, 37.9 LiveCodeBenchDeploymentClosed API, OpenAI / Copilot stackClosed API, Anthropic console, Claude CodeClosed API, Google AI Studio / Vertex AIOpen weights, self-hosted or cloudOpen weights, self-hosted; V3 via providersOpen weights, self-hosted or via providersOpen weights, available on multiple cloudsIntegration pathChatGPT, OpenAI API, CopilotClaude app, Claude Code, SDKsGemini Apps, Vertex AI, GCPHugging Face, vLLM, cloud marketplacesHugging Face, vLLM, custom stacksHugging Face, commercial APIs, local runnersAzure, GCP, custom inference, IDE pluginsBest fitMax SWE-bench / Aider performance in hosted settingRepo-level agents and debugging qualityGCP-centric engineering and data + codeSingle open foundation modelOpen MoE experiments and Chinese ecosystemSelf-hosted high-accuracy code assistantFast open model for IDE and product integration

What to use when?

You want the strongest hosted repo-level solver: Use GPT-5 / GPT-5-Codex. Claude Sonnet 4.x is the closest competitor, but GPT-5 has the clearest SWE-bench Verified and Aider numbers today.

You want a full coding agent over a VM and GitHub: Use Claude Sonnet + Claude Code for repo-aware workflows and long multi-step debugging sessions.

You are standardized on Google Cloud: Use Gemini 2.5 Pro as the default coding model inside Vertex AI and AI Studio.

You need a single open general foundation: Use Llama 3.1 405B Instruct when you want one open model for application logic, RAG, and code.

You want the strongest open code specialist: Use Qwen2.5-Coder-32B-Instruct, and add a smaller general LLM for non-code tasks if needed.

You want MoE-based open models: Use DeepSeek-V2.5-1210 now and plan for DeepSeek-V3 as you move to the latest upgrade.

You are building IDEs or SaaS products and need a fast open code model: Use Codestral 25.01 for FIM, completion, and mid-size repo work with 256k context.

Editorial comments

GPT-5, Claude Sonnet 4.x, and Gemini 2.5 Pro now define the upper bound of hosted coding performance, especially on SWE-bench Verified and Aider Polyglot. At the same time, open models such as Llama 3.1 405B, Qwen2.5-Coder-32B, DeepSeek-V2.5/V3, and Codestral 25.01 show that it is realistic to run high-quality coding systems on your own infrastructure, with full control over weights and data paths.

For most software engineering teams, the practical answer is a portfolio: one or two hosted frontier models for the hardest multi-service refactors, plus one or two open models for internal tools, regulated code bases, and latency-sensitive IDE integrations.

References

OpenAI – Introducing GPT-5 for developers (SWE-bench Verified, Aider Polyglot) (openai.com)

Vellum, Runbear and other benchmark summaries for GPT-5 coding performance (vellum.ai)

Anthropic – Claude 3.5 Sonnet and Claude 4 announcements (Anthropic)

Kitemetric and other third-party Claude 3.5 Sonnet coding benchmark reviews (Kite Metric)

Google – Gemini 2.5 Pro model page and Google / Datacamp benchmark posts (Google DeepMind)

Meta – Llama 3.1 405B model card and analyses of HumanEval / MBPP scores (Hugging Face)

DeepSeek – DeepSeek-V2.5-1210 model card and update notes; community coverage on V3 (Hugging Face)

Alibaba – Qwen2.5-Coder technical report and Hugging Face model card (arXiv)

Mistral – Codestral 25.01 announcement and benchmark summaries (Mistral AI)

The post Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025 appeared first on MarkTechPost.

Cache-to-Cache(C2C): Direct Semantic Communication Between Large Langu …

Can large language models collaborate without sending a single token of text? a team of researchers from Tsinghua University, Infinigence AI, The Chinese University of Hong Kong, Shanghai AI Laboratory, and Shanghai Jiao Tong University say yes. Cache-to-Cache (C2C) is a new communication paradigm where large language models exchange information through their KV-Cache rather than through generated text.

https://arxiv.org/pdf/2510.03215

Text communication is the bottleneck in multi LLM systems

Current multi LLM systems mostly use text to communicate. One model writes an explanation, another model reads it as context.

This design has three practical costs:

Internal activations are compressed into short natural language messages. Much of the semantic signal in the KV-Cache never crosses the interface.

Natural language is ambiguous. Even with structured protocols, a coder model may encode structural signals, such as the role of an HTML <p> tag, that do not survive a vague textual description.

Every communication step requires token by token decoding, which dominates latency in long analytical exchanges.

The C2C work asks a direct question, can we treat KV-Cache as the communication channel instead.

Oracle experiments, can KV-Cache carry communication

The research team first run two oracle style experiments to test whether KV-Cache is a useful medium.

Cache enrichment oracle

They compare three setups on multiple choice benchmarks:

Direct, prefill on the question only.

Few shot, prefill on exemplars plus question, longer cache.

Oracle, prefill on exemplars plus question, then discard the exemplar segment and keep only the question aligned slice of the cache, so cache length is the same as Direct.

https://arxiv.org/pdf/2510.03215

Oracle improves accuracy from 58.42 percent to 62.34 percent at the same cache length, while Few shot reaches 63.39 percent. This demonstrates that enriching the question KV-Cache itself, even without more tokens, improves performance. Layer wise analysis shows that enriching only selected layers is better than enriching all layers, which later motivates a gating mechanism.

Cache transformation oracle

Next, they test whether KV-Cache from one model can be transformed into the space of another model. A three layer MLP is trained to map KV-Cache from Qwen3 4B to Qwen3 0.6B. t SNE plots show that the transformed cache lies inside the target cache manifold, but only in a sub region.

https://arxiv.org/pdf/2510.03215

C2C, direct semantic communication through KV-Cache

Based on these oracles, the research team defines Cache-to-Cache communication between a Sharer and a Receiver model.

During prefill, both models read the same input and produce layer wise KV-Cache. For each Receiver layer, C2C selects a mapped Sharer layer and applies a C2C Fuser to produce a fused cache. During decoding, the Receiver predicts tokens conditioned on this fused cache instead of its original cache.

The C2C Fuser follows a residual integration principle and has three modules:

Projection module concatenates Sharer and Receiver KV-Cache vectors, applies a projection layer, then a feature fusion layer.

Dynamic weighting module modulates heads based on the input so that some attention heads rely more on Sharer information.

Learnable gate adds a per layer gate that decides whether to inject Sharer context into that layer. The gate uses a Gumbel sigmoid during training and becomes binary at inference.

Sharer and Receiver can come from different families and sizes, so C2C also defines:

Token alignment by decoding Receiver tokens to strings and re encoding them with the Sharer tokenizer, then choosing Sharer tokens with maximal string coverage.

Layer alignment using a terminal strategy that pairs top layers first and walks backward until the shallower model is fully covered.

https://arxiv.org/pdf/2510.03215

During training, both LLMs are frozen. Only the C2C module is trained, using a next token prediction loss on Receiver outputs. The main C2C fusers are trained on the first 500k samples of the OpenHermes2.5 dataset, and evaluated on OpenBookQA, ARC Challenge, MMLU Redux and C Eval.

Accuracy and latency, C2C versus text communication

Across many Sharer Receiver combinations built from Qwen2.5, Qwen3, Llama3.2 and Gemma3, C2C consistently improves Receiver accuracy and reduces latency. For results:

C2C achieves about 8.5 to 10.5 percent higher average accuracy than individual models.

C2C outperforms text communication by about 3.0 to 5.0 percent on average.

C2C delivers around 2x average speedup in latency compared with text based collaboration, and in some configurations the speedup is larger.

A concrete example uses Qwen3 0.6B as Receiver and Qwen2.5 0.5B as Sharer. On MMLU Redux, the Receiver alone reaches 35.53 percent, text to text reaches 41.03 percent, and C2C reaches 42.92 percent. Average time per query for text to text is 1.52 units, while C2C stays close to the single model at 0.40. Similar patterns appear on OpenBookQA, ARC Challenge and C Eval.

On LongBenchV1, with the same pair, C2C outperforms text communication across all sequence length buckets. For sequences of 0 to 4k tokens, text communication reaches 29.47 while C2C reaches 36.64. Gains remain for 4k to 8k and for longer contexts.

https://arxiv.org/pdf/2510.03215

Key Takeaways

Cache-to-Cache communication lets a Sharer model send information to a Receiver model directly via KV-Cache, so collaboration does not need intermediate text messages, which removes the token bottleneck and reduces semantic loss in multi model systems.

Two oracle studies show that enriching only the question aligned slice of the cache improves accuracy at constant sequence length, and that KV-Cache from a larger model can be mapped into a smaller model’s cache space through a learned projector, confirming cache as a viable communication medium.

C2C Fuser architecture combines Sharer and Receiver caches with a projection module, dynamic head weighting and a learnable per layer gate, and integrates everything in a residual way, which allows the Receiver to selectively absorb Sharer semantics without destabilizing its own representation.

Consistent accuracy and latency gains are observed across Qwen2.5, Qwen3, Llama3.2 and Gemma3 model pairs, with about 8.5 to 10.5 percent average accuracy improvement over a single model, 3 to 5 percent gains over text to text communication, and around 2x faster responses because unnecessary decoding is removed.

Editorial Comments

Cache-to-Cache reframes multi LLM communication as a direct semantic transfer problem, not a prompt engineering problem. By projecting and fusing KV-Cache between Sharer and Receiver with a neural fuser and learnable gating, C2C uses the deep specialized semantics of both models while avoiding explicit intermediate text generation, which is an information bottleneck and a latency cost. With 8.5 to 10.5 percent higher accuracy and about 2x lower latency than text communication, C2C is a strong systems level step toward KV native collaboration between models.

Check out the Paper and Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Cache-to-Cache(C2C): Direct Semantic Communication Between Large Language Models via KV-Cache Fusion appeared first on MarkTechPost.

Iterate faster with Amazon Bedrock AgentCore Runtime direct code deplo …

Amazon Bedrock AgentCore is an agentic platform for building, deploying, and operating effective agents securely at scale. Amazon Bedrock AgentCore Runtime is a fully managed service of Bedrock AgentCore, which provides low latency serverless environments to deploy agents and tools. It provides session isolation, supports multiple agent frameworks including popular open-source frameworks, and handles multimodal workloads and long-running agents.
AgentCore Runtime supports container based deployments where the container definition is provided in a Dockerfile, and the agent is built as a container image. Customers who have container build and deploy pipelines benefit from this method, where agent deployment can be integrated into existing pipelines. 
Today, AgentCore Runtime has launched a second method to deploy agents – direct code deployment (for Python). Agent code and its dependencies can be packaged as a zip archive, alleviating the need for Docker definition and ECR dependencies. This makes it straightforward for developers to prototype and iterate faster. This method is a good fit for customers who prefer not to worry about Docker expertise and container infrastructure when deploying agents.
In this post, we’ll demonstrate how to use direct code deployment (for Python).
Introducing AgentCore Runtime direct code deployment
With the container deployment method, developers create a Dockerfile, build ARM-compatible containers, manage ECR repositories, and upload containers for code changes. This works well where container DevOps pipelines have already been established to automate deployments. 
However, customers looking for fully managed deployments can benefit from direct code deployment, which can significantly improve developer time and productivity. Direct code deployment provides a secure and scalable path forward for rapid prototyping agent capabilities to deploying production workloads at scale.
We’ll discuss the strengths of each deployment option to help you choose the right approach for your use case. 

With direct code deployment, developers create a zip archive of code and dependencies, upload to Amazon S3, and configure the bucket in the agent configuration. When using the AgentCore starter toolkit, the toolkit handles dependency detection, packaging, and upload which provides a much-simplified developer experience. Direct code deployment is also supported using the API.
Let’s compare the deployment steps at a high level between the two methods:
Container-based deployment
The container-based deployment method involves the following steps:

Create a Dockerfile
Build ARM-compatible container
Create ECR repository
Upload to ECR
Deploy to AgentCore Runtime

Direct code deployment
The direct code deployment method involves the following steps:

Package your code and dependencies into a zip archive
Upload it to S3
Configure the bucket in agent configuration
Deploy to AgentCore Runtime

How to use direct code deployment
Let’s illustrate how direct code deployment works with an agent created with Strands Agents SDK and using the AgentCore starter-toolkit to deploy the agent.
Prerequisites
Before you begin, make sure you have the following:

Any of the versions of Python 3.10 to 3.13
Your preferred package manager installed. For example, we use uv package manager.
AWS account for creating and deploying agents
Amazon Bedrock model access to Anthropic Claude Sonnet 4.0

Step 1: Initialize your project
Set up a new Python project using the uv package manager, then navigate into the project directory:

uv init <project> –python 3.13
cd <project>

Step 2: Add the dependencies for the project
Install the required Bedrock AgentCore libraries and development tools for your project. In this example, dependencies are added using .toml file, alternatively they can be specified in requirements.txt file:

uv add bedrock-agentcore strands-agents strands-agents-tools
uv add –dev bedrock-agentcore-starter-toolkit
source .venv/bin/activate

Step 3: Create an agent.py file
Create the main agent implementation file that defines your AI agent’s behavior:

from bedrock_agentcore import BedrockAgentCoreApp
from strands import Agent, tool
from strands_tools import calculator
from strands.models import BedrockModel
import logging

app = BedrockAgentCoreApp(debug=True)

# Logging setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create a custom tool
@tool
def weather():
“”” Get weather “””
return “sunny”

model_id = “us.anthropic.claude-sonnet-4-20250514-v1:0″
model = BedrockModel(
model_id=model_id,
)

agent = Agent(
model=model,
tools=[calculator, weather],
system_prompt=”You’re a helpful assistant. You can do simple math calculation, and tell the weather.”
)

@app.entrypoint
def invoke(payload):
“””Your AI agent function”””
user_input = payload.get(“prompt”, “Hello! How can I help you today?”)
logger.info(“n User input: %s”, user_input)
response = agent(user_input)
logger.info(“n Agent result: %s “, response.message)
return response.message[‘content’][0][‘text’]

if __name__ == “__main__”:
app.run()

Step 4: Deploy to AgentCore Runtime
Configure and deploy your agent to the AgentCore Runtime environment:

agentcore configure –entrypoint agent.py –name <agent-name>

This will launch an interactive session where you configure the S3 bucket to upload the zip deployment package to and choose a deployment configuration type (as shown in the following configuration). To opt for direct code deployment, choose option 1 – Code Zip.
Deployment Configuration
Select deployment type:

Code Zip (recommended) – Simple, serverless, no Docker required
Container – For custom runtimes or complex dependencies

agentcore launch

This command creates a zip deployment package, uploads it to the specified S3 bucket, and launches the agent in the AgentCore Runtime environment, making it ready to receive and process requests.
To test the solution, let’s prompt the agent to see how the weather is:

agentcore invoke ‘{“prompt”:”How is the weather today?”}’

The first deployment takes approximately 30 seconds to complete, but subsequent updates to the agent benefit from the streamlined direct code deployment process and should take less than half the time, supporting faster iteration cycles during development.
When to choose direct code instead of container-based deployment
Let’s look at some of the dimensions and see how the direct code and container-based deployment options are different. This will help you choose the option that’s right for you:

Deployment process: Direct code deploys agents as zip files with no Docker, ECR, or CodeBuild required. Container-based deployment uses Docker and ECR with full Dockerfile control.
Deployment time: Although there is not much difference during first deployment of an agent, subsequent updates to the agent are significantly faster with direct code deployment (from an average of 30 seconds for containers to about 10 seconds for direct code deployment).
Artifact storage: Direct code stores ZIP packages in an S3 bucket. Container-based deployment stores Docker images in Amazon ECR. Direct code deployment incurs storage costs at standard S3 storage rates (starting February 27th  2026) as artifacts are stored in the service account. Container-based deployment incurs Amazon ECR charges in your account.
Customization: Direct code deployment supports custom dependencies through ZIP-based packaging, while container based depends on a Dockerfile.
Package size: Direct code deployment limits the package size to 250MB whereas container-based packages can be up to 2GB in size.
Language Support: Direct code currently supports Python 3.10, 3.11, 3.12, and 3.13. Container-based deployment supports many languages and runtimes.

Our general guidance is:
Container-based deployment is the right choice when your package exceeds 250MB, you have existing container CI/CD pipelines, or you need highly specialized dependencies and custom packaging requirements. Choose containers if you require multi-language support, custom system dependencies or direct control over artifact storage and versioning in your account.
Direct code deployment is the right choice when your package is under 250MB, you use Python 3.10-3.13 with common frameworks like LangGraph, Strands, or CrewAI, and you need rapid prototyping with fast iteration cycles. Choose direct code if your build process is straightforward without complex dependencies, and you want to remove the Docker/ECR/CodeBuild setup.
A hybrid approach works well for many teams, use direct code for rapid prototyping and experimentation where fast iteration and simple setup accelerate development, then graduate to containers for production when package size, multi-language requirements, or specialized build processes demand it.
Conclusion
Amazon Bedrock AgentCore direct code deployment makes iterative agent development cycles even faster, while still benefiting from enterprise security and scale of deployments. Developers can now rapidly prototype and iterate by deploying their code directly, without having to create a container. To get started with Amazon Bedrock AgentCore direct code deployment, visit the AWS documentation.

About the authors
Chaitra Mathur is as a GenAI Specialist Solutions Architect at AWS. She works with customers across industries in building scalable generative AI platforms and operationalizing them. Throughout her career, she has shared her expertise at numerous conferences and has authored several blogs in the Machine Learning and Generative AI domains.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime, Browser, Code Interpreter, and Identity. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and enjoys life with his wife and kids.

How to Build Supervised AI Models When You Don’t Have Annotated Data

One of the biggest challenges in real-world machine learning is that supervised models require labeled data—yet in many practical scenarios, the data you start with is almost always unlabeled. Manually annotating thousands of samples isn’t just slow; it’s expensive, tedious, and often impractical.

This is where active learning becomes a game-changer.

Active learning is a subset of machine learning in which the algorithm is not a passive consumer of data—it becomes an active participant. Instead of labeling the entire dataset upfront, the model intelligently selects which data points it wants labeled next. It interactively queries a human or oracle for labels on the most informative samples, allowing it to learn faster using far fewer annotations. Check out the FULL CODES here.

Here’s how the workflow typically looks:

Begin by labeling a small seed portion of the dataset to train an initial, weak model.

Use this model to generate predictions and confidence scores on the unlabeled data.

Compute a confidence metric (e.g., probability gap) for each prediction.

Select only the lowest-confidence samples—the ones the model is most unsure about.

Manually label these uncertain samples and add them to the training set.

Retrain the model and repeat the cycle of predict → rank confidence → label → retrain.

After several iterations, the model can achieve near–fully supervised performance while requiring far fewer manually labeled samples.

In this article, we’ll walk through how to apply this strategy step-by-step and show how active learning can help you build high-quality supervised models with minimal labeling effort. Check out the FULL CODES here.

Installing & Importing the libraries

Copy CodeCopiedUse a different Browserpip install numpy pandas scikit-learn matplotlib

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

For this tutorial, we will be using the make_classification dataset from the sklearn library

Copy CodeCopiedUse a different BrowserSEED = 42 # For reproducibility
N_SAMPLES = 1000 # Total number of data points
INITIAL_LABELED_PERCENTAGE = 0.10 # Your constraint: Start with 10% labeled data
NUM_QUERIES = 20 # Number of times we ask the “human” to label a confusing sample

NUM_QUERIES = 20 represents the annotation budget in an active learning setup. In a real-world workflow, this would mean the model selects the 20 most confusing samples and sends them to human annotators to label—each annotation costing time and money. In our simulation, we replicate this process automatically: during each iteration, the model selects one uncertain sample, the code instantly retrieves its true label (acting as the human oracle), and the model is retrained with this new information. 

Thus, setting NUM_QUERIES = 20 means we’re simulating the benefit of labeling only 20 strategically chosen samples and observing how much the model improves with that limited but valuable human effort.

Data Generation and Splitting Strategy for Active Learning

This block handles data generation and the initial split that powers the entire Active Learning experiment. It first uses make_classification to create 1,000 synthetic samples for a two-class problem. The dataset is then split into a 10% held-out test set for final evaluation and a 90% pool for training. From this pool, only 10% is kept as the small initial labeled set—matching the constraint of starting with very limited annotations—while the remaining 90% becomes the unlabeled pool. This setup creates the realistic low-label scenario Active Learning is designed for, with a large pool of unlabeled samples ready for strategic querying. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserX, y = make_classification(
n_samples=N_SAMPLES, n_features=10, n_informative=5, n_redundant=0,
n_classes=2, n_clusters_per_class=1, flip_y=0.1, random_state=SEED
)

# 1. Split into 90% Pool (samples to be queried) and 10% Test (final evaluation)
X_pool, X_test, y_pool, y_test = train_test_split(
X, y, test_size=0.10, random_state=SEED, stratify=y
)

# 2. Split the 90% Pool into Initial Labeled (10% of the pool) and Unlabeled (90% of the pool)
X_labeled_current, X_unlabeled_full, y_labeled_current, y_unlabeled_full = train_test_split(
X_pool, y_pool, test_size=1.0 – INITIAL_LABELED_PERCENTAGE,
random_state=SEED, stratify=y_pool
)

# A set to track indices in the unlabeled pool for efficient querying and removal
unlabeled_indices_set = set(range(X_unlabeled_full.shape[0]))

print(f”Initial Labeled Samples (STARTING N): {len(y_labeled_current)}”)
print(f”Unlabeled Pool Samples: {len(unlabeled_indices_set)}”)

Initial Training and Baseline Evaluation

This block trains the initial Logistic Regression model using only the small labeled seed set and evaluates its accuracy on the held-out test set. The labeled sample count and baseline accuracy are then stored as the first points in the performance history, establishing a starting benchmark before Active Learning begins. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserlabeled_size_history = []
accuracy_history = []

# Train the baseline model on the small initial labeled set
baseline_model = LogisticRegression(random_state=SEED, max_iter=2000)
baseline_model.fit(X_labeled_current, y_labeled_current)

# Evaluate performance on the held-out test set
y_pred_init = baseline_model.predict(X_test)
accuracy_init = accuracy_score(y_test, y_pred_init)

# Record the baseline point (x=90, y=0.8800)
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy_init)

print(f”INITIAL BASELINE (N={labeled_size_history[0]}): Test Accuracy: {accuracy_history[0]:.4f}”)

Active Learning Loop

This block contains the heart of the Active Learning process, where the model iteratively selects the most uncertain sample, receives its true label, retrains, and evaluates performance. In each iteration, the current model predicts probabilities for all unlabeled samples, identifies the one with the highest uncertainty (least confidence), and “queries” its true label—simulating a human annotator. The newly labeled data point is added to the training set, a fresh model is retrained, and accuracy is recorded. Repeating this cycle for 20 queries demonstrates how targeted labeling quickly improves model performance with minimal annotation effort. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsercurrent_model = baseline_model # Start the loop with the baseline model

print(f”nStarting Active Learning Loop ({NUM_QUERIES} Queries)…”)

# ———————————————–
# The Active Learning Loop (Query, Annotate, Retrain, Evaluate)
# Purpose: Run 20 iterations to demonstrate strategic labeling gains.
# ———————————————–
for i in range(NUM_QUERIES):
if not unlabeled_indices_set:
print(“Unlabeled pool is empty. Stopping.”)
break

# — A. QUERY STRATEGY: Find the Least Confident Sample —
# 1. Get probability predictions from the CURRENT model for all unlabeled samples
probabilities = current_model.predict_proba(X_unlabeled_full)
max_probabilities = np.max(probabilities, axis=1)

# 2. Calculate Uncertainty Score (1 – Max Confidence)
uncertainty_scores = 1 – max_probabilities

# 3. Identify the index of the sample with the MAXIMUM uncertainty score
current_indices_list = list(unlabeled_indices_set)
current_uncertainty = uncertainty_scores[current_indices_list]
most_uncertain_idx_in_subset = np.argmax(current_uncertainty)
query_index_full = current_indices_list[most_uncertain_idx_in_subset]
query_uncertainty_score = uncertainty_scores[query_index_full]

# — B. HUMAN ANNOTATION SIMULATION —
# This is the single critical step where the human annotator intervenes.
# We look up the true label (y_unlabeled_full) for the sample the model asked for.
X_query = X_unlabeled_full[query_index_full].reshape(1, -1)
y_query = np.array([y_unlabeled_full[query_index_full]])

# Update the Labeled Set: Add the new annotated sample (N becomes N+1)
X_labeled_current = np.vstack([X_labeled_current, X_query])
y_labeled_current = np.hstack([y_labeled_current, y_query])
# Remove the sample from the unlabeled pool
unlabeled_indices_set.remove(query_index_full)

# — C. RETRAIN and EVALUATE —
# Train the NEW model on the larger, improved labeled set
current_model = LogisticRegression(random_state=SEED, max_iter=2000)
current_model.fit(X_labeled_current, y_labeled_current)

# Evaluate the new model on the held-out test set
y_pred = current_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Record results for plotting
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy)

# Output status
print(f”nQUERY {i+1}: Labeled Samples: {len(y_labeled_current)}”)
print(f” > Test Accuracy: {accuracy:.4f}”)
print(f” > Uncertainty Score: {query_uncertainty_score:.4f}”)

final_accuracy = accuracy_history[-1]

Final Result

The experiment successfully validated the efficiency of Active Learning. By focusing annotation efforts on only 20 strategically selected samples (increasing the labeled set from 90 to 110), the model’s performance on the unseen Test Set improved from 0.8800 (88%) to 0.9100 (91%). 

This 3 percentage point increase in accuracy was achieved with a minimal increase in annotation effort—roughly a 22% increase in the size of the training data resulted in a measurable and meaningful performance boost. 

In essence, the Active Learner acts as an intelligent curator, ensuring that every dollar or minute spent on human labeling provides the maximum possible benefit, proving that smart labeling is far more valuable than random or bulk labeling. Check out the FULL CODES here.

Plotting the results

Copy CodeCopiedUse a different Browserplt.figure(figsize=(10, 6))
plt.plot(labeled_size_history, accuracy_history, marker=’o’, linestyle=’-‘, color=’#00796b’, label=’Active Learning (Least Confidence)’)
plt.axhline(y=final_accuracy, color=’red’, linestyle=’–‘, alpha=0.5, label=’Final Accuracy’)
plt.title(‘Active Learning: Accuracy vs. Number of Labeled Samples’)
plt.xlabel(‘Number of Labeled Samples’)
plt.ylabel(‘Test Set Accuracy’)
plt.grid(True, linestyle=’–‘, alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build Supervised AI Models When You Don’t Have Annotated Data appeared first on MarkTechPost.

How to Design a Persistent Memory and Personalized Agentic AI System w …

In this tutorial, we explore how to build an intelligent agent that remembers, learns, and adapts to us over time. We implement a Persistent Memory & Personalisation system using simple, rule-based logic to simulate how modern Agentic AI frameworks store and recall contextual information. As we progress, we see how the agent’s responses evolve with experience, how memory decay helps prevent overload, and how personalisation improves performance. We aim to understand, step by step, how persistence transforms a static chatbot into a context-aware, evolving digital companion. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport math, time, random
from typing import List

class MemoryItem:
def __init__(self, kind:str, content:str, score:float=1.0):
self.kind = kind
self.content = content
self.score = score
self.t = time.time()

class MemoryStore:
def __init__(self, decay_half_life=1800):
self.items: List[MemoryItem] = []
self.decay_half_life = decay_half_life

def _decay_factor(self, item:MemoryItem):
dt = time.time() – item.t
return 0.5 ** (dt / self.decay_half_life)

We established the foundation for our agent’s long-term memory. We define the MemoryItem class to hold each piece of information and build a MemoryStore with an exponential decay mechanism. We begin laying the foundation for storing and aging information just like a human’s memory. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser def add(self, kind:str, content:str, score:float=1.0):
self.items.append(MemoryItem(kind, content, score))

def search(self, query:str, topk=3):
scored = []
for it in self.items:
decay = self._decay_factor(it)
sim = len(set(query.lower().split()) & set(it.content.lower().split()))
final = (it.score * decay) + sim
scored.append((final, it))
scored.sort(key=lambda x: x[0], reverse=True)
return [it for _, it in scored[:topk] if _ > 0]

def cleanup(self, min_score=0.1):
new = []
for it in self.items:
if it.score * self._decay_factor(it) > min_score:
new.append(it)
self.items = new

We expand the memory system by adding methods to insert, search, and clean old memories. We implement a simple similarity function and a decay-based cleanup routine, enabling the agent to remember relevant facts while automatically forgetting weak or outdated ones. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass Agent:
def __init__(self, memory:MemoryStore, name=”PersonalAgent”):
self.memory = memory
self.name = name

def _llm_sim(self, prompt:str, context:List[str]):
base = “OK. ”
if any(“prefers short” in c for c in context):
base = “”
reply = base + f”I considered {len(context)} past notes. ”
if “summarize” in prompt.lower():
return reply + “Summary: ” + ” | “.join(context[:2])
if “recommend” in prompt.lower():
if any(“cybersecurity” in c for c in context):
return reply + “Recommended: write more cybersecurity articles.”
if any(“rag” in c for c in context):
return reply + “Recommended: build an agentic RAG demo next.”
return reply + “Recommended: continue with your last topic.”
return reply + “Here’s my response to: ” + prompt

def perceive(self, user_input:str):
ui = user_input.lower()
if “i like” in ui or “i prefer” in ui:
self.memory.add(“preference”, user_input, 1.5)
if “topic:” in ui:
self.memory.add(“topic”, user_input, 1.2)
if “project” in ui:
self.memory.add(“project”, user_input, 1.0)
def act(self, user_input:str):
mems = self.memory.search(user_input, topk=4)
ctx = [m.content for m in mems]
answer = self._llm_sim(user_input, ctx)
self.memory.add(“dialog”, f”user said: {user_input}”, 0.6)
self.memory.cleanup()
return answer, ctx

We design an intelligent agent that utilizes memory to inform its responses. We create a mock language model simulator that adapts replies based on stored preferences and topics. At the same time, the perception function enables the agent to dynamically capture new insights about the user. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef evaluate_personalisation(agent:Agent):
agent.memory.add(“preference”, “User likes cybersecurity articles”, 1.6)
q = “Recommend what to write next”
ans_personal, _ = agent.act(q)
empty_mem = MemoryStore()
cold_agent = Agent(empty_mem)
ans_cold, _ = cold_agent.act(q)
gain = len(ans_personal) – len(ans_cold)
return ans_personal, ans_cold, gain

Now we give our agent the ability to act and evaluate itself. We allow it to recall memories to shape contextual answers and add a small evaluation loop to compare personalised responses versus a memory-less baseline, quantifying how much the memory helps. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsermem = MemoryStore(decay_half_life=60)
agent = Agent(mem)

print(“=== Demo: teaching the agent about yourself ===”)
inputs = [
“I prefer short answers.”,
“I like writing about RAG and agentic AI.”,
“Topic: cybersecurity, phishing, APTs.”,
“My current project is to build an agentic RAG Q&A system.”
]
for inp in inputs:
agent.perceive(inp)

print(“n=== Now ask the agent something ===”)
user_q = “Recommend what to write next in my blog”
ans, ctx = agent.act(user_q)
print(“USER:”, user_q)
print(“AGENT:”, ans)
print(“USED MEMORY:”, ctx)

print(“n=== Evaluate personalisation benefit ===”)
p, c, g = evaluate_personalisation(agent)
print(“With memory :”, p)
print(“Cold start :”, c)
print(“Personalisation gain (chars):”, g)

print(“n=== Current memory snapshot ===”)
for it in agent.memory.items:
print(f”- {it.kind} | {it.content[:60]}… | score~{round(it.score,2)}”)

Finally, we run the full demo to see our agent in action. We feed it user inputs, observe how it recommends personalised actions, and check its memory snapshot. We witness the emergence of adaptive behaviour, proof that persistent memory transforms a static script into a learning companion.

In conclusion, we demonstrate how adding memory and personalisation makes our agent more human-like, capable of remembering preferences, adapting plans, and forgetting outdated details naturally. We observe that even simple mechanisms such as decay and retrieval significantly improve the agent’s relevance and response quality. By the end, we realize that persistent memory is the foundation of next-generation Agentic AI, one that learns continuously, tailors experiences intelligently, and maintains context dynamically in a fully local, offline setup.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Persistent Memory and Personalized Agentic AI System with Decay and Self-Evaluation? appeared first on MarkTechPost.

Anyscale and NovaSky Team Releases SkyRL tx v0.1.0: Bringing Tinker Co …

How can AI teams run Tinker style reinforcement learning on large language models using their own infrastructure with a single unified engine? Anyscale and NovaSky (UC Berkeley) Team releases SkyRL tx v0.1.0 that gives developers a way to run a Tinker compatible training and inference engine directly on their own hardware, while keeping the same minimal API that Tinker exposes in the managed service.

The research team describes SkyRL tx as a unified training and inference engine that implements the Tinker API and allows people to run a Tinker like service on their own infrastructure. This v0.1.0 version is the first of its series that supports reinforcement learning end to end, and it also makes sampling significantly faster.

Tinker API in brief

Tinker from Thinking Machines is a training API built around four core functions. forward_backward performs a forward pass and a backward pass and accumulates gradients. optim_step updates model weights based on those gradients. sample generates tokens for interaction, evaluation or RL actions. save_state writes checkpoints for resuming training.

Instead of a full task specific fine tuning abstraction, Tinker exposes these low level primitives so that users can implement their own supervised or reinforcement learning loops in regular Python code, while the service handles GPU scheduling and distributed execution.

SkyRL tx targets this exact API and implements an open backend that users can deploy locally. It keeps the Tinker programming model, while removing the need to rely only on the hosted environment.

Where SkyRL tx fits inside SkyRL

SkyRL is a full stack reinforcement learning library for large language models that includes skyrl-agent for long horizon agents, skyrl-train for training, and skyrl-gym for tool use environments such as math, coding, search and SQL.

Within this stack, skyrl-tx is marked as an experimental cross platform library that exposes a local Tinker like REST API for model post training. SkyRL tx therefore becomes the system layer that connects RL logic, environments and training code to concrete GPU resources through the Tinker interface.

Architecture, inference engine that also trains

The SkyRL tx architecture is described as an inference engine that also supports backward passes. It has four main components:

REST API server that processes incoming requests from different users.

Database that tracks metadata about models, checkpoints, requests and futures, and also acts as a job queue. The current implementation uses SQLite behind an interface that also supports other SQL databases such as Postgres.

Engine that schedules and batches requests across users. Each engine instance serves a single base model and can attach many LoRA adapters.

Worker that executes forward and backward passes and holds model definitions and optimizer states. Multiple workers would be enabling more advanced multi node sharding in upcoming versions

What v0.1.0 adds?

The v0.1.0 release focuses on reinforcement learning support and performance improvements. The official release highlights several concrete changes:

Sampling is now much faster, since it is jitted and properly batched and sharded in the engine.

Different sampling parameters per request, per request seeds and stop tokens are now supported, which is useful when many experiments share a base model.

After several fixes, the RL loop now runs properly through the engine.

Gradient checkpointing support and micro batching for sampling are implemented.

Postgres is now supported as a database backend, next to SQLite.

Running RL end to end on 8 H100 GPUs

The official release contains a specific code recipe for running reinforcement learning end to end on a cluster with 8 H100 GPUs.

First, users clone the SkyRL repository and in the skyrl-tx folder start the engine with:

Copy CodeCopiedUse a different Browseruv run –extra gpu –extra tinker -m tx.tinker.api
–base-model Qwen/Qwen3-4B
–max-lora-adapters 3
–max-lora-rank 1
–tensor-parallel-size 8
–train-micro-batch-size 8 > out.log

Then they clone the Tinker Cookbook from the Thinking Machines team and in the tinker_cookbook/recipes folder run:

Copy CodeCopiedUse a different Browserexport TINKER_API_KEY=dummy
export WANDB_API_KEY=<your key>
uv run –with wandb –with tinker rl_loop.py
base_url=http://localhost:8000
model_name=”Qwen/Qwen3-4B”
lora_rank=1
max_length=1024
save_every=100

This produces a reward curve that confirms the RL loop runs correctly through the local SkyRL tx backend.

Key Takeaways

SkyRL tx v0.1.0 implements a local, Tinker compatible engine that unifies training and inference for LLM post training.

The system exposes Tinker primitives, forward_backward, optim_step, sample and save_state over REST, while handling batching, LoRA adapters and device placement internally.

Architecture is split into API server, SQL database, scheduling engine and workers that execute forward and backward passes for a single base model with multiple LoRA adapters.

v0.1.0 adds end to end reinforcement learning support, faster jitted and sharded sampling, per request sampling parameters, gradient checkpointing, micro batching and Postgres support.

Editorial Comments

SkyRL tx v0.1.0 is a practical step for dev teams that want Tinker style reinforcement learning on their own clusters with a consistent Tinker API surface. The design that treats the system as an inference engine that also runs backward passes is clean and reduces stack divergence. Support for LoRA, gradient checkpointing, micro batching and Postgres is a concrete systems upgrade. Overall, this release turns Tinker compatibility into an actionable local RL backend for LLM

Check out the Repo and Official Release. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anyscale and NovaSky Team Releases SkyRL tx v0.1.0: Bringing Tinker Compatible Reinforcement Learning RL Engine To Local GPU Clusters appeared first on MarkTechPost.

How Switchboard, MD automates real-time call transcription in clinical …

In high-volume healthcare contact centers, every patient conversation carries both clinical and operational significance, making accurate real-time transcription necessary for automated workflows. Accurate, instant transcription enables intelligent automation without sacrificing clarity or care, so that teams can automate electronic medical record (EMR) record matching, streamline workflows, and eliminate manual data entry. By removing routine process steps, staff can stay fully focused on patient conversations, improving both the experience and the outcome. As healthcare systems seek to balance efficiency with empathy, real-time transcription has become a capability for delivering responsive, high-quality care at scale.
Switchboard, MD is a physician-led AI and data science company with a mission to prioritize the human connection in medicine. Its service improves patient engagement and outcomes, while reducing inefficiency and burnout. By designing and deploying clinically relevant solutions, Switchboard, MD helps providers and operators collaborate more effectively to deliver great experiences for both patients and staff. One of its key solutions is streamlining the contact center using AI voice automation, real-time medical record matching, and suggested next steps, which has led to significant reductions in queue times and call abandonment rates.
With more than 20,000 calls handled each month, Switchboard, MD supports healthcare providers in delivering timely, personalized communication at scale. Its AI platform is already helping reduce call queue times, improve patient engagement, and streamline contact center operations for clinics and health systems. Customers using Switchboard have seen outcomes such as:

75% reduction in queue times
59% reduction in call abandonment rate

Despite these early successes, Switchboard faced a critical challenge: their existing transcription approach couldn’t scale economically while maintaining the accuracy required for clinical workflows. Cost and word error rate (WER) weren’t just operational metrics—they were critical enablers for scaling automation and expanding Switchboard’s impact across more patient interactions.
In this post, we examine the specific challenges Switchboard, MD faced with scaling transcription accuracy and cost-effectiveness in clinical environments, their evaluation process for selecting the right transcription solution, and the technical architecture they implemented using Amazon Connect and Amazon Kinesis Video Streams. This post details the impressive results achieved and demonstrates how they were able to use this foundation to automate EMR matching and give healthcare staff more time to focus on patient care. Finally, we’ll look at the broader implications for healthcare AI automation and how other organizations can implement similar solutions using Amazon Bedrock.
Choosing an accurate, scalable, and cost-effective transcription model for contact center automation
Switchboard, MD needed a transcription solution that delivered high accuracy at a sustainable cost. In clinical settings, transcription accuracy is critical because errors can compromise EMR record matching, affect recommended treatment plans, and disrupt automated workflows. At the same time, scaling support for thousands of calls each week meant that inference costs couldn’t be ignored.
Switchboard initially explored multiple paths, including evaluating open source models such as Open AI’s Whisper model hosted locally. But these options presented tradeoffs—either in performance, cost, or integration complexity.
After testing, the team determined that Amazon Nova Sonic provided the right combination of transcription quality and efficiency needed to support their healthcare use case. The model performed reliably across live caller audio, even in noisy or variable conditions. It delivered:

80–90% lower transcription costs
A word error rate of 4% on Switchboard’s proprietary evaluation dataset
Low-latency output that aligned with their need for real-time processing

Equally important, Nova Sonic integrated smoothly into Switchboard’s existing architecture, minimizing engineering lift and accelerating deployment. With this foundation, the team reduced manual transcription steps and scaled accurate, real-time automation across thousands of patient interactions.

“Our vision is to restore the human connection in medicine by removing administrative barriers that get in the way of meaningful interaction. Nova Sonic gave us the speed and accuracy we needed to transcribe calls in real time—so our customers can focus on what truly matters: the patient conversation. By reducing our transcription costs by 80–90%, it’s also made real-time automation sustainable at scale.” – Dr. Blake Anderson, Founder, CEO, and CTO, Switchboard, MD

Architecture and implementation
Switchboard’s architecture uses Amazon Connect to capture live audio from both patients and representatives. Switchboard processes audio streams through Amazon Kinesis Video Streams , which handles the real-time media conversion before routing the data to containerized AWS Lambda functions. Switchboard’s Lambda functions establish bidirectional streaming connections with Amazon Nova Sonic using BedrockRuntimeClient’s InvokeModelWithBidirectionalStream API.  This novel architecture creates separate transcription streams for each conversation participant, which Switchboard recombines to create the complete transcription record. The entire processing pipeline runs in a serverless environment, providing scalable operation designed to handle thousands of concurrent calls while using Nova Sonic’s real-time speech-to-text capabilities for immediate transcription processing.
Nova Sonic integration: Real-time speech processing
Harnessing Amazon Nova Sonic’s advanced audio streaming and processing, Switchboard developed and built the capability of separating and recombining speakers’ streams and transcripts. This makes Amazon Nova Sonic particularly effective for Switchboard’s healthcare applications, where accurate transcription and speaker identification are crucial.
Amazon Nova Sonic offers configurable settings that can be optimized for different healthcare use cases, with the flexibility to prioritize either transcription or speech generation based on specific needs. A key cost-optimization feature is the ability to adjust speech output tokens – organizations can set lower token values when primarily focused on transcription, resulting in significant cost savings while maintaining high accuracy. This versatility and cost flexibility makes Amazon Nova Sonic a valuable tool for healthcare organizations like Switchboard looking to implement voice-enabled solutions.
Why serverless: Strategic advantages for healthcare innovation
Switchboard’s choice of a serverless architecture using Amazon Connect, Amazon Kinesis Video Streams, and containerized Lambda functions represents a strategic decision that maximizes operational efficiency while minimizing infrastructure overhead. The serverless approach eliminates the need to provision, manage, and monitor underlying infrastructure, so that Switchboard’s engineering team can focus on developing clinical automation features rather than server management. This architecture provides built-in fault tolerance and high availability for critical healthcare communications without requiring extensive configuration from Switchboard’s team.
Switchboard’s event-driven architecture, shown in the following figure, enables the system to scale from handling dozens to thousands of concurrent calls, with AWS automatically managing capacity provisioning behind the scenes. The pay-as-you-go billing model helps Switchboard pay only for compute resources used during call processing, optimizing costs while eliminating the risk of over-provisioning servers that would sit idle during low-volume periods.

Conclusion
Switchboard, MD’s implementation of Amazon Nova Sonic demonstrates how the right transcription technology can transform healthcare operations. By achieving 80–90% cost reductions while maintaining clinical-grade accuracy, they’ve created a sustainable foundation for scaling AI-powered patient interactions across the healthcare industry.
By building on Amazon Bedrock, Switchboard now has the flexibility to expand automation across more use cases and provider networks. Their success exemplifies how healthcare innovators can combine accuracy, speed, and efficiency to transform how care teams connect with patients—one conversation at a time.
Get started with Amazon Nova on the Amazon Bedrock console. Learn more about Amazon Nova models at the Amazon Nova product page.

About the authors
Tanner Jones is a Technical Account Manager in AWS Enterprise Support, where he helps customers navigate and optimize their production applications on AWS. He specializes in helping customers develop applications that incorporate AI agents, with a particular focus on building safe multi-agent systems.
Anuj Jauhari is a Sr. Product Marketing Manager at AWS, where he helps customers innovate and drive business impact with generative AI solutions built on Amazon Nova models.
Jonathan Woods is a Solutions Architect at AWS based in Nashville currently working with SMB customers. He has a passion for communicating AWS technology to businesses in a relevant way making it easy for customers to innovate. Outside of work, he tries keeping up with his three kids.
Nauman Zulfiqar is a senior account manager based in New York working with SMB clients. He loves building and maintaining strong customer relationships, understanding their business challenges and serving as the customer’s primary business advocate within AWS.

How to Create AI-ready APIs?

Postman recently released a comprehensive checklist and developer guide for building AI-ready APIs, highlighting a simple truth: even the most powerful AI models are only as good as the data they receive—and that data comes through your APIs. If your endpoints are inconsistent, unclear, or unreliable, models waste time fixing bad inputs instead of producing insight. Postman’s playbook distills years of best practices into practical steps that help teams make their APIs predictable, machine-readable, and dependable for AI workloads.

This article summarizes the key ideas from that playbook. As we move into a world where Agents—not humans—will make purchases, compare options, and interact with services, APIs must evolve. Unlike developers, Agents can’t compensate for messy docs or ambiguous behavior. They rely on standardized patterns and automatically generated, machine-consumable documentation that stays in sync with your schema. The goal is simple: create APIs that humans and AI agents can understand instantly, so your systems can scale smarter and unlock their full potential.

Machine consumable metadata

Humans can infer missing details from vague API docs, but AI agents can’t—they rely entirely on explicit, machine-readable metadata. Instead of saying “this endpoint returns user preferences,” an AI-ready API must define everything: request type, parameter schema, response structure, and object definitions. Clear metadata like the example above removes ambiguity, ensures agents don’t guess, and makes APIs fully understandable to machines.

Rich Error Semantics

Developers can interpret vague errors like “Something went wrong,” but AI agents can’t—they need precise, structured guidance. AI-ready APIs must clearly spell out what failed, why it failed, and how to fix it. Rich error metadata with fields like code, message, expected, and received removes guesswork and enables agents to self-correct instead of getting stuck.

Introspection Capabilities

For APIs to be AI-ready, they must move beyond human-centric, vague documentation. Unlike developers who can infer missing details using context and RESTful conventions, AI agents rely entirely on structured data for planning and execution. This means APIs must provide complete introspection through a full schema, explicitly defining all endpoints, parameters, data schemas, and error codes. Without this clarity, AI systems are forced to guess, which inevitably leads to broken workflows and unreliable, hallucinated behavior.

Consistent Naming Patterns

AI systems rely on consistent patterns, so predictable naming conventions make your API far easier for them to understand and navigate. When endpoints and fields follow clear, uniform structures—like proper REST methods and consistent casing—AI can infer relationships and behaviors without guesswork. This reduces ambiguity and enables more accurate automation, reasoning, and integration across your entire API.

Predictable behaviour

AI agents need strict consistency—same inputs should always produce the same structure, format, and fields. Humans can troubleshoot inconsistent responses using intuition, but AI can’t assume or investigate; it only learns from the patterns you provide. If naming, nesting, or errors vary across endpoints, the agent becomes unreliable or breaks entirely. To be AI-ready, your API must enforce predictable responses, uniform naming, consistent error handling, and zero hidden edge cases. In short: inconsistent inputs lead to inconsistent agent behavior.

Proper documentation

Humans can look things up when docs are unclear, but AI agents can’t—they only know what your API explicitly tells them. Without clear, complete documentation, an agent can’t discover endpoints, understand parameters, predict responses, or recover from errors. Good documentation isn’t optional for AI-ready APIs—it’s the only way agents can learn and reliably interact with your system.

Reliable and fast

AI agents act as orchestrators, making rapid and often parallel API calls—so your API’s speed and reliability directly impact their performance. Humans can wait out slow responses or retry manually, but agents will time out, fail, or break entire workflows. In fast, automated environments, an AI system is only as strong as the APIs it relies on. If your API can’t keep up, neither can your AI.

Discoverability

Humans can track down missing APIs through wikis, chats, code, or intuition—but AI agents can’t. If an API isn’t clearly published with structured, searchable metadata, it simply doesn’t exist to them. AI systems depend on standardized, discoverable specs and examples to understand how to use an API. Making your API visible, accessible, and well-indexed—through platforms like the Postman API Network—ensures both developers and agents can reliably find and integrate it.
The post How to Create AI-ready APIs? appeared first on MarkTechPost.

LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Para …

How do you design a single model that can listen, see, read and respond in real time across text, image, video and audio without losing the efficiency? Meituan’s LongCat team has released LongCat Flash Omni, an open source omni modal model with 560 billion parameters and about 27 billion active per token, built on the shortcut connected Mixture of Experts design that LongCat Flash introduced. The model extends the text backbone to vision, video and audio, and it keeps a 128K context so it can run long conversations and document level understanding in one stack.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

Architecture and Modal Attachments

LongCat Flash Omni keeps the language model unchanged, then adds perception modules. A LongCat ViT encoder processes both images and video frames so there is no separate video tower. An audio encoder together with the LongCat Audio Codec turns speech into discrete tokens, then the decoder can output speech from the same LLM stream, which enables real time audio visual interaction.

Streaming and Feature Interleaving

The research team describes chunk wise audio visual feature interleaving, where audio features, video features and timestamps are packed into 1 second segments. Video is sampled at 2 frames per second by default, then the rate is adjusted according to video length, the report does not tie the sampling rule to user or model speaking phases, so the correct description is duration conditioned sampling. This keeps latency low and still provides spatial context for GUI, OCR and video QA tasks.

Curriculum from Text to Omni

Training follows a staged curriculum. The research team first trains the LongCat Flash text backbone, which activates 18.6B to 31.3B parameters per token, average 27B, then applies text speech continued pretraining, then multimodal continued pretraining with image and video, then context extension to 128K, then audio encoder alignment.

Systems Design, Modality Decoupled Parallelism

Because the encoders and the LLM have different compute patterns, Meituan uses modality decoupled parallelism. Vision and audio encoders run with hybrid sharding and activation recomputation, the LLM runs with pipeline, context and expert parallelism, and a ModalityBridge aligns embeddings and gradients. The research team reports that multimodal supervised fine tuning keeps more than 90 percent of the throughput of text only training, which is the main systems result in this release.

https://github.com/meituan-longcat/LongCat-Flash-Omni?tab=readme-ov-file

Benchmarks and Positioning

LongCat Flash Omni reaches 61.4 on OmniBench, this is higher than Qwen 3 Omni Instruct at 58.5 and Qwen 2.5 Omni at 55.0, but lower than Gemini 2.5 Pro at 66.8. On VideoMME it scores 78.2, which is close to GPT 4o and Gemini 2.5 Flash, and on VoiceBench it reaches 88.7, slightly higher than GPT 4o Audio in the same table.

Key Takeaways

LongCat Flash Omni is an open source omni modal model built on Meituan’s 560B MoE backbone, it activates about 27B parameters per token through shortcut connected MoE with zero computation experts, so it keeps large capacity but inference friendly compute.

The model attaches unified vision video encoding and a streaming audio path to the existing LongCat Flash LLM, using 2 fps default video sampling with duration conditioned adjustment, and packs audio visual features into 1 second chunks for synchronized decoding, which is what enables real time any to any interaction.

LongCat Flash Omni scores 61.4 on OmniBench, above Qwen 3 Omni Instruct at 58.5, but below Gemini 2.5 Pro at 66.8.

Meituan uses modality decoupled parallelism, vision and audio encoders run with hybrid sharding, the LLM runs with pipeline, context and expert parallelism, and report more than 90 percent of text only throughput for multimodal SFT, which is the main systems contribution of the release.

Editorial Comments

This release shows that Meituan is trying to make omni modal interaction practical, not experimental. It keeps the 560B Shortcut connected Mixture of Experts with 27B activated, so the language backbone stays compatible with earlier LongCat releases. It adds streaming audio visual perception with 2 fps default video sampling and duration conditioned adjustment, so latency remains low without losing spatial grounding. It reports over 90 percent text only throughput in multimodal supervised fine tuning through modality decoupled parallelism.

Check out the Paper, Model Weights and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post LongCat-Flash-Omni: A SOTA Open-Source Omni-Modal Model with 560B Parameters with 27B activated, Excelling at Real-Time Audio-Visual Interaction appeared first on MarkTechPost.