Anthropic Proposes Targeted Transparency Framework for Frontier AI Sys …

As the development of large-scale AI systems accelerates, concerns about safety, oversight, and risk management are becoming increasingly critical. In response, Anthropic has introduced a targeted transparency framework aimed specifically at frontier AI models—those with the highest potential impact and risk—while deliberately excluding smaller developers and startups to avoid stifling innovation across the broader AI ecosystem.

Why a Targeted Approach?

Anthropic’s framework addresses the need for differentiated regulatory obligations. It argues that universal compliance requirements could overburden early-stage companies and independent researchers. Instead, the proposal focuses on a narrow class of developers: companies building models that surpass specific thresholds for computational power, evaluation performance, R&D expenditure, and annual revenue. This scope ensures that only the most capable—and potentially hazardous—systems are subject to stringent transparency requirements.

Key Components of the Framework

The proposed framework is structured into four major sections: scope, pre-deployment requirements, transparency obligations, and enforcement mechanisms.

I. Scope

The framework applies to organizations developing frontier models—defined not by model size alone, but by a combination of factors including:

Compute scale

Training cost

Evaluation benchmarks

Total R&D investment

Annual revenue

Importantly, startups and small developers are explicitly excluded, using financial thresholds to prevent unnecessary regulatory overhead. This is a deliberate choice to maintain flexibility and support innovation at the early stages of AI development.

II. Pre-Deployment Requirements

Central to the framework is the requirement for companies to implement a Secure Development Framework (SDF) before releasing any qualifying frontier model.

Key SDF requirements include:

Model Identification: Companies must specify which models the SDF applies to.

Catastrophic Risk Mitigation: Plans must be in place to assess and mitigate catastrophic risks—defined broadly to include Chemical, Biological, Radiological, and Nuclear (CBRN) threats, and autonomous actions by models that contradict developer intent.

Standards and Evaluations: Clear evaluation procedures and standards must be outlined.

Governance: A responsible corporate officer must be assigned for oversight.

Whistleblower Protections: Processes must support internal reporting of safety concerns without retaliation.

Certification: Companies must affirm SDF implementation before deployment.

Recordkeeping: SDFs and their updates must be retained for at least five years.

This structure promotes rigorous pre-deployment risk analysis while embedding accountability and institutional memory.

III. Minimum Transparency Requirements

The framework mandates public disclosure of safety processes and results, with allowances for sensitive or proprietary information.

Covered companies must:

Publish SDFs: These must be posted in a publicly accessible format.

Release System Cards: At deployment or upon adding major new capabilities, documentation (akin to model “nutrition labels”) must summarize testing results, evaluation procedures, and mitigations.

Certify Compliance: A public confirmation that the SDF has been followed, including descriptions of any risk mitigations.

Redactions are allowed for trade secrets or public safety concerns, but any omissions must be justified and flagged.

This strikes a balance between transparency and security, ensuring accountability without risking model misuse or competitive disadvantage.

IV. Enforcement

The framework proposes modest but clear enforcement mechanisms:

False Statements Prohibited: Intentionally misleading disclosures regarding SDF compliance are banned.

Civil Penalties: The Attorney General may seek penalties for violations.

30-Day Cure Period: Companies have an opportunity to rectify compliance failures within 30 days.

These provisions emphasize compliance without creating excessive litigation risk, providing a pathway for responsible self-correction.

Strategic and Policy Implications

Anthropic’s targeted transparency framework serves as both a regulatory proposal and a norm-setting initiative. It aims to establish baseline expectations for frontier model development before regulatory regimes are fully in place. By anchoring oversight in structured disclosures and responsible governance—rather than blanket rules or model bans—it provides a blueprint that could be adopted by policymakers and peer companies alike.

The framework’s modular structure could also evolve. As risk signals, deployment scales, or technical capabilities change, the thresholds and compliance requirements can be revised without upending the entire system. This design is particularly valuable in a field as fast-moving as frontier AI.

Conclusion

Anthropic’s proposal for a Targeted Transparency Framework offers a pragmatic middle ground between unchecked AI development and overregulation. It places meaningful obligations on developers of the most powerful AI systems—those with the greatest potential for societal harm—while allowing smaller players to operate without excessive compliance burdens.

As governments, civil society, and the private sector wrestle with how to regulate foundation models and frontier systems, Anthropic’s framework provides a technically grounded, proportionate, and enforceable path forward.

Check out the Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Anthropic Proposes Targeted Transparency Framework for Frontier AI Systems appeared first on MarkTechPost.

Google AI Just Open-Sourced a MCP Toolbox to Let AI Agents Query Datab …

Google has released the MCP Toolbox for Databases, a new open-source module under its GenAI Toolbox aimed at simplifying the integration of SQL databases into AI agents. The release is part of Google’s broader strategy to advance the Model Context Protocol (MCP), a standardized approach that allows language models to interact with external systems—including tools, APIs, and databases—using structured, typed interfaces.

This toolbox addresses a growing need: enabling AI agents to interact with structured data repositories like PostgreSQL and MySQL in a secure, scalable, and efficient manner. Traditionally, building such integrations requires managing authentication, connection handling, schema alignment, and security controls—introducing friction and complexity. The MCP Toolbox removes much of this burden, making integration possible with less than 10 lines of Python and minimal configuration.

Why This Matters for AI Workflows

Databases are essential for storing and querying operational and analytical data. In enterprise and production contexts, AI agents need to access these data sources to perform tasks like reporting, customer support, monitoring, and decision automation. However, connecting large language models (LLMs) directly to SQL databases introduces operational and security concerns such as unsafe query generation, poor connection lifecycle management, and exposure of sensitive credentials.

The MCP Toolbox for Databases solves these problems by providing:

Built-in support for credential-based authentication

Secure and scalable connection pooling

Schema-aware tool interfaces for structured querying

MCP-compliant input/output formats for compatibility with LLM orchestration frameworks

Key Technical Highlights

Minimal Configuration, Maximum Usability

The toolbox allows developers to integrate databases with AI agents using a configuration-driven setup. Instead of dealing with raw credentials or managing individual connections, developers can simply define their database type and environment, and the toolbox handles the rest. This abstraction reduces the boilerplate and risk associated with manual integration.

Native Support for MCP-Compliant Tooling

All tools generated through the toolbox conform to the Model Context Protocol, which defines structured input/output formats for tool interactions. This standardization improves interpretability and safety by constraining LLM interactions through schemas rather than free-form text. These tools can be used directly in agent orchestration frameworks such as LangChain or Google’s own agent infrastructure.

The structured nature of MCP-compliant tools also aids in prompt engineering, allowing LLMs to reason more effectively and safely when interacting with external systems.

Connection Pooling and Authentication

The database interface includes native support for connection pooling to handle concurrent queries efficiently—especially important in multi-agent or high-traffic systems. Authentication is handled securely through environment-based configurations, reducing the need to hard-code credentials or expose them during runtime.

This design minimizes risks such as leaking credentials or overwhelming a database with concurrent requests, making it suitable for production-grade deployment.

Schema-Aware Query Generation

One of the core advantages of this toolbox is its ability to introspect database schemas and make them available to LLMs or agents. This enables safe, schema-validated querying. By mapping out the structure of tables and their relationships, the agent gains situational awareness and can avoid generating invalid or unsafe queries.

This schema grounding also enhances the performance of natural language to SQL pipelines by improving query generation reliability and reducing hallucinations.

Use Cases

The MCP Toolbox for Databases supports a broad range of applications:

Customer service agents that retrieve user information from relational databases in real time

BI assistants that answer business metric questions by querying analytical databases

DevOps bots that monitor database status and report anomalies

Autonomous data agents for ETL, reporting, and compliance verification tasks

Because it’s built on open protocols and popular Python libraries, the toolbox is easily extensible and fits into existing LLM-agent workflows.

Fully Open Source

The module is part of the fully open-source GenAI Toolbox released under the Apache 2.0 license. It builds on established packages such as sqlalchemy to ensure compatibility with a wide range of databases and deployment environments. Developers can fork, customize, or contribute to the module as needed.

Conclusion

The MCP Toolbox for Databases represents an important step in operationalizing AI agents in data-rich environments. By removing integration overhead and embedding best practices for security and performance, Google is enabling developers to bring AI to the heart of enterprise data systems. The combination of structured interfaces, lightweight setup, and open-source flexibility makes this release a compelling foundation for building production-ready AI agents with reliable database access.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Just Open-Sourced a MCP Toolbox to Let AI Agents Query Databases Safely and Efficiently appeared first on MarkTechPost.

Implementing a Tool-Enabled Multi-Agent Workflow with Python, OpenAI A …

In this advanced tutorial, we aim to build a multi-agent task automation system using the PrimisAI Nexus framework, which is fully integrated with the OpenAI API. Our primary objective is to demonstrate how hierarchical supervision, intelligent tool utilization, and structured outputs can facilitate the coordination of multiple AI agents to perform complex tasks, ranging from planning and development to quality assurance and data analysis. As we walk through each phase, we don’t just build individual agents; we architect a collaborative ecosystem where each agent has a clear role, responsibilities, and smart tools to accomplish the task.

Copy CodeCopiedUse a different Browser!pip install primisai openai nest-asyncio

import os
import nest_asyncio
from primisai.nexus.core import AI, Agent, Supervisor
from primisai.nexus.utils.debugger import Debugger
import json

nest_asyncio.apply()

We begin by installing the core dependencies: Primisai for agent orchestration, OpenAI for LLM access, and nest_asyncio to handle Colab’s event loop quirks. After applying nest_asyncio, we ensure the notebook is ready to execute asynchronous tasks seamlessly, a key requirement for multi-agent execution.

Copy CodeCopiedUse a different Browserprint(” PrimisAI Nexus Advanced Tutorial with OpenAI API”)
print(“=” * 55)

os.environ[“OPENAI_API_KEY”] = “Use Your Own API Key Here5”

# llm_config = {
# “api_key”: os.environ[“OPENAI_API_KEY”],
# “model”: “gpt-4o-mini”,
# “base_url”: “https://api.openai.com/v1”,
# “temperature”: 0.7
# }

llm_config = {
“api_key”: os.environ[“OPENAI_API_KEY”],
“model”: “gpt-3.5-turbo”,
“base_url”: “https://api.openai.com/v1”,
“temperature”: 0.7
}

print(” API Configuration:”)
print(f”• Model: {llm_config[‘model’]}”)
print(f”• Base URL: {llm_config[‘base_url’]}”)
print(“• Note: OpenAI has limited free tokens through April 2025”)
print(“• Alternative: Consider Puter.js for unlimited free access”)

To power our agents, we connect to OpenAI’s models, starting with gpt-3.5-turbo for cost-efficient tasks. We store our API key in environment variables and construct a configuration dictionary specifying the model, temperature, and base URL. This section allows us to flexibly switch between models, such as gpt-4o-mini or gpt-4o, depending on task complexity and cost.

Copy CodeCopiedUse a different Browsercode_schema = {
“type”: “object”,
“properties”: {
“description”: {“type”: “string”, “description”: “Code explanation”},
“code”: {“type”: “string”, “description”: “Python code implementation”},
“language”: {“type”: “string”, “description”: “Programming language”},
“complexity”: {“type”: “string”, “enum”: [“beginner”, “intermediate”, “advanced”]},
“test_cases”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “Example usage”}
},
“required”: [“description”, “code”, “language”]
}

analysis_schema = {
“type”: “object”,
“properties”: {
“summary”: {“type”: “string”, “description”: “Brief analysis summary”},
“insights”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “Key insights”},
“recommendations”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “Action items”},
“confidence”: {“type”: “number”, “minimum”: 0, “maximum”: 1},
“methodology”: {“type”: “string”, “description”: “Analysis approach used”}
},
“required”: [“summary”, “insights”, “confidence”]
}

planning_schema = {
“type”: “object”,
“properties”: {
“tasks”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “List of tasks to complete”},
“priority”: {“type”: “string”, “enum”: [“low”, “medium”, “high”]},
“estimated_time”: {“type”: “string”, “description”: “Time estimate”},
“dependencies”: {“type”: “array”, “items”: {“type”: “string”}, “description”: “Task dependencies”}
},
“required”: [“tasks”, “priority”]
}

We define JSON schemas for three agent types: CodeWriter, Data Analyst, and Project Planner. These schemas enforce structure in the agent’s responses, making the output machine-readable and predictable. It helps us ensure that the system returns consistent data, such as code blocks, insights, or project timelines, even when different LLMs are behind the scenes.

Copy CodeCopiedUse a different Browserdef calculate_metrics(data_str):
“””Calculate comprehensive statistics for numerical data”””
try:
data = json.loads(data_str) if isinstance(data_str, str) else data_str
if isinstance(data, list) and all(isinstance(x, (int, float)) for x in data):
import statistics
return {
“mean”: statistics.mean(data),
“median”: statistics.median(data),
“mode”: statistics.mode(data) if len(set(data)) < len(data) else “No mode”,
“std_dev”: statistics.stdev(data) if len(data) > 1 else 0,
“max”: max(data),
“min”: min(data),
“count”: len(data),
“sum”: sum(data)
}
return {“error”: “Invalid data format – expecting array of numbers”}
except Exception as e:
return {“error”: f”Could not parse data: {str(e)}”}

def validate_code(code):
“””Advanced code validation with syntax and basic security checks”””
try:
dangerous_imports = [‘os’, ‘subprocess’, ‘eval’, ‘exec’, ‘__import__’]
security_warnings = []

for danger in dangerous_imports:
if danger in code:
security_warnings.append(f”Potentially dangerous: {danger}”)

compile(code, ‘<string>’, ‘exec’)

return {
“valid”: True,
“message”: “Code syntax is valid”,
“security_warnings”: security_warnings,
“lines”: len(code.split(‘n’))
}
except SyntaxError as e:
return {
“valid”: False,
“message”: f”Syntax error: {e}”,
“line”: getattr(e, ‘lineno’, ‘unknown’),
“security_warnings”: []
}

def search_documentation(query):
“””Simulate searching documentation (placeholder function)”””
docs = {
“python”: “Python is a high-level programming language”,
“list”: “Lists are ordered, mutable collections in Python”,
“function”: “Functions are reusable blocks of code”,
“class”: “Classes define objects with attributes and methods”
}

results = []
for key, value in docs.items():
if query.lower() in key.lower():
results.append(f”{key}: {value}”)

return {
“query”: query,
“results”: results if results else [“No documentation found”],
“total_results”: len(results)
}

Next, we add custom tools that agents could call, such as calculate_metrics for statistical summaries, validate_code for syntax and security checks, and search_documentation for simulated programming help. These tools extend the agents’ abilities, turning them from simple chatbots into interactive, utility-driven workers capable of autonomous reasoning and validation.

Copy CodeCopiedUse a different Browserprint(“n Setting up Multi-Agent Hierarchy with OpenAI”)

main_supervisor = Supervisor(
name=”ProjectManager”,
llm_config=llm_config,
system_message=”You are a senior project manager coordinating development and analysis tasks. Delegate appropriately, provide clear summaries, and ensure quality delivery. Always consider time estimates and dependencies.”
)

dev_supervisor = Supervisor(
name=”DevManager”,
llm_config=llm_config,
is_assistant=True,
system_message=”You manage development tasks. Coordinate between coding, testing, and code review. Ensure best practices and security.”
)

analysis_supervisor = Supervisor(
name=”AnalysisManager”,
llm_config=llm_config,
is_assistant=True,
system_message=”You manage data analysis and research tasks. Ensure thorough analysis, statistical rigor, and actionable insights.”
)

qa_supervisor = Supervisor(
name=”QAManager”,
llm_config=llm_config,
is_assistant=True,
system_message=”You manage quality assurance and testing. Ensure thorough validation and documentation.”
)

To simulate a real-world management structure, we create a multi-tiered hierarchy. A ProjectManager serves as the root supervisor, overseeing three assistant supervisors (DevManager, AnalysisManager, and QAManager), each in charge of domain-specific agents. This modular hierarchy allows tasks to flow down from high-level strategy to granular execution.

Copy CodeCopiedUse a different Browsercode_agent = Agent(
name=”CodeWriter”,
llm_config=llm_config,
system_message=”You are an expert Python developer. Write clean, efficient, well-documented code with proper error handling. Always include test cases and follow PEP 8 standards.”,
output_schema=code_schema,
tools=[{
“metadata”: {
“function”: {
“name”: “validate_code”,
“description”: “Validates Python code syntax and checks for security issues”,
“parameters”: {
“type”: “object”,
“properties”: {
“code”: {“type”: “string”, “description”: “Python code to validate”}
},
“required”: [“code”]
}
}
},
“tool”: validate_code
}, {
“metadata”: {
“function”: {
“name”: “search_documentation”,
“description”: “Search for programming documentation and examples”,
“parameters”: {
“type”: “object”,
“properties”: {
“query”: {“type”: “string”, “description”: “Documentation topic to search for”}
},
“required”: [“query”]
}
}
},
“tool”: search_documentation
}],
use_tools=True
)

review_agent = Agent(
name=”CodeReviewer”,
llm_config=llm_config,
system_message=”You are a senior code reviewer. Analyze code for best practices, efficiency, security, maintainability, and potential issues. Provide constructive feedback and suggestions.”,
keep_history=True,
tools=[{
“metadata”: {
“function”: {
“name”: “validate_code”,
“description”: “Validates code syntax and security”,
“parameters”: {
“type”: “object”,
“properties”: {
“code”: {“type”: “string”, “description”: “Code to validate”}
},
“required”: [“code”]
}
}
},
“tool”: validate_code
}],
use_tools=True
)

analyst_agent = Agent(
name=”DataAnalyst”,
llm_config=llm_config,
system_message=”You are a data scientist specializing in statistical analysis and insights generation. Provide thorough analysis with confidence metrics and actionable recommendations.”,
output_schema=analysis_schema,
tools=[{
“metadata”: {
“function”: {
“name”: “calculate_metrics”,
“description”: “Calculates comprehensive statistics for numerical data”,
“parameters”: {
“type”: “object”,
“properties”: {
“data_str”: {“type”: “string”, “description”: “JSON string of numerical data array”}
},
“required”: [“data_str”]
}
}
},
“tool”: calculate_metrics
}],
use_tools=True
)

planner_agent = Agent(
name=”ProjectPlanner”,
llm_config=llm_config,
system_message=”You are a project planning specialist. Break down complex projects into manageable tasks with realistic time estimates and clear dependencies.”,
output_schema=planning_schema
)

tester_agent = Agent(
name=”QATester”,
llm_config=llm_config,
system_message=”You are a QA specialist focused on comprehensive testing strategies, edge cases, and quality assurance.”,
tools=[{
“metadata”: {
“function”: {
“name”: “validate_code”,
“description”: “Validates code for testing”,
“parameters”: {
“type”: “object”,
“properties”: {
“code”: {“type”: “string”, “description”: “Code to test”}
},
“required”: [“code”]
}
}
},
“tool”: validate_code
}],
use_tools=True
)

We then build a diverse set of specialized agents: CodeWriter for generating Python code, CodeReviewer for reviewing logic and security, DataAnalyst for performing structured data analysis, ProjectPlanner for task breakdown, and QATester for quality checks. Each agent has domain-specific tools, output schemas, and system instructions tailored to their role.

Copy CodeCopiedUse a different Browserdev_supervisor.register_agent(code_agent)
dev_supervisor.register_agent(review_agent)
analysis_supervisor.register_agent(analyst_agent)
qa_supervisor.register_agent(tester_agent)

main_supervisor.register_agent(dev_supervisor)
main_supervisor.register_agent(analysis_supervisor)
main_supervisor.register_agent(qa_supervisor)
main_supervisor.register_agent(planner_agent)

All agents are registered under their respective supervisors, and the assistant supervisors are, in turn, registered with the main supervisor. This setup creates a fully linked agent ecosystem, where instructions could cascade from the top-level agent to any specialist agent in the network.

Copy CodeCopiedUse a different Browserprint(“n Agent Hierarchy:”)
main_supervisor.display_agent_graph()

print(“n Testing Full Multi-Agent Communication”)
print(“-” * 45)

try:
test_response = main_supervisor.chat(“Hello! Please introduce your team and explain how you coordinate complex projects.”)
print(f” Supervisor communication test successful!”)
print(f”Response preview: {test_response[:200]}…”)
except Exception as e:
print(f” Supervisor test failed: {str(e)}”)
print(“Falling back to direct agent testing…”)

We visualize the entire hierarchy using display_agent_graph() to confirm our structure. It offers a clear view of how each agent is connected within the broader task management flow, a helpful diagnostic before deployment.

Copy CodeCopiedUse a different Browserprint(“n Complex Multi-Agent Task Execution”)
print(“-” * 40)

complex_task = “””Create a Python function that implements a binary search algorithm,
have it reviewed for optimization, tested thoroughly, and provide a project plan
for integrating it into a larger search system.”””

print(f”Complex Task: {complex_task}”)

try:
complex_response = main_supervisor.chat(complex_task)
print(f” Complex task completed”)
print(f”Response: {complex_response[:300]}…”)
except Exception as e:
print(f” Complex task failed: {str(e)}”)

We give the full system a real-world task: create a binary search function, review it, test it, and plan its integration into a larger project. The ProjectManager seamlessly coordinates agents across development, QA, and planning, demonstrating the true power of hierarchical, tool-driven agent orchestration.

Copy CodeCopiedUse a different Browserprint(“n Tool Integration & Structured Outputs”)
print(“-” * 43)

print(“Testing Code Agent with tools…”)
try:
code_response = code_agent.chat(“Create a function to calculate fibonacci numbers with memoization”)
print(f” Code Agent with tools: Working”)
print(f”Response type: {type(code_response)}”)

if isinstance(code_response, str) and code_response.strip().startswith(‘{‘):
code_data = json.loads(code_response)
print(f” – Description: {code_data.get(‘description’, ‘N/A’)[:50]}…”)
print(f” – Language: {code_data.get(‘language’, ‘N/A’)}”)
print(f” – Complexity: {code_data.get(‘complexity’, ‘N/A’)}”)
else:
print(f” – Raw response: {code_response[:100]}…”)

except Exception as e:
print(f” Code Agent error: {str(e)}”)

print(“nTesting Analyst Agent with tools…”)
try:
analysis_response = analyst_agent.chat(“Analyze this sales data: [100, 150, 120, 180, 200, 175, 160, 190, 220, 185]. What trends do you see?”)
print(f” Analyst Agent with tools: Working”)

if isinstance(analysis_response, str) and analysis_response.strip().startswith(‘{‘):
analysis_data = json.loads(analysis_response)
print(f” – Summary: {analysis_data.get(‘summary’, ‘N/A’)[:50]}…”)
print(f” – Confidence: {analysis_data.get(‘confidence’, ‘N/A’)}”)
print(f” – Insights count: {len(analysis_data.get(‘insights’, []))}”)
else:
print(f” – Raw response: {analysis_response[:100]}…”)

except Exception as e:
print(f” Analyst Agent error: {str(e)}”)

We directly test the capabilities of two specialized agents using real prompts. We first ask the CodeWriter agent to generate a Fibonacci function with memoization and validate that it returns structured output containing a code description, language, and complexity level. Then, we evaluate the DataAnalyst agent by feeding it sample sales data to extract trends.

Copy CodeCopiedUse a different Browserprint(“n Manual Tool Usage”)
print(“-” * 22)

# Test all tools manually
sample_data = “[95, 87, 92, 88, 91, 89, 94, 90, 86, 93]”
metrics_result = calculate_metrics(sample_data)
print(f”Statistics for {sample_data}:”)
for key, value in metrics_result.items():
print(f” {key}: {value}”)

print(“nCode validation test:”)
test_code = “””
def binary_search(arr, target):
left, right = 0, len(arr) – 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid – 1
return -1
“””
validation_result = validate_code(test_code)
print(f”Validation result: {validation_result}”)

print(“nDocumentation search test:”)
doc_result = search_documentation(“python function”)
print(f”Search results: {doc_result}”)

We step outside the agent framework to test each tool directly. First, we use the calculate_metrics tool on a dataset of ten numbers, confirming it correctly returned statistics such as mean, median, mode, and standard deviation. Next, we run the validate_code tool on a sample binary search function, which confirms both syntactic correctness and flags no security warnings. Finally, we test the search_documentation tool with the query “python function” and receive relevant documentation snippets, verifying its ability to efficiently simulate contextual lookup.

Copy CodeCopiedUse a different Browserprint(“n Advanced Multi-Agent Workflow”)
print(“-” * 35)

workflow_stages = [
(“Planning”, “Create a project plan for building a web scraper for news articles”),
(“Development”, “Implement the web scraper with error handling and rate limiting”),
(“Review”, “Review the web scraper code for security and efficiency”),
(“Testing”, “Create comprehensive test cases for the web scraper”),
(“Analysis”, “Analyze sample scraped data: [45, 67, 23, 89, 12, 56, 78, 34, 91, 43]”)
]

workflow_results = {}

for stage, task in workflow_stages:
print(f”n{stage} Stage: {task}”)
try:
if stage == “Planning”:
response = planner_agent.chat(task)
elif stage == “Development”:
response = code_agent.chat(task)
elif stage == “Review”:
response = review_agent.chat(task)
elif stage == “Testing”:
response = tester_agent.chat(task)
elif stage == “Analysis”:
response = analyst_agent.chat(task)

workflow_results[stage] = response
print(f” {stage} completed: {response[:80]}…”)

except Exception as e:
print(f” {stage} failed: {str(e)}”)
workflow_results[stage] = f”Error: {str(e)}”

We simulate a five-stage project lifecycle: planning, development, review, testing, and analysis. Each task is passed to the most relevant agent, and responses are collected to evaluate performance. This demonstrates the framework’s capability to manage end-to-end workflows without manual intervention.

Copy CodeCopiedUse a different Browserprint(“n System Monitoring & Performance”)
print(“-” * 37)

debugger = Debugger(name=”OpenAITutorialDebugger”)
debugger.log(“Advanced OpenAI tutorial execution completed successfully”)

print(f”Main Supervisor ID: {main_supervisor.workflow_id}”)

We activate the Debugger tool to track the performance of our session and log system events. We also print the main supervisor’s workflow_id as a traceable identifier, useful when managing multiple workflows in production.

In conclusion, we have successfully built a fully automated, OpenAI-compatible multi-agent system using PrimisAI Nexus. Each agent operates with clarity, precision, and autonomy, whether writing code, validating logic, analyzing data, or breaking down complex workflows. Our hierarchical structure allows for seamless task delegation and modular scalability. PrimisAI Nexus framework establishes a robust foundation for automating real-world tasks, whether in software development, research, planning, or data operations, through intelligent collaboration between specialized agents.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Implementing a Tool-Enabled Multi-Agent Workflow with Python, OpenAI API, and PrimisAI Nexus appeared first on MarkTechPost.

How INRIX accelerates transportation planning with Amazon Bedrock

This post is co-written with Shashank Saraogi, Nat Gale, and Durran Kelly from INRIX.
The complexity of modern traffic management extends far beyond mere road monitoring, encompassing massive amounts of data collected worldwide from connected cars, mobile devices, roadway sensors, and major event monitoring systems. For transportation authorities managing urban, suburban, and rural traffic flow, the challenge lies in effectively processing and acting upon this vast network of information. The task requires balancing immediate operational needs, such as real-time traffic redirection during incidents, with strategic long-term planning for improved mobility and safety.
Traditionally, analyzing these complex data patterns and producing actionable insights has been a resource-intensive process requiring extensive collaboration. With recent advances in generative AI, there is an opportunity to transform how we process, understand, and act upon transportation data, enabling more efficient and responsive traffic management systems.
In this post, we partnered with Amazon Web Services (AWS) customer INRIX to demonstrate how Amazon Bedrock can be used to determine the best countermeasures for specific city locations using rich transportation data and how such countermeasures can be automatically visualized in street view images. This approach allows for significant planning acceleration compared to traditional approaches using conceptual drawings.
INRIX pioneered the use of GPS data from connected vehicles for transportation intelligence. For over 20 years, INRIX has been a leader for probe-based connected vehicle and device data and insights, powering automotive, enterprise, and public sector use cases. INRIX’s products range from tickerized datasets that inform investment decisions for the financial services sector to digital twins for the public rights-of-way in the cities of Philadelphia and San Francisco. INRIX was the first company to develop a crowd-sourced traffic network, and they continue to lead in real-time mobility operations.
In June 2024, the State of California’s Department of Transportation (Caltrans) selected INRIX for a proof of concept for a generative AI-powered solution to improve safety for vulnerable road users (VRUs). The problem statement sought to harness the combination of Caltrans’ asset, crash, and points-of-interest (POI) data and INRIX’s 50 petabyte (PB) data lake to anticipate high-risk locations and quickly generate empirically validated safety measures to mitigate the potential for crashes. Trained on real-time and historical data and industry research and manuals, the solution provides a new systemic, safety-based methodology for risk assessment, location prioritization, and project implementation.
Solution overview
INRIX announced INRIX Compass in November 2023. INRIX Compass is an application that harnesses generative AI and INRIX’s 50 PB data lake to solve transportation challenges. This solution uses INRIX Compass countermeasures as the input, AWS serverless architecture, and Amazon Nova Canvas as the image visualizer. Key components include:

Countermeasures generation:

INRIX Compass generates the countermeasures for a selected location
Amazon API Gateway and Amazon Elastic Kubernetes Service (Amazon EKS) manage API requests and responses
Amazon Bedrock Knowledge Bases and Anthropic’s Claude Models provide Retrieval Augmented Generation (RAG) implementation

Image visualization

API Gateway and AWS Lambda process requests from API Gateway and Amazon Bedrock
Amazon Bedrock with model access to Amazon Nova Canvas provide image generation and in-painting

The following diagram shows the architecture of INRIX Compass.

INRIX Compass for countermeasures
By using INRIX Compass, users can ask natural language queries such as, Where are the top five locations with the highest risk for vulnerable road users? and Can you recommend a suite of proven safety countermeasures at each of these locations? Furthermore, users can probe deeper into the roadway characteristics that contribute to risk factors, and find similar locations in the roadway network that meet those conditions. Behind the scenes, Compass AI uses RAG and Amazon Bedrock powered foundation models (FMs) to query the roadway network to identify and prioritize locations with systemic risk factors and anomalous safety patterns. The solution provides prioritized recommendations for operational and design solutions and countermeasures based on industry knowledge.
The following image shows the interface of INRIX Compass.

Image visualization for countermeasures
The generation of countermeasure suggestions represents the initial phase in transportation planning. Image visualization requires the crucial next step of preparing conceptual drawings. This process has traditionally been time-consuming due to the involvement of multiple specialized teams, including:

Transportation engineers who assess technical feasibility and safety standards
Urban planners who verify alignment with city development goals
Landscape architects who integrate environmental and aesthetic elements
CAD or visualization specialists who create detailed technical drawings
Safety analysts who evaluate the potential impact on road safety
Public works departments who oversee implementation feasibility
Traffic operations teams who assess impact on traffic flow and management

These teams work collaboratively, creating and iteratively refining various visualizations based on feedback from urban designers and other stakeholders. Each iteration cycle typically involves multiple rounds of reviews, adjustments, and approvals, often extending the timeline significantly. The complexity is further amplified by city-specific rules and design requirements, which often necessitate significant customization. Additionally, local regulations, environmental considerations, and community feedback must be incorporated into the design process. Consequently, this lengthy and costly process frequently leads to delays in implementing safety countermeasures. To streamline this challenge, INRIX has pioneered an innovative approach to the visualization phase by using generative AI technology. This prototyped solution enables rapid iteration of conceptual drawings that can be efficiently reviewed by various teams, potentially reducing the design cycle from weeks to days. Moreover, the system incorporates a few-shot learning approach with reference images and carefully crafted prompts, allowing for seamless integration of city-specific requirements into the generated outputs. This approach not only accelerates the design process but also supports consistency across different projects while maintaining compliance with local standards.
The following image shows the congestion insights by INRIX Compass.

Amazon Nova Canvas for conceptual visualizations
INRIX developed and prototyped this solution using Amazon Nova models. Amazon Nova Canvas delivers advanced image processing through text-to-image generation and image-to-image transformation capabilities. The model provides sophisticated controls for adjusting color schemes and manipulating layouts to achieve desired visual outcomes. To promote responsible AI implementation, Amazon Nova Canvas incorporates built-in safety measures, including watermarking and content moderation systems.
The model supports a comprehensive range of image editing operations. These operations encompass basic image generation, object removal from existing images, object replacement within scenes, creation of image variations, and modification of image backgrounds. This versatility makes Amazon Nova Canvas suitable for a wide range of professional applications requiring sophisticated image editing.
The following sample images show an example of countermeasures visualization.

In-painting implementation in Compass AI
Amazon Nova Canvas integrates with INRIX Compass’s existing natural language analytics capabilities. The original Compass system generated text-based countermeasure recommendations based on:

Historical transportation data analysis
Current environmental conditions
User-specified requirements

The INRIX Compass visualization feature specifically uses the image generation and in-painting capabilities of Amazon Nova Canvas. In-painting enables object replacement through two distinct approaches:

A binary mask precisely defines the areas targeted for replacement.
Text prompts identify objects for replacement, allowing the model to interpret and modify the specified elements while maintaining visual coherence with the surrounding image context. This functionality provides seamless integration of new elements while preserving the overall image composition and contextual relevance. The developed interface accommodates both image generation and in-painting approaches, providing comprehensive image editing capabilities.

The implementation follows a two-stage process for visualizing transportation countermeasures. Initially, the system employs image generation functionality to create street-view representations corresponding to specific longitude and latitude coordinates where interventions are proposed. Following the initial image creation, the in-painting capability enables precise placement of countermeasures within the generated street view scene. This sequential approach provides accurate visualization of proposed modifications within the actual geographical context.
An Amazon Bedrock API facilitates image editing and generation through the Amazon Nova Canvas model. The responses contain the generated or modified images in base64 format, which can be decoded and processed for further use in the application. The generative AI capabilities of Amazon Bedrock enable rapid iteration and simultaneous visualization of multiple countermeasures within a single image. RAG implementation can further extend the pipeline’s capabilities by incorporating county-specific regulations, standardized design patterns, and contextual requirements. The integration of these technologies significantly streamlines the countermeasure deployment workflow. Traditional manual visualization processes that previously required extensive time and resources can now be executed efficiently through automated generation and modification. This automation delivers substantial improvements in both time-to-deployment and cost-effectiveness.
Conclusion
The partnership between INRIX and AWS showcases the transformative potential of AI in solving complex transportation challenges. By using Amazon Bedrock FMs, INRIX has turned their massive 50 PB data lake into actionable insights through effective visualization solutions. This post highlighted a single specific transportation use case, but Amazon Bedrock and Amazon Nova power a wide spectrum of applications, from text generation to video creation. The combination of extensive data and advanced AI capabilities continues to pave the way for smarter, more efficient transportation systems worldwide.
For more information, check out the documentation for Amazon Nova Foundation Models, Amazon Bedrock, and INRIX Compass.

About the authors
Arun is a Senior Solutions Architect at AWS, supporting enterprise customers in the Pacific Northwest. He’s passionate about solving business and technology challenges as an AWS customer advocate, with his recent interest being AI strategy. When not at work, Arun enjoys listening to podcasts, going for short trail runs, and spending quality time with his family.
Alicja Kwasniewska, PhD, is an AI leader driving generative AI innovations in enterprise solutions and decision intelligence for customer engagements in North America, advertisement and marketing verticals at AWS. She is recognized among the top 10 women in AI and 100 women in data science. Alicja published in more than 40 peer-reviewed publications. She also serves as a reviewer for top-tier conferences, including ICML,NeurIPS,and ICCV. She advises organizations on AI adoption, bridging research and industry to accelerate real-world AI applications.
Shashank is the VP of Engineering at INRIX, where he leads multiple verticals, including generative AI and traffic. He is passionate about using technology to make roads safer for drivers, bikers, and pedestrians every day. Prior to working at INRIX, he held engineering leadership roles at Amazon and Lyft. Shashank brings deep experience in building impactful products and high-performing teams at scale. Outside of work, he enjoys traveling, listening to music, and spending time with his family.
Nat Gale is the Head of Product at INRIX, where he manages the Safety and Traffic product verticals. Nat leads the development of data products and software that help transportation professionals make smart, more informed decisions. He previously ran the City of Los Angeles’ Vision Zero program and was the Director of Capital Projects and Operations for the City of Hartford, CT.
Durran is a Lead Software Engineer at INRIX, where he designs scalable backend systems and mentors engineers across multiple product lines. With over a decade of experience in software development, he specializes in distributed systems, generative AI, and cloud infrastructure. Durran is passionate about writing clean, maintainable code and sharing best practices with the developer community. Outside of work, he enjoys spending quality time with his family and deepening his Japanese language skills.

Qwen3 family of reasoning models now available in Amazon Bedrock Marke …

Today, we are excited to announce that Qwen3, the latest generation of large language models (LLMs) in the Qwen family, is available through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, you can deploy the Qwen3 models—available in 0.6B, 4B, 8B, and 32B parameter sizes—to build, experiment, and responsibly scale your generative AI applications on AWS.
In this post, we demonstrate how to get started with Qwen3 on Amazon Bedrock Marketplace and SageMaker JumpStart. You can follow similar steps to deploy the distilled versions of the models as well.
Solution overview
Qwen3 is the latest generation of LLMs in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:

Unique support of seamless switching between thinking mode and non-thinking mode within a single model, providing optimal performance across various scenarios.
Significantly enhanced in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
Good human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open source models in complex agent-based tasks.
Support for over 100 languages and dialects with strong capabilities for multilingual instruction following and translation.

Prerequisites
To deploy Qwen3 models, make sure you have access to the recommended instance types based on the model size. You can find these instance recommendations on Amazon Bedrock Marketplace or the SageMaker JumpStart console. To verify you have the necessary resources, complete the following steps:

Open the Service Quotas console.
Under AWS Services, select Amazon SageMaker.
Check that you have sufficient quota for the required instance type for endpoint deployment.
Make sure at least one of these instance types is available in your target AWS Region.

If needed, request a quota increase and contact your AWS account team for support.
Deploy Qwen3 in Amazon Bedrock Marketplace
Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access Qwen3 in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, in the navigation pane under Foundation models, choose Model catalog.
Filter for Hugging Face as a provider and choose a Qwen3 model. For this example, we use the Qwen3-32B model.

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.
The page also includes deployment options and licensing information to help you get started with Qwen3-32B in your applications.

To begin using Qwen3-32B, choose Deploy.

You will be prompted to configure the deployment details for Qwen3-32B. The model ID will be pre-populated.

For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
For Number of instances, enter a number of instances (between 1–100).
For Instance type, choose your instance type. For optimal performance with Qwen3-32B, a GPU-based instance type like ml.g5-12xlarge is recommended.
To deploy the model, choose Deploy.

When the deployment is complete, you can test Qwen3-32B’s capabilities directly in the Amazon Bedrock playground.

Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with any Amazon Bedrock APIs, you must have the endpoint Amazon Resource Name (ARN).
Enable reasoning and non-reasoning responses with Converse API
The following code shows how to turn reasoning on and off with Qwen3 models using the Converse API, depending on your use case. By default, reasoning is left on for Qwen3 models, but you can streamline interactions by using the /no_think command within your prompt. When you add this to the end of your query, reasoning is turned off and the models will provide just the direct answer. This is particularly useful when you need quick information without explanations, are familiar with the topic, or want to maintain a faster conversational flow. At the time of writing, the Converse API doesn’t support tool use for Qwen3 models. Refer to the Invoke_Model API example later in this post to learn how to use reasoning and tools in the same completion.

import boto3
from botocore.exceptions import ClientError

# Create a Bedrock Runtime client in the AWS Region you want to use.
client = boto3.client(“bedrock-runtime”, region_name=”us-west-2″)

# Configuration
model_id = “”  # Replace with Bedrock Marketplace endpoint arn

# Start a conversation with the user message.
user_message = “hello, what is 1+1 /no_think” #remove /no_think to leave default reasoning on
conversation = [
    {
        “role”: “user”,
        “content”: [{“text”: user_message}],
    }
]

try:
    # Send the message to the model, using a basic inference configuration.
    response = client.converse(
        modelId=model_id,
        messages=conversation,
        inferenceConfig={“maxTokens”: 512, “temperature”: 0.5, “topP”: 0.9},
    )

    # Extract and print the response text.
    #response_text = response[“output”][“message”][“content”][0][“text”]
    #reasoning_content = response [“output”][“message”][“reasoning_content”][0][“text”]
    #print(response_text, reasoning_content)
    print(response)
    
except (ClientError, Exception) as e:
    print(f”ERROR: Can’t invoke ‘{model_id}’. Reason: {e}”)
    exit(1)

The following is a response using the Converse API, without default thinking:

{‘ResponseMetadata’: {‘RequestId’: ‘f7f3953a-5747-4866-9075-fd4bd1cf49c4’, ‘HTTPStatusCode’: 200, ‘HTTPHeaders’: {‘date’: ‘Tue, 17 Jun 2025 18:34:47 GMT’, ‘content-type’: ‘application/json’, ‘content-length’: ‘282’, ‘connection’: ‘keep-alive’, ‘x-amzn-requestid’: ‘f7f3953a-5747-4866-9075-fd4bd1cf49c4’}, ‘RetryAttempts’: 0}, ‘output’: {‘message’: {‘role’: ‘assistant’, ‘content’: [{‘text’: ‘nnHello! The result of 1 + 1 is **2**. 😊’}, {‘reasoningContent’: {‘reasoningText’: {‘text’: ‘nn’}}}]}}, ‘stopReason’: ‘end_turn’, ‘usage’: {‘inputTokens’: 20, ‘outputTokens’: 22, ‘totalTokens’: 42}, ‘metrics’: {‘latencyMs’: 1125}}

The following is an example with default thinking on; the <think> tokens are automatically parsed into the reasoningContent field for the Converse API:

{‘ResponseMetadata’: {‘RequestId’: ‘b6d2ebbe-89da-4edc-9a3a-7cb3e7ecf066’, ‘HTTPStatusCode’: 200, ‘HTTPHeaders’: {‘date’: ‘Tue, 17 Jun 2025 18:32:28 GMT’, ‘content-type’: ‘application/json’, ‘content-length’: ‘1019’, ‘connection’: ‘keep-alive’, ‘x-amzn-requestid’: ‘b6d2ebbe-89da-4edc-9a3a-7cb3e7ecf066’}, ‘RetryAttempts’: 0}, ‘output’: {‘message’: {‘role’: ‘assistant’, ‘content’: [{‘text’: ‘nnHello! The sum of 1 + 1 is **2**. Let me know if you have any other questions or need further clarification! 😊’}, {‘reasoningContent’: {‘reasoningText’: {‘text’: ‘nOkay, the user asked “hello, what is 1+1”. Let me start by acknowledging their greeting. They might just be testing the water or actually need help with a basic math problem. Since it’s 1+1, it’s a very simple question, but I should make sure to answer clearly. Maybe they’re a child learning math for the first time, or someone who’s not confident in their math skills. I should provide the answer in a friendly and encouraging way. Let me confirm that 1+1 equals 2, and maybe add a brief explanation to reinforce their understanding. I can also offer further assistance in case they have more questions. Keeping it conversational and approachable is key here.n’}}}]}}, ‘stopReason’: ‘end_turn’, ‘usage’: {‘inputTokens’: 16, ‘outputTokens’: 182, ‘totalTokens’: 198}, ‘metrics’: {‘latencyMs’: 7805}}

Perform reasoning and function calls in the same completion using the Invoke_Model API
With Qwen3, you can stream an explicit trace and the exact JSON tool call in the same completion. Up until now, reasoning models have forced the choice to either show the chain of thought or call tools deterministically. The following code shows an example:

messages = json.dumps( {
    “messages”: [
        {
            “role”: “user”,
            “content”: “Hi! How are you doing today?”
        },
        {
            “role”: “assistant”,
            “content”: “I’m doing well! How can I help you?”
        },
        {
            “role”: “user”,
            “content”: “Can you tell me what the temperate will be in Dallas, in fahrenheit?”
        }
    ],
    “tools”: [{
        “type”: “function”,
        “function”: {
            “name”: “get_current_weather”,
            “description”: “Get the current weather in a given location”,
            “parameters”: {
                “type”: “object”,
                “properties”: {
                    “city”: {
                        “type”:
                            “string”,
                        “description”:
                            “The city to find the weather for, e.g. ‘San Francisco'”
                    },
                    “state”: {
                        “type”:
                            “string”,
                        “description”:
                            “the two-letter abbreviation for the state that the city is in, e.g. ‘CA’ which would mean ‘California'”
                    },
                    “unit”: {
                        “type”: “string”,
                        “description”:
                            “The unit to fetch the temperature in”,
                        “enum”: [“celsius”, “fahrenheit”]
                    }
                },
                “required”: [“city”, “state”, “unit”]
            }
        }
    }],
    “tool_choice”: “auto”
})

response = client.invoke_model(
    modelId=model_id,
    body=body
)
print(response)
model_output = json.loads(response[‘body’].read())
print(json.dumps(model_output, indent=2))

Response:

{‘ResponseMetadata’: {‘RequestId’: ‘5da8365d-f4bf-411d-a783-d85eb3966542’, ‘HTTPStatusCode’: 200, ‘HTTPHeaders’: {‘date’: ‘Tue, 17 Jun 2025 18:57:38 GMT’, ‘content-type’: ‘application/json’, ‘content-length’: ‘1148’, ‘connection’: ‘keep-alive’, ‘x-amzn-requestid’: ‘5da8365d-f4bf-411d-a783-d85eb3966542’, ‘x-amzn-bedrock-invocation-latency’: ‘6396’, ‘x-amzn-bedrock-output-token-count’: ‘148’, ‘x-amzn-bedrock-input-token-count’: ‘198’}, ‘RetryAttempts’: 0}, ‘contentType’: ‘application/json’, ‘body’: <botocore.response.StreamingBody object at 0x7f7d4a598dc0>}
{
  “id”: “chatcmpl-bc60b482436542978d233b13dc347634”,
  “object”: “chat.completion”,
  “created”: 1750186651,
  “model”: “lmi”,
  “choices”: [
    {
      “index”: 0,
      “message”: {
        “role”: “assistant”,
        “reasoning_content”: “nOkay, the user is asking about the weather in San Francisco. Let me check the tools available. There’s a get_weather function that requires location and unit. The user didn’t specify the unit, so I should ask them if they want Celsius or Fahrenheit. Alternatively, maybe I can assume a default, but since the function requires it, I need to include it. I’ll have to prompt the user for the unit they prefer.n”,
        “content”: “nnThe user hasn’t specified whether they want the temperature in Celsius or Fahrenheit. I need to ask them to clarify which unit they prefer.nn”,
        “tool_calls”: [
          {
            “id”: “chatcmpl-tool-fb2f93f691ed4d8ba94cadc52b57414e”,
            “type”: “function”,
            “function”: {
              “name”: “get_weather”,
              “arguments”: “{“location”: “San Francisco, CA”, “unit”: “celsius”}”
            }
          }
        ]
      },
      “logprobs”: null,
      “finish_reason”: “tool_calls”,
      “stop_reason”: null
    }
  ],
  “usage”: {
    “prompt_tokens”: 198,
    “total_tokens”: 346,
    “completion_tokens”: 148,
    “prompt_tokens_details”: null
  },
  “prompt_logprobs”: null
}

Deploy Qwen3-32B with SageMaker JumpStart
SageMaker JumpStart is a machine learning (ML) hub with FMs, built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can customize pre-trained models to your use case, with your data, and deploy them into production using either the UI or SDK.Deploying the Qwen3-32B model through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.
Deploy Qwen3-32B through SageMaker JumpStart UI
Complete the following steps to deploy Qwen3-32B using SageMaker JumpStart:

On the SageMaker console, choose Studio in the navigation pane.
First-time users will be prompted to create a domain.
On the SageMaker Studio console, choose JumpStart in the navigation pane.

The model browser displays available models, with details like the provider name and model capabilities.

Search for Qwen3 to view the Qwen3-32B model card.

Each model card shows key information, including:

Model name
Provider name
Task category (for example, Text Generation)
Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, so you can use Amazon Bedrock APIs to invoke the model

Choose the model card to view the model details page.

The model details page includes the following information:

The model name and provider information
A Deploy button to deploy the model
About and Notebooks tabs with detailed information

The About tab includes important details, such as:

Model description
License information
Technical specifications
Usage guidelines

Before you deploy the model, it’s recommended to review the model details and license terms to confirm compatibility with your use case.

Choose Deploy to proceed with deployment.
For Endpoint name, use the automatically generated name or create a custom one.
For Instance type¸ choose an instance type (default: ml.g6-12xlarge).
For Initial instance count, enter the number of instances (default: 1).

Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.

Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
Choose Deploy to deploy the model.

The deployment process can take several minutes to complete.
When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.
Deploy Qwen3-32B using the SageMaker Python SDK
To get started with Qwen3-32B using the SageMaker Python SDK, you must install the SageMaker Python SDK and make sure you have the necessary AWS permissions and environment set up. The following is a step-by-step code example that demonstrates how to deploy and use Qwen3-32B for inference programmatically:

!pip install –force-reinstall –no-cache-dir sagemaker==2.235.2

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging

sagemaker_session = Session()
artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

# Changed to Qwen32B model
js_model_id = “huggingface-reasoning-qwen3-32b”
gpu_instance_type = “ml.g5.12xlarge”

response = “Hello, I’m a language model, and I’m here to help you with your English.”

sample_input = {
    “inputs”: “Hello, I’m a language model,”,
    “parameters”: {
        “max_new_tokens”: 128,
        “top_p”: 0.9,
        “temperature”: 0.6
    }
}

sample_output = [{“generated_text”: response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

model = model_builder.build()

predictor = model.deploy(
    model_access_configs={js_model_id: ModelAccessConfig(accept_eula=True)},
    accept_eula=True
)

predictor.predict(sample_input)

You can run additional requests against the predictor:

new_input = {
“inputs”: “What is Amazon doing in Generative AI?”,
“parameters”: {“max_new_tokens”: 64, “top_p”: 0.8, “temperature”: 0.7},
}

prediction = predictor.predict(new_input)
print(prediction)

The following are some error handling and best practices to enhance deployment code:

# Enhanced deployment code with error handling
import backoff
import botocore
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@backoff.on_exception(backoff.expo,
                     (botocore.exceptions.ClientError,),
                     max_tries=3)
def deploy_model_with_retries(model_builder, model_id):
    try:
        model = model_builder.build()
        predictor = model.deploy(
            model_access_configs={model_id:ModelAccessConfig(accept_eula=True)},
            accept_eula=True
        )
        return predictor
    except Exception as e:
        logger.error(f”Deployment failed: {str(e)}”)
        raise

def safe_predict(predictor, input_data):
    try:
        return predictor.predict(input_data)
    except Exception as e:
        logger.error(f”Prediction failed: {str(e)}”)
        return None

Clean up
To avoid unwanted charges, complete the steps in this section to clean up your resources.
Delete the Amazon Bedrock Marketplace deployment
If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Select the endpoint, and on the Actions menu, choose Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:

Endpoint name
Model name
Endpoint status

Choose Delete to delete the endpoint.
In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor
The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion
In this post, we explored how you can access and deploy the Qwen3 models using Amazon Bedrock Marketplace and SageMaker JumpStart. With support for both the full parameter models and its distilled versions, you can choose the optimal model size for your specific use case. Visit SageMaker JumpStart in Amazon SageMaker Studio or Amazon Bedrock Marketplace to get started. For more information, refer to Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models, SageMaker JumpStart pretrained models, Amazon SageMaker JumpStart Foundation Models, Amazon Bedrock Marketplace, and Getting started with Amazon SageMaker JumpStart.
The Qwen3 family of LLMs offers exceptional versatility and performance, making it a valuable addition to the AWS foundation model offerings. Whether you’re building applications for content generation, analysis, or complex reasoning tasks, Qwen3’s advanced architecture and extensive context window make it a powerful choice for your generative AI needs.

About the authors
Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.
Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technologies.
Mohhid Kidwai is a Solutions Architect at AWS. His area of focus is generative AI and machine learning solutions for small-medium businesses. He holds a bachelor’s degree in Computer Science with a minor in Biological Science from North Carolina State University. Mohhid is currently working with the SMB Engaged East Team at AWS.
Yousuf Athar is a Solutions Architect at AWS specializing in generative AI and AI/ML. With a Bachelor’s degree in Information Technology and a concentration in Cloud Computing, he helps customers integrate advanced generative AI capabilities into their systems, driving innovation and competitive edge. Outside of work, Yousuf loves to travel, watch sports, and play football.
John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.
Rohit Talluri is a Generative AI GTM Specialist at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.
Varun Morishetty is a Software Engineer with Amazon SageMaker JumpStart and Bedrock Marketplace. Varun received his Bachelor’s degree in Computer Science from Northeastern University. In his free time, he enjoys cooking, baking and exploring New York City.

Build a just-in-time knowledge base with Amazon Bedrock

Software as a service (SaaS) companies managing multiple tenants face a critical challenge: efficiently extracting meaningful insights from vast document collections while controlling costs. Traditional approaches often lead to unnecessary spending on unused storage and processing resources, impacting both operational efficiency and profitability. Organizations need solutions that intelligently scale processing and storage resources based on actual tenant usage patterns while maintaining data isolation. Traditional Retrieval Augmented Generation (RAG) systems consume valuable resources by ingesting and maintaining embeddings for documents that might never be queried, resulting in unnecessary storage costs and reduced system efficiency. Systems designed to handle large amounts of small to mid-sized tenants can exceed cost structure and infrastructure limits or might need to use silo-style deployments to keep each tenant’s information and usage separate. Adding to this complexity, many projects are transitory in nature, with work being completed on an intermittent basis, leading to data occupying space in knowledge base systems that could be used by other active tenants.
To address these challenges, this post presents a just-in-time knowledge base solution that reduces unused consumption through intelligent document processing. The solution processes documents only when needed and automatically removes unused resources, so organizations can scale their document repositories without proportionally increasing infrastructure costs.
With a multi-tenant architecture with configurable limits per tenant, service providers can offer tiered pricing models while maintaining strict data isolation, making it ideal for SaaS applications serving multiple clients with varying needs. Automatic document expiration through Time-to-Live (TTL) makes sure the system remains lean and focused on relevant content, while refreshing the TTL for frequently accessed documents maintains optimal performance for information that matters. This architecture also makes it possible to limit the number of files each tenant can ingest at a specific time and the rate at which tenants can query a set of files.This solution uses serverless technologies to alleviate operational overhead and provide automatic scaling, so teams can focus on business logic rather than infrastructure management. By organizing documents into groups with metadata-based filtering, the system enables contextual querying that delivers more relevant results while maintaining security boundaries between tenants.The architecture’s flexibility supports customization of tenant configurations, query rates, and document retention policies, making it adaptable to evolving business requirements without significant rearchitecting.
Solution overview
This architecture combines several AWS services to create a cost-effective, multi-tenant knowledge base solution that processes documents on demand. The key components include:

Vector-based knowledge base – Uses Amazon Bedrock and Amazon OpenSearch Serverless for efficient document processing and querying
On-demand document ingestion – Implements just-in-time processing using the Amazon Bedrock CUSTOM data source type
TTL management – Provides automatic cleanup of unused documents using the TTL feature in Amazon DynamoDB
Multi-tenant isolation – Enforces secure data separation between users and organizations with configurable resource limits

The solution enables granular control through metadata-based filtering at the user, tenant, and file level. The DynamoDB TTL tracking system supports tiered pricing structures, where tenants can pay for different TTL durations, document ingestion limits, and query rates.
The following diagram illustrates the key components and workflow of the solution.

The workflow consists of the following steps:

The user logs in to the system, which attaches a tenant ID to the current user for calls to the Amazon Bedrock knowledge base. This authentication step is crucial because it establishes the security context and makes sure subsequent interactions are properly associated with the correct tenant. The tenant ID becomes the foundational piece of metadata that enables proper multi-tenant isolation and resource management throughout the entire workflow.
After authentication, the user creates a project that will serve as a container for the files they want to query. This project creation step establishes the organizational structure needed to manage related documents together. The system generates appropriate metadata and creates the necessary database entries to track the project’s association with the specific tenant, enabling proper access control and resource management at the project level.
With a project established, the user can begin uploading files. The system manages this process by generating pre-signed URLs for secure file upload. As files are uploaded, they are stored in Amazon Simple Storage Service (Amazon S3), and the system automatically creates entries in DynamoDB that associate each file with both the project and the tenant. This three-way relationship (file-project-tenant) is essential for maintaining proper data isolation and enabling efficient querying later.
When a user requests to create a chat with a knowledge base for a specific project, the system begins ingesting the project files using the CUSTOM data source. This is where the just-in-time processing begins. During ingestion, the system applies a TTL value based on the tenant’s tier-specific TTL interval. The TTL makes sure project files remain available during the chat session while setting up the framework for automatic cleanup later. This step represents the core of the on-demand processing strategy, because files are only processed when they are needed.
Each chat session actively updates the TTL for the project files being used. This dynamic TTL management makes sure frequently accessed files remain in the knowledge base while allowing rarely used files to expire naturally. The system continually refreshes the TTL values based on actual usage, creating an efficient balance between resource availability and cost optimization. This approach maintains optimal performance for actively used content while helping to prevent resource waste on unused documents.
After the chat session ends and the TTL value expires, the system automatically removes files from the knowledge base. This cleanup process is triggered by Amazon DynamoDB Streams monitoring TTL expiration events, which activate an AWS Lambda function to remove the expired documents. This final step reduces the load on the underlying OpenSearch Serverless cluster and optimizes system resources, making sure the knowledge base remains lean and efficient.

Prerequisites
You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 AWS Region.

An active AWS account with permissions to create resources in us-east-1
The AWS Command Line Interface (AWS CLI) installed
The AWS Cloud Development Kit (AWS CDK) installed
Git installed to clone the repository

Deploy the solution
Complete the following steps to deploy the solution:

Download the AWS CDK project from the GitHub repo.
Install the project dependencies:

npm run install:all

Deploy the solution:

npm run deploy

Create a user and log in to the system after validating your email.

Validate the knowledge base and run a query
Before allowing users to chat with their documents, the system performs the following steps:

Performs a validation check to determine if documents need to be ingested. This process happens transparently to the user and includes checking document status in DynamoDB and the knowledge base.
Validates that the required documents are successfully ingested and properly indexed before allowing queries.
Returns both the AI-generated answers and relevant citations to source documents, maintaining traceability and empowering users to verify the accuracy of responses.

The following screenshot illustrates an example of chatting with the documents.

Looking at the following example method for file ingestion, note how file information is stored in DynamoDB with a TTL value for automatic expiration. The ingest knowledge base documents call includes essential metadata (user ID, tenant ID, and project), enabling precise filtering of this tenant’s files in subsequent operations.

# Ingesting files with tenant-specific TTL values
def ingest_files(user_id, tenant_id, project_id, files):
    # Get tenant configuration and calculate TTL
    tenants = json.loads(os.environ.get(‘TENANTS’))[‘Tenants’]
    tenant = find_tenant(tenant_id, tenants)
    ttl = int(time.time()) + (int(tenant[‘FilesTTLHours’]) * 3600)
    
    # For each file, create a record with TTL and start ingestion
    for file in files:
        file_id = file[‘id’]
        s3_key = file.get(‘s3Key’)
        bucket = file.get(‘bucket’)
        
        # Create a record in the knowledge base files table with TTL
        knowledge_base_files_table.put_item(
            Item={
                ‘id’: file_id,
                ‘userId’: user_id,
                ‘tenantId’: tenant_id,
                ‘projectId’: project_id,
                ‘documentStatus’: ‘ready’,
                ‘createdAt’: int(time.time()),
                ‘ttl’: ttl  # TTL value for automatic expiration
            }
        )
        
        # Start the ingestion job with tenant, user, and project metadata for filtering
        bedrock_agent.ingest_knowledge_base_documents(
            knowledgeBaseId=KNOWLEDGE_BASE_ID,
            dataSourceId=DATA_SOURCE_ID,
            clientToken=str(uuid.uuid4()),
            documents=[
                {
                    ‘content’: {
                        ‘dataSourceType’: ‘CUSTOM’,
                        ‘custom’: {
                            ‘customDocumentIdentifier’: {
                                ‘id’: file_id
                            },
                            ‘s3Location’: {
                                ‘uri’: f”s3://{bucket}/{s3_key}”
                            },
                            ‘sourceType’: ‘S3_LOCATION’
                        }
                    },
                    ‘metadata’: {
                        ‘type’: ‘IN_LINE_ATTRIBUTE’,
                        ‘inlineAttributes’: [
                            {‘key’: ‘userId’, ‘value’: {‘stringValue’: user_id, ‘type’: ‘STRING’}},
                            {‘key’: ‘tenantId’, ‘value’: {‘stringValue’: tenant_id, ‘type’: ‘STRING’}},
                            {‘key’: ‘projectId’, ‘value’: {‘stringValue’: project_id, ‘type’: ‘STRING’}},
                            {‘key’: ‘fileId’, ‘value’: {‘stringValue’: file_id, ‘type’: ‘STRING’}}
                        ]
                    }
                }
            ]
        )

During a query, you can use the associated metadata to construct parameters that make sure you only retrieve files belonging to this specific tenant. For example:

    filter_expression = {
        “andAll”: [
            {
                “equals”: {
                    “key”: “tenantId”,
                    “value”: tenant_id
                }
            },
            {
                “equals”: {
                    “key”: “projectId”,
                    “value”: project_id
                }
            },
            {
                “in”: {
                    “key”: “fileId”,
                    “value”: file_ids
                }
            }
        ]
    }

    # Create base parameters for the API call
    retrieve_params = {
        ‘input’: {
            ‘text’: query
        },
        ‘retrieveAndGenerateConfiguration’: {
            ‘type’: ‘KNOWLEDGE_BASE’,
            ‘knowledgeBaseConfiguration’: {
                ‘knowledgeBaseId’: knowledge_base_id,
                ‘modelArn’: ‘arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-pro-v1:0’,
                ‘retrievalConfiguration’: {
                    ‘vectorSearchConfiguration’: {
                        ‘numberOfResults’: limit,
                        ‘filter’: filter_expression
                    }
                }
            }
        }
    }
    response = bedrock_agent_runtime.retrieve_and_generate(**retrieve_params)

Manage the document lifecycle with TTL
To further optimize resource usage and costs, you can implement an intelligent document lifecycle management system using the DynamoDB (TTL) feature. This consists of the following steps:

When a document is ingested into the knowledge base, a record is created with a configurable TTL value.
This TTL is refreshed when the document is accessed.
DynamoDB Streams with specific filters for TTL expiration events is used to trigger a cleanup Lambda function.
The Lambda function removes expired documents from the knowledge base.

See the following code:

# Lambda function triggered by DynamoDB Streams when TTL expires items
def lambda_handler(event, context):
    “””
    This function is triggered by DynamoDB Streams when TTL expires items.
    It removes expired documents from the knowledge base.
    “””
    
    # Process each record in the event
    for record in event.get(‘Records’, []):
        # Check if this is a TTL expiration event (REMOVE event from DynamoDB Stream)
        if record.get(‘eventName’) == ‘REMOVE’:
            # Check if this is a TTL expiration
            user_identity = record.get(‘userIdentity’, {})
            if user_identity.get(‘type’) == ‘Service’ and user_identity.get(‘principalId’) == ‘dynamodb.amazonaws.com’:
                # Extract the file ID and tenant ID from the record
                keys = record.get(‘dynamodb’, {}).get(‘Keys’, {})
                file_id = keys.get(‘id’, {}).get(‘S’)
                
                # Delete the document from the knowledge base
                bedrock_agent.delete_knowledge_base_documents(
                    clientToken=str(uuid.uuid4()),
                    knowledgeBaseId=knowledge_base_id,
                    dataSourceId=data_source_id,
                    documentIdentifiers=[
                        {
                            ‘custom’: {
                                ‘id’: file_id
                            },
                            ‘dataSourceType’: ‘CUSTOM’
                        }
                    ]
                )

Multi-tenant isolation with tiered service levels
Our architecture enables sophisticated multi-tenant isolation with tiered service levels:

Tenant-specific document filtering – Each query includes user, tenant, and file-specific filters, allowing the system to reduce the number of documents being queried.
Configurable TTL values – Different tenant tiers can have different TTL configurations. For example:

Free tier: Five documents ingested with a 7-day TTL and five queries per minute.
Standard tier: 100 documents ingested with a 30-day TTL and 10 queries per minute.
Premium tier: 1,000 documents ingested with a 90-day TTL and 50 queries per minute.
You can configure additional limits, such as total queries per month or total ingested files per day or month.

Clean up
To clean up the resources created in this post, run the following command from the same location where you performed the deploy step:

npm run destroy

Conclusion
The just-in-time knowledge base architecture presented in this post transforms document management across multiple tenants by processing documents only when queried, reducing the unused consumption of traditional RAG systems. This serverless implementation uses Amazon Bedrock, OpenSearch Serverless, and the DynamoDB TTL feature to create a lean system with intelligent document lifecycle management, configurable tenant limits, and strict data isolation, which is essential for SaaS providers offering tiered pricing models.
This solution directly addresses cost structure and infrastructure limitations of traditional systems, particularly for deployments handling numerous small to mid-sized tenants with transitory projects. This architecture combines on-demand document processing with automated lifecycle management, delivering a cost-effective, scalable resource that empowers organizations to focus on extracting insights rather than managing infrastructure, while maintaining security boundaries between tenants.
Ready to implement this architecture? The full sample code is available in the GitHub repository.

About the author
Steven Warwick is a Senior Solutions Architect at AWS, where he leads customer engagements to drive successful cloud adoption and specializes in SaaS architectures and Generative AI solutions. He produces educational content including blog posts and sample code to help customers implement best practices, and has led programs on GenAI topics for solution architects. Steven brings decades of technology experience to his role, helping customers with architectural reviews, cost optimization, and proof-of-concept development.

Getting Started with Agent Communication Protocol (ACP): Build a Weath …

The Agent Communication Protocol (ACP) is an open standard designed to enable seamless communication between AI agents, applications, and humans. As AI systems are often developed using diverse frameworks and infrastructures, they can end up isolated and incompatible, limiting their ability to collaborate. ACP addresses this fragmentation by offering a unified RESTful API that facilitates:

Multimodal communication

Both synchronous and asynchronous messaging

Real-time streaming

Support for stateful and stateless agent interactions

Discovery of agents, whether online or offline

Execution of long-running tasks

In this tutorial, we’ll take our first steps with ACP by building a basic server that provides London’s weather information and a simple client that can interact with it.

Setting up the dependencies

Installing the libraries

Copy CodeCopiedUse a different Browserpip install acp acp-sdk beeai-framework httpx

Creating the ACP Server

We’ll begin by setting up the ACP server, starting with the creation of an agent.py file.

We’ll begin by importing the necessary libraries. To fetch London’s weather data, we’ll use the httpx library to make a request to the Open‑Meteo API.

Copy CodeCopiedUse a different Browserimport asyncio
from collections.abc import AsyncGenerator
import httpx

from acp_sdk.models import Message, MessagePart
from acp_sdk.server import Context, RunYield, RunYieldResume, Server

server = Server()

Next, we’ll define an asynchronous helper function called get_london_weather that retrieves the current weather in London using the Open‑Meteo API. This function sends a request with London’s coordinates and returns a formatted weather summary including temperature, wind speed, and weather condition code.

Copy CodeCopiedUse a different Browserasync def get_london_weather() -> str:
“””Fetch current London weather from the free Open‑Meteo API.”””
params = {
“latitude”: 51.5072, # London coordinates
“longitude”: -0.1276,
“current_weather”: True,
“timezone”: “Europe/London”
}
url = “https://api.open-meteo.com/v1/forecast”

async with httpx.AsyncClient(timeout=10) as client:
resp = await client.get(url, params=params)
resp.raise_for_status()
cw = resp.json()[“current_weather”]

return (
f”Weather in London: {cw[‘temperature’]} °C, ”
f”wind {cw[‘windspeed’]} km/h, code {cw[‘weathercode’]}.”
)

This code defines an ACP-compatible agent using the @server.agent() decorator. The london_weather_agent function handles incoming messages by first yielding a thought message, then asynchronously fetching the current weather in London using the get_london_weather() helper. The weather data is then returned as a plain text message. Finally, server.run() starts the ACP server and makes the agent available to handle requests

Copy CodeCopiedUse a different Browser@server.agent()
async def london_weather_agent(
input: list[Message], context: Context
) -> AsyncGenerator[RunYield, RunYieldResume]:
“””Returns current London weather.”””
for _ in input:
yield {“thought”: “Fetching London weather…”}
weather = await get_london_weather()
yield Message(
role=”agent”,
parts=[MessagePart(content=weather, content_type=”text/plain”)]
)

server.run()

Running the server

Next, we’ll run the agent.py file to start the server. Once running, the ACP agent will be available to handle requests at http://localhost:8000

Copy CodeCopiedUse a different Browserpython agent.py

To verify that your agent is up and running, open a new terminal and execute the following curl command:

Copy CodeCopiedUse a different Browsercurl http://localhost:8000/agents

If everything is working correctly, you’ll receive a JSON response listing your agent, confirming that it’s available and ready to handle requests.

Copy CodeCopiedUse a different Browser{
“agents”: [
{
“name”: “london_weather_agent”,
“description”: “Returns current London weather.”,
“metadata”: {
“annotations”: null,
“documentation”: null,
“license”: null,
“programming_language”: null,
“natural_languages”: null,
“framework”: null,
“capabilities”: null,
“domains”: null,
“tags”: null,
“created_at”: null,
“updated_at”: null,
“author”: null,
“contributors”: null,
“links”: null,
“dependencies”: null,
“recommended_models”: null
},
“input_content_types”: [
“*/*”
],
“output_content_types”: [
“*/*”
]
}
]
}

Creating the ACP Client

We will now create an ACP client (client.py) to interact with our server. 

This client script uses the ACP SDK to connect to the locally running london_weather_agent via the ACP server at http://localhost:8000. It sends a synchronous message asking for the weather using the run_sync method. Once the agent responds, the script prints out the returned weather details.

Copy CodeCopiedUse a different Browserimport asyncio

from acp_sdk.client import Client
from acp_sdk.models import Message, MessagePart

async def call_london_weather_agent() -> None:
async with Client(base_url=”http://localhost:8000″) as client:
run = await client.run_sync(
agent=”london_weather_agent”,
input=[
Message(
parts=[MessagePart(content=”Tell me the weather”, content_type=”text/plain”)]
)
],
)

print(“Response from london_weather_agent:”)
for message in run.output:
for part in message.parts:
print(“-“, part.content)

if __name__ == “__main__”:
asyncio.run(call_london_weather_agent())

Running the Client

In another terminal, run the following command to send request to our ACP server

Copy CodeCopiedUse a different Browserpython client.py

You should see a response from the server containing the current weather in London, returned by the london_weather_agent.

Copy CodeCopiedUse a different BrowserResponse from london_weather_agent:
– Weather in London: 20.8 °C, wind 10.1 km/h, code 3.

Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Getting Started with Agent Communication Protocol (ACP): Build a Weather Agent with Python appeared first on MarkTechPost.

SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for Sta …

Understanding Limitations of Current Reward Models

Although reward models play a crucial role in Reinforcement Learning from Human Feedback (RLHF), many of today’s top-performing open models still struggle to reflect the full range of complex human preferences. Even with sophisticated training techniques, meaningful progress has been limited. A major reason appears to be the shortcomings in current preference datasets, which are often too narrow, artificially generated, or poorly vetted. While some rule-based systems are effective for clear tasks like math or coding, they usually fail to capture nuanced human judgment. Moreover, common benchmarks like RewardBench are becoming less reliable indicators of real-world RM performance, showing poor correlation with downstream task success.

Challenges in Preference Data Creation and New Approaches

Creating high-quality preference data has traditionally relied on human annotators, but this method is time-consuming, costly, and sometimes inconsistent. To address this, recent techniques like RLAIF use LLMs to automate annotations, sometimes even outperforming humans. Newer approaches aim to combine the strengths of both by integrating LLM-generated data with human-verified labels. Meanwhile, reward models have evolved from simple scoring systems, such as the Bradley-Terry model, to more complex frameworks, including generative and direct optimization methods. Despite the availability of numerous robust open models and datasets, challenges persist in accurately capturing nuanced human preferences across diverse tasks and languages.

Introducing SynPref-40M: Large-Scale Human-AI Preference Dataset

Researchers from 2050 Research, Skywork AI introduce SynPref-40M, a massive dataset of 40 million preference pairs curated through a two-stage human-AI pipeline. Human annotators ensure quality through strict verification, while LLMs scale up data curation using human guidance. From this, they develop Skywork-Reward-V2, a family of eight reward models (0.6B–8B parameters) trained on a high-quality subset of 26 M. These models achieve state-of-the-art results across seven leading benchmarks, excelling in alignment, safety, objectivity, and robustness. The study highlights that success comes not just from data volume, but from careful, iterative curation that blends human expertise with AI scalability.

Scalable Two-Stage Human-AI Curation Pipeline

Current open reward models often suffer from overfitting to narrow benchmarks, such as RewardBench, which limits their real-world usefulness. To address this, the researchers introduce a two-stage, human-AI pipeline for curating large-scale preference data. Stage 1 starts with human-verified annotations to guide LLMs in labeling diverse preference attributes, followed by iterative training and error analysis to refine the reward model. Stage 2 scales this process using consistency checks between the best and a human-trained “gold” reward model, filtering reliable samples without further human input. This approach strikes a balance between quality and scalability, ultimately enabling the creation of tens of millions of high-quality preference pairs.

Benchmarking Skywork-Reward-V2: Compact Yet Powerful Models

The Skywork-Reward-V2 series demonstrates strong performance across multiple benchmarks, outperforming both larger models (e.g., 70B parameters) and emerging generative reward models. Trained using Qwen3 (0.6B–8B) and Llama 3.1/3.2 (1B–8B) backbones, these models achieve high scores on RewardBench, PPE, RM-Bench, and JudgeBench, with the best-performing variant (Llama-3.1-8B-40M) surpassing all others with an average score of 88.6. Despite smaller model sizes, Skywork-Reward-V2 models benefit from high-quality preference data (SynPref-40M) and efficient training setups, enabling them to generalize better in real-world RLHF scenarios. Notably, even mid-sized models like the Qwen3-1.7B outperform some 70B models, emphasizing the impact of training data quality and methodology over sheer parameter count.

Conclusion and Future Outlook: Scaling with Precision

In conclusion, SynPref-40M, a large-scale preference dataset built through a two-stage human-AI collaboration, combining human judgment with LLM-based scalability. Using a curated subset of 26 million preference pairs, the team developed the Skywork-Reward-V2, a suite of eight reward models (0.6B–8B parameters) that outperform existing models across seven key benchmarks. These models show strong generalization in aligning with human values, ensuring correctness, safety, and robustness to bias. Extensive studies confirm that both the data quality and curation method are key drivers of performance. Looking forward, the researchers aim to explore new training strategies, as reward models become central to LLM development and alignment.

Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post SynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models appeared first on MarkTechPost.

New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online …

Optimizing LLMs for Human Alignment Using Reinforcement Learning

Large language models often require a further alignment phase to optimize them for human use. In this phase, reinforcement learning plays a central role by enabling models to make decisions based on human feedback or task-based correctness. This fine-tuning allows for the models to align more closely with user expectations, making them more suitable for instruction-based applications or precise mathematical tasks.

Challenges in Choosing Offline vs. Online Reinforcement Learning Strategies

A major difficulty arises when choosing the most effective way to conduct this fine-tuning. Training methods fall into two extremes—offline approaches that depend on static, pre-generated data and fully online approaches that continuously update with each new interaction. Each method has distinct challenges. Offline models can’t adapt during training, which limits performance, while online models often demand more computational resources. Moreover, ensuring that models perform well across both mathematical (verifiable) and open-ended (non-verifiable) tasks adds further complexity to this choice.

Overview of Alignment Algorithms: DPO and GRPO

Historically, tools like Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have been employed for model alignment. DPO operates offline and is designed to work with preference-based data pairs. It is valued for its simplicity and data efficiency but lacks the adaptability of online methods. GRPO is based on the PPO algorithm and handles online fine-tuning by comparing groups of outputs to compute relative advantages. While GRPO adapts in real-time and suits dynamic reward systems, its on-policy nature increases computational load and makes experimentation more demanding.

A Balanced Alternative for LLM Alignment

Research introduced by Meta and NYU explored a method to overcome these limitations through a semi-online training setup. This technique modulates how frequently the model’s generation and training components are synchronized, rather than updating at every training step, as in fully online methods, or not at all, as in offline setups. The semi-online method strikes a middle ground by adjusting the synchronization rate. Researchers designed this approach to reduce training time and maintain high model adaptability. The modular setup also allowed them to apply either DPO or GRPO with task-specific reward models in a flexible manner.

Instruction Following and Mathematical Reasoning

The methodology involved fine-tuning the Llama-3.1-8B-Instruct model using two types of tasks: open-ended instruction following and math problem-solving. For non-verifiable tasks, user prompts were sampled from the WildChat-1M dataset and evaluated using the Athene-RM-8B reward model, which assigns scalar scores to each prompt. For verifiable tasks, the team utilized the NuminaMath dataset in conjunction with the Math-Verify toolkit, which verifies whether generated answers align with expected outputs. Training experiments were conducted on 32 NVIDIA H200 GPUs for training and 8 GPUs for inference, with different setups comparing offline, semi-online, and online synchronization intervals.

Performance Gains Across Both Verifiable and Non-Verifiable Tasks

The performance differences were observed. On Math500, the offline DPO reached 53.7% accuracy, whereas the semi-online DPO with a synchronization interval of s = 100 achieved 58.9%. Online DPO and GRPO showed similar results at 58.7% and 58.1%, respectively. Similar trends were observed on the NuminaMath benchmark, where the offline DPO achieved 36.4%, and semi-online variants increased this to 39.4% (s = 10). The performance gains were not limited to math tasks. When non-verifiable tasks were evaluated with AlpacaEval 2.0 and Arena-Hard benchmarks, models trained with mixed reward types performed consistently better. Combining verifiable and non-verifiable rewards in a single training setup resulted in stronger average scores, indicating that the method generalized effectively.

A Flexible, Scalable Approach for Reinforcement Learning in LLMs

This study demonstrates that fine-tuning large language models does not require strict adherence to either offline or online setups. By introducing a flexible synchronization scheme, the research team from Meta and NYU effectively increased training efficiency while maintaining or improving performance. The results show that carefully balancing reward types and training synchronization frequency leads to models that perform well across task types without incurring high computational costs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning appeared first on MarkTechPost.

AbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost R …

Recent research indicates that LLMs, particularly smaller ones, frequently struggle with robust reasoning. They tend to perform well on familiar questions but falter when those same problems are slightly altered, such as changing names or numbers, or adding irrelevant but related information. This weakness, known as poor out-of-distribution (OOD) generalization, results in notable accuracy drops, even in simple math tasks. One promising solution is to create synthetic variations of reasoning problems, helping models learn to focus on the underlying logic rather than surface details. Strengthening reasoning in this manner is crucial for developing more general and reliable AI systems.

Abstracting the Core Logic of LLM Reasoning Failures

LLMs have demonstrated impressive reasoning capabilities, yet they often falter when exposed to distribution shifts, such as changes in phrasing, numerical values, or the introduction of distractions. This vulnerability is evident across benchmarks in logic, mathematics, and commonsense reasoning. Prior solutions have relied on data augmentation to expose models to a broader variety of inputs, improving robustness but increasing computational demands. Researchers have also explored formats such as abstraction-of-thought and chain-of-abstraction to teach abstract reasoning, while planning techniques like chain-of-thought and tree-of-thought aid step-by-step problem-solving. Reinforcement learning and preference-based methods provide additional support for reasoning skill development beyond pattern memorization.

AbstRaL’s Symbolic Learning Method to Improve Reasoning Consistency

Researchers from Apple and EPFL propose AbstRaL, a method that teaches LLMs to understand abstract reasoning patterns rather than memorizing surface details. Instead of generating many varied training examples, which is computationally costly, AbstRaL helps LLMs learn the underlying structure of reasoning problems using reinforcement learning. This method connects these abstract patterns to symbolic tools, enabling more reliable problem-solving. Tested on GSM benchmarks, AbstRaL significantly improves LLM performance, especially when faced with input changes or distracting information. It outperforms models trained only with supervised learning by promoting more consistent and context-independent reasoning.

Four Steps to Abstract Symbolic Reasoning via AbstRaL

AbstRaL is a four-step framework designed to teach LLMs to reason abstractly rather than rely on surface patterns. First, it identifies key variables in a question and replaces them with symbolic placeholders. Then, using specially crafted data (GranulAR), the model learns to reason step-by-step with these abstract symbols. Next, it retrieves the general reasoning structure (abstraction) from the symbolic answer. Finally, it uses this abstraction with the original values to compute the correct answer. Reinforcement learning with two rewards, one for correctness and another for symbolic similarity, further improves the model’s ability to generate accurate, context-independent reasoning patterns.

GSM8K Variations Reveal AbstRaL’s Robustness Across LLM Sizes

The researchers evaluate AbstRaL on math reasoning tasks using models such as Llama-3 and Qwen2, training them with a dataset called GranulAR that rewrites math problems in an abstract symbolic form. This helps models focus on structure rather than surface details. They test robustness using altered versions of GSM8K problems, changing numbers, names, and phrasing. Compared to baselines like standard Chain-of-Thought prompting, AbstRaL shows stronger consistency and less accuracy drop on these variations. Especially for smaller models, it improves reliability across reworded inputs. The results suggest that teaching models to reason abstractly makes them more adaptable and less reliant on memorized patterns.

Teaching LLMs Abstract Thinking through Reinforcement Yields Robust Reasoning

In conclusion, AbstRaL is a method designed to enhance abstract reasoning in LLMs, making them more resilient to superficial changes in problems. Unlike traditional fine-tuning or data augmentation, AbstRaL uses reinforcement learning to train models on GranulAR rationales that mix Socratic chain-of-thought with detailed abstraction. This approach helps models strip away surface-level distractions and better connect with symbolic tools. Tested on challenging GSM8K perturbation benchmarks, AbstRaL notably reduces performance drops under distribution shifts, particularly in smaller models. The study shows that learning to abstract improves reasoning robustness more effectively than relying solely on direct supervision.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post AbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost Robustness on GSM Benchmarks appeared first on MarkTechPost.

Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms L …

Kyutai, an open AI research lab, has released a groundbreaking streaming Text-to-Speech (TTS) model with ~2 billion parameters. Designed for real-time responsiveness, this model delivers ultra-low latency audio generation (220 milliseconds) while maintaining high fidelity. It’s trained on an unprecedented 2.5 million hours of audio and is licensed under the permissive CC-BY-4.0, reinforcing Kyutai’s commitment to openness and reproducibility. This advancement redefines the efficiency and accessibility of large-scale speech generation models, particularly for edge deployment and agentic AI.

Unpacking the Performance: Sub-350ms Latency for 32 Concurrent Users on a Single L40 GPU

The model’s streaming capability is its most distinctive feature. On a single NVIDIA L40 GPU, the system can serve up to 32 concurrent users while keeping the latency under 350ms. For individual use, the model maintains a generation latency as low as 220ms, enabling nearly real-time applications such as conversational agents, voice assistants, and live narration systems. This performance is enabled through Kyutai’s novel Delayed Streams Modeling approach, which allows the model to generate speech incrementally as text arrives.

Key Technical Metrics:

Model size: ~2B parameters

Training data: 2.5 million hours of speech

Latency: 220ms single-user, <350ms with 32 users on one L40 GPU

Language support: English and French

License: CC-BY-4.0 (open source)

Delayed Streams Modeling: Architecting Real-Time Responsiveness

Kyutai’s innovation is anchored in Delayed Streams Modeling, a technique that allows speech synthesis to begin before the full input text is available. This approach is specifically designed to balance prediction quality with response speed, enabling high-throughput streaming TTS. Unlike conventional autoregressive models that suffer from response lag, this architecture maintains temporal coherence while achieving faster-than-real-time synthesis.

The codebase and training recipe for this architecture are available at Kyutai’s GitHub repository, supporting full reproducibility and community contributions.

Model Availability and Open Research Commitment

Kyutai has released the model weights and inference scripts on Hugging Face, making it accessible for researchers, developers, and commercial teams. The permissive CC-BY-4.0 license encourages unrestricted adaptation and integration into applications, provided proper attribution is maintained.

This release supports both batch and streaming inference, making it a versatile foundation for voice cloning, real-time chatbots, accessibility tools, and more. With pretrained models in both English and French, Kyutai sets the stage for multilingual TTS pipelines.

Implications for Real-Time AI Applications

By reducing the speech generation latency to the 200ms range, Kyutai’s model narrows the human-perceptible delay between intent and speech, making it viable for:

Conversational AI: Human-like voice interfaces with low turnaround

Assistive Tech: Faster screen readers and voice feedback systems

Media Production: Voiceovers with rapid iteration cycles

Edge Devices: Optimized inference for low-power or on-device environments

The ability to serve 32 users on a single L40 GPU without quality degradation also makes it attractive for scaling speech services efficiently in cloud environments.

Conclusion: Open, Fast, and Ready for Deployment

Kyutai’s streaming TTS release is a milestone in speech AI. With high-quality synthesis, real-time latency, and generous licensing, it addresses critical needs for both researchers and real-world product teams. The model’s reproducibility, multilingual support, and scalable performance make it a standout alternative to proprietary solutions.

For more details, you can explore the official model card on Hugging Face, technical explanation on Kyutai’s site, and implementation specifics on GitHub.
The post Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training appeared first on MarkTechPost.

Can We Improve Llama 3’s Reasoning Through Post-Training Alone? ASTR …

Improving the reasoning capabilities of large language models (LLMs) without architectural changes is a core challenge in advancing AI alignment and usability. Researchers at Meta AI and the University of Washington have introduced ASTRO—Autoregressive Search-Taught Reasoner—a novel post-training framework designed to enhance reasoning in Llama-3.1-70B-Instruct. ASTRO is unique in teaching models to perform in-context search, self-reflection, and backtracking, mechanisms often associated with human problem-solving and traditional symbolic search algorithms. Through this approach, ASTRO boosts Llama 3’s math performance on several competitive benchmarks with significant improvements:

MATH 500: 65.8% ➝ 81.8%

AMC 2023: 37.5% ➝ 64.4%

AIME 2024: 10.0% ➝ 30.0%

Search-Guided Chain-of-Thought Generation

ASTRO’s methodology begins with a Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. This search explores both correct and incorrect reasoning paths. The key innovation is procedure cloning: entire search trees are linearized into long chain-of-thoughts (CoT) that naturally encode both failures and recoveries via self-reflection and backtracking. These linearized traces are rewritten in natural language and used as the basis for supervised fine-tuning (SFT).

This results in a model that doesn’t just solve problems step-by-step but reevaluates its trajectory—often backtracking after self-assessment to correct intermediate reasoning mistakes. For instance, the model may interject with phrases like “Let’s go back to where we set up the equation” when its internal confidence drops.

Supervised Fine-Tuning: Injecting Search Priors

ASTRO fine-tunes Llama-3.1-70B-Instruct on 36.1K curated CoT solutions from MATH, AMC/AIME, and AoPS-style datasets. The model trained with ASTRO-SFT achieves:

MATH 500: 69.6%

AMC 2023: 51.9%

AIME 2024: 16.3%

These scores are competitive with or exceed those of baseline and SPOC/Step-KTO variants trained without explicit search priors. Importantly, even SFT alone—without reinforcement learning—yields performance boosts by exposing the model to search-structured reasoning data.

Reinforcement Learning with Search-Aware Initialization

ASTRO proceeds to reinforcement learning (RL) by initializing with the SFT checkpoint and running an RL loop using a modified Group Relative Policy Optimization (GRPO). Unlike standard preference-based RL, ASTRO employs verifiable reward signals (+1 for correct, -1 for incorrect) on 8.7K moderately difficult prompts. During training, the model’s CoT generation grows longer—from ~1.8K to ~6K tokens—demonstrating deeper internal exploration.

The resulting ASTRO-RL model achieves:

MATH 500: 81.8%

AMC 2023: 64.4%

AIME 2024: 30.0%

These results rival or exceed models with larger parameter counts and confirm the importance of ASTRO’s search-aware initialization.

Backtracking Behavior Correlates with Reasoning Success

A striking empirical observation is the positive correlation between backtracking frequency and performance. As training progresses, ASTRO-RL exhibits more self-corrective actions and deeper exploration. Pearson correlation coefficients across benchmarks exceed 0.8, indicating that self-reflection and backtracking are not merely cosmetic behaviors but functionally tied to better accuracy.

Comparative Insights and Broader Impact

Control experiments comparing ASTRO with models trained on direct CoT solutions (no search priors) reveal that even when trained on the same problem sets and search trees, ASTRO consistently outperforms. For instance, ASTRO-RL beats Direct-RL by:

+2% on MATH 500

+3.9% on AMC 2023

+2.9% on AIME 2024

Moreover, ASTRO’s outputs can be visualized as directed graphs, with nodes as reasoning steps and edges capturing transitions, reflections, and corrections—facilitating better interpretability.

ASTRO Key Takeaways Table

Conclusion

ASTRO demonstrates that LLMs like Llama 3 can learn to reason more effectively—not through larger models or longer pretraining, but via principled post-training techniques. By mimicking search algorithms in natural language, ASTRO enables models to think before answering, doubt their own steps, and correct themselves mid-reasoning. This framework sets a new benchmark for fine-tuning open LLMs to approach human-like reasoning through search-inspired behaviors.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Can We Improve Llama 3’s Reasoning Through Post-Training Alone? ASTRO Shows +16% to +20% Benchmark Gains appeared first on MarkTechPost.

A Tutorial on Using OpenAI Codex with GitHub Repositories for Seamless …

When we first land in the Codex environment, it feels like stepping into a co-pilot’s seat for coding. Codex is designed to take over much of the routine or overwhelming parts of software engineering, like understanding massive codebases, drafting PRs, and finding bugs, and help us focus on higher-level thinking. In this guided setup, we explore how to connect a GitHub repository, configure a smart environment, and utilize Codex to kick-start useful engineering tasks.

As we begin, we start with this blank workspace. At this point, we haven’t linked any code or given the assistant any instructions, so it’s patiently waiting for us to define the first step. It feels clean, open, and ready for us to steer the direction of our development work.

We then proceed to select the GitHub organization and repository with which Codex will work. In this case, we chose the “teammmtp” organization and linked it to the private `ai-scribe-stories` repo. Codex smartly filters only the repositories we have access to, ensuring we don’t accidentally link the wrong one. We’re also asked whether we want to allow the agent to use the internet. We chose to leave it off for now, meaning Codex will rely solely on local dependencies and scripts. This setting is ideal when we want to maintain a secure and fully deterministic environment.

Now, we get introduced to the actual powers of Codex as a software engineering agent. It outlines four main capabilities: drafting GitHub pull requests automatically, navigating our codebase to identify bugs and suggest improvements, running lint and tests to ensure code quality, and being powered by a fine-tuned model specifically designed for understanding large repositories. At this point, we also have access to the GitHub push menu where we can choose between actions like creating PRs, copying patch code, or applying git commands, just by clicking a dropdown. This interface makes our workflow seamless and gives us fine control over how we want to ship code.

With our repo and features ready, Codex recommends a set of initial tasks to get us started. We select suggestions that include explaining the overall code structure, identifying and fixing bugs, and reviewing for minor issues such as typos or broken tests. What’s great here is that Codex helps break the ice for us, even if we’re unfamiliar with the project. These cards serve as bite-sized onboarding challenges, enabling us to quickly understand and improve the codebase while seeing Codex in action. We checked all three, signaling that we’re ready for the assistant to begin analyzing and working alongside us.

In this task dashboard, we’re asked, “What are we coding next?”, a gentle nudge that we’re now in control of what the AI focuses on. We can either create a completely custom task or select from one of the three predefined options. We notice that Codex has also enabled “Best-of-N,” a feature that generates multiple implementation suggestions for a task, allowing us to pick the one we like most. We’ve linked the agent to the `main` branch of our repository and configured the task to run in a 1x container. It’s like telling a teammate, “Here’s the branch, here’s the task, go to work.”

Now Codex starts digging into the codebase. We see a command running in the terminal that’s grepping for the word “react” in `vite.config.ts`. This step demonstrates how Codex doesn’t just make blind assumptions; it actively searches through our files, identifies references to libraries and components, and builds a picture of the tools our project is using. Watching this in real time makes the experience feel dynamic, like having an assistant that’s not just smart but also curious and methodical in its approach.

Finally, Codex delivers a detailed breakdown of the codebase and some well-thought-out suggestions for improvement. We learn that the project is built using Vite, React, TypeScript, Tailwind CSS, and shadcn-ui. It identifies our routing, styling configurations, and toast logic. It also tells us what’s missing, such as automated testing and realistic data fetching. These insights go beyond basic code reading; they help us prioritize tasks that matter and create a roadmap for evolving the project. Codex also utilizes specific file names and components in its report, demonstrating that it truly understands our structure, not just superficially, but functionally.

In conclusion, we’ve connected a GitHub repository and also unlocked an AI-powered engineering assistant that reads our code, interprets its design, and proactively suggests ways to improve it. We experienced Codex transitioning from a passive helper to an active co-developer, offering guidance, running commands, and generating summaries just like a skilled teammate would. Whether we’re improving tests, documenting logic, or cleaning up structure, Codex provides the clarity and momentum we often need when diving into unfamiliar code. With this setup, we’re now ready to build faster, debug smarter, and collaborate more efficiently with AI as our coding partner.
The post A Tutorial on Using OpenAI Codex with GitHub Repositories for Seamless AI-Powered Development appeared first on MarkTechPost.

Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling …

Reward models are fundamental components for aligning LLMs with human feedback, yet they face the challenge of reward hacking issues. These models focus on superficial attributes such as response length or formatting rather than identifying true quality indicators like factuality and relevance. This problem arises because standard training objectives fail to differentiate between spurious correlations present in training data and genuine causal drivers of response quality. The failure to separate these factors leads to brittle reward models (RMs) that generate misaligned policies. Moreover, there is a need for a method that utilizes a causal understanding of preference formation to train RMs that are sensitive to causal quality attributes and invariant to various spurious cues.

Limitations of Existing RM Approaches and the Need for Causal Robustness

Existing methods try to solve reward hacking issues in standard RLHF systems that rely on Bradley-Terry or pairwise ranking methods. This includes architectural modifications, such as Odin, policy-level adjustments, and data-centric methods involving ensembles or consistency checks. Recent causal-inspired methods use MMD regularization against pre-specified spurious factors or estimate causal effects through corrected rewrites. However, these methods target only predetermined spurious factors, missing unknown correlates. While augmentation strategies remain coarse, and evaluation-focused methods fail to equip reward models with robust training mechanisms against diverse spurious variations.

Introducing Crome: Causally Robust Reward Modeling for LLMs

Researchers from Google DeepMind, McGill University, and MILA – Quebec AI Institute have proposed Crome (Causally Robust Reward Modeling), a framework built on an explicit causal model of answer generation. Crome trains RMs to differentiate genuine quality drivers from superficial cues by adding preference datasets with targeted, LLM-generated counterfactual examples. Moreover, it creates two types of synthetic training pairs: (a) Causal Augmentations, which introduce changes along specific causal attributes, such as factuality to enforce sensitivity to true quality shifts, and (b) Neutral Augmentations that enforce invariance along spurious attributes like style using tie-labels. Crome enhances robustness, increasing RewardBench accuracy by up to 4.5%, enhancing safety and reasoning.

Technical Approach: Counterfactual Augmentation and Composite Loss Optimization

The Crome operates through two main phases: generating attribute-aware counterfactual data based on a causal model and training the reward model with a specialized loss on combined data. It provides a theoretical analysis on how causal augmentation isolates true reward drivers from spurious correlates under an idealized model. Crome utilizes the UltraFeedback dataset with counterfactuals generated using Gemini 2.0 Flash, and evaluates performance on RewardBench and reWordBench. Researchers utilize diverse base LLMs in their experiments, including Gemma-2-9B-IT, Qwen2.5-7B, and Gemma-2-2B for both Pairwise Preference and Bradley-Terry reward models, with downstream alignment impact through Best-of-N selection on multiple tasks.

Performance Gains: From RewardBench to WildGuardTest

On RewardBench, Crome achieves improvements in ranking accuracy over RRM across diverse base models, with significant gains in Safety (up to 13.18%) and Reasoning (up to 7.19%) categories. Crome shows aggregate accuracy gains of up to 9.1% on reWordBench with Gemma-2-9B-IT in PairPM settings and superior performance on 21 out of 23 transformations. Moreover, it shows a smaller decrease in ranking accuracy from RewardBench to reWordBench compared to RRM (19.78% versus 21.54%). Crome shows excellent safety improvements on WildGuardTest with Best-of-N selection, achieving lower attack success ratios on harmful prompts while maintaining similar refusal rates on benign prompts.

Conclusion and Future Directions in Causal Data Augmentation

In conclusion, researchers introduced Crome, a causal framework that solves reward hacking issues during RM training. It employs two targeted synthetic data augmentation strategies: Causal Augmentations and Neutral Augmentations. Crome outperforms strong baselines across multiple base models and reward modeling techniques on RewardBench, and superior robustness on reWordBench against spurious correlations. This dataset curation-centered training method (i.e, Crome) opens new research directions in synthetic data generation for base model training, where causal attribute verification could prove highly beneficial for future developments in robust language model alignment.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling in LLM Alignment appeared first on MarkTechPost.

Thought Anchors: A Machine Learning Framework for Identifying and Meas …

Understanding the Limits of Current Interpretability Tools in LLMs

AI models, such as DeepSeek and GPT variants, rely on billions of parameters working together to handle complex reasoning tasks. Despite their capabilities, one major challenge is understanding which parts of their reasoning have the greatest influence on the final output. This is especially crucial for ensuring the reliability of AI in critical areas, such as healthcare or finance. Current interpretability tools, such as token-level importance or gradient-based methods, offer only a limited view. These approaches often focus on isolated components and fail to capture how different reasoning steps connect and impact decisions, leaving key aspects of the model’s logic hidden.

Thought Anchors: Sentence-Level Interpretability for Reasoning Paths

Researchers from Duke University and Aiphabet introduced a novel interpretability framework called “Thought Anchors.” This methodology specifically investigates sentence-level reasoning contributions within large language models. To facilitate widespread use, the researchers also developed an accessible, detailed open-source interface at thought-anchors.com, supporting visualization and comparative analysis of internal model reasoning. The framework comprises three primary interpretability components: black-box measurement, white-box method with receiver head analysis, and causal attribution. These approaches uniquely target different aspects of reasoning, providing comprehensive coverage of model interpretability. Thought Anchors explicitly measure how each reasoning step affects model responses, thus delineating meaningful reasoning flows throughout the internal processes of an LLM.

Evaluation Methodology: Benchmarking on DeepSeek and the MATH Dataset

The research team detailed three interpretability methods clearly in their evaluation. The first approach, black-box measurement, employs counterfactual analysis by systematically removing sentences within reasoning traces and quantifying their impact. For instance, the study demonstrated sentence-level accuracy assessments by running analyses over a substantial evaluation dataset, encompassing 2,000 reasoning tasks, each producing 19 responses. They utilized the DeepSeek Q&A model, which features approximately 67 billion parameters, and tested it on a specifically designed MATH dataset comprising around 12,500 challenging mathematical problems. Second, receiver head analysis measures attention patterns between sentence pairs, revealing how previous reasoning steps influence subsequent information processing. The study found significant directional attention, indicating that certain anchor sentences significantly guide subsequent reasoning steps. Third, the causal attribution method assesses how suppressing the influence of specific reasoning steps impacts subsequent outputs, thereby clarifying the precise contribution of internal reasoning elements. Combined, these techniques produced precise analytical outputs, uncovering explicit relationships between reasoning components.

Quantitative Gains: High Accuracy and Clear Causal Linkages

Applying Thought Anchors, the research group demonstrated notable improvements in interpretability. Black-box analysis achieved robust performance metrics: for each reasoning step within the evaluation tasks, the research team observed clear variations in impact on model accuracy. Specifically, correct reasoning paths consistently achieved accuracy levels above 90%, significantly outperforming incorrect paths. Receiver head analysis provided evidence of strong directional relationships, measured through attention distributions across all layers and attention heads within DeepSeek. These directional attention patterns consistently guided subsequent reasoning, with receiver heads demonstrating correlation scores averaging around 0.59 across layers, confirming the interpretability method’s ability to effectively pinpoint influential reasoning steps. Moreover, causal attribution experiments explicitly quantified how reasoning steps propagated their influence forward. Analysis revealed that causal influences exerted by initial reasoning sentences resulted in observable impacts on subsequent sentences, with a mean causal influence metric of approximately 0.34, further solidifying the precision of Thought Anchors.

Also, the research addressed another critical dimension of interpretability: attention aggregation. Specifically, the study analyzed 250 distinct attention heads within the DeepSeek model across multiple reasoning tasks. Among these heads, the research identified that certain receiver heads consistently directed significant attention toward particular reasoning steps, especially during mathematically intensive queries. In contrast, other attention heads exhibited more distributed or ambiguous attention patterns. The explicit categorization of receiver heads by their interpretability provided further granularity in understanding the internal decision-making structure of LLMs, potentially guiding future model architecture optimizations.

Key Takeaways: Precision Reasoning Analysis and Practical Benefits

Thought Anchors enhance interpretability by focusing specifically on internal reasoning processes at the sentence level, substantially outperforming conventional activation-based methods.

Combining black-box measurement, receiver head analysis, and causal attribution, Thought Anchors deliver comprehensive and precise insights into model behaviors and reasoning flows.

The application of the Thought Anchors method to the DeepSeek Q&A model (with 67 billion parameters) yielded compelling empirical evidence, characterized by a strong correlation (mean attention score of 0.59) and a causal influence (mean metric of 0.34).

The open-source visualization tool at thought-anchors.com provides significant usability benefits, fostering collaborative exploration and improvement of interpretability methods.

The study’s extensive attention head analysis (250 heads) further refined the understanding of how attention mechanisms contribute to reasoning, offering potential avenues for improving future model architectures.

Thought Anchors’ demonstrated capabilities establish strong foundations for utilizing sophisticated language models safely in sensitive, high-stakes domains such as healthcare, finance, and critical infrastructure.

The framework proposes opportunities for future research in advanced interpretability methods, aiming to refine the transparency and robustness of AI further.

Check out the Paper and Interaction. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Thought Anchors: A Machine Learning Framework for Identifying and Measuring Key Reasoning Steps in Large Language Models with Precision appeared first on MarkTechPost.