How CBRE powers unified property management search and digital assista …

This post was written with Lokesha Thimmegowda, Muppirala Venkata Krishna Kumar, and Maraka Vishwadev of CBRE.
CBRE is the world’s largest commercial real estate services and investment firm. The company serves clients in more than 100 countries and offers services ranging from capital markets and leasing advisory to investment management, project management and facilities management.
CBRE uses AI to improve commercial real estate solutions with advanced analytics, automated workflows, and predictive insights. The chance to unlock value with AI in the commercial real estate lifecycle begins with data at scale. With the industry’s largest dataset and a comprehensive suite of enterprise-grade technology, the company has implemented a range of AI solutions to boost individual productivity and support broad-scale transformation.
This blog post describes how CBRE and AWS partnered to transform how property management professionals access information, creating a next-generation search and digital assistant experience that unifies access across many types of property data using Amazon Bedrock, Amazon OpenSearch Service, Amazon Relational Database Service, Amazon Elastic Container Service, and AWS Lambda.
Unified property management search challenges
CBRE’s proprietary PULSE system consolidates a wide range of essential property data—covering structured data from relational databases that record transactions and unstructured data stored in document repositories containing everything from lease agreements to property inspections. In the past, property management professionals had to sift through millions of documents and switch between multiple different systems to locate property maintenance details. Data was scattered across 10 distinct sources and four separate databases, which made it hard to get complete answers. This fragmented setup reduced productivity and made it difficult to uncover key insights about property operations.
Experts in property management, not database syntax, needed to ask complex questions in natural language, quickly synthesize disparate information, and avoid manual review of lengthy documents.
The challenge: deliver an intuitive, unified search solution bridging structured and unstructured content, with robust security, enterprise-grade performance and reliability.
Solution architecture
CBRE implemented a global search solution within PULSE, powered by Amazon Bedrock, to address these challenges. The search architecture is designed for a seamless, intelligent, and secure information retrieval experience across diverse data types. It orchestrates an interplay of user interaction, AI-driven processing, and robust data storage.
CBRE’s PULSE search solution uses Amazon Bedrock for the rapid deployment of generative AI capabilities by using multiple foundation models through a single API. CBRE’s implementation uses Amazon Nova Pro for SQL query generation, achieving a 67% reduction in processing time, while Claude Haiku powers intelligent document interactions. The solution maintains enterprise-grade security for all property data. By combining Amazon Bedrock capabilities with Retrieval Augmented Generation (RAG) and Amazon OpenSearch Service, CBRE created a unified search experience across more than eight million documents and multiple databases, fundamentally transforming how property professionals access and analyze business-critical information.
The following diagram illustrates the architecture for the solution that CBRE implemented in AWS:

Let us go through the flow for the solution:

Property Manager and PULSE UI: Property managers interact through the intuitive PULSE user interface, which serves as the gateway for both traditional keyword searches and natural language queries (NLQ). The UI displays search results, supports document conversations, and presents intelligent summaries in desktop and mobile.
Dynamic search execution: When users submit requests, the system first retrieves user-specific permissions from Amazon ElastiCache for Redis, chosen for its low latency and high throughput. Search operations across Amazon OpenSearch and transactional databases are then constrained by these user-specific permissions, making sure users only access authorized results with real-time granular control.
Orchestration layer: This central control hub serves as the application’s brain, receiving user requests from PULSE UI and intelligently routing them to appropriate backend services. Key responsibilities include:

Routing queries to relevant data systems (structured databases, unstructured documents, or both for deep search).
Initiating parallel searches across SQL Interact and Doc Interact components.
Merging, de-duplicating, and ranking results from disparate sources for unified outcomes.
Managing conversation history through Amazon DynamoDB integration.

SQL interact component (structured data search): This pathway manages interactions with structured relational databases (RDBMS) through these key steps:

4.1 Database metadata retrieval: Dynamically fetches schema details (for example, table names, column names, data types, relationships, constraints) for entities like property, contacts, and tenants from an Amazon OpenSearch index.
4.2 Amazon Bedrock LLM (Amazon Nova Pro): Interprets the user’s natural language query alongside schema metadata, translating it into accurate, optimized SQL queries tailored to the database. The solution reduced SQL query generation time from an average of 12 seconds earlier to 4 seconds using Amazon Nova Pro.
4.3 RDBMS systems (PostgreSQL, MS SQL): Actual transactional databases, such as PostgreSQL and MS SQL, which house the core structured property management data (for example, properties, contacts, tenants, K2 forms). They execute the LLM-generated SQL queries and return the structured tabular results back to the SQL Interact component.

DocInteract Component (Unstructured Document Search): This pathway is specifically designed for intelligent search and interaction with unstructured documents.

5.1 Vector Store (OpenSearch Cluster): Stores documents, including those from OpenText, as high-dimensional vectors for efficient semantic search using techniques like k-Nearest Neighbors while prioritizing speed and accuracy with metadata filtering.
5.2 Amazon Bedrock LLM (Claude Haiku): Interprets NLQs and translates them into optimized OpenSearch DSL queries, while powering the “Chat With AI” feature for direct document interaction, generating concise, conversational responses including answers, summaries, and natural dialogue.

Having established the core architecture with both SQL Interact and DocInteract components, the following sections explore the specific optimizations and innovations implemented for each data type, beginning with structured data search enhancements.
Structured data search
Building on the SQL interact component outlined in the architecture, the PULSE Search application offers two search methods for accessing structured data in PostgreSQL and MS SQL. Keyword Search scans the fields and schemas for specific terms, facilitating comprehensive coverage of the entire data system. With Natural Language Query (NLQ) Search users can interact with the databases using everyday language, translating queries into database queries. Both methods support property managers to efficiently locate and retrieve information across the database modules.
Database layer search performance enhancement at the SQL level
Our unique challenge involved implementing application-wide keyword searches that needed to scan across the columns in database tables – a non-conventional requirement compared to traditional indexed column-specific searches in RDBMS systems. This universal search capability was essential for user experience, allowing information discovery without knowing specific column names or data structures.
We leveraged native full-text search capabilities in both PostgreSQL and MS SQL Server databases:

PostgreSQL Implementation:

SELECT * FROM dbo.pg_db_view_name bd WHERE textsearchable_all_col @@ to_tsquery(‘english’, ‘keyword’)

Microsoft SQL Server Implementation:

SELECT * FROM [dbo].ms_db_view_name WHERE CONTAINS(*, ‘8384F’)

Note: Our implementation uses specialized text search columns (textsearchable_all_col) concatenating the searchable fields from the view pd_db_view_name, while ms_db_view_name represents a view created with full-text search indexing.
This optimization delivered an 80% improvement in query performance by harnessing native database capabilities while balancing comprehensive search coverage with optimal database performance through specialized indexing algorithms.
Database layer search performance enhancement at the SQL interact API level
We implemented several optimizations in database search functionality targeting three key performances (KPIs): Accuracy (precision of results), Consistency (reproducible outcomes), and Relevancy (making sure results align with user intent). The enhancements reduced response latency while simultaneously boosting these ACR metrics, resulting in faster and more dependable search results.
Prompt Engineering Changes: We implemented a comprehensive approach to prompt management and optimization, focusing on the following factors.

Configurability: We implemented modular prompt templates stored in external files to enable version control, simplified management, and reduced prompt size, improving performance and maintainability.
Dynamic field selection for context window reduction: The system uses KNN-based similarity search to filter and select only the most relevant schema fields aligned with user intent, reducing context window size and optimizing prompt effectiveness.
Dynamic few-shot example: The system intelligently selects the most relevant few-shot example from a configuration file using KNN-based similarity search for the SQL generation. This smart, context-aware approach makes sure that only the most pertinent example is included in the prompt, minimizing unnecessary data overhead. This approach helped in getting consistent and accurate SQL generation from LLM.
Business rule integration: The system maintains a centralized repository of business rules in a dedicated schema wise configuration file, making rule management and updates streamlined and efficient. During prompt generation, relevant business rules are dynamically integrated into prompts, facilitating consistency in rule application while providing flexibility for updates and maintenance.
LLM score-based relevancy: We added a fourth LLM call to evaluate and reorder schema relevance after initial KNN retrieval, addressing challenges where vector search returned irrelevant or poorly ordered schemas.For example, when processing a user query about property or contact information, the vector search might return three schemas, but:

The third schema might be irrelevant to the query.
The ordering of the two relevant schemas might not reflect their true relevancy to the query.
To address these challenges, we introduced an additional LLM processing (4th LLM parallel call) step that:

Evaluates the relevance of each schema to the user query.
Assigns relevancy scores to determine schema importance.
Reorders schemas based on their actual relevance to the query.
This enhancement improved our schema selection process by:

Making sure only truly relevant schemas are selected.
Maintaining proper relevancy ordering.
Providing more accurate context for subsequent query processing.

These enhancements improved schema selection by verifying only truly relevant schemas are processed, maintaining proper relevancy ordering, and providing more accurate context for query processing. The result was more precise, contextually appropriate responses and improved overall application performance.
Parallel LLM inference for SQL generation with Amazon Nova Pro
We implemented a comprehensive parallel processing architecture for NLQ to SQL conversion, enhancing system performance and efficiency. The solution introduces concurrent schema-based API calls to the LLM inference engine, with asynchronous processing for multiple schema evaluations. Our security-first approach authenticates and validates user entitlements while performing context-aware schema identification that incorporates similarity search and enforces access permissions. The system only processes schemas for which the user has explicit authorization, facilitating foundational data security. Following authentication, the system dynamically generates prompts (as detailed in our prompt engineering framework) and initiates concurrent processing of the most relevant schemas through parallel LLM inference calls. Before execution, it enhances the generated SQL queries with mandatory security joins that enforce building-level access controls, restricting users to their authorized buildings only.
Finalized SQL queries are executed on respective database systems (PostgreSQL or SQL Server). The system processes the query results and returns them as a structured API response, maintaining security and data integrity throughout the entire workflow. This architecture facilitates both optimal performance through parallel processing and comprehensive security through multi-layered access controls.
This integrated approach incorporates concurrent validation of generated SQL queries, resulting in reduced processing time and improved system throughput and reduced inference latency with Amazon Nova Pro. With introduction of Nova Pro there was significant improvement in inference latency. The framework’s architecture facilitates efficient resource utilization while maintaining high accuracy in SQL query generation, making it particularly effective for handling complex database operations and high-volume query processing requirements.

Enhancing unstructured data search
The PULSE document search uses two main methods, enhanced by purpose-built specialized search functions. Users can use the streamlined Keyword Search to precisely locate terms within documents and metadata for fast retrieval when precise search terms are known. This straightforward approach makes sure users can quickly locate exact matches across the entire document landscape. The second method, Natural Language Query (NLQ) Search, supports interaction with documents using everyday language, interpreting intent and converting queries into search parameters—particularly powerful for complex or concept -based queries. Complementing these core search methods, the system offers specialized search capabilities including Favorites and Collections search so users can efficiently navigate their personally curated document sets and shared collections. Additionally, the system provides intelligent document upload search functionality that helps users quickly locate appropriate document categories and upload locations based on document types and property contexts.
The search infrastructure supports comprehensive file formats including PDFs, Microsoft Office documents (Word, Excel, PowerPoint), emails (MSG), images (JPG, PNG), text files, HTML files, and various other document types, facilitating comprehensive coverage across the document categories in the property management environment.
Prompt engineering and management optimization
Our Document Search system incorporates advanced prompt engineering techniques to enhance search accuracy, efficiency, and maintainability. Let’s explore the key features of our prompt management system and the value they bring to the search experience.
Two-stage prompt architecture and modular prompt management:
At the core of our system is a two-stage prompt architecture. This design separates tool selection from task execution for more efficient and accurate query processing.

# Modular prompt loading from configuration
get_doc_detect_prompt = get_prompts(“doc_prompts/tool_detect/Get_Document_data_detect”)
get_doc_prompt = get_prompts(“doc_prompts/prepare_prompt/Get_Document_data_prompt”)
keyword_search_detect_prompt = get_prompts(“doc_prompts/tool_detect/keyword_search_detect”)

def detect_tool(user_prompt):
tool_descriptions = {
“Get_Document_data”: get_doc_detect_prompt,
“keyword_search”: keyword_search_detect_prompt,
“Get_Fawdocs_collections”: faw_collection_detect_prompt,
“upload_documents”: upload_document_detect_prompt
}

messages = [
{“role”: “system”, “content”: “You are an AI assistant that determines the most appropriate tool…”},
{“role”: “user”, “content”: f”Here are the tool descriptions:n{json.dumps(tool_descriptions, indent=2)}nnUser query: {user_prompt}nnWhich tool should be used?”}
]

This architecture reduces token usage by up to 60% by loading only necessary prompts per query processing stage. The lightweight initial stage quickly routes queries to appropriate tools, while specialized prompts handle the actual execution with focused context, improving both performance and accuracy in tool selection and query execution.
Our modular prompt management system stores prompts in external configuration files for dynamic loading based on context and supporting personalization. It supports prompt updates without code deployments, cutting update cycles from hours to minutes. This architecture facilitates A/B testing of different prompt variations and quick rollbacks, enhancing system adaptability and reliability.

def prepare_tool_prompt(detected_tool, userid):
tool_prompts = {
“keyword_search”: keyword_search_prompt,
“Get_Document_data”: get_doc_prompt.replace(“userid”, userid),
“upload_documents”: upload_document_prompt,
“Get_Favdocs_collections”: fav_collection_prompt
}
return tool_prompts[detected_tool]

The system implements context-aware prompt selection, adapting to query types, document characteristics, and search contexts. This approach makes sure that the most appropriate prompt and query structure are used for each unique search scenario. For example, the system distinguishes between different question types (for example, ‘list_question’) for tailored processing of various query intents.
Search algorithm optimization
Our document search system implements search algorithms that combine vector-based semantic search with traditional text-based approaches to search across document metadata and content. We use different query strategies optimized for specific search scenarios.
Keyword search:
Keyword search uses a dual strategy combining both metadata and content searches using phrase matching. A fixed query template structure facilitates efficiency and consistency, incorporating predefined metadata, content, permission rules, and building ID constraints, while dynamically integrating user-specific terms and roles. This approach allows for fast and reliable searches while maintaining proper access controls and relevance.
User queries like “lease agreement” or “property tax 2023” are parsed into component words, each requiring a match in the document content for relevancy, facilitating precise results.

“bool”: {
“must”: [
{“match_phrase”: {“srccontent”: word}} for word in search_words
]
}

Similarly, for metadata searches, the system uses phrase searching across metadata fields:

“multi_match”: {
“query”: search_words,
“type”: “phrase”,
“fields”: [“srcmetadata”]
}

This approach provides exact matching capabilities across document metadata, facilitating precise results when users are searching for specific document properties. The system executes both search types concurrently and results from both searches are then merged and deduplicated, with scoring normalized across both result sets.
Natural language query search:
Our NLQ search combines LLM-generated queries with vector-based semantic search through two main components. The metadata search uses an LLM to generate OpenSearch queries from natural language input. For instance, “Find lease agreements mentioning early termination for tech companies from last year” is transformed into a structured query that searches across document types, dates, property names and other metadata fields.
For content searches, we employ KNN vector search with a K-factor of 5 to identify semantically similar content. The system converts queries into vector embeddings and executes both metadata and content searches simultaneously, combining results while minimizing duplicates.
Chat with Document (digital assistant for in-depth document interaction):
The Chat with Document feature supports natural conversation with specific documents after initial search. Users can ask questions, request summaries, or seek specific information from selected documents through a straightforward interaction process.
When engaged, the system retrieves the complete document content using its node identifier and processes user queries through a streamlined pipeline. Each query is handled by an LLM using carefully constructed prompts that combine the user’s question with relevant document context.
With this capability users can extract information from complex documents efficiently. For example, property managers can quickly understand lease terms or payment schedules without manually scanning lengthy agreements. The feature provides instant summaries and explanations for rapid information access and decision-making in document-intensive workflows.
Scaling document ingestion
To handle high-throughput document processing and large-scale enterprise ingestion, our ingestion pipeline uses asynchronous Amazon Textract for scalable, parallel text extraction. The architecture efficiently processes diverse file types-PDFs, PPTs, Word documents, Excel files and images-even with hundreds of pages or high-resolution content. Once a document is uploaded to an Amazon S3 bucket, a message triggers an SQS queue, invoking a Lambda function that initiates an asynchronous Textract job, offloading heavy extraction and OCR tasks without blocking execution.
For text documents, the system reads the file from Amazon S3 and submits it to Amazon Textract’s asynchronous API, which processes the document in the background. Once the job completes, the results are retrieved and parsed to extract structured text. This text is then chunked intelligently—based on token count or semantic boundaries—and passed through a Bedrock embedding model (For example, Amazon Titan Text embeddings v2). Each chunk is enriched with metadata and indexed into Amazon OpenSearch for fast and context-aware search capabilities. Once ingested, our intelligent query strategy, driven by user and CBRE market lookups, dynamically directs searches to the relevant OpenSearch indexes.
Image files follow a similar flow but use Amazon Bedrock Claude 3 Haiku for OCR after base64 conversion. Extracted text is then chunked, embedded, and indexed like standard text documents.
Security and access control
User authentication and authorization occurs through a multi-layered security process:

Access token validation: The system verifies the user’s identity by validating the user identity in Microsoft B2C and their access token against each request. The user is also checked for their authorization to access application.
Entitlement verification: Simultaneously, the system checks the user’s permissions in a Redis database to verify they have the appropriate access rights to specific modules in application and database schemas (entitlements) they’re authorized to query on.
Property access validation: The system also retrieves their authorized building list from Redis database (building id list to which the user is mapped), making sure they can only access data related to their properties within their business portfolio.

This parallel validation process facilitates more secure and appropriate access while maintaining optimal performance through Redis’s high-speed data retrieval capabilities. Redis is populated during the application load through mapping user entitlement and building mapping maintained in the database. If the user details are not found in Redis an API is invoked to replenish the Redis database.

Results and impact
CBRE’s experience with this initiative has led to enhanced operational efficiency and data reliability, directly translating into tangible business benefits:

Cost savings and resource optimization: By reducing hours of manual effort annually per user, the business can realize substantial cost savings (for example, in labor costs, reduced overtime, or reallocated personnel). This frees up valuable user time so that the team can focus on more strategic, high-value tasks that drive building performance, innovation and growth rather than repetitive manual processes.
Improved decision-making and risk mitigation: Delivering results with 95% accuracy for business decisions that are based on highly reliable data. This minimizes the risk of errors, leading to more informed strategies, fewer costly mistakes, and ultimately, better business outcomes.
Increased productivity and throughput: With less time spent on manual tasks and a higher assurance of data quality, workflows can become smoother and faster. This translates to increased overall productivity and potentially higher throughput for related processes, enhancing service delivery.

Lessons learned and best practices
The following are our lessons learned and best practices based on our experience building this solution:

Use prompt modularization: Prompt engineering is essential for optimizing application performance and maintaining consistent results. Breaking prompts into modular components helped in better prompt management, enhanced control and maintainability through streamlined version control, simplified testing and validation processes, and improved performance tracking capabilities. The modular approach to prompt design reduced token usage, which in turn decreased LLM response times and improved overall system performance. Module approach also helps in enhanced SQL generation efficiency through faster troubleshooting, reduced implementation time, and more reliable query generation, resulting in quicker resolution of edge cases and business rule updates.
Provide accurate few shot example: For increased accuracy and consistency of SQL generation, use dynamic few shot example with modular components for seamless updates to example repository.

Include examples covering common use cases and edge scenarios.
Maintain a diverse set of high-quality example pairs covering various business scenarios.
Keep examples concise and focused on specific patterns.
Regularly update examples based on new business requirements. Remove or update outdated examples.
Limit to top-1 or top-2 most relevant examples to manage token usage.
Regularly validate the relevance of selected examples.
Set up feedback loops to continuously improve example matching accuracy.
Fine-tune similarity thresholds for optimal example matching.

Reduce the context window: For reducing the context window size of the context passed, select only the top-N KNN fields from the schema definition along with key/mandatory fields. Only apply the dynamic context field selection for schema where high number of fields are present and increasing the context window size.
Improve relevancy: LLM Scoring mechanism helped us in getting the right relevant set of schemas (modules). Harnessing LLM intelligence over the KNN result of relevant module helped us get the most relevant ordered results. Also consider:

Vector similarity alone may not capture true semantic relevance.
Top-K nearest neighbors don’t always guarantee contextual accuracy.
Order of results may not reflect actual relevance to the query.
Use of LLM Scoring provided a more accurate schema relevancy determination.

Conclusion
CBRE Property Management and AWS together demonstrated how innovative cloud AI solutions can unlock real business value at scale. By using AWS services and best practices, enterprises can reimagine how they access, manage, and derive insight from their data and take real action.
To learn how your organization can accelerate digital transformation with AWS, contact your AWS account team or start exploring AWS AI and data analytics services today.
Further reading on AWS services featured in this solution:

Amazon Bedrock: Foundation Model Service
Amazon Nova
Amazon OpenSearch Service documentation

About the authors
Lokesha Thimmegowda is a Senior Principal Software Engineer at CBRE, specializing in artificial intelligence and AWS. With four AWS certifications, including Solutions Architect Professional and AWS AI Practitioner, he excels at guiding teams through complex challenges with innovative solutions. Lokesha is passionate about designing transformative solution architectures that drive efficiency. Outside of work, he enjoys daily tennis with his daughters and weekend cricket.
Muppirala Venkata Krishna Kumar Principal Software Engineer at CBRE with over 18 years of expertise in leading technical teams and designing end-to-end solutions across diverse domains. A strategic technical lead with a strong command over both front-end and back-end technologies, cloud architecture using AWS, and AI/ML-driven innovations. Passionate about staying at the forefront of technology, continuously learning, and implementing modern tools to drive impactful results. Outside of work, values quality time with family and enjoys spiritual travel experiences that bring balance and inspiration.
Maraka Vishwadev is a Senior Staff Engineer at CBRE with 18 years of experience in enterprise software development, specializing in backend–frontend technologies and AWS Cloud. He leads impactful initiatives in Generative AI, leveraging Large Language Models to drive intelligent automation, enhance user experiences, and unlock new business capabilities. He is deeply involved in architecting and delivering scalable, secure, and cloud-native solutions, aligning technology with business strategy. Vishwa balances his professional life with cooking, movies, and quality family time.
Chanpreet Singh is a Senior Consultant at AWS with 18+ years of industry experience, specializing in Data Analytics and AI/ML solutions. He partners with enterprise customers to architect and implement cutting-edge solutions in Big Data, Machine Learning, and Generative AI using AWS native services, partner solutions and open-source technologies. A passionate technologist and problem solver, he balances his professional life with nature exploration, reading, and quality family time.
Sachin Khanna is a Lead Consultant specializing in Artificial Intelligence and Machine Learning (AI/ML) within the AWS Professional Services team. With a strong background in data management, generative AI, large language models, and machine learning, he brings extensive expertise to projects involving data, databases, and AI-driven solutions. His proficiency in cloud migration and cost optimization has enabled him to guide customers through successful cloud adoption journeys, delivering tailored solutions and strategic insights.
Dwaragha Sivalingam is a Senior Solutions Architect specializing in generative AI at AWS, serving as a trusted advisor to customers on cloud transformation and AI strategy. With seven AWS certifications including ML Specialty, he has helped customers in many industries, including insurance, telecom, utilities, engineering, construction, and real estate. A machine learning enthusiast, he balances his professional life with family time, enjoying road trips, movies, and drone photography.

Managed Tiered KV Cache and Intelligent Routing for Amazon SageMaker H …

Modern AI applications demand fast, cost-effective responses from large language models, especially when handling long documents or extended conversations. However, LLM inference can become prohibitively slow and expensive as context length increases, with latency growing exponentially and costs mounting with each interaction.
LLM inference requires recalculating attention mechanisms for the previous tokens when generating each new token. This creates significant computational overhead and high latency for long sequences. Key-value (KV) caching addresses this bottleneck by storing and reusing key-value vectors from previous computations, reducing inference latency and time-to-first-token (TTFT). Intelligent routing in LLMs is a technique that sends requests with shared prompts to the same inference instance to maximize the efficiency of the KV cache. It routes a new request to an instance that has already processed the same prefix, allowing it to reuse the cached KV data to accelerate processing and reduce latency. However, customers have told us that setting up and configuring the right framework for KV caching and intelligent routing at production scale is challenging and takes long experimental cycles.
Today we’re excited to announce that Amazon SageMaker HyperPod now supports Managed Tiered KV Cache and Intelligent Routing capabilities through the HyperPod Inference Operator. These new capabilities can deliver significant performance improvements for LLM inference workloads by reducing time to first token (TTFT) by up to 40%, increasing throughput, and lowering compute costs by up to 25% when used for long context prompts and multi-turn chat conversations using our internal tools. These capabilities are available for use with the HyperPod Inference Operator, which automatically manages the routing and distributed KV caching infrastructure, significantly reducing operational overhead while delivering enterprise-grade performance for production LLM deployments. By using the new Managed Tiered KV Cache feature you can efficiently offload attention caches to CPU memory (L1 cache) and distribute L2 cache for cross-instance sharing through a tiered storage architecture in HyperPod for optimal resource utilization and cost efficiency at scale.
Efficient KV caching combined with intelligent routing maximizes cache hits across workers so you can achieve higher throughput and lower costs for your model deployments. These features are particularly beneficial in applications that are processing long documents where the same context or prefix is referenced, or in multi-turn conversations where context from previous exchanges needs to be maintained efficiently across multiple interactions.
For example, legal teams analyzing 200 page contracts can now receive instant answers to follow-up questions instead of waiting 5+ seconds per query, healthcare chatbots maintain natural conversation flow across 20+ turn patient dialogues, and customer service systems process millions of daily requests with both better performance and lower infrastructure costs. These optimizations make document analysis, multi-turn conversations, and high-throughput inference applications economically viable at enterprise scale.
Optimizing LLM inference with Managed Tiered KV Cache and Intelligent Routing
Let’s break down the new features:

Managed Tiered KV Cache: Automatic management of attention states across CPU memory (L1) and distributed tiered storage (L2) with configurable cache sizes and eviction policies. SageMaker HyperPod handles the distributed cache infrastructure through the newly launched tiered storage, alleviating operational overhead for cross node cache sharing across clusters. KV cache entries are accessible cluster-wide (L2) so that a node can benefit from computations performed by other nodes.
Intelligent Routing: Configurable request routing to maximize cache hits using strategies like prefix-aware, KV-aware, and round-robin routing.
Observability: Built-in HyperPod Observability integration for observability of metrics and logs for Managed Tiered KV Cache and Intelligent Routing in Amazon Managed Grafana.

Sample flow for inference requests with KV caching and Intelligent Routing
As a user sends an inference request to HyperPod Load Balancer, it forwards the request to the Intelligent Router within the HyperPod cluster. The Intelligent Router dynamically distributes requests to the most appropriate mode pod (Instance A or Instance B) based on the routing strategy to maximize KV cache hit and minimize inference latency. As the request reaches the model pod, the pod first checks L1 cache (CPU) for frequently used key-value pairs, then queries the shared L2 cache (Managed Tiered KV Cache) if needed, before performing full computation of the token. Newly generated KV pairs are stored in both cache tiers for future reuse. After computation completes, the inference result flows back through the Intelligent Router and Load Balancer to the user.

Managed Tiered KV Cache
Managed Tiered KV Cache and Intelligent Routing are configurable opt-in features. When enabling Managed KV Cache, L1 cache is enabled by default, while both L1 and L2 cache can be configured to be enabled or disabled. The L1 cache resides locally on each inference node utilizing CPU memory. This local cache provides significantly fast access, making it ideal for frequently accessed data within a single model instance. The cache automatically manages memory allocation and eviction policies to optimize for the most valuable cached content. The L2 cache operates as a distributed cache layer spanning the entire cluster, enabling cache sharing across multiple model instances. We support two backend options for L2 cache, each with the following benefits:

Managed Tiered KV Cache (Recommended): A HyperPod disaggregated memory solution that offers excellent scalability to Terabyte pools, low latency, AWS network optimized, GPU-aware design with zero-copy support, and cost efficiency at scale.
Redis: Simple to set up, works well for small to medium workloads, and offers a rich environment of tools and integrations.

The two-tier architecture works together seamlessly. When a request arrives, the system first checks the L1 cache for the required KV pairs. If found, they are used immediately with minimal latency. If not found in L1, the system queries the L2 cache. If found there, the data is retrieved and optionally promoted to L1 for faster future access. Only if the data is not present in either cache does the system perform the full computation, storing the results in both L1 and L2 for future reuse.
Intelligent Routing
Our Intelligent Routing system offers four configurable strategies to optimize request distribution based on your workload characteristics, with the routing strategy being user-configurable at deployment time to match your application’s specific requirements.

Prefix-aware routing serves as the default strategy, maintaining a tree structure to track which prefixes are cached on which endpoints, delivering strong general-purpose performance for applications with common prompt templates such as multi-turn conversations, customer service bots with standard greetings, and code generation with common imports.
KV-aware routing provides the most sophisticated cache management through a centralized controller that tracks cache locations and handles eviction events in real-time, excelling at long conversation threads, document processing workflows, and extended coding sessions where maximum cache efficiency is critical.
Round-robin routing offers the most straightforward approach, distributing requests evenly across the available workers, best suited for scenarios where requests are independent, such as batch inference jobs, stateless API calls, and load testing scenarios.

Strategy
Best for

Prefix-aware routing (default)
Multi-turn conversations, customer service bots, code generation with common headers

KV-aware routing
Long conversations, document processing, extended coding sessions

Round-robin routing
Batch inference, stateless API calls, load testing

Deploying the Managed Tiered KV Cache and Intelligent Routing solution
Prerequisites
Create a HyperPod cluster with Amazon EKS as an orchestrator.

In Amazon SageMaker AI console, navigate to HyperPod Clusters, then Cluster Management.
On the Cluster Management page, select Create HyperPod cluster, then Orchestrated by Amazon EKS.
You can use one-click deployment from the SageMaker AI console. For cluster set up details see Creating a SageMaker HyperPod cluster with Amazon EKS orchestration.
Verify that the HyperPod cluster status is InService.

Verify that the inference operator is up and running. The Inference add-on is installed as a default option when you create the HyperPod cluster from the console. If you want to use an existing EKS cluster, see Setting up your HyperPod clusters for model deployment to manually install the inference operator.

From the command line, run the following command: 

kubectl get pods -n hyperpod-inference-system

Output:

hyperpod-inference-operator-conroller-manager-xxxxxx pod is in running state in namespace hyperpod-inference-system

Or, verify that the operator is running from console. Navigate to EKS cluster, Resources, Pods, Pick namespace, hyperpod-inference-system.

Preparing your model deployment manifest files
You can enable these features by adding configurations to your InferenceEndpointConfig custom CRD file.
For the complete example, visit the AWS samples GitHub repository.

export MODEL_NAME=”Llama-3.1-8B-Instruct”
export INSTANCE_TYPE=”ml.g5.24xlarge”
export MODEL_IMAGE=”public.ecr.aws/deep-learning-containers/vllm:0.11.1-gpu-py312-cu129-ubuntu22.04-ec2-v1.0″
export S3_BUCKET=”my-model-bucket”
export S3_MODEL_PATH=”models/Llama-3.1-8B-Instruct”
export AWS_REGION=”us-west-2″
export CERT_S3_URI=”s3://my-bucket/certs/”
export NAMESPACE=”default”
export NAME=”demo”

cat << EOF > inference_endpoint_config.yaml
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
name: ${NAME}
namespace: ${NAMESPACE}
spec:
modelName: ${MODEL_NAME}
instanceType: ${INSTANCE_TYPE}
replicas: 1
invocationEndpoint: v1/chat/completions
modelSourceConfig:
modelSourceType: s3
s3Storage:
bucketName: ${S3_BUCKET}
region: ${AWS_REGION}
modelLocation: ${S3_MODEL_PATH}
prefetchEnabled: false
kvCacheSpec:
enableL1Cache: true
enableL2Cache: true
l2CacheSpec:
l2CacheBackend: “tieredstorage” # can also be “redis”
# Set l2CacheLocalUrl if selecting “redis”
# l2CacheLocalUrl: “redis:redisdefaultsvcclusterlocal:6379”
intelligentRoutingSpec:
enabled: true
routingStrategy: prefixaware
tlsConfig:
tlsCertificateOutputS3Uri: ${CERT_S3_URI}
metrics:
enabled: true
modelMetrics:
port: 8000
loadBalancer:
healthCheckPath: /health
worker:
resources:
limits:
nvidia.com/gpu: “4”
requests:
cpu: “6”
memory: 30Gi
nvidia.com/gpu: “4”
image: ${MODEL_IMAGE}
args:
– “–model”
– “/opt/ml/model”
– “–max-model-len”
– “20000”
– “–tensor-parallel-size”
– “4”
modelInvocationPort:
containerPort: 8000
name: http
modelVolumeMount:
name: model-weights
mountPath: /opt/ml/model
environmentVariables:
– name: OPTION_ROLLING_BATCH
value: “vllm”
– name: SAGEMAKER_SUBMIT_DIRECTORY
value: “/opt/ml/model/code”
– name: MODEL_CACHE_ROOT
value: “/opt/ml/model”
– name: SAGEMAKER_MODEL_SERVER_WORKERS
value: “1”
– name: SAGEMAKER_MODEL_SERVER_TIMEOUT
value: “3600”
EOF

kubectl apply -f inference_endpoint_config.yaml

# Check inferenceendpointconfig status
kubectl get inferenceendpointconfig ${NAME} -n ${NAMESPACE}
NAME AGE
demo 8s

# Check pods status – you should see worker pods
kubectl get pods -n ${NAMESPACE}
NAME READY STATUS RESTARTS AGE
demo-675886c7bb-7bhhg 3/3 Running 0 30s

# Router pods are under hyperpod-inference-system namespace
kubectl get pods -n hyperpod-inference-system
NAME READY STATUS RESTARTS AGE
hyperpod-inference-operator-controller-manager-dff64b947-m5nqk 1/1 Running 0 5h49m
demo-default-router-8787cf46c-jmgqd 2/2 Running 0 2m16s

Observability
You can monitor Managed KV Cache and Intelligent Routing metrics through the SageMaker HyperPod Observability features. For more information, see Accelerate foundation model development with one-click observability in Amazon SageMaker HyperPod.
KV Cache Metrics are available in the Inference dashboard.

Benchmarking
We conducted comprehensive benchmarking to validate real-world performance improvements for production LLM deployments. Our benchmarks were run with Managed Tiered KV Cache and Intelligent Routing feature using the Llama-3.1-70B-Instruct model deployed across 7 replicas on p5.48xlarge instances (each equipped with eight NVIDIA GPUs), under a steady-load traffic pattern. The benchmark environment used a dedicated client node group—with one c5.12xlarge instance per 100 concurrent requests to generate a controlled load, and a dedicated server node group, making sure model servers operated in isolation to help prevent resource contention under high concurrency.
Our benchmarks demonstrate that a combination of L1 and L2 Managed Tiered KV Cache and Intelligent Routing delivers substantial performance improvements across multiple dimensions. For medium context scenarios (8k tokens), we observed a 40% reduction in time to first token (TTFT) at P90, 72% reduction at P50, 24% increase in throughput, and 21% cost reduction compared to baseline configurations without optimization. The benefits are even more pronounced for long context workloads (64K tokens), achieving a 35% reduction in TTFT at P90, 94% reduction at P50, 38% throughput increase, and 28% cost savings. The optimization benefits scale dramatically with context length. While 8K token scenarios demonstrate solid improvements across the metrics, 64K token workloads experience transformative gains that fundamentally change the user experience. Our testing also confirmed that AWS-managed tiered storage consistently outperformed Redis-based L2 caching across the scenarios. The tiered storage backend delivered better latency and throughput without requiring the operational overhead of managing separate Redis infrastructure, making it the recommended choice for most deployments. Finally, unlike traditional performance optimizations that require tradeoffs between cost and speed, this solution delivers both simultaneously.
TTFT (P90)

TTFT (P50)

Throughput (TPS)

Cost/1000 token ($)

Conclusion
Managed Tiered KV Cache and Intelligent Routing in Amazon SageMaker HyperPod Model Deployment help you optimize LLM inference performance and costs through efficient memory management and smart request routing. You can get started today by adding these configurations to your HyperPod model deployments in the AWS Regions where SageMaker HyperPod is available.
To learn more, visit the Amazon SageMaker HyperPod documentation or follow the model deployment getting started guide.

About the authors
Chaitanya Hazarey is the Software Development Manager for SageMaker HyperPod Inference at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.
Pradeep Cruz is a Senior SDM at Amazon Web Services (AWS), driving AI infrastructure and applications at enterprise scale. Leading cross-functional organizations at Amazon SageMaker AI, he has built and scaled multiple high-impact services for enterprise customers including SageMaker HyperPod-EKS Inference, Task Governance, Feature Store, AIOps, and JumpStart Model Hub at AWS, alongside enterprise AI platforms at T-Mobile and Ericsson. His technical depth spans distributed systems, GenAI/ML, Kubernetes, cloud computing, and full-stack software development.
Vinay Arora is a Specialist Solution Architect for Generative AI at AWS, where he collaborates with customers in designing cutting-edge AI solutions leveraging AWS technologies. Prior to AWS, Vinay has over two decades of experience in finance—including roles at banks and hedge funds—he has built risk models, trading systems, and market data platforms. Vinay holds a master’s degree in computer science and business management.
Piyush Daftary is a Senior Software Engineer at AWS, working on Amazon SageMaker with a focus on building performant, scalable inference systems for large language models. His technical interests span AI/ML, databases, and search technologies, where he specializes in developing production-ready solutions that enable efficient model deployment and inference at scale. His work involves optimizing system performance, implementing intelligent routing mechanisms, and designing architectures that support both research and production workloads, with a passion for solving complex distributed systems challenges and making advanced AI capabilities more accessible to developers and organizations. Outside of work, he enjoys traveling, hiking, and spending time with family.
Ziwen Ning is a Senior Software Development Engineer at AWS, currently working on SageMaker Hyperpod Inference with a focus on building scalable infrastructure for large-scale AI model inference. His technical expertise spans container technologies, Kubernetes orchestration, and ML infrastructure, developed through extensive work across the AWS ecosystem. He has deep experience in container registries and distribution, container runtime development and open source contributions, and containerizing ML workloads with custom resource management and monitoring. Ziwen is passionate about designing production-grade systems that make advanced AI capabilities more accessible. In his free time, he enjoys kickboxing, badminton, and immersing himself in music.
Roman Blagovirnyy is a Sr. User Experience Designer on the SageMaker AI team with 19 years of diverse experience in interactive, workflow, and UI design, working on enterprise and B2B applications and features for the finance, healthcare, security, and HR industries prior to joining Amazon. At AWS Roman was a key contributor to the design of SageMaker AI Studio, SageMaker Studio Lab, data and model governance capabilities, and HyperPod. Roman’s currently works on new features and improvements to the administrator experience for HyperPod. In addition to this, Roman has a keen interest in design operations and process.
Caesar Chen is the Software Development Manager for SageMaker HyperPod at AWS, where he leads the development of cutting-edge machine learning infrastructure. With extensive experience in building production-grade ML systems, he drives technical innovation while fostering team excellence. His work in scalable model hosting infrastructure empowers data scientists and ML engineers to deploy and manage models with greater efficiency and reliability.
Chandra Lohit Reddy Tekulapally is a Software Development Engineer with the Amazon SageMaker HyperPod team. He is passionate about designing and building reliable, high-performance distributed systems that power large-scale AI workloads. Outside of work, he enjoys traveling and exploring new coffee spots.
Kunal Jha is a Principal Product Manager at AWS. He is focused on building Amazon SageMaker Hyperpod as the best-in-class choice for Generative AI model’s training and inference. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest.
Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and content for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Salesforce AI Research Introduces xRouter: A Reinforcement Learning Ro …

When your application can call many different LLMs with very different prices and capabilities, who should decide which one answers each request? Salesforce AI research team introduces ‘xRouter’, a tool-calling–based routing system that targets this gap with a reinforcement learning based router and learns when to answer locally and when to call external models, while tracking cost at token level.

What is xRouter?

xRouter is a tool calling based orchestration system built on Qwen2.5-7B-Instruct as the router backbone. The router is an instruction tuned model with tool calling capabilities that decides which downstream model to invoke, how to prompt it, and whether to synthesize or select an answer. The implementation uses DAPO, Distributional Advantage Policy Optimization, inside the Verl reinforcement learning framework, and exposes an OpenAI compatible API.

The router operates over more than 20 LLM tools in the full system. These tools span premium, standard, budget and specialized tiers, including GPT-5, GPT-4.1, GPT-5-Mini, GPT-5-Nano, o3, Kimi K2, DeepSeek-R1, Qwen3-235B variants and GPT-OSS models. The offloading pool is a 12 model subset that includes GPT-5, GPT-5-Mini, GPT-5-Nano, GPT-4o, GPT-4.1, o3, o3-Pro, o4-Mini, GPT-OSS-120B, GPT-OSS-20B and two Gemini-2.5 variants.

https://arxiv.org/pdf/2510.08439

Cost Aware Reward and Success Gating

Routing is framed as a reinforcement learning problem. For each episode, the reward combines a binary success signal and a cost penalty. The research team defines a reward that gives a fixed bonus when the final answer is correct, then subtracts a term proportional to the total normalized cost of all model calls. If the answer is wrong, the reward is zero regardless of how cheap it was.

As per the Model weights page, reward = quality − λ × normalized_cost, where λ is a cost penalty coefficient. Episodes with failures effectively have zero quality. This ‘success gated, cost shaped’ objective forces the router to first achieve correctness, then optimize cost among successful strategies. In practice, training uses 3 cost penalty settings, which produce the xRouter-7B-1, xRouter-7B-2 and xRouter-7B-3 variants.

https://arxiv.org/pdf/2510.08439

Training Data and Signal Design

xRouter training data comes from Reasoning360, which includes math, code and general reasoning tasks with difficulty estimates derived from a strong reference model, Qwen3-32B. The research team stratify samples into easy, medium and hard bands, and add simpler chit chat, retrieval and factual questions to teach the router when it can answer directly without delegation. Each sample includes descriptions and prices for models from different tiers. The system also refreshes the model catalog and perturbs costs to avoid overfitting to a static price table.

Failed trajectories, such as wrong answers from expensive models or unnecessary calls when the router could have answered itself, still incur full cost and receive zero reward. This produces a clean learning signal, where correctness gates reward and cost shapes the routing policy.

How the Router Behaves at Inference Time?

The router supports three execution modes. It can answer directly from the backbone without calling tools. It can call one or more downstream models, then synthesize a response using its own reasoning over their outputs. It can also call downstream models and use a special select_response tool to pick one of the replies as the final answer. These modes are implemented through function calls in an OpenAI style interface, which the orchestration engine executes through LiteLLM and SGLang.

Empirically, trained xRouter instances use a mix of direct and synthesized responses. Off the shelf routers such as GPT-4o, GPT-4.1, GPT-5, Qwen2.5-7B and Qwen3-8B tend to respond directly most of the time, even when instructed to offload when uncertain. This is an important behavioral difference and explains part of the efficiency gain.

Quantitative Results and Cost Utility

On static routing baselines across Minerva, MATH-500, Olympiad Bench, AIME-24, AMC-23, Codeforces, Code-Contests and Human-EvalPlus, xRouter-7B variants consistently improve accuracy compared to using the same base model as an untrained router. xRouter-7B-2, for example, reaches near GPT-5 accuracy on Olympiad Bench while using about one eighth of the GPT-5 evaluation cost.

In the system level comparison on LiveCodeBenchv5, GPQADiamond, AIME25, MT-Bench, IFEval and LiveBench, xRouter-7B-3 achieves the highest average accuracy on LiveCodeBenchv5 among all tested systems, and does this with moderate cost. Across tasks such as GPQA, xRouter variants reach around 80 to 90 percent of GPT-5 accuracy while consuming less than one fifth of the cost. The research team summarize that their cost aware reward can reduce inference cost by up to 80 percent at similar completion rates. The model weights HF card reports up to 60 percent cost reduction for comparable quality under other settings.

The research team also defines ‘cost utility’ as accuracy divided by cost. Open source single models with very low API prices often reach higher cost utility, but with lower absolute accuracy. xRouter sits in the middle, trading some cost utility for stronger task performance, which is usually what production systems care about.

Key Takeaways

xRouter is a tool calling router built on Qwen2.5 7B Instruct that learns to select among 20 plus external LLMs with a reinforcement learning policy that is explicitly cost aware.

The router uses a success gated reward, tasks only get positive reward when the final answer is correct, and within successful trajectories it applies a cost penalty term λ times normalized cost, which yields three xRouter 7B variants with different cost accuracy trade offs.

Training on Reasoning360 with difficulty stratification and synthetic easy queries teaches xRouter when to answer directly and when to offload, while perturbing prices and model pools improves robustness to changing provider catalogs.

Across math, coding and reasoning benchmarks, xRouter 7B models achieve near GPT 5 accuracy on hard tasks like Olympiad Bench and around 80 to 90 percent of GPT 5 accuracy on GPQA, while cutting offloading cost by up to 60 to 80 percent depending on the evaluation setup.

Editorial Notes

xRouter is a practical step toward cost aware orchestration for heterogeneous LLM fleets. It shows that a mid size router, trained with DAPO on Reasoning360 using a success gated, cost shaped reward, can consistently approach GPT 5 accuracy while reducing offloading cost by up to 60 to 80 percent.

Check out the PAPER and Model Weight. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Salesforce AI Research Introduces xRouter: A Reinforcement Learning Router for Cost Aware LLM Orchestration appeared first on MarkTechPost.

Agent0: A Fully Autonomous AI Framework that Evolves High-Performing A …

Large language models need huge human datasets, so what happens if the model must create all its own curriculum and teach itself to use tools? A team of researchers from UNC-Chapel Hill, Salesforce Research and Stanford University introduce ‘Agent0’, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration

Agent0 targets mathematical and general reasoning. It shows that careful task generation and tool integrated rollouts can push a base model beyond its original capabilities, across ten benchmarks.

https://arxiv.org/pdf/2511.16043

Two agents from one base model

Agent0 starts from a base policy π_base, for example Qwen3 4B Base or Qwen3 8B Base. It clones this policy into:

a Curriculum Agent πθ that generates tasks,

an Executor Agent πϕ that solves those tasks with a Python tool.

Training proceeds in iterations with two stages per iteration:

Curriculum evolution: The curriculum agent generates a batch of tasks. For each task, the executor samples multiple responses. A composite reward measures how uncertain the executor is, how often it uses the tool and how diverse the batch is. πθ is updated with Group Relative Policy Optimization (GRPO) using this reward.

Executor evolution: The trained curriculum agent is frozen. It generates a large pool of tasks. Agent0 filters this pool to keep only tasks near the executor’s capability frontier, then trains the executor on these tasks using an ambiguity aware RL objective called Ambiguity Dynamic Policy Optimization (ADPO).

This loop creates a feedback cycle. As the executor becomes stronger by using the code interpreter, the curriculum must generate more complex, tool reliant problems to keep its reward high.

https://arxiv.org/pdf/2511.16043

How the curriculum agent scores tasks?

The curriculum reward combines three signals:

Uncertainty reward: For each generated task x, the executor samples k responses and majority votes a pseudo answer. Self consistency p̂(x) is the fraction of responses that agree with this majority. The reward is maximal when p̂ is close to 0.5 and low when tasks are too easy or too hard. This encourages tasks that are challenging but still solvable for the current executor.

Tool use reward: The executor can trigger a sandboxed code interpreter using python tags and receives results tagged as output. Agent0 counts the number of tool calls in a trajectory and gives a scaled, capped reward, with a cap C set to 4 in experiments. This favors tasks that actually require tool calls rather than pure mental arithmetic.

Repetition penalty: Within each curriculum batch, Agent0 measures pairwise similarity between tasks using a BLEU based distance. Tasks are clustered, and a penalty term increases with cluster size. This discourages the curriculum from generating many near duplicates.

A composite reward multiplies a format check with a weighted sum of uncertainty and tool rewards minus the repetition penalty. This composite value feeds into GRPO to update πθ.

How the executor learns from noisy self labels?

The executor is also trained with GRPO but on multi turn, tool integrated trajectories and pseudo labels instead of ground truth answers.

Frontier dataset construction: After curriculum training in an iteration, the frozen curriculum generates a large candidate pool. For each task, Agent0 computes self consistency p̂(x) with the current executor and keeps only tasks where p̂ lies in an informative band, for example between 0.3 and 0.8. This defines a challenging frontier dataset that avoids trivial or impossible problems.

Multi turn tool integrated rollouts: For each frontier task, the executor generates a trajectory that can interleave:

natural language reasoning tokens,

python code segments,

output tool feedback.

Generation pauses when a tool call appears, executes the code in a sandboxed interpreter built on VeRL Tool, then resumes conditioned on the result. The trajectory terminates when the model produces a final answer inside {boxed …} tags.

A majority vote across sampled trajectories defines a pseudo label and a terminal reward for each trajectory.

ADPO, ambiguity aware RL: Standard GRPO treats all samples equally, which is unstable when labels come from majority voting on ambiguous tasks. ADPO modifies GRPO in two ways using p̂ as an ambiguity signal.

It scales the normalized advantage with a factor that increases with self consistency, so trajectories from low confidence tasks contribute less.

It sets a dynamic upper clipping bound for the importance ratio, which depends on self consistency. Empirical analysis shows that fixed upper clipping mainly affects low probability tokens. ADPO relaxes this bound adaptively, which improves exploration on uncertain tasks, as visualized by the up clipped token probability statistics.

https://arxiv.org/pdf/2511.16043

Results on mathematical and general reasoning

Agent0 is implemented on top of VeRL and evaluated on Qwen3 4B Base and Qwen3 8B Base. It uses a sandboxed Python interpreter as the single external tool.

The research team evaluate on ten benchmarks:

Mathematical reasoning: AMC, Minerva, MATH, GSM8K, Olympiad Bench, AIME24, AIME25.

General reasoning: SuperGPQA, MMLU Pro, BBEH.

They report pass@1 for most datasets and mean@32 for AMC and AIME tasks.

For Qwen3 8B Base, Agent0 reaches:

math average 58.2 versus 49.2 for the base model,

overall general average 42.1 versus 34.5 for the base model.

Agent0 also improves over strong data free baselines such as R Zero, Absolute Zero, SPIRAL and Socratic Zero, both with and without tools. On Qwen3 8B, it surpasses R Zero by 6.4 percentage points and Absolute Zero by 10.6 points on the overall average. It also beats Socratic Zero, which relies on external OpenAI APIs.

Across three co evolution iterations, average math performance on Qwen3 8B increases from 55.1 to 58.2 and general reasoning also improves per iteration. This confirms stable self improvement rather than collapse.

Qualitative examples show that curriculum tasks evolve from basic geometry questions to complex constraint satisfaction problems, while executor trajectories mix reasoning text with Python calls to reach correct answers.

Key Takeaways

Fully data free co evolution: Agent0 eliminates external datasets and human annotations. Two agents, a curriculum agent and an executor agent, are initialized from the same base LLM and co evolve only via reinforcement learning and a Python tool.

Frontier curriculum from self uncertainty: The curriculum agent uses the executor’s self consistency and tool usage to score tasks. It learns to generate frontier tasks that are neither trivial nor impossible, and that explicitly require tool integrated reasoning.

ADPO stabilizes RL with pseudo labels: The executor is trained with Ambiguity Dynamic Policy Optimization. ADPO down weights highly ambiguous tasks and adapts the clipping range based on self consistency, which makes GRPO style updates stable when rewards come from majority vote pseudo labels.

Consistent gains on math and general reasoning: On Qwen3 8B Base, Agent0 improves math benchmarks from 49.2 to 58.2 average and general reasoning from 34.5 to 42.1, which corresponds to relative gains of about 18 percent and 24 percent.

Outperforms prior zero data frameworks: Across ten benchmarks, Agent0 surpasses previous self evolving methods such as R Zero, Absolute Zero, SPIRAL and Socratic Zero, including those that already use tools or external APIs. This shows that the co evolution plus tool integration design is a meaningful step beyond earlier single round self play approaches.

Editorial Notes

Agent0 is an important step toward practical, data free reinforcement learning for tool integrated reasoning. It shows that a base LLM can act as both Curriculum Agent and Executor Agent, and that GRPO with ADPO and VeRL Tool can drive stable improvement from majority vote pseudo labels. The method also demonstrates that tool integrated co evolution can outperform prior zero data frameworks such as R Zero and Absolute Zero on strong Qwen3 baselines. Agent0 makes a strong case that self evolving, tool integrated LLM agents are becoming a realistic training paradigm.

Check out the PAPER and Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Agent0: A Fully Autonomous AI Framework that Evolves High-Performing Agents without External Data through Multi-Step Co-Evolution appeared first on MarkTechPost.

How to Build a Neuro-Symbolic Hybrid Agent that Combines Logical Plann …

In this tutorial, we demonstrate how to combine the strengths of symbolic reasoning with neural learning to build a powerful hybrid agent. We focus on creating a neuro-symbolic architecture that uses classical planning for structure, rules, and goal-directed behavior, while neural networks handle perception and action refinement. As we walk through the code, we see how both layers interact in real time, allowing us to navigate an environment, overcome uncertainty, and adapt intelligently. At last, we understand how neuro-symbolic systems bring interpretability, robustness, and flexibility together in a single agentic framework. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Set, Optional
from collections import deque
import warnings
warnings.filterwarnings(‘ignore’)

@dataclass
class State:
robot_pos: Tuple[int, int]
holding: Optional[str] = None
visited: Set[Tuple[int, int]] = field(default_factory=set)
objects_collected: Set[str] = field(default_factory=set)
def __hash__(self):
return hash((self.robot_pos, self.holding))

class SymbolicPlanner:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.actions = [‘up’, ‘down’, ‘left’, ‘right’, ‘pickup’, ‘drop’]
def get_successors(self, state: State, obstacles: Set[Tuple[int, int]], objects: Dict[str, Tuple[int, int]]) -> List[Tuple[str, State]]:
successors = []
x, y = state.robot_pos
moves = {‘up’: (x, y-1), ‘down’: (x, y+1), ‘left’: (x-1, y), ‘right’: (x+1, y)}
for action, new_pos in moves.items():
nx, ny = new_pos
if (0 <= nx < self.grid_size and 0 <= ny < self.grid_size and new_pos not in obstacles):
new_state = State(new_pos, state.holding, state.visited | {new_pos}, state.objects_collected.copy())
successors.append((action, new_state))
if state.holding is None:
for obj_name, obj_pos in objects.items():
if state.robot_pos == obj_pos and obj_name not in state.objects_collected:
new_state = State(state.robot_pos, obj_name, state.visited.copy(), state.objects_collected.copy())
successors.append((‘pickup’, new_state))
if state.holding is not None:
new_state = State(state.robot_pos, None, state.visited.copy(), state.objects_collected | {state.holding})
successors.append((‘drop’, new_state))
return successors
def heuristic(self, state: State, goal: Tuple[int, int]) -> float:
return abs(state.robot_pos[0] – goal[0]) + abs(state.robot_pos[1] – goal[1])
def a_star_plan(self, start_state: State, goal: Tuple[int, int], obstacles: Set[Tuple[int, int]], objects: Dict[str, Tuple[int, int]]) -> List[str]:
counter = 0
frontier = [(self.heuristic(start_state, goal), counter, 0, start_state, [])]
visited = set()
while frontier:
frontier.sort()
_, _, cost, state, plan = frontier.pop(0)
counter += 1
if state.robot_pos == goal and len(state.objects_collected) >= len(objects):
return plan
state_key = (state.robot_pos, state.holding)
if state_key in visited:
continue
visited.add(state_key)
for action, next_state in self.get_successors(state, obstacles, objects):
new_cost = cost + 1
new_plan = plan + [action]
priority = new_cost + self.heuristic(next_state, goal)
frontier.append((priority, counter, new_cost, next_state, new_plan))
counter += 1
return []

We lay the foundation for our symbolic reasoning system and define how states, actions, and transitions work. We implement classical planning logic using A* search to generate goal-directed, interpretable action sequences. As we build this part, we establish the rule-based backbone that guides the agent’s high-level decisions. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuralPerception:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.W1 = np.random.randn(grid_size * grid_size, 64) * 0.1
self.b1 = np.zeros(64)
self.W2 = np.random.randn(64, 32) * 0.1
self.b2 = np.zeros(32)
self.W3 = np.random.randn(32, grid_size * grid_size) * 0.1
self.b3 = np.zeros(grid_size * grid_size)
def relu(self, x):
return np.maximum(0, x)
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def perceive(self, noisy_grid: np.ndarray) -> np.ndarray:
x = noisy_grid.flatten()
h1 = self.relu(x @ self.W1 + self.b1)
h2 = self.relu(h1 @ self.W2 + self.b2)
out = self.sigmoid(h2 @ self.W3 + self.b3)
return out.reshape(self.grid_size, self.grid_size)

class NeuralPolicy:
def __init__(self, state_dim: int = 4, action_dim: int = 4):
self.W = np.random.randn(state_dim, action_dim) * 0.1
self.b = np.zeros(action_dim)
self.action_map = [‘up’, ‘down’, ‘left’, ‘right’]
def softmax(self, x):
exp_x = np.exp(x – np.max(x))
return exp_x / exp_x.sum()
def get_action_probs(self, state_features: np.ndarray) -> np.ndarray:
logits = state_features @ self.W + self.b
return self.softmax(logits)
def select_action(self, state_features: np.ndarray, symbolic_action: str) -> str:
probs = self.get_action_probs(state_features)
if symbolic_action in self.action_map:
sym_idx = self.action_map.index(symbolic_action)
probs[sym_idx] += 0.7
probs = probs / probs.sum()
return np.random.choice(self.action_map, p=probs)

We introduce the neural components that allow our agent to sense and adapt. We design a lightweight neural network to denoise the environment and a simple policy network to refine actions based on features. As we integrate these elements, we ensure that our agent can handle uncertainty and adjust behavior dynamically. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass NeuroSymbolicAgent:
def __init__(self, grid_size: int = 8):
self.grid_size = grid_size
self.planner = SymbolicPlanner(grid_size)
self.perception = NeuralPerception(grid_size)
self.policy = NeuralPolicy()
self.obstacles = {(3, 3), (3, 4), (4, 3), (5, 5), (6, 2)}
self.objects = {‘key’: (2, 6), ‘gem’: (6, 6)}
self.goal = (7, 7)
def create_noisy_observation(self, true_grid: np.ndarray) -> np.ndarray:
noise = np.random.randn(*true_grid.shape) * 0.2
return np.clip(true_grid + noise, 0, 1)
def extract_state_features(self, pos: Tuple[int, int], goal: Tuple[int, int]) -> np.ndarray:
return np.array([pos[0]/self.grid_size, pos[1]/self.grid_size, goal[0]/self.grid_size, goal[1]/self.grid_size])
def execute_mission(self, verbose: bool = True) -> Tuple[List, List]:
start_state = State(robot_pos=(0, 0), visited={(0, 0)})
symbolic_plan = self.planner.a_star_plan(start_state, self.goal, self.obstacles, self.objects)
if verbose:
print(f” Symbolic Plan Generated: {len(symbolic_plan)} steps”)
print(f” Plan: {symbolic_plan[:10]}{‘…’ if len(symbolic_plan) > 10 else ”}n”)
true_grid = np.zeros((self.grid_size, self.grid_size))
for obs in self.obstacles:
true_grid[obs[1], obs[0]] = 1.0
noisy_obs = self.create_noisy_observation(true_grid)
perceived_grid = self.perception.perceive(noisy_obs)
if verbose:
print(f” Neural Perception: Denoised obstacle map”)
print(f” Perception accuracy: {np.mean((perceived_grid > 0.5) == true_grid):.2%}n”)
trajectory = [(0, 0)]
current_pos = (0, 0)
actions_taken = []
for i, sym_action in enumerate(symbolic_plan[:30]):
features = self.extract_state_features(current_pos, self.goal)
refined_action = self.policy.select_action(features, sym_action) if sym_action in [‘up’,’down’,’left’,’right’] else sym_action
actions_taken.append(refined_action)
if refined_action == ‘up’: current_pos = (current_pos[0], max(0, current_pos[1]-1))
elif refined_action == ‘down’: current_pos = (current_pos[0], min(self.grid_size-1, current_pos[1]+1))
elif refined_action == ‘left’: current_pos = (max(0, current_pos[0]-1), current_pos[1])
elif refined_action == ‘right’: current_pos = (min(self.grid_size-1, current_pos[0]+1), current_pos[1])
if current_pos not in self.obstacles:
trajectory.append(current_pos)
return trajectory, actions_taken

We bring the symbolic and neural layers together into a unified agent. We generate a symbolic plan, perceive the environment through neural processing, and refine each planned action using the neural policy. As we execute the mission loop, we observe how both systems interact seamlessly to produce robust behavior. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef visualize_execution(agent: NeuroSymbolicAgent, trajectory: List, title: str = “Neuro-Symbolic Agent Execution”):
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
ax = axes[0]
grid = np.zeros((agent.grid_size, agent.grid_size, 3))
for obs in agent.obstacles:
grid[obs[1], obs[0]] = [0.3, 0.3, 0.3]
for obj_pos in agent.objects.values():
grid[obj_pos[1], obj_pos[0]] = [1.0, 0.8, 0.0]
grid[agent.goal[1], agent.goal[0]] = [0.0, 1.0, 0.0]
for i, pos in enumerate(trajectory):
intensity = 0.3 + 0.7 * (i / len(trajectory))
grid[pos[1], pos[0]] = [intensity, 0.0, 1.0]
if trajectory:
grid[trajectory[0][1], trajectory[0][0]] = [1.0, 0.0, 0.0]
ax.imshow(grid)
ax.set_title(“Agent Trajectory in Environment”, fontsize=14, fontweight=’bold’)
ax.set_xlabel(“X Position”)
ax.set_ylabel(“Y Position”)
ax.grid(True, alpha=0.3)
ax = axes[1]
ax.axis(‘off’)
ax.text(0.5, 0.95, “Neuro-Symbolic Architecture”, ha=’center’, fontsize=16, fontweight=’bold’, transform=ax.transAxes)
layers = [(“SYMBOLIC LAYER”, 0.75, “Planning • State Logic • Rules”), (” INTEGRATION”, 0.60, “Feature Extraction • Action Blending”), (“NEURAL LAYER”, 0.45, “Perception • Policy Learning”), (” EXECUTION”, 0.30, “Action Refinement • Feedback”), (“ENVIRONMENT”, 0.15, “State Transitions • Observations”)]
colors = [‘#FF6B6B’, ‘#4ECDC4’, ‘#45B7D1’, ‘#96CEB4’, ‘#FFEAA7′]
for i, (name, y, desc) in enumerate(layers):
ax.add_patch(plt.Rectangle((0.1, y-0.05), 0.8, 0.08, facecolor=colors[i], alpha=0.7, transform=ax.transAxes))
ax.text(0.5, y, f”{name}n{desc}”, ha=’center’, va=’center’, fontsize=10, fontweight=’bold’, transform=ax.transAxes)
plt.tight_layout()
plt.savefig(‘neurosymbolic_agent.png’, dpi=150, bbox_inches=’tight’)
plt.show()
print(f”n Execution complete! Trajectory length: {len(trajectory)} steps”)

We visualize how the agent moves through the environment and how the architecture is structured. We plot obstacles, objects, the goal, and the full trajectory so that we can clearly see the agent’s decision process. As we render the architecture layers, we understand how the hybrid design flows from planning to perception to action. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
print(“=” * 70)
print(“NEURO-SYMBOLIC HYBRID AGENT TUTORIAL”)
print(“Combining Classical AI Planning with Modern Neural Networks”)
print(“=” * 70)
print()
agent = NeuroSymbolicAgent(grid_size=8)
trajectory, actions = agent.execute_mission(verbose=True)
visualize_execution(agent, trajectory)
print(“n” + “=” * 70)
print(“KEY INSIGHTS:”)
print(“=” * 70)
print(“✦ Symbolic Layer: Provides interpretable, verifiable plans”)
print(“✦ Neural Layer: Handles noisy perception & adapts to uncertainty”)
print(“✦ Integration: Combines strengths of both paradigms”)
print(“✦ Benefits: Explainability + Flexibility + Robustness”)
print(“=” * 70)

We run the complete neuro-symbolic pipeline from planning to execution to visualization. We instantiate the agent, execute the mission, and display key insights to summarize the system’s behavior. As we run this final block, we see the overall hybrid architecture in action and appreciate how each component contributes to the outcome.

In conclusion, we observe how smoothly the symbolic and neural components work together to produce a more capable and reliable agent. We appreciate how the symbolic planner gives us transparent, verifiable steps, while the neural layer adds adaptability and perceptual grounding that pure logic cannot offer. Through this hybrid approach, we can build agents that reason, perceive, and act in ways that are both intelligent and interpretable. We end with a deeper understanding of how neuro-symbolic AI moves us closer to practical, resilient agentic systems.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Build a Neuro-Symbolic Hybrid Agent that Combines Logical Planning with Neural Perception for Robust Autonomous Decision-Making appeared first on MarkTechPost.

Amazon SageMaker AI introduces EAGLE based adaptive speculative decodi …

Generative AI models continue to expand in scale and capability, increasing the demand for faster and more efficient inference. Applications need low latency and consistent performance without compromising output quality. Amazon SageMaker AI introduces new enhancements to its inference optimization toolkit that bring EAGLE based adaptive speculative decoding to more model architectures. These updates make it easier to accelerate decoding, optimize performance using your own data and deploy higher-throughput models using the familiar SageMaker AI workflow.
EAGLE, short for Extrapolation Algorithm for Greater Language-model Efficiency, is a technique that speeds up large language model decoding by predicting future tokens directly from the hidden layers of the model. When you guide optimization using your own application data, the improvements align with the actual patterns and domains you serve, producing faster inference that reflects your real workloads rather than generic benchmarks. Based on the model architecture, SageMaker AI trains EAGLE 3 or EAGLE 2 heads.
Note that this training and optimization is not limited to just a one time optimization operation. You can start by utilizing the datasets provided by SageMaker for the initial training, but as you continue to gather and collect your own data you can also fine-tune using your own curated dataset for highly adaptive, workload-specific performance. An example would be utilizing a tool such as Data Capture to curate your own dataset over time from real-time requests that are hitting your hosted model. This can be an iterative feature with multiple cycles of training to continuously improve performance.
In this post we’ll explain how to use EAGLE 2 and EAGLE 3 speculative decoding in Amazon SageMaker AI.
Solution overview
SageMaker AI now offers native support for both EAGLE 2 and EAGLE 3 speculative decoding, enabling each model architecture to apply the technique that best matches its internal design. For your base LLM, you can utilize either SageMaker JumpStart models or bring your own model artifacts to S3 from other model hubs, such as HuggingFace.
Speculative decoding is a widely employed technique for accelerating inference in LLMs without compromising quality. This method involves using a smaller draft model to generate preliminary tokens, which are then verified by the target LLM. The extent of the speedup achieved through speculative decoding is heavily dependent on the selection of the draft model.

The sequential nature of modern LLMs makes them expensive and slow, and speculative decoding has proven to be an effective solution to this problem. Methods like EAGLE improve upon this by reusing features from the target model, leading to better results. However, a current trend in the LLM community is to increase training data to boost model intelligence without adding inference costs. Unfortunately, this approach has limited benefits for EAGLE. This limitation is due to EAGLE’s constraints on feature prediction. To address this, EAGLE-3 is introduced, which predicts tokens directly instead of features and combines features from multiple layers using a technique called training-time testing. These changes significantly improve performance and allow the model to fully benefit from increased training data.

To give customers maximum flexibility, SageMaker supports every major workflow for building or refining an EAGLE model. You can train an EAGLE model entirely from scratch using the SageMaker curated open dataset, or train it from scratch with your own data to align speculative behavior with your traffic patterns. You can also start from an existing EAGLE base model: either retraining it with the default open dataset for a fast, high-quality baseline, or fine-tuning that base model with your own dataset for highly adaptive, workload-specific performance. In addition, SageMaker JumpStart provides fully pre-trained EAGLE models so you can begin optimizing immediately without preparing any artifacts.
The solution spans six supported architectures and includes a pre-trained, pre-cached EAGLE base to accelerate experimentation. SageMaker AI also supports widely used training data formats, specifically ShareGPT and OpenAI chat and completions, so existing corpora can be used directly. Customers can also provide the data captured using their own SageMaker AI endpoints provided the data is in the above specified formats. Whether you rely on the SageMaker open dataset or bring your own, optimization jobs typically deliver around a 2.5x thoughput over standard decoding while adapting naturally to the nuances of your specific use case.
All optimization jobs automatically produce benchmark results giving you clear visibility into latency and throughput improvements. You can run the entire workflow using SageMaker Studio or the AWS CLI and you deploy the optimized model through the same interface you already use for standard SageMaker AI inference.
SageMaker AI currently supports LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM and GptOssForCausalLM with EAGLE 3, and Qwen3NextForCausalLM with EAGLE 2. You can use one optimization pipeline across a mix of architectures while still gaining the benefits of model-specific behavior.
How EAGLE works inside the model
Speculative decoding can be thought of like a seasoned chief scientist guiding the flow of discovery. In traditional setups, a smaller “assistant” model runs ahead, quickly sketching out several possible token continuations, while the larger model examines and corrects those suggestions. This pairing reduces the number of slow, sequential steps by verifying multiple drafts at once.
EAGLE streamlines this process even further. Instead of depending on an external assistant, the model effectively becomes its own lab partner: it inspects its internal hidden-layer representations to anticipate several future tokens in parallel. Because these predictions arise from the model’s own learned structure, they tend to be more accurate upfront, leading to deeper speculative steps, fewer rejections, and smoother throughput.
By removing the overhead of coordinating a secondary model and enabling highly parallel verification, this approach alleviates memory bandwidth bottlenecks and delivers notable speedups, often around 2.5x, while maintaining the same output quality the baseline model would produce.
Running optimization jobs from the SDK or CLI
You can interface with the Optimization Toolkit using the AWS Python Boto3 SDK, Studio UI. In this section we explore utilizing the AWS CLI, the same API calls will map over to the Boto3 SDK. Here, the core API calls for endpoint creation remain the same: create_model, create_endpoint_config, and create_endpoint. The workflow we showcase here begins with model registration using the create_model API call. With the create_model API call you can specify your serving container and stack. You don’t need to create a SageMaker model object and can specify the model data in the Optimization Job API call as well.
For the EAGLE heads optimization, we specify the model data by pointing towards to the Model Data Source parameter, at the moment specification of the HuggingFace Hub Model ID is not supported. Pull your artifacts and upload them to an S3 bucket and specify it in the Model Data Source parameter. By default checks are done to verify that the appropriate files are uploaded so you have the standard model data expected for LLMs:

# traditional model data needed
model/
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
generation_config.json
vocab.json
model.safetensors
model.safetensors.index.json

Let’s look at a few paths here:

Using your own model data with your own EAGLE curated dataset
Bringing your own trained EAGLE that you may want to train more
Bring your own model data and use SageMaker AI built-in datasets

1. Using your own model data with your own EAGLE curated dataset
We can start an optimization job with the create-optimization-job API call. Here is an example with a Qwen3 32B model. Note that you can bring your own data or also use the built-in SageMaker provided datasets. First we can create a SageMaker Model object that specifies the S3 bucket with our model artifacts:

aws sagemaker –region us-west-2 create-model
–model-name <target-model-name>
–primary-container ‘{ “Image”: “763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}”,
“ModelDataSource”: { “S3DataSource”: { “S3Uri”: “Enter model path”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } } }’ –execution-role-arn “Enter Execution Role ARN”

Our optimization call then pulls down these model artifacts when you specify the SageMaker Model and a TrainingDataSource parameter as the following:

aws sagemaker –region us-west-2 create-optimization-job
–optimization-job-name <job-name>
–account-id <account-id>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: { “ModelName”: “Created Model name” }
}’
–optimization-configs'{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3DataType”: “S3Prefix”,
“S3Uri”: “Enter custom train data location”
}
}
}’
–output-config ‘{
“S3OutputLocation”: “Enter optimization output location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

2. Bringing your own trained EAGLE that you may want to train more
For your own trained EAGLE you can specify another parameter in the create_model API call where you point towards your EAGLE artifacts, optionally you can also specify a SageMaker JumpStart Model ID to pull down the packaged model artifacts.

# Enable additional model data source with EAGLE artifacts
aws sagemaker –region us-west-2 create-model
–model-name <target-model-name>
–primary-container ‘{ “Image”: “763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}”,
“ModelDataSource”: { “S3DataSource”: { “S3Uri”: “<model path>”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } },
“AdditionalModelDataSources”: [ { “ChannelName”: “eagle_model”,
“S3DataSource”: { “S3Uri”: “<pre-trained EAGLE path>”,
“S3DataType”: “S3Prefix”, “CompressionType”: “None” } } ] }’ –execution-role-arn “Enter Execution Role ARN”

Similarly the optimization API then inherits this model object with the necessary model data:

aws sagemaker –region us-west-2 create-optimization-job
–account-id <account-id>
–optimization-job-name <job-name>
–deployment-instance-type ml.p5.48xlarge
–max-instance-count 10
–model-source ‘{
“SageMakerModel”: {
“ModelName”: “Created Model Name”
}
}’
–optimization-configs ‘{
“ModelSpeculativeDecodingConfig”: {
“Technique”: “EAGLE”,
“TrainingDataSource”: {
“S3Uri”: “Enter training data path”,
“S3DataType”: “S3Prefix”
}
}
}’
–output-config ‘{
“SageMakerModel”: {
“ModelName”: “Model Name”
},
“S3OutputLocation”: “Enter output data location”
}’
–stopping-condition ‘{“MaxRuntimeInSeconds”: 432000}’
–role-arn “Enter Execution Role ARN”

3. Bring your own model data and use SageMaker built-in datasets
Optionally, we can utilize the SageMaker provided datasets:

# SageMaker Provided Optimization Datasets
gsm8k_training.jsonl (https://huggingface.co/datasets/openai/gsm8k)
magicoder.jsonl (https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K)
opencodeinstruct.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
swebench_oracle_train.jsonl (https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
ultrachat_0_8k_515292.jsonl (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

After completion, SageMaker AI stores evaluation metrics in S3 and records the optimization lineage in Studio. You can deploy the optimized model to an inference endpoint with either the create_endpoint API call or in the UI.
Benchmarks
To benchmark this further we compared three states:

No EAGLE: Base model without EAGLE as a baseline
Base EAGLE: EAGLE training using built-in datasets provided by SageMaker AI
Trained EAGLE: EAGLE training using built-in datasets provided by SageMaker AI and retraining with own custom dataset

The numbers displayed below are for qwen3-32B across metrics such as Time to First Token (TTFT) and overall throughput.

Configuration
Concurrency
TTFT (ms)
TPOT (ms)
ITL (ms)
Request Throughput
Output Throughput (tokens/sec)
OTPS per request (tokens/sec)

No EAGLE
4
168.04
45.95
45.95
0.04
86.76
21.76

No EAGLE
8
219.53
51.02
51.01
0.08
156.46
19.6

Base EAGLE
1
89.76
21.71
53.01
0.02
45.87
46.07

Base EAGLE
2
132.15
20.78
50.75
0.05
95.73
48.13

Base EAGLE
4
133.06
20.11
49.06
0.1
196.67
49.73

Base EAGLE
8
154.44
20.58
50.15
0.19
381.86
48.59

Trained EAGLE
1
83.6
17.32
46.37
0.03
57.63
57.73

Trained EAGLE
2
129.07
18
48.38
0.05
110.86
55.55

Trained EAGLE
4
133.11
18.46
49.43
0.1
214.27
54.16

Trained EAGLE
8
151.19
19.15
51.5
0.2
412.25
52.22

Pricing considerations
Optimization jobs run on SageMaker AI training instances, you will be billed depending on the instance type and job duration. Deployment of the resulting optimized model uses standard SageMaker AI Inference pricing.
Conclusion
EAGLE based adaptive speculative decoding gives you a faster and more effective path to improve generative AI inference performance on Amazon SageMaker AI. By working inside the model rather than relying on a separate draft network, EAGLE accelerates decoding, increases throughput and maintains generation quality. When you optimize using your own dataset, the improvements reflect the unique behavior of your applications, resulting in better end-to-end performance. With built-in dataset support, benchmark automation and streamlined deployment, the inference optimization toolkit helps you deliver low-latency generative applications at scale.

About the authors
Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling generative AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.
Xu Deng is a Software Engineer Manager with the SageMaker team. He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker. In his spare time, he loves traveling and snowboarding.
Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on SageMaker. In his spare time, he loves traveling and writing.
Vinay Arora is a Specialist Solution Architect for Generative AI at AWS, where he collaborates with customers in designing cutting-edge AI solutions leveraging AWS technologies. Prior to AWS, Vinay has over two decades of experience in finance—including roles at banks and hedge funds—he has built risk models, trading systems, and market data platforms. Vinay holds a master’s degree in computer science and business management.
Siddharth Shah is a Principal Engineer at AWS SageMaker, specializing in large-scale model hosting and optimization for Large Language Models. He previously worked on the launch of Amazon Textract, performance improvements in the model-hosting platform, and expedited retrieval systems for Amazon S3 Glacier. Outside of work, he enjoys hiking, video games, and hobby robotics.
Andy Peng is a builder with curiosity, motivated by scientific research and product innovation. He helped build key initiatives that span AWS SageMaker and Bedrock, Amazon S3, AWS App Runner, AWS Fargate, Alexa Health & Wellness, and AWS Payments, from 0-1 incubation to 10x scaling. Open-source enthusiast.
Johna Liu is a Software Development Engineer on the Amazon SageMaker team, where she builds and explores AI/LLM-powered tools that enhance efficiency and enable new capabilities. Outside of work, she enjoys tennis, basketball and baseball.
Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring new Seattle restaurants, traveling, and spending time with family and friends.

Train custom computer vision defect detection model using Amazon SageM …

On October 10, 2024, Amazon announced the discontinuation of the Amazon Lookout for Vision service, with a scheduled shut down date of October 31, 2025 (see Exploring alternatives and seamlessly migrating data from Amazon Lookout for Vision blog post). As part of our transition guidance for customers, we recommend the use of Amazon SageMaker AI tools to build applications for customers who are interested in AI/ML computer vision models for automated quality inspection use cases. To support that effort, AWS has made a pre-trained computer vision defect detection model available on AWS Marketplace that can be fine-tuned using Amazon SageMaker AI for a customer’s specific use case. If run in the cloud, this model only requires paying for infrastructure costs for training or inference. This approach provides the tools to accelerate solution development while facilitating complete flexibility to build a solution that integrates with any existing hardware and software infrastructure.
In this blog post, you will learn how to migrate your computer vision workloads from Amazon Lookout for Vision to Amazon SageMaker AI by following our step-by-step guidance.
AWS is sharing the main underlying models used for the service to end users in the AWS Marketplace. You can use the two main types of models, binary classification and semantic segmentation, when you train in your own AWS accounts for deployment on AWS or at the edge.
This model helps customers continue to use AWS defect detection technology at their own pace with greater flexibility. For example, you can train your models with larger instance types for faster training times. With access to set hyperparameters, you can also adjust model behavior that was not previously available on the AWS console. For example, you can set the multi-head model for semantic segmentation to disable the binary classifier head. This can make the model mode more tolerant of changing background and lighting conditions. You can also personalize the maximum training time, which was set to a non-changeable 24-hour limit on Amazon Lookout for Vision (L4V).
The GitHub repository for Amazon Lookout for Vision has been updated with a Jupyter Notebook to help you train datasets with these two model types and package them up. From there you can deploy the models by using a SageMaker endpoint, or edge devices.
To label the images beyond the sample data, you can use Amazon SageMaker Ground Truth to enable crowdsourcing or allow private teams to label the data, or use a partner solution such as Edge Impulse, Roboflow, or SuperbAI to do so. When you have the manifest file of the labeled data, the marketplace models can be used for training. You will lose a thumbnail-based dataset management tool like the Amazon Lookout for Vision console, so consider one of the previously mentioned partner solutions to help manage datasets. You can also export your existing data from the Lookout For Vision service using this guide.
Prerequisites
Before you begin, make sure you have the following components and permissions in place:

Amazon SageMaker Studio or Amazon SageMaker Unified Studio for integrated development environment (IDE)
AWS Identity and Access Management (IAM) role with these permissions to follow the principle of least privilege

Amazon S3

s3:GetObject
s3:PutObject
s3:DeleteObject
s3:ListBucket

SageMaker

sagemaker:CreateTrainingJob
sagemaker:CreateModel
sagemaker:CreateEndpoint
sagemaker:CreateEndpointConfig
sagemaker:CreateTransformJob
sagemaker:DescribeTrainingJob
sagemaker:DescribeModel
sagemaker:DescribeEndpoint
sagemaker:DescribeEndpointConfig
sagemaker:DescribeTransformJob
sagemaker:InvokeEndpoint
sagemaker:DeleteEndpoint
sagemaker:DeleteEndpointConfig
sagemaker:DeleteModel

Model subscription:

An AWS account with a subscription to Computer Vision Defect Detection Model or
An IAM role with these three permissions permission to make AWS Marketplace subscriptions in the AWS account you use:

aws-marketplace:ViewSubscriptions
aws-marketplace:Unsubscribe
aws-marketplace:Subscribe

Labeled data (you can use the cookie data sample in Github) or label your own data with SageMaker Ground Truth or an AWS Partner tool
Basic knowledge of creating a SageMaker notebook instance and running Jupyter notebook

Architecture overview
The following diagram illustrates the end-to-end flow, from image acquisition to inferencing at the edge. This blog focus on steps 2 and 3.

Use an edge application to configure cameras or sensors and capture training images.
Use SageMaker GroundTruth or AWS Partner platforms to export and label images.
Use Amazon SageMaker AI for model training.
Use REST, PLC, or digital input for image acquisition and processing.
Run real-time inference using the trained and deployed model.
Publish inference results to analytics and monitoring for alerts and analytics.
Perform automated action on the machine of concern or notify plant personnel of anomalies from inspection station component using OPC-UA or digital output.
Line operators and plant managers receive notifications for action.

Set up the labeling process
This section covers the steps to set up the labeling process using Amazon SageMaker Ground Truth, including creating a private labeling team and configuring the labeling job.

Configure Amazon SageMaker Ground Truth private team:

Select Amazon SageMaker AI, Ground Truth, Labeling workforces.
Select Private, then Create Private Team.
Enter a team name.
Leave other values as their defaults.
Select Create a new Amazon Cognito user group.
Select Create private Team.

On the Workers tab, select Invite New Workers.
Enter your team members’ email addresses to send sign-up invitations.

Label the dataset
After successfully completing the workforce setup for labelling, the next step is to label the dataset. This section explains how to prepare the dataset by uploading the images to an Amazon Simple Storage Service (Amazon S3) bucket, then create and run the SageMaker Ground Truth labeling job to label the images as normal or anomaly.

Upload the image datasets to an Amazon S3 bucket that SageMaker Ground Truth can access. If you don’t have a dataset, you can use either the cookie-dataset or aliens-dataset.

Copy all of the images from “normal” and “anomaly” folders into a single directory for SMGT to access or you will get an error message on the next step.
To use AWS CloudShell, run the following script:

#!/bin/bash
# Clone the repository
git clone https://github.com/aws-samples/amazon-lookout-for-vision.git
cd amazon-lookout-for-vision/aliens-dataset
# Remove existing all directory if it exists
rm -rf all
# Create a new all directory
mkdir -p all
# Copy normal images to all directory
cp normal/*.png all/
# Make sure we’re in the right directory before running the loop
cd “$(dirname “$0″)/amazon-lookout-for-vision/aliens-dataset”
# Copy anomaly images with .anomaly.png suffix
for file in anomaly/*.png; do
if [ -f “$file” ]; then
filename=$(basename “$file”)
cp “$file” “all/${filename}.anomaly.png”
fi
done
# Count files to verify
echo “Normal images: $(find normal -name “*.png” | wc -l)”
echo “Anomaly images: $(find anomaly -name “*.png” | wc -l)”
echo “Total images in all directory: $(find all -type f | wc -l)”
# Upload to S3
aws s3 cp all/ s3://<BUCKET_NAME>/aliens-dataset-all/ –recursive
# Clean up – remove the cloned repository
cd ../..
rm -rf amazon-lookout-for-vision

Alternatively, if you have the AWS CLI installed, you can copy them with the following commands (See setting up AWS CLI for how to do this):

sh-4.2$ git checkout https://github.com/aws-samples/amazon-lookout-for-vision.git
sh-4.2$ cd aliens-dataset ## keep in mind the filenames here clash, the following Linux command can help fix this
sh-4.2$ mkdir all
sh-4.2$ cp normal/.png all
sh-4.2$ aws s3 cp s3://aws-blogs-artifacts-public/artifacts/ML-19308/copy_conflicts.sh .

sh-4.2$ bash copy_conflicts.sh

sh-4.2$ ls -al all/

-rwxrwxr-x 1 ec2-user ec2-user 120035 Feb 17 16:39 59.png
-rwxrwxr-x 1 ec2-user ec2-user 93407 Feb 17 16:39 5.png
-rwxrwxr-x 1 ec2-user ec2-user 125477 Feb 17 16:39 5.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 123679 Feb 17 16:39 60.png
-rwxrwxr-x 1 ec2-user ec2-user 96330 Feb 17 16:39 6.png
-rwxrwxr-x 1 ec2-user ec2-user 126014 Feb 17 16:39 6.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 81051 Feb 17 16:39 7.png
-rwxrwxr-x 1 ec2-user ec2-user 128985 Feb 17 16:39 7.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 94216 Feb 17 16:39 8.png
-rwxrwxr-x 1 ec2-user ec2-user 128002 Feb 17 16:39 8.png.anomaly.png
-rwxrwxr-x 1 ec2-user ec2-user 110814 Feb 17 16:39 9.png
-rwxrwxr-x 1 ec2-user ec2-user 131385 Feb 17 16:39 9.png.anomaly.png

sh-4.2$aws s3 cp all/ s3://<BUCKET_NAME>/aliens-dataset-all/ –recursive
Note: To prevent filename clash from the two folders, a suffix anomaly was added. The uploaded files should be in your <BUCKET_NAME>/aliens-dataset-all bucket for the Ground Truth job.

In the AWS Console, navigate to Amazon SageMaker AI, Ground Truth, Labeling Jobs, Create labeling job.

There are several options here to fill in; the most important fields to fill or select are:

Input data setup: Select Automated data setup
S3 location for input datasets: <Full path where your dataset exists>
S3 location data output datasets: <Same location as input dataset>
Data type: Select Image
IAM Role – Select Create new role if you do not have one set up to allow Ground Truth to interact with SageMaker services.

Choose Complete data setup. An Input data connection successful message displays. If you get an error, check your IAM role to make sure S3 access is enabled, and the directory has image files in it, as it will not recurse through sub-directories.

Select the task type. These models support Image Classification (Single Label), which is binary classification (think good or bad), or Semantic segmentation. You cannot use a bounding box type with these models. You can change your selection later.
Choose Next.
For Worker types, select Private. You can read more about Amazon Mechanical Turks or labeling subscriptions in the Developer Guide.
Under Private teams, select the private team you created in the previous steps.
For Task timeout and Task expiration time, leave the default values.
Leave Enable automated data labeling unselected. You can read more about automated data labeling here; however, it is not compatible with semantic segmentation.
On the Image classification screen, add two new labels: normal and anomaly. You can fill in the rest as needed. Choose Preview to see a preview of what it will look like to the end user.
Choose Create.
Select Ground Truth, and then select the Private tab.

Open the labeling portal sign-in URL in a new tab in your browser and then sign in to see your assigned tasks.
Select an assigned task and choose Start working to label the data.
Select normal or anomaly.

When the job is complete, make note of the output dataset location. You will need this for the training step.

If you need to add workers to the labelling job:

On the Amazon SageMaker AI Ground Truth page, select Labeling workforces.
Select the Private tab.
Click on the private team that was created earlier (CV-team).
Select the Workers tab
Select the desired worker from the list and choose Add workers to team.

You will then be redirected to the Amazon SageMaker AI, labelling workforces page with a confirmation message that worker has been added.

After you complete the labeling task, the output of the task is used to train the Computer Vision Detection model from the AWS Marketplace.
Train the model
This section discusses training the computer vision model using the AWS Marketplace Computer Vision Detection model and the labeled dataset from the previous step.

Go to the AWS Marketplace to subscribe to the model, https://aws.amazon.com/marketplace/pp/prodview-j72hhmlt6avp6.
Choose Continue to Subscribe.
Choose Continue to configuration.
Select the latest software version, your Region, and make sure Create a training job is selected.

Note: Copy the Product Arn and store in a text editor or notepad for later use.

Go to SageMaker AI, Notebook instances, Create notebook instance.

Note: GPU-enabled notebook instance is not required. Amazon SageMaker Training jobs will spin up the GPU instances needed during training, so most basic instances will be sufficient.

Select m5.2xl instance, Jupyter lab 4, with volume size of 128 GB. The default is 5 GB, which is too small.
Select an IAM role to allow the notebook to access resources in your account. You will need access to S3.
In the Git Repositories – optional section, select Clone a public Git repository to this notebook instance only.
Enter the Git repository URL. Leave all the other fields as their default, then choose Create notebook instance to start the instance.
After the instance starts, (the status will display as InService), select Open JupyterLab action for the new notebook instance.

JupyterLab opens:

On the left navigation pane, open the computer-vision-defect-detection folder.

In the AWS Console, go to Marketplace, Manage subscriptions, and then copy the ARN of your model subscription.

In the Jupyter notebook, locate the snippet below and update the placeholder value for algorithm_name variable with the Product Arn you copied in the previous step.

# TODO: change this to use subscribed SageMaker algorithm algorithm_name = “<Customer to specify the algorithm name after subscription >”

The bucket that would be used for this step would be automatically created and named in the format SageMaker-<REGION>-<ACCOUNT_ID>.

# Initialize SageMaker session and get execution role
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
#bucket = sagemaker_session.default_bucket()
role = get_execution_role()
# Project name would be used as part of s3 output path
project = “ComputerVisionDefectDetection”

In the AWS Console, navigate to Amazon SageMaker AI, Ground Truth, Labeling jobs and select the job that was completed.
Identify and take note of the output images folder (Output dataset location)

Note: To start the training job, look at the path for the output manifest in <BUCKET NAME>/aliens-dataset/all/aliensv2/manifests/output/output.manifest—this will be the training manifest for the next step.

Set the bucket variable to be the images bucket name that you previously set and object key the path to your manifest:

bucket: where to store the manifest file
classification_manifest_key: where the output manifest file is stored (for example, aliens-dataset-all/[job-name]/manifests/output/output.manifest)

Review the model training configuration in the Classification Model with Algorithm Estimator section.

# Create AlgorithmEstimator for classificatio
classification_estimator = AlgorithmEstimator(
algorithm_arn=algorithm_name,
role=role, instance_count=1,
instance_type=’ml.g4dn.2xlarge’,
volume_size=20, max_run=7200,
input_mode=’Pipe’, # REQUIRED: Algorithm only supports Pipe mode
sagemaker_session=sagemaker_session,
enable_network_isolation=True
)

# Set hyperparameters
classification_estimator.set_hyperparameters(
ModelType=’classification’,
TestInputDataAttributeNames=’source-ref,anomaly-label-metadata,anomaly-label’, 
TrainingInputDataAttributeNames=’source-ref,anomaly-label-metadata,anomaly-label’)

print(“Classification estimator configured successfully”)</code></pre><pre><code class=”lang-python”># Define training input using TrainingInput class
classification_training_input = TrainingInput(
s3_data=classification_s3_path, ‘
s3_data_type=’AugmentedManifestFile’,
attribute_names=[
‘source-ref’,
‘anomaly-label-metadata’,
‘anomaly-label’
],
record_wrapping=’RecordIO’,
input_mode=’Pipe’ # Must match the estimator’s input_mode)
# Start training job
classification_job_name = f’defect-detection-classification-
{datetime.datetime.now().strftime(“%Y-%m-%d-%H-%M-%S”)}
‘print(f”Starting classification training job: {classification_job_name}”)
classification_estimator.fit(
inputs={‘training’: classification_training_input},
job_name=classification_job_name,
wait=True,
logs=True

)

Note: The job uses NVIDIA G4DN instances. They can be sized up to a larger instance to decrease training time, but on a only 118 instances. The image dataset training finishes in less than 10 minutes with a g4dn.2xl. You can experiment with other instance types, however results may vary because the models were extensively tested on the G4DN instances.

Validate the values of TestInputDataAttributeNames and TrainingInputDataAttributeNames in the Hyperparameters section, as well as AttributeNames in the

TrainingInput section. The labels on all three must match the structure of your manifest file. Here is a sample manifest:

{
“source-ref”: “s3://[bucketname]/getting-started/training-images/anomaly-1.jpg”,
“anomaly-label-metadata”: {
“job-name”: “anomaly-label”,
“class-name”: “anomaly”,
“human-annotated”: “yes”,
“creation-date”: “2022-08-22T20:52:51.851Z”,
“type”: “groundtruth/image-classification”
},
“anomaly-label”: 1
}
{
“source-ref”: “s3://[bucketname]/getting-started/training-images/anomaly-2.jpg”,
“anomaly-label-metadata”: {
“job-name”: “anomaly-label”,
“class-name”: “anomaly”,
“human-annotated”: “yes”,
“creation-date”: “2022-08-22T21:11:39.545Z”,
“type”: “groundtruth/image-classification”
},
“anomaly-label”: 1
}

Note: Two of the three values include the labelling job name.

response = sagemaker.create_training_job(
TrainingJobName=classification_training_job_name,
HyperParameters={
‘ModelType’: ‘classification’,
‘TestInputDataAttributeNames’: ‘source-ref,aliens-v3,aliens-v3-metadata’,
‘TrainingInputDataAttributeNames’: ‘source-ref,aliens-v3,aliens-v3-metadata’
}
)

Run all the cells or blocks listed in the Classification Model with Algorithm Estimator section to start the training job.
If you want to train a segmentation model as well, follow the steps in the Segmentation Model with Algorithm Estimator section.

Note: After the training is completed, you are ready to test it!  There are few inference options available for this:

Real-time inference using Amazon SageMaker endpoints
Amazon SageMaker AI Batch Transform inference.
Edge deployment

Deploy the model
Amazon SageMaker AI endpoints and Amazon SageMaker AI Batch Transform inference are both used for inference but serve different purposes.
Amazon SageMaker AI endpoints
Amazon SageMaker AI endpoints are used for real-time inference, providing low-latency predictions suitable for applications requiring immediate responses. Endpoints remain active while they’re deployed, making them better suited for continuous and steady traffic, but potentially more costly due to ongoing resource usage.

In the Jupyter notebook, navigate to the (Optional) Running real-time inference using Amazon SageMaker endpoints section.
Run the following cell blocks to set up and invoke the endpoint:

#classification_training_job_name = “defect-detection-classification-2025-10-01-00-29-57” # remove

classification_training_job_name = “<provide training job name here>”

# Create estimator from training job
estimator = AlgorithmEstimator.attach(classification_training_job_name)

# Deploy endpoint using SageMaker v2 SDK
predictor = estimator.deploy(
initial_instance_count=1,
instance_type=’ml.c5.2xlarge’
)

print(f”Endpoint deployed: {predictor.endpoint_name}”)

#Invoke the endpoint

# Invoke the endpoint using predictor
result = predictor.predict(image_data)

# Clean up the temporary file
os.remove(local_file)

# Print the result
print(“nEndpoint Response:”)
print(json.dumps(result, indent=2))

Validate the inference, then delete the endpoint by running the following block:

# Delete the endpoint

predictor.delete_endpoint()
print(“Endpoint deleted”)

Note: If you start an endpoint, keep in mind you will be billed while it is running until you turn it off.
Amazon SageMaker AI Batch Transform
Batch Transform is designed for offline inference and making predictions on large datasets stored in S3, and is ideal for bulk processing where low latency is not critical. After the job is complete, the resources are released, making it cost-effective for sporadic workloads.

Navigate to the (Optional) Run Batch Transform Inference using SageMaker SDK v2 section.
Define the s3_input_data and s3_output_path parameters.

# Run batch transform job

#############################################
# Change to your input/output data S3 path #
#############################################

s3_input_data = “s3://<Specify-s3-path-to-test-images>”
s3_output_path = f”s3://{bucket}/{project}/batch-transform-output”

Run all the cells and blocks in the (Optional) Run Batch Transform Inference using SageMaker SDK v2 section to complete the batch inference.
Validate the batch transform job after completion by navigating to the s3_output_path folder. The following is a sample inference output file:

{
“Source”: {
“Type”: “direct”
},
“IsAnomalous”: true,
“Confidence”: 0.92744799389183
}

Clean up
To avoid incurring unnecessary charges, delete the following resources when you no longer need them:

Delete SageMaker endpoints.

Navigate to the Amazon SageMaker Console.
Select Endpoints.
Select the endpoint you created.
Choose Delete.

Delete SageMaker Notebook instances.

Navigate to the Amazon SageMaker Console.
Select Notebook instances.
Select the notebook instance you created.
Choose Stop if the instance is running.
Once stopped, choose Delete.

Delete S3 objects and buckets.

Navigate to the Amazon S3 Console.
Delete all objects in the buckets you created for this tutorial.
Delete the empty buckets.

Delete the Ground Truth labeling team.

Navigate to Ground Truth.
Select Labeling workforces.
Select the Private tab.
Select the private team you created.
Choose Delete team.

Conclusion
In this blog post, we’ve demonstrated how to transition from Amazon Lookout for Vision to using the underlying Computer Vision Detection models available through the AWS Marketplace, showing the step-by-step process of setting up labeling, training the model, and running inference through batch transformation. The transition provides customers with greater flexibility in terms of training options, hyperparameter adjustments, and deployment choices while continuing to use AWS defect detection technology at their own pace. Also be sure to check out our edge-based open source integrated Defect Detection Application on GitHub if you would like to combine what you have learned here.

About the authors
Ryan Vanderwerf is a is a senior partner solutions architect at Amazon Web Services specializing in smart manufacturing, vision, and machine learning. Ryan previously provided Java virtual machine-focused consulting and project development as a software engineer at OCI on the Grails and Micronaut team. He was chief architect/director of products at ReachForce, with a focus on software and system architecture for AWS Cloud SaaS solutions for marketing data management. Ryan has built several SaaS solutions in several domains such as financial, media, telecom, and e-learning companies since 1996
Lu Min is a Software Development Engineer for AWS Edge ML services, focused on developing machine learning solutions that operate at the edge for AWS customers. With expertise in optimizing ML models for resource-constrained environments, Lu helps customers implement efficient inference capabilities on edge devices and cloud communication, as well as manage model lifecycle using AWS SageMaker.
Tim Westman is the Product Manager and Go-to-Market Lead for Edge Machine Learning, AWS. Tim leads the Product Management and Business Development for the Edge Machine Learning business at Amazon Web Services. In this role, he works with customers to help build computer vision solutions at the edge to solve complex operational challenges. Tim has more than 30 years of experience in sales, business development and product management roles for leading hardware and software companies, with the last 8 years specializing in AI and computer vision for IoT applications.
Kunle Adeleke is an enterprise solutions architect, providing guidance to large AWS commercial customers in diverse industries craft their technology strategy. Kunle has led enterprise architecture teams and software development teams in both government and commercial sectors. His deep expertise spans software development, solution architecture, enterprise architecture, security, and data & AI/ML.

Practical implementation considerations to close the AI value gap

Artificial Intelligence (AI) is changing how businesses operate. Gartner® predicts at least 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028. And 92% of companies are boosting their AI spending, according to McKinsey.
But here’s the problem: most companies are yet to realize a positive impact of AI on their profit and loss (P&L). According to analysis from S&P Global Market Intelligence,

“The share of companies abandoning most of their AI initiatives jumped to 42%, up from 17% last year [2024]” in the first half of 2025.

According to Gartner,

“Over 40% of agentic AI projects will be canceled by the end of 2027.”

The gap between spending and results is clear. To make AI work, companies need to stop running scattered experiments and start building enterprise-wide programs. As McKinsey puts it:

“The organizations that are building a genuine and lasting competitive advantage from their AI efforts are the ones that are thinking in terms of holistic transformative change that stands to alter their business models, cost structures, and revenue streams—rather than proceeding incrementally.”

The AWS Customer Success Center of Excellence (CS COE) helps customers get tangible value from their AWS investments. We’ve seen a pattern: customers who build AI strategies that address people, process, and technology together succeed more often.
In this post, we share practical considerations that can help close the AI value gap.
Implementation considerations
The following sections include practical implementation considerations for aligning leadership, redesigning incentives, building governance frameworks, and measuring outcomes—all grounded in real-world examples from organizations that have successfully closed their AI value gap. These practical insights can help you avoid common pitfalls and accelerate your path from AI investment to measurable business impact.
Figure 1: Six considerations for successful AI transformation and sustained value realization
Business leaders — not just tech leaders — need to drive your AI agenda
AI transformation requires translating vision into specific business outcomes with clear tracking mechanisms—and this demands broad cross-functional leadership from day one.
Roles like Chief Revenue Officers and line-of-business leaders need a seat at the decision-making table alongside technology leaders right from the start. These leaders have typically joined digital or cloud transformations much later in the process, but AI is different. The most impactful AI use cases come from two sources: line-of-business leaders who understand customer pain points and industry opportunities intimately, and employees across business functions who are willing to change their mindsets and fundamentally alter their operating models. Consider a large global institutional investment organization that embarked on an AI transformation program. They started by defining and creating relevant data and AI technical and business professions. Then, the organization designed and implemented the mechanisms and operating model needed to create data and AI products. Ultimately, they launched a new data and AI organization that helps them create new products, better serve customers, and monetize data assets by addressing new business opportunities. While engineering and product management remained at its core, their entire leadership team treated this as a business development initiative and partnered to make it possible.
Redesign incentives to reward AI-first operations
Transform organizational behavior to reward actual AI adoption, not just theoretical interest. Restructure career pathways to create advancement opportunities tied to effective AI use and measurable business outcomes. Critical to success is defining what outcomes matter. AI can generate voluminous output with little business impact, making measurement of outcomes essential.
One organization introduced standardized definitions for business processes and automation levels. They then redesigned their performance management framework to incorporate automation achievement as a key metric for Product Managers. This approach shifted focus from traditional input metrics toward measurable automation outcomes. It encouraged leaders to prioritize AI-augmented structures and intelligent process redesign over manual operations.
This alignment demonstrates how organizations must clearly define and measure desired outcomes—and tie individual rewards directly to tangible AI-driven business results.
Put people first and have HR lead the change as a strategic partner
HR serves as the cornerstone for aligning culture, talent, and incentives with AI transformation goals. Success requires HR to partner with executives in communicating the rationale for AI initiatives, addressing employee concerns, and fostering organizational buy-in through coaching and thought leadership.
Build AI fluency through tailored learning pathways. Provide focused training with practical tools like pre-populated prompt catalogs and quick-start demonstrations. Strengthen employee engagement through continuous feedback loops, celebrate AI learning participation across teams, and invest in retention strategies that value AI-skilled talent. HR champions adoption by collaborating with business and operations teams to develop role-based “What’s in it for me” content and current versus future process comparisons. For example, HR at a global financial institution took a leadership role to accelerate adoption of a reimagined product operating model. After the institution had invested significantly in a bottom-up transformation, HR designed and led—in partnership with AWS—a top-down approach. They empowered business leaders from lines of business, operations, and technology with extensive executive-level training to help them lead product teams, not just operate them. These leaders worked with technology teams to build mechanisms that helped accelerate adoption of their product operating model. The resulting mechanisms enabled them to create AI solutions focused on industry opportunities and customer needs.
HR support is key to transforming resistance into enthusiasm by embedding AI-first behaviors into the cultural DNA.
Set guardrails that help protect—without slowing down
Establish AI governance frameworks from day one that balance centralization and federation. This facilitates compliance alignment and integration while enabling rapid innovation at the edge. Pure centralization offers simpler governance but slows innovation. Complete federation creates integration challenges and compliance gaps.
For both centralized and federated models, create cross-functional AI governance councils with representation from legal, risk, IT, and business units. Define clear guardrails, approval thresholds, and escalation paths. This approach accelerates AI delivery by creating clear paths to production and reducing bureaucratic friction while maintaining enterprise-wide coherence and risk management.
One financial services customer implemented a three-layered AI governance approach. At the enterprise level, they automated security and compliance policies through policy as code. At the line-of-business level, they created data policies that support AI solutions within the value stream. At the solution level, they addressed individual AI model risks and performance thresholds. This approach facilitated necessary guardrails and policy adherence while allowing builders to focus on value-added AI solution features. It unlocked true innovation at the edge while maintaining compliance alignment with critical policies.
Work with the right partners to move faster on AI
According to Gartner,

“Scaling AI solutions across the enterprise is challenging and requires intentional plans to address AI skills, infrastructure, governance policies and forums to facilitate collaboration, integration, and shared best practices.”

Organizations achieve higher success rates when working with partners who provide AI innovation, cloud expertise, and industry-specific knowledge at the right time. Effective AI transformation partners serve three roles: industry advisors who reimagine existing value streams and workflows to uncover high-value use cases, technical experts who bring leading experience building scalable AI solutions and change champions who manage cultural shifts through training and governance frameworks.
A global insurance company engaged an AI transformation partner for a long-term engagement focused on building durable capabilities. The partner established business case frameworks and assets to prioritize use cases and baseline KPIs. They developed detailed adoption strategies using train-the-trainer methodologies. They implemented measurement systems to continuously track productivity impact. Together, they established governance models for ongoing AI agent creation and enterprise-wide deployment. This “teach to fish” model meant the insurance company could independently sustain and expand their AI transformation beyond the partnership engagement.
Track results that matter—not just what AI costs
Traditional cost prediction models struggle with AI’s continuously changing pricing and capabilities. Success requires anchoring to one or two measurable business outcomes that can be baselined and tracked—such as customer conversations handled entirely by AI agents or revenue uplift per recommendation accepted.
Build adaptive ROI frameworks that can be seamlessly adjusted to changes in token pricing, inference efficiency, and model capabilities rather than fixed cost projections. Focus on outcome-based metrics that demonstrate clear business value as use cases scale. With these metrics executives can make informed investment decisions despite technological uncertainty. This approach transforms AI economics from unpredictable cost centers into measurable value drivers, providing the financial clarity needed for confident scaling decisions. A marketing team implemented generative AI for long-form content creation and quality assurance. They analyzed their end-to-end process to determine the distribution of their production capacity and identify the costliest failure point: localization errors. They anchored against measurable baselines of 150+ annual localization errors and 300 monthly QA hours across 150 assets. The solution delivered immediate impact by catching errors earlier, minimizing costly localization rework while accelerating production speed. Return on investment in the solution was measured through localization cost savings and top-line value through increased content output, providing a clear path to assess the impact of scaling the solution.
Conclusion
Becoming an AI-first organization requires synchronized transformation across seven critical dimensions: Data and AI Vision and Strategy that establishes a data-driven foundation while embedding AI into core business objectives; Business Process Redesign to optimize human-AI collaboration; Culture & Change Management to drive adoption top-down and bottom-up change; Infrastructure and Operations for scalable, self-healing systems; AI Skills and Talent development with continuous learning to build core AI capabilities beyond basic awareness; Security, Governance, and Ethics to facilitate responsible AI deployment; and AI Industrialization for seamless integration and automation.

Figure 2: Seven dimensions of AI-First organizational transformation
These dimensions provide a framework for systematically evaluating and implementing AI transformation. But here’s what matters most: technology alone delivers marginal gains. When orchestrated with organizational change and process redesign, it creates measurable business value. Organizations that have success, compared to those that do not, see dramatic results—45% more in cost savings and 60% more in revenue growth, according to the Boston Consulting Group (BCG).
The AWS Customer Success Center of Excellence collaborates with AWS partners to define programmatic implementation plans that can help customers embed AI into their operations, product development, business processes, and go-to-market strategies. Because becoming AI-first isn’t about isolated technology initiatives—it requires synchronized evolution across people, process, and technology, with comprehensive change management as the enabler.
For more information about becoming an AI-first company, contact your AWS account team. For more information on delivering agents see the AWS Artificial Intelligence blog.

About the authors
Bhargs Srivathsan leads the Customer Success Center of Excellence for Amazon Web Services (AWS), where she is responsible for defining and executing on the strategic vision for customer success across AWS’ services. In this role, she focuses on ensuring AWS customers and partners realize maximum value from their technology investments, particularly as the pace of innovation accelerates with AI and other emerging technologies. She works closely with the field, specialist GTM leaders, and partners across AWS to build and scale customer success capabilities that drive adoption and business outcomes for customers.
Sergio Klarreich is a Senior Manager of Customer Success at AWS, within the Customer Success Center of Excellence. Sergio leads a team focused on enabling enterprises to realize tangible business outcomes from AI investments. With hands-on experience leading Fortune 500 companies through successful AI-first transformation journeys and over 20 years driving technology innovation across global markets. He specializes in bridging the gap between AI strategy and measurable business results.
Joseph Badalamenti is a Senior Customer Success AI Specialist at AWS, within the Customer Success Center of Excellence. As a Customer Success Specialist, he partners with enterprise customers to accelerate their AI transformation journeys. Joseph specializes in Generative AI and Agentic AI implementations, helping organizations realize measurable business value through strategic AI adoption. Joseph has 20+ years experience supporting customers with Digital, Cloud, and AI Transformation journeys.

Microsoft AI Releases Fara-7B: An Efficient Agentic Model for Computer …

How do we safely let an AI agent handle real web tasks like booking, searching, and form filling directly on our own devices without sending everything to the cloud? Microsoft Research has released Fara-7B, a 7 billion parameter agentic small language model designed specifically for computer use. It is an open weight Computer Use Agent that runs from screenshots, predicts mouse and keyboard actions, and is small enough to execute on a single user device, which reduces latency and keeps browsing data local.

Fara-7B: An Efficient Agentic Model for Computer Use

From Chatbots to Computer Use Agents

Conventional chat oriented LLMs return text. Computer Use Agents such as Fara-7B instead control the browser or desktop user interface to complete tasks like filling forms, booking travel, or comparing prices. They perceive the screen, reason about the page layout, then emit low level actions such as click, scroll, type, web_search, or visit_url.

Many existing systems rely on large multimodal models wrapped in complex scaffolding that parses accessibility trees and orchestrates multiple tools. This increases latency and often requires server side deployment. Fara-7B compresses the behavior of such multi agent systems into a single multimodal decoder only model built on Qwen2.5-VL-7B. It consumes browser screenshots and text context, then directly outputs thought text followed by a tool call with grounded arguments such as coordinates, text, or URLs.

FaraGen, Synthetic Trajectories for Web Interaction

The key bottleneck for Computer Use Agents is data. High quality logs of human web interaction with multi step actions are rare and expensive to collect. The Fara project introduces FaraGen, a synthetic data engine that generates and filters web trajectories on live sites.

FaraGen uses a three stage pipeline. Task Proposal starts from seed URLs drawn from public corpora such as ClueWeb22 and Tranco, which are categorized into domains like e commerce, travel, entertainment, or forums. Large language models convert each URL into realistic tasks that users might attempt on that page, for example booking specific movie tickets or creating a shopping list with constraints on reviews and materials. Tasks must be achievable without login or paywall, fully specified, useful, and automatically verifiable.

Fara-7B: An Efficient Agentic Model for Computer Use

Task Solving runs a multi agent system based on Magentic-One and Magentic-UI. An Orchestrator agent plans the high level strategy and keeps a ledger of task state. A WebSurfer agent receives accessibility trees and Set-of-Marks screenshots, then emits browser actions through Playwright, such as click, type, scroll, visit_url, or web_search. A UserSimulator agent supplies follow up instructions when the task needs clarification.

Trajectory Verification uses three LLM based verifiers. An Alignment Verifier checks that the actions and final answer match the task intent. A Rubric Verifier generates a rubric of subgoals and scores partial completion. A Multimodal Verifier inspects screenshots plus the final answer to catch hallucinations and confirm that visible evidence supports success. These verifiers agree with human labels on 83.3 percent of cases, with reported false positive and false negative rates around 17 to 18 percent.

After filtering, FaraGen yields 145,603 trajectories with 1,010,797 steps over 70,117 unique domains. The trajectories range from 3 to 84 steps, with an average of 6.9 steps and about 0.5 unique domains per trajectory, which indicates that many tasks involve sites not seen elsewhere in the dataset. Generating data with premium models such as GPT-5 and o3 costs roughly 1 dollar per verified trajectory.

https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf

Model Architecture

Fara-7B is a multimodal decoder only model that uses Qwen2.5-VL-7B as the base. It takes as input a user goal, the latest screenshots from the browser, and the full history of previous thoughts and actions. The context window is 128,000 tokens. At each step the model first generates a chain of thought describing the current state and the plan, then outputs a tool call that specifies the next action and its arguments.

The tool space matches the Magentic-UI computer_use interface. It includes key, type, mouse_move, left_click, scroll, visit_url, web_search, history_back, pause_and_memorize_fact, wait, and terminate. Coordinates are predicted directly as pixel positions on the screenshot, which allows the model to operate without access to the accessibility tree at inference time.

Training uses supervised finetuning over approximately 1.8 million samples that mix multiple data sources. These include the FaraGen trajectories broken into observe think act steps, grounding and UI localization tasks, screenshot based visual question answering and captioning, and safety and refusal datasets.

https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf

Benchmarks and Efficiency

Microsoft evaluates Fara-7B on four live web benchmarks: WebVoyager, Online-Mind2Web, DeepShop, and the new WebTailBench, which focuses on under represented segments such as restaurant reservations, job applications, real estate search, comparison shopping, and multi site compositional tasks.

On these benchmarks, Fara-7B achieves 73.5 percent success on WebVoyager, 34.1 percent on Online-Mind2Web, 26.2 percent on DeepShop, and 38.4 percent on WebTailBench. This outperforms the 7B Computer Use Agent baseline UI-TARS-1.5-7B, which scores 66.4, 31.3, 11.6, and 19.5 respectively, and compares favorably to larger systems like OpenAI computer-use-preview and SoM Agent configurations built on GPT-4o.

On WebVoyager, Fara-7B uses on average 124,000 input tokens and 1,100 output tokens per task, with about 16.5 actions. Using market token prices, the research team estimate an average cost of 0.025 dollars per task, versus around 0.30 dollars for SoM agents backed by proprietary reasoning models such as GPT-5 and o3. Fara-7B uses a similar number of input tokens but about one tenth the output tokens of these SoM agents.

Key Takeaways

Fara-7B is a 7B parameter, open weight Computer Use Agent built on Qwen2.5-VL-7B that operates directly from screenshots and text, then outputs grounded actions such as clicks, typing and navigation, without relying on accessibility trees at inference time.

The model is trained with 145,603 verified browser trajectories and 1,010,797 steps generated by the FaraGen pipeline, which uses multi agent task proposal, solving, and LLM based verification on live websites across 70,117 domains.

Fara-7B achieves 73.5 percent success on WebVoyager, 34.1 percent on Online-Mind2Web, 26.2 percent on DeepShop, and 38.4 percent on WebTailBench, improving substantially over the 7B UI-TARS-1.5 baseline on all four benchmarks.

On WebVoyager, Fara-7B uses about 124,000 input tokens and 1,100 output tokens per task, with an average of 16.5 actions, yielding an estimated cost of around 0.025 dollars per task, which is around an order of magnitude cheaper in output token usage than SoM agents backed by GPT 5 class models.

Editorial Notes

Fara-7B is a useful step toward practical Computer Use Agents that can run on local hardware with lower inference cost while preserving privacy. The combination of Qwen2.5 VL 7B, FaraGen synthetic trajectories and WebTailBench gives a clear and well instrumented path from multi agent data generation to a single compact model that matches or exceeds larger systems on key benchmarks while enforcing Critical Point and refusal safeguards.

Check out the Paper, Model weights and technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Microsoft AI Releases Fara-7B: An Efficient Agentic Model for Computer Use appeared first on MarkTechPost.

NVIDIA AI Releases Nemotron-Elastic-12B: A Single AI Model that Gives …

Why are AI dev teams still training and storing multiple large language models for different deployment needs when one elastic model can generate several sizes at the same cost? NVIDIA is collapsing the usual ‘model family’ stack into a single training job. NVIDIA AI team releases Nemotron-Elastic-12B, a 12B parameter reasoning model that embeds nested 9B and 6B variants in the same parameter space, so all three sizes come from one elastic checkpoint with no extra distillation runs per size.

Many in one model family

Most production systems need several model sizes, a larger model for server side workloads, a mid size model for strong edge GPUs, and a smaller model for tight latency or power budgets. The usual pipeline trains or distills each size separately, so token cost and checkpoint storage scale with the number of variants.

Nemotron Elastic takes a different route. It starts from the Nemotron Nano V2 12B reasoning model and trains an elastic hybrid Mamba Attention network that exposes multiple nested submodels. The released Nemotron-Elastic-12B checkpoint can be sliced into 9B and 6B variants, Nemotron-Elastic-9B and Nemotron-Elastic-6B, using a provided slicing script, without any extra optimization.

All variants share weights and routing metadata, so training cost and deployment memory are tied to the largest model, not to the number of sizes in the family.

https://arxiv.org/pdf/2511.16664v1

Hybrid Mamba Transformer with elastic masks

Architecturally, Nemotron Elastic is a Mamba-2 Transformer hybrid. The base network follows the Nemotron-H style design, where most layers are Mamba-2 based sequence state space blocks plus MLP, and a small set of attention layers preserve global receptive field.

Elasticity is implemented by turning this hybrid into a dynamic model controlled by masks:

Width, embedding channels, Mamba heads and head channels, attention heads, and FFN intermediate size can be reduced through binary masks.

Depth, layers can be dropped according to a learned importance ordering, with residual paths preserving signal flow.

A router module outputs discrete configuration choices per budget. These choices are converted to masks with Gumbel Softmax, then applied to embeddings, Mamba projections, attention projections, and FFN matrices. The research team adds several details to keep the SSM structure valid:

Group aware SSM elastification that respects Mamba head and channel grouping.

Heterogeneous MLP elastification where different layers can have distinct intermediate sizes.

Normalized MSE based layer importance to decide which layers stay when depth is reduced.

Smaller variants are always prefix selections in the ranked component lists, which makes the 6B and 9B models true nested subnetworks of the 12B parent.

https://arxiv.org/pdf/2511.16664v1

Two stage training for reasoning workloads

Nemotron Elastic is trained as a reasoning model with a frozen teacher. The teacher is the original Nemotron-Nano-V2-12B reasoning model. The elastic-12B student is optimized jointly for all three budgets, 6B, 9B, 12B, using knowledge distillation plus language modeling loss.

Training runs in two stages:

Stage 1: short context, sequence length 8192, batch size 1536, around 65B tokens, with uniform sampling over the three budgets.

Stage 2: extended context, sequence length 49152, batch size 512, around 45B tokens, with non uniform sampling that favors the full 12B budget.

https://arxiv.org/pdf/2511.16664v1

The second stage is important for reasoning tasks. The above table shows that for AIME 2025, the 6B model improves from 56.88 to 68.13, a 19.8 percent relative gain, while the 9B model gains 9.7 percent and the 12B model gains 4.0 percent after extended context training.

Budget sampling is also tuned. In Stage 2, non uniform weights of 0.5, 0.3, 0.2 for 12B, 9B, 6B avoid degradation of the largest model and keep all variants competitive on Math 500, AIME 2025, and GPQA.

Benchmark results

Nemotron Elastic is evaluated on reasoning heavy benchmarks, MATH 500, AIME 2024, AIME 2025, GPQA, LiveCodeBench v5, and MMLU Pro. The below table summarizes pass at 1 accuracy.

https://arxiv.org/pdf/2511.16664v1

The 12B elastic model matches the NanoV2-12B baseline on average, 77.41 versus 77.38, while also providing 9B and 6B variants from the same run. The 9B elastic model tracks the NanoV2-9B baseline closely, 75.95 versus 75.99. The 6B elastic model reaches 70.61, slightly below Qwen3-8B at 72.68 but still strong for its parameter count given that it is not trained separately.

Training token and memory savings

Nemotron Elastic targets the cost problem directly. The below table compares the token budgets needed to derive 6B and 9B models from a 12B parent:

NanoV2 pretraining for 6B and 9B, 40T tokens total.

NanoV2 Compression with Minitron SSM, 480B exploratory plus 270B final, 750B tokens.

Nemotron Elastic, 110B tokens in a single elastic distillation run.

https://arxiv.org/pdf/2511.16664v1

The research team reports that this gives around 360 times reduction versus training the two extra models from scratch, and around 7 times reduction versus the compression baseline.

Deployment memory is reduced as well. The below table states that storing Nemotron Elastic 6B, 9B, and 12B together requires 24GB of BF16 weights, while storing NanoV2 9B plus 12B requires 42GB. This is a 43 percent memory reduction while also exposing an extra 6B size.

https://arxiv.org/pdf/2511.16664v1

Comparison

SystemSizes (B)Avg reasoning score*Tokens for 6B + 9BBF16 memoryNemotron Elastic6, 9, 1270.61 / 75.95 / 77.41110B24GBNanoV2 Compression9, 1275.99 / 77.38750B42GBQwen3872.68n / an / a

Key Takeaways

Nemotron Elastic trains one 12B reasoning model that contains nested 9B and 6B variants which can be extracted zero shot without extra training.

The elastic family uses a hybrid Mamba-2 and Transformer architecture plus a learned router that applies structured masks over width and depth to define each submodel.

The approach needs 110B training tokens to derive 6B and 9B from the 12B parent which is about 7 times fewer tokens than the 750B token Minitron SSM compression baseline and about 360 times fewer than training extra models from scratch.

On reasoning benchmarks such as MATH 500, AIME 2024 and 2025, GPQA, LiveCodeBench and MMLU Pro the 6B, 9B and 12B elastic models reach average scores of about 70.61, 75.95 and 77.41 which are on par with or close to the NanoV2 baselines and competitive with Qwen3-8B.

All three sizes share one 24GB BF16 checkpoint so deployment memory stays constant for the family compared with around 42GB for separate NanoV2-9B and 12B models which gives about 43 percent memory savings while adding a 6B option.

Editorial Comments

Nemotron-Elastic-12B is a practical step toward making reasoning model families cheaper to build and operate. One elastic checkpoint produces 6B, 9B, and 12B variants with a hybrid Mamba-2 and Transformer architecture, a learned router, and structured masks that preserve reasoning performance. The approach cuts token cost relative to separate compression or pretraining runs and keeps deployment memory at 24GB for all sizes, which simplifies fleet management for multi tier LLM deployments. Overall, Nemotron-Elastic-12B turns multi size reasoning LLMs into a single elastic systems design problem.

Check out the Paper and Model weights. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post NVIDIA AI Releases Nemotron-Elastic-12B: A Single AI Model that Gives You 6B/9B/12B Variants without Extra Training Cost appeared first on MarkTechPost.

AI Interview Series #3: Explain Federated Learning

Question:

“You’re an ML engineer at a fitness company like Fitbit or Apple Health.

Millions of users generate sensitive sensor data every day — heart rate, sleep cycles, step counts, workout patterns, etc.

You want to build a model that predicts health risk or recommends personalized workouts.

But due to privacy laws (GDPR, HIPAA), none of this raw data can ever leave the user’s device.

How would you train such a model?“

Training a model in this scenario seems impossible at first—after all, you can’t collect or centralize any of the user’s sensor data. But the trick is this: instead of bringing the data to the model, you bring the model to the data.

Using techniques like federated learning, the model is sent to each user’s device, trained locally on their private data, and only the model updates (not the raw data) are sent back. These updates are then securely aggregated to improve the global model while keeping every user’s data fully private.

This approach allows you to leverage massive, real-world datasets without ever violating privacy laws.

What is Federated Learning

Federated Learning is a technique for training machine learning models without ever collecting user data centrally. Instead of uploading private data (like heart rate, sleep cycles, or workout logs), the model is sent to each device, trained locally, and only the model updates are returned. These updates are securely aggregated to improve the global model—ensuring privacy and compliance with laws like GDPR and HIPAA.

There are multiple variants:

Centralized FL: A central server coordinates training and aggregates updates.

Decentralized FL: Devices share updates with each other directly—no single point of failure.

Heterogeneous FL: Designed for devices with different compute capabilities (phones, watches, IoT sensors).

The workflow is simple:

A global model is sent to user devices.

Each device trains on its private data (e.g., a user’s fitness and health metrics).

Only the model updates—not the data—are encrypted and sent back.

The server aggregates all updates into a new global model.

Challenges in Federated Learning

Device Constraints: User devices (phones, smartwatches, fitness trackers) have limited CPU/GPU power, small RAM, and rely on battery. Training must be lightweight, energy-efficient, and scheduled intelligently so it doesn’t interfere with normal device usage.

Model Aggregation: Even after training locally on thousands or millions of devices, we still need to combine all these model updates into a single global model. Techniques like Federated Averaging (FedAvg) help, but updates can be delayed, incomplete, or inconsistent depending on device participation.

Skewed Local Data (Non-IID Data):

Each user’s fitness data reflects personal habits and lifestyle:

Some users run daily; others never run.

Some have high resting heart rates; others have low.

Sleep cycles vary drastically by age, culture, work pattern.

Workout types differ—yoga, strength training, cycling, HIIT, etc.

This leads to non-uniform, biased local datasets, making it harder for the global model to learn generalized patterns.

Intermittent Client Availability: Many devices may be offline, locked, low on battery, or not connected to Wi-Fi. Training must only happen under safe conditions (charging, idle, Wi-Fi), reducing the number of active participants at any moment.

Communication Efficiency: Sending model updates frequently can drain bandwidth and battery. Updates must be compressed, sparse, or limited to smaller subsets of parameters.

Security & Privacy Guarantees: Even though raw data never leaves the device, updates must be encrypted. Additional protections like differential privacy or secure aggregation may be required to prevent reconstructing sensitive patterns from gradients.

AI Interview Series #2: Explain Some of the Common Model Context Protocol (MCP) Security Vulnerabilities

The post AI Interview Series #3: Explain Federated Learning appeared first on MarkTechPost.

Accelerate generative AI innovation in Canada with Amazon Bedrock cros …

Generative AI has created unprecedented opportunities for Canadian organizations to transform their operations and customer experiences. We are excited to announce that customers in Canada can now access advanced foundation models including Anthropic’s Claude Sonnet 4.5 and Claude Haiku 4.5 on Amazon Bedrock through cross-Region inference (CRIS).
This post explores how Canadian organizations can use cross-Region inference profiles from the Canada (Central) Region to access the latest foundation models to accelerate AI initiatives. We will demonstrate how to get started with these new capabilities, provide guidance for migrating from older models, and share recommended practices for quota management.
Canadian cross-Region inference: Your gateway to global AI innovation
To help customers achieve the scale of their Generative AI applications, Amazon Bedrock offers Cross-Region Inference (CRIS) profiles, a powerful feature that enables organizations to seamlessly distribute inference processing across multiple AWS Regions. This capability helps you get higher throughput while building at scale, helping to ensure your generative AI applications remain responsive and reliable even under heavy load.
Amazon Bedrock provides two types of cross-Region Inference profiles:

Geographic CRIS: Amazon Bedrock automatically selects the optimal commercial Region within that geography to process your inference request.
Global CRIS: Global CRIS further enhances cross-Region inference by enabling the routing of inference requests to supported commercial Regions worldwide, optimizing available resources and enabling higher model throughput.

Cross-Region Inference operates through the secure AWS network with end-to-end encryption for both data in transit and at rest. When a customer submits an inference request from the Canada (Central) Region, CRIS intelligently routes the request to one of the destination regions configured for the inference profile (US or Global profiles).
The key distinction is that while inference processing (the transient computation) may occur in another Region, all data at rest—including logs, knowledge bases, and any stored configurations—remains exclusively within the Canada (Central) Region. The inference request travels over the AWS Global Network, never traversing the public internet, and responses are returned encrypted to your application in Canada.

Cross-Region inference configuration for Canada
With CRIS, Canadian organizations gain earlier access to foundation models, including cutting-edge models like Claude Sonnet 4.5 with enhanced reasoning capabilities, providing a faster path to innovation. CRIS also delivers enhanced capacity and performance by providing access to capacity across multiple Regions. This enables higher throughput during peak periods such as tax season, Black Friday, and holiday shopping, automatic burst handling without manual intervention, and greater resiliency by serving requests from a larger pool of resources.
Canadian customers can choose between two inference profile types based on their requirements:

CRIS profile
Source Region
Destination Regions
Description

US cross-Region inference
ca-central-1
Multiple US Regions
Requests from Canada (Central) can be routed to supported US Regions with capacity.

Global inference
ca-central-1
Global AWS Regions
Requests from Canada (Central) can be routed to a Region in the AWS global CRIS profile.

Getting started with CRIS from Canada
To begin using cross-Region Inference from Canada, follow these steps:
Configure AWS Identity and Access Management (IAM) permissions
First, verify your IAM role or user has the necessary permissions to invoke Amazon Bedrock models using cross-Region inference profiles.
Here’s an example of a policy for US cross-Region inference:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel*”
],
“Resource”: [
“arn:aws:bedrock:ca-central-1::inference-profile/us.anthropic.claude-sonnet-4-5-20250929-v1:0”
]
},
{
“Effect”: “Allow”,
“Action”: [
“bedrock:InvokeModel*”
],
“Resource”: [
“arn:aws:bedrock:*::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0”
],
“Condition”: {
“StringLike”: {
“bedrock:InferenceProfileArn”: “arn:aws:bedrock:ca-central-1::inference-profile/us.anthropic.claude-sonnet-4-5-20250929-v1:0″
}
}
}
]
}

For global CRIS refer to the blog post, Unlock global AI inference scalability using new global cross-Region inference on Amazon Bedrock with Anthropic’s Claude Sonnet 4.5.
Use cross-Region inference profiles
Configure your application to use the relevant inference profile ID. The profiles use prefixes to indicate their routing scope:

Model
Routing scope
Inference profile ID

Claude Sonnet 4.5
US Regions
us.anthropic.claude-sonnet-4-5-20250929-v1:0

Claude Sonnet 4.5
Global
global.anthropic.claude-sonnet-4-5-20250929-v1:0

Claude Haiku 4.5
US Regions
us.anthropic.claude-haiku-4-5-20251001-v1:0

Claude Haiku 4.5
Global
global.anthropic.claude-haiku-4-5-20251001-v1:0

Example code
Here’s how to use the Amazon Bedrock Converse API with a US CRIS inference profile from Canada:

import boto3

# Initialize Bedrock Runtime client
bedrock_runtime = boto3.client(
service_name=”bedrock-runtime”,
region_name=”ca-central-1″ # Canada (Central) Region
)

# Define the inference profile ID
inference_profile_id = “us.anthropic.claude-sonnet-4-5-20250929-v1:0”

# Prepare the conversation
response = bedrock_runtime.converse(
modelId=inference_profile_id,
messages=[
{
“role”: “user”,
“content”: [
{
“text”: “What are the benefits of using Amazon Bedrock for Canadian organizations?”
}
]
}
],
inferenceConfig={
“maxTokens”: 512,
“temperature”: 0.7
}
)

# Print the response
print(f”Response: {response[‘output’][‘message’][‘content’][0][‘text’]}”)

Quota management for Canadian workloads
When using CRIS from Canada, quota management is performed at the source Region level (ca-central-1). This means quota increases requested for the Canada (Central) Region apply to all inference requests originating from Canada, regardless of where they’re processed.
Understanding quota calculations
Important: When calculating your required quota increases, you need to take into account the burndown rate, defined as the rate at which input and output tokens are converted into token quota usage for the throttling system. The following models have a 5x burn down rate for output tokens (1 output token consumes 5 tokens from your quotas):

Anthropic Claude Opus 4
Anthropic Claude Sonnet 4.5
Anthropic Claude Sonnet 4
Anthropic Claude 3.7 Sonnet

For other models, the burndown rate is 1:1 (1 output token consumes 1 token from your quota). For input tokens, the token to quota ratio is 1:1. The calculation for the total number of tokens per request is as follows:
Input token count + Cache write input tokens + (Output token count x Burndown rate)
Requesting quota increases
To request quota increases for CRIS in Canada:

Navigate to the AWS Service Quotas console in the Canada (Central) Region
Search for the specific model quota (for example, “Claude Sonnet 4.5 tokens per minute”)
Submit an increase request based on your projected usage

Migrating from older Claude models to Claude 4.5
Organizations currently using older Claude models should plan their migration to Claude 4.5 to leverage the latest model capabilities.
To plan your migration strategy, incorporate the following elements:

Benchmark current performance: Establish baseline metrics for your existing models.
Test with representative workloads and optimize prompts: Validate Claude 4.5 performance with your specific use cases, and adjust prompt to leverage Claude 4.5’s enhanced capabilities and make use of the Bedrock prompt optimizer tool.
Implement gradual rollout: Transition traffic progressively.
Monitor and adjust: Track performance metrics and adjust quotas as needed.

Choosing between US and Global inference profiles
When implementing CRIS from Canada, organizations can choose between US and Global inference profiles based on their specific requirements.
US cross-Region inference is recommended for organizations with existing US data processing agreements, high throughput and resilience requirements and development and testing environments.
Conclusion
Cross-Region inference for Amazon Bedrock represents an opportunity for Canadian organizations that want to use AI while maintaining data governance. By distinguishing between transient inference processing and persistent data storage, CRIS provides faster access to the latest foundation models without compromising compliance requirements.
With CRIS, Canadian organizations get access to new models within days instead of months. The system scales automatically during peak business periods while maintaining complete audit trails within Canada. This helps you meet compliance requirements and use the same advanced AI capabilities as organizations worldwide. To get started, review your data governance requirements and configure IAM permissions. Then test with the inference profile that matches your needs—US for lower latency to US Regions, or Global for maximum capacity.

About the authors
Daniel Duplessis is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services (AWS), where he guides enterprises in crafting comprehensive AI implementation strategies and establish the foundational capabilities essential for scaling AI across the enterprise.
Dan MacKay is the Financial Services Compliance Specialist for AWS Canada. He advises customers on recommended practices and practical solutions for cloud-related governance, risk, and compliance. Dan specializes in helping AWS customers navigate financial services and privacy regulations applicable to the use of cloud technology in Canada with a focus on third-party risk management and operational resilience.
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions using state-of-the-art AI/ML tools. She has been actively involved in multiple generative AI initiatives across APJ, harnessing the power of LLMs. Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Serge Malikov is a Senior Solutions Architect Manager based out of Canada. His focus is on the financial services industry.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and Amazon SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Sharadha Kandasubramanian is a Senior Technical Program Manager for Amazon Bedrock. She drives cross-functional GenAI programs for Amazon Bedrock, enabling customers to grow and scale their GenAI workloads. Outside of work, she’s an avid runner and biker who loves spending time outdoors in the sun.

Power up your ML workflows with interactive IDEs on SageMaker HyperPod

Amazon SageMaker HyperPod clusters with Amazon Elastic Kubernetes Service (EKS) orchestration now support creating and managing interactive development environments such as JupyterLab and open source Visual Studio Code, streamlining the ML development lifecycle by providing managed environments for familiar tools to data scientists. This feature introduces a new add-on called Amazon SageMaker Spaces for AI developers to create and manage self-contained environments for running notebooks. Organizations can now maximize their GPU investments by running both interactive workloads and their training jobs on the same infrastructure, with support for fractional GPU allocations to improve cost efficiency. This feature reduces the complexity of managing multiple development environments and focus on building and deploying their AI and ML models.
This post shows how HyperPod administrators can configure Spaces for their clusters, and how data scientists can create and connect to these Spaces. You’ll also learn how to connect directly from your local VS Code environment to Spaces created in HyperPod.
Solution overview
The following diagram showcases the different components involved in creating and managing Spaces on HyperPod clusters.

Here’s how the feature works:

Cluster administrator installs the Spaces add-on from the SageMaker AI console. The administrator can either use a Quick install or a Custom install option to install the add-on.
Once the cluster is set up, data scientists and AI developers can create Spaces using HyperPod Command Line Interface, or kubectl.
Once the Space is created, the user can connect to a running Space through one of the following two options:

Access Space Web UI: This requires setting up an AWS Application Load Balancer (ALB) and setting up or registering your own custom Domain Name System (DNS) in Amazon Route 53. Once the custom domain is set up, the user will be able to connect to the JupyterLab or Code Editor space securely using a presigned URL through their web browser.
Remote IDE connection (connect to the Space remotely from local Visual Studio Code): SSH-over-SSM tunneling is used under the hood to securely connect remote IDEs to SageMaker Spaces pods without requiring customers to manage SSH keys or exposing port 22.

Prerequisites
To follow along, you need the following prerequisites:

An AWS account with permissions to create IAM roles, SageMaker resources such as HyperPod, and access to EKS cluster resources. If you are creating a new SageMaker HyperPod cluster, you will also need permissions to create networking and storage resources, see IAM permissions for cluster creation.
A SageMaker HyperPod cluster orchestrated using EKS, running Kubernetes version 1.30 or later. If you do not have one, you can create by following instructions in Creating a SageMaker HyperPod cluster with Amazon EKS orchestration. This workflow will create a HyperPod cluster, an EKS cluster and the associated resources such as an Amazon Virtual Private Cloud (VPC) and Amazon FSx for Lustre volume for storage.
HyperPod CLI installed (or kubectl).
A local IDE such as VS Code, with the AWS Toolkit for VS Code installed, to connect to the Spaces.

Step 1: Install the Spaces add-on
To get started, first install the Spaces add-on to your SageMaker cluster. This add-on allows users to run JupyterLab and Code Editor applications directly on cluster compute. The Quick install option is the fastest way to get started. With a single click, SageMaker AI automatically creates and configures the required AWS resources with optimized defaults. Here’s how to install it:

In the SageMaker AI console, choose Clusters on the left pane and navigate to your HyperPod cluster
Choose the IDE and Notebooks tab
Choose Quick install

Review the dependencies that will be automatically installed and choose Install.

The Quick install will create the associated dependencies for your Spaces add-on with default settings. They are listed below:

IAM roles for SageMaker Spaces:

Controller pod role for AWS API calls and AWS Systems Manager Session Manager (SSM) operations.
In-cluster router role for AWS Key Management Service (KMS) operations and JWT signing.
SSM managed instance role for remote access to Spaces. A list of the IAM roles and the required permissions are available in Set up permissions.

Remote access components:

Enables SSH connectivity to Spaces including SSM activation and session documents. This activates Systems Manager Advanced tier which includes additional per-instance charges.

Dependent EKS add-ons:

Cert-manager for certificate management.
Amazon Elastic Block Store (EBS) CSI driver for persistent storage volumes.
AWS Load Balancer Controller to manage AWS Elastic Load Balancers.

SageMaker Spaces add-on

Deploys the Spaces controller and in-cluster router for managing Space lifecycle operations.

The Quick install option does not install web UI configurations such as Route 53 DNS records and SSL certificates for accessing Spaces through the web browser. Administrators can either use the Custom install option or configure these properties after installation of the add-on. For instructions on configuring web browser access, see Operator installing – helm/Console.
The installation typically takes 2-5 minutes depending on availability of pre-existing dependencies or if the Spaces add-on will need to provision completely new resources.  After installation completes, administrators can perform the following actions as shown below:

View the Spaces created by data scientists in the Spaces table
Configure namespaces to organize Spaces by team or project
Create Space templates with pre-configured settings for common use cases
Edit configuration at as needed to enable or disable Spaces features or change your configuration settings

For production use cases, we recommend using the Custom install option, where admins can set up fine-grained IAM policies that apply principle of least-privilege. For the full set of configurations that can be set up using the Custom install option, including namespaces and default templates, see Installation.
Step 2: Create or update EKS access entries
To give your users access to create and manage Spaces, grant them access through EKS access entries. The following two access entry policies are required:

AmazonSagemakerHyperpodSpacePolicy
AmazonSagemakerHyperpodSpaceTemplatePolicy

For instructions on creating and editing access entries, see Create access entries and Update access entries.
Step 3: Create and manage Spaces
Data scientists can create JupyterLab and Code Editor Spaces on the cluster using kubectl or the HyperPod CLI. For detailed instructions on creating and managing Spaces, see Hyperpod CLI.
To create a Space, run the following commands:

# set cluster context using hyp CLI
hyp set-cluster-context –cluster-name <your-hyperpod-cluster-name>

# create a space
hyp create hyp-space
    –name “data-science-space”
    –display-name “Data Science Workspace”
    –namespace “default”

The hyp create hyp-space command will create a Space with the default settings. To create a Code Editor space, use the command below:

hyp create hyp-space
–name code-editor-demo
–display-name “code-editor space”
–memory 8Gi
–template-ref name=sagemaker-code-editor-template,namespace=jupyter-k8s-system

You can modify the settings when creating the Space as well, see example below:

hyp create hyp-space
   –name test-space
   –display-name “test space”
   –memory 8Gi
   –volume name=vol,mountPath=/home/,persistentVolumeClaimName=pvcname

Once the Space is created, you can access the Space from either the web UI, or from your local VS Code. To open the Space in VS Code, run:

hyp create hyp-space-access
    –name data-science-space
    –connection-type vscode-remote

If you have set up the custom domain following our documentation, you can get the Space access URL as shown below. This will open your space on your browser.

hyp create hyp-space-access
    –name data-science-space
    –connection-type web-ui

Alternatively, you can connect to the Space from your local VS Code using the AWS toolkit. From your VS Code IDE, open the AWS toolkit panel. From the toolkit, under SageMaker AI, choose HyperPod. Here, you can list, start, stop, and connect to Spaces.

The Spaces need to be created using the HyperPod CLI or kubectl.
HyperPod CLI supports additional CRUD operations to Spaces such as updating, describing and deleting Spaces. For a list of the operations, see HyperPod CLI on Github.
For practitioners familiar with kubectl, they can also create, update and delete Spaces using kubectl. For example, you can create a Space using kubectl as shown below:

kubectl apply -f – <<EOF
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: training-workspace-1
  namespace: hyperpod-training-team
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-training-team-localqueue
    kueue.x-k8s.io/priority-class: ide-priority
spec:
  displayName: “Training Team Workspace 1”
  image: jupyter/minimal-notebook:latest
  desiredStatus: Running
  resources:
    requests:
      cpu: 3
      memory: 12Gi
    limits:
      cpu: 3
      memory: 12Gi
EOF

Best practices
We recommend the following best practices when using SageMaker Spaces.
User management, RBAC, and collaboration
SageMaker Spaces identifies users through Amazon EKS Access Entries, which are derived from your IAM identity when you interact with a Space using either the HyperPod CLI or kubectl. Your EKS captured identity may appear as an IAM user or as an assumed-role session ARN. For assumed roles, the session name can represent the actual user when admin applies IAM policy to enforce assumed role session names that reflect individual identities. If session names are not enforced or do not uniquely map to users, SageMaker Spaces access control falls back to role-based access control, causing all users sharing the same role to be treated as the same identity. For more details see Add users and set up service accounts.
Spaces can either be private, accessible only by the user who created the Spaces, or public, accessible by any user who has access to the hosting Kubernetes namespace. Spaces are public by default. The creator and the administrator group still retain full control, including the ability to update or delete the Space. A Space becomes private only when access is restricted to the creator and the admin group. This model gives teams a flexible foundation: public Spaces support open collaboration within a shared environment, while private Spaces provide isolation.
Multiple users can collaborate on the same Space if it is configured to be shared. When enabled with SageMaker Distribution images for JupyterLab environments, we also support real time collaboration (RTC) which enables multiple users to collaborate on the interactive ML experiments and workloads.
Admin defaults and controls
Templates set up by admins help data scientists quickly use pre-configured Space settings for their use case. SageMaker provides two pre-created system templates, one for JupyterLab and one for Code Editor, so that data scientists to get started without additional configurations needed. Admins can also set up custom templates for data scientists with custom configurations such as image, storage and compute.Templates can be used by data scientists in the cluster and are flexible depending on the needs of admins. Admins can create multiple templates based on specific use cases, projects, or dependency requirements.
Customizing Spaces
Administrators and developers can customize their Spaces using custom images and lifecycle scripts. Use lifecycle scripts for minimal customization such as installing additional packages, setting up default variables, or running clean up tasks, while still using the SageMaker Distribution image capabilities. For organizations that have a standardized image for development and training, SageMaker Spaces also supports custom images and entry points for users. For custom image specifications, see Customization.
Shutdown idle compute
Spaces by default support automatic shutdown of idle workspaces to optimize resource usage. When idle shutdown is enabled, the system periodically checks the Space for activity and if the workspace is idle for the specified timeout duration, the workspace automatically stops, freeing up the compute resources for other tasks. Administrators can set default timeouts and optionally avoid overrides to defaults to enforce the idle shutdown.
Integration with other HyperPod add-ons
For guardrails against excess resource usage, set up HyperPod task governance, which provides comprehensive resource management controls. To help prevent workspaces from being evicted due to changes in unrelated workloads, configure task governance to set interactive ML workloads as the highest priority or schedule them in task governance namespaces with eviction turned off.
Set up the HyperPod Observability plug in to monitor the resource usage of Spaces running within the cluster. With one click install, the observability plugin provides insight into how many resources Spaces are using over time, allowing admins to observe and tune their compute allocations.
Fractional GPU support
SageMaker Spaces support fractional GPU configurations, specifically the MIG technology provided by NVIDIA GPUs. Fractional GPU support with MIG means that users can share GPU instances, optimizing compute usage, while still providing isolation between workloads. This means that experiments running on a fractional GPU profile are unlikely to interfere with other workloads running on the same GPU.
To check if an instance in your cluster supports fractional GPU, run the command:

hyp list-accelerator-partition-type –instance-type <instance type>

If your cluster contains instance groups that support fractional GPU, you can create a space with fractional GPU as shown below:

hyp create hyp-space
–name test-space
–display-name “mig-testing”
–accelerator-partition-type mig-3g.20gb
–accelerator-partition-count 1
–memory 8Gi
–template-ref sagemaker-code-editor-template

Clean up
To avoid incurring unnecessary charges, clean up the resources you created in this walkthrough.

Delete all spaces you created. Run this command for each space you created:

hyp delete hyp-space
–name <space-name>

Remove the SageMaker HyperPod Spaces add-on: From the cluster details page, navigate to the IDE and Notebooks tab, and choose Remove.
If you created a HyperPod cluster for the purposes of this blog, delete the cluster to avoid being charged for unused compute. To delete the cluster, follow the instructions in Deleting a SageMaker HyperPod cluster. Additionally, if you used the console to create the cluster, go to the AWS CloudFormation console and delete the parent stack to remove the additional resources such as storage and networking resources created for the cluster. The parent stack will be in the format sagemaker-<your-hyperpod-cluster-name>-<unique-id>

Conclusion
Spaces in SageMaker HyperPod boosts data scientist and AI developer productivity by providing more secure, managed development environments on purpose-build compute. We walked through the setup steps for administrators and data scientists, showing how teams can quickly create and connect to Spaces. With this feature, teams can now reduce time spent on environment setup and focus on model development, while also maintaining consistent development environments. By integrating with HyperPod task governance features, administrators can optimize for cost and equitable compute allocations.

About the authors
Durga Sury is a Senior Solutions Architect at Amazon SageMaker, helping enterprise customers build secure and scalable AI/ML systems. When she’s not architecting solutions, you can find her enjoying sunny walks with her dog, immersing herself in murder mystery books, or catching up on her favorite Netflix shows.
 Edward Sun is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions and simplifying the customer experience to integrate SageMaker Studio with popular technologies in data engineering and ML landscape. In his spare time, Edward is big fan of camping, hiking, and fishing, and enjoys spending time with his family.
Josh Dunne is a Senior UX Designer at SageMaker AI at Amazon Web Services. He has 7+ years of experience across UX and product management, with a focus on ML/AI and cloud computing creating practical, straightforward to use workflows for machine learning builders across SageMaker AI, including HyperPod, SageMaker Studio, SageMaker Unified Studio, and interactive IDEs.  Outside of work, he enjoys exploring the Pacific Northwest and traveling with his wife and their dog and trying new restaurants.
Joshua Towner is a Senior SDE working for SageMaker AI at Amazon Web Services, where he is currently working on building and improving interactive ML solutions for SageMaker Studio and HyperPod. Outside of work, he enjoys traveling, skiing, and watching movies.
Khushboo Srivastava is a Product Manager for Amazon SageMaker, AWS. She enjoys building products that simplify machine learning workflows for users. With over 7+ years in software engineering and data science, and 7+ years in product management, Khushboo has launched several products and services that have helped accelerate speed of AI/ML development for customers. With her background in generative AI and distributed computing, and her passion for democratizing AI, she is committed to sharing insights and empowering others in their AI and open source journey.
Prayag Singh is a Senior SDE working for SageMaker AI at Amazon Web Services. With 10+ years of software development experience, he focuses on integrating customers’ preferred ML tools and IDEs on SageMaker Studio and HyperPod. Outside of work, Prayag enjoys traveling and all things comedy, from stand-up specials to sitcoms. You can find him on LinkedIn.

Claude Opus 4.5 now in Amazon Bedrock

Anthropic’s newest foundation model, Claude Opus 4.5, is now available in Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models from leading AI companies. Opus 4.5 is a meaningful step forward in what AI systems can do and sets a new standard across coding, agents, computer use, and office tasks. It outperforms both Sonnet 4.5 and Opus 4.1 while providing Opus-level at one-third the cost.
In this post, I’ll show you what makes this model different, walk through key business applications, and demonstrate how to use Opus 4.5’s new tool use capabilities on Amazon Bedrock. By the end, you’ll understand how to use this model’s capabilities for production agent deployments.
Claude Opus 4.5: What makes this model different
Opus 4.5 is Anthropic’s most advanced model offering in the Opus class, designed for developers building sophisticated AI agents that can reason, plan, and execute complex tasks with minimal oversight. It upgrades Sonnet 4.5 with better performance on existing use cases and adds new capabilities for complex workflows.
The model excels in professional software engineering, achieving 80.9% on SWE-bench Verified, helping to transform multi-day development projects into hours-long tasks. It works independently including improved multilingual coding capabilities, and enhanced behaviors like more efficient code, better test coverage, and cleaner architecture choices. For office productivity, the model handles complex projects end-to-end. It powers agents that create PowerPoint presentations, Excel spreadsheets, and Word documents with professional polish, including document redlining for contracts and NDAs. The model also produces higher quality React and HTML artifacts. It maintains consistency and accuracy—important for finance and other industries where precision matters—and maintains context across files throughout long projects.
This is Anthropic’s best vision model yet, achieving 80.7% on MMMU, for workflows that depend on complex visual interpretation and multi-step navigation—such as analyzing design mockups, processing documents with complex layouts, or automating browser-based tasks—with computer use performance improving further still.
The model introduces two key improvements for agent developers. The tool search tool lets agents work with hundreds of tools by dynamically discovering and loading only what they need instead of loading all definitions upfront—potentially saving tens of thousands of tokens and preventing schema confusion when scaling to large tool libraries. Tool use examples lets you provide sample tool calls directly in the tool definition, improving accuracy for complex schemas with nested objects or arrays.

Opus 4.5 Performance BenchmarksSource: https://www.anthropic.com/news/claude-opus-4-5

Business applications and use cases
Opus 4.5 excels in the following use cases:

Software development: Build agents that write and refactor code across entire projects, manage full-stack architectures, or design agentic systems that break down high-level goals into executable steps. This generation of Claude spans the full development lifecycle: Opus 4.5 for production code and sophisticated agents (those using 10+ tools in workflows like end-to-end software engineering, cybersecurity, or financial analysis), Sonnet 4.5 for rapid iteration and scaled user experiences, Haiku 4.5 for sub-agents and free-tier products. Opus 4.5 can analyze technical documentation, plan a software implementation, write the required code, and iteratively refine it—while tracking requirements and architectural context throughout the process.
Enterprise operations and office tasks: Manage complex projects from start to finish. Opus 4.5 uses memory to maintain context and consistency across files, alongside improvements in creating spreadsheets, slides, and documents. The model handles ongoing enterprise projects, automating manual workflows.
Financial analysis: Work across complex information systems—regulatory filings, market reports, internal data—enabling predictive modeling and proactive compliance. The model’s consistency and accuracy make it useful for finance and other industries where precision matters.
Cybersecurity: Bring professional-grade analysis to security workflows, correlating logs, security issue databases, and security intelligence for security event detection and automated incident response.

Integration with Amazon Bedrock AgentCore
Amazon Bedrock provides the enterprise foundation for deploying Opus 4.5 in production. The fully managed service provides a unified API for foundation models with enterprise-grade security, compliance, and governance.
Opus 4.5 integrates with Amazon Bedrock AgentCore, which provides the infrastructure and primitives for building production agents. AgentCore includes persistent memory for maintaining context across sessions, Tool Gateway for converting your APIs and Lambda functions into agent-compatible tools, and built-in identity and access management for secure resource access. You can deploy and monitor agents with complete session isolation, long-running workflow support (up to 8 hours), and observability features—so you can focus on building agents instead of managing infrastructure.
Amazon Bedrock AgentCore provides additional capabilities for production deployments. The Tool Gateway converts your existing APIs and Lambda functions into agent-compatible tools with minimal code—working with the model’s tool search feature to orchestrate hundreds of tools. Built-in observability through Amazon CloudWatch tracks token usage, latency, and error rates across your agent workflows.
Getting started
Access the Opus 4.5 model today through Amazon Bedrock. I’ll demonstrate the model’s tool search capability—a feature that lets agents work with hundreds of tools without loading all definitions into context upfront. First, I import the required modules and set up the Amazon Bedrock client:

# Import required libraries
import boto3
import json
# Create a session and Bedrock client
session = boto3.Session()
bedrock_client = session.client(
service_name=’bedrock-runtime’,
region_name=’us-east-1′

For this example, I’ll define multiple tools with defer_loading to enable tool search. This lets the model discover and load only the tools it needs instead of loading all definitions upfront:

# Define tools with tool search enabled
tools = [
# Enable tool search – allows dynamic tool discovery
{
“type”: “tool_search_tool_regex”,
“name”: “tool_search_tool_regex”
},
# Tools marked with defer_loading are discovered on-demand
{
“name”: “get_weather”,
“description”: “Get current weather for a location”,
“input_schema”: {
“type”: “object”,
“properties”: {
“location”: {“type”: “string”},
“unit”: {“type”: “string”, “enum”: [“celsius”, “fahrenheit”]}
},
“required”: [“location”]
},
“defer_loading”: True,
# Provide example inputs to improve accuracy for complex schemas
“input_examples”: [
{“location”: “San Francisco, CA”, “unit”: “fahrenheit”},
{“location”: “Tokyo, Japan”, “unit”: “celsius”}
]
},
{
“name”: “search_documentation”,
“description”: “Search AWS documentation”,
“input_schema”: {
“type”: “object”,
“properties”: {
“query”: {“type”: “string”},
“service”: {“type”: “string”}
},
“required”: [“query”]
},
“defer_loading”: True,
“input_examples”: [
{“query”: “Lambda pricing”, “service”: “lambda”},
{“query”: “S3 bucket policies”}
]
},
{
“name”: “analyze_logs”,
“description”: “Analyze application logs for errors”,
“input_schema”: {
“type”: “object”,
“properties”: {
“log_file”: {“type”: “string”},
“time_range”: {“type”: “string”}
},
“required”: [“log_file”]
},
“defer_loading”: True,
“input_examples”: [
{“log_file”: “/var/log/app.log”, “time_range”: “last 24 hours”},
{“log_file”: “/var/log/error.log”}
]
}
]

Now I call the model using the invoke_model API with the effort parameter set to medium:

# Construct the request with beta features enabled
request_body = {
“anthropic_version”: “bedrock-2023-05-31”,
# Enable beta features: tool search, tool examples, and effort parameter
“anthropic_beta”: [“tool-search-tool-2025-10-19”, “tool-examples-2025-10-29”, “effort-2025-11-24”],
“max_tokens”: 4096,
“temperature”: 0.7,
# Set effort to “medium” for balanced token usage
“output_config”: {
“effort”: “medium”
},
“messages”: [
{
“role”: “user”,
“content”: “What’s the weather in Seattle?”
}
],
“tools”: tools
}

)
# Invoke the model
response = bedrock_client.invoke_model(
modelId=”global.anthropic.claude-opus-4-5-20251101-v1:0″,
body=json.dumps(request_body)

# Parse the response
response_body = json.loads(response[‘body’].read())

The model uses tool search to find the relevant tool (get_weather) from the library without loading all tool definitions upfront. The effort parameter, available in beta, controls how liberally the model spends tokens across thinking, tool calls, and responses. You can set effort to high for best results, medium for balanced usage, or low for conservative token usage.
Key features for agent development
Opus 4.5 has several capabilities that make it well-suited for building production agents. The model maintains coherence across extended workflows for consistent decision-making for agents that run multi-step processes over hours or days. Better tool handling means agents interact more reliably with external systems, APIs, and software interfaces—the model chooses the right tools and interprets results more accurately. Opus 4.5 also tracks information across conversation turns and maintains context, helping agents accumulate knowledge over time and make decisions based on history.
The effort parameter, available in beta, gives you control over token usage. You can set it to high for best results when quality matters most, medium for balanced performance, or low for conservative token usage. Opus 4.5 adjusts token spending across thinking, tool calls, and responses based on this setting. For production deployments, Amazon Bedrock AgentCore provides monitoring and observability through CloudWatch integration, tracking token usage in real-time (useful when tuning the effort parameter), along with latency metrics, session duration, and error rates to help optimize agent performance and manage costs.
Pricing
The model is priced at $5 per million input tokens and $25 per million output tokens, making Opus-level intelligence accessible at one-third the cost of previous offerings.
Availability and access
This model is available today in Amazon Bedrock through cross-Region inference, which automatically routes requests to available capacity across AWS Regions for higher throughput during peak demand.
Use this model for agents that handle long-running tasks, coordinate multiple tools, or maintain context across extended sessions.
For detailed information about availability, pricing, and model specifications, visit the Amazon Bedrock documentation.
Conclusion
This post showed you how to get started with Claude Opus 4.5 in Amazon Bedrock. Opus 4.5 excels at complex, long-running workflows like software development and enterprise operations. Opus 4.5’s capabilities in tool handling, context management, and decision-making make it valuable for building agents that operate reliably in production environments. The model works well for agents in software engineering, research synthesis, and enterprise workflow automation.
I encourage you to experiment with Opus 4.5 for your own agent workflows. Consider how its capabilities could improve manual processes in your organization, or support new types of automation. The combination of Opus 4.5’s capabilities with Amazon Bedrock’s enterprise features provides a foundation for production AI agents.
To get started, try the model in the Amazon Bedrock console, explore the technical documentation, and check out Anthropic’s Claude model detail page for more information about its capabilities. To deploy agents at scale, explore Opus 4.5 in Amazon Bedrock AgentCore for managed infrastructure with tool orchestration and monitoring.
I’d love to hear about what you build with this model—share your experiences and agent use cases in the comments below!

About the authors
Jonathan Evans is a Worldwide Solutions Architect for Generative AI at AWS, where he helps customers leverage cutting-edge AI technologies with Anthropic’s Claude models on Amazon Bedrock, to solve complex business challenges. With a background in AI/ML engineering and hands-on experience supporting machine learning workflows in the cloud, Jonathan is passionate about making advanced AI accessible and impactful for organizations of all sizes.

How to Design a Mini Reinforcement Learning Environment-Acting Agent w …

In this tutorial, we code a mini reinforcement learning setup in which a multi-agent system learns to navigate a grid world through interaction, feedback, and layered decision-making. We build everything from scratch and bring together three agent roles: an Action Agent, a Tool Agent, and a Supervisor, so we can observe how simple heuristics, analysis, and oversight combine to produce more intelligent behavior. Also, we observe how the agents collaborate, refine their strategies, and gradually learn to reach the goal while overcoming obstacles and uncertainty. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
import time
from collections import defaultdict

class GridWorld:
def __init__(self, size=8):
self.size = size
self.agent_pos = [0, 0]
self.goal_pos = [size-1, size-1]
self.obstacles = self._generate_obstacles()
self.visited = set()
self.step_count = 0
self.max_steps = size * size * 2

def _generate_obstacles(self):
obstacles = set()
n_obstacles = self.size
while len(obstacles) < n_obstacles:
pos = (np.random.randint(1, self.size-1),
np.random.randint(1, self.size-1))
if pos != (0, 0) and pos != (self.size-1, self.size-1):
obstacles.add(pos)
return obstacles

def reset(self):
self.agent_pos = [0, 0]
self.visited = {tuple(self.agent_pos)}
self.step_count = 0
return self._get_state()

def _get_state(self):
return {
‘position’: tuple(self.agent_pos),
‘goal’: self.goal_pos,
‘distance_to_goal’: abs(self.agent_pos[0] – self.goal_pos[0]) +
abs(self.agent_pos[1] – self.goal_pos[1]),
‘visited_count’: len(self.visited),
‘steps’: self.step_count,
‘can_move’: self._get_valid_actions()
}

def _get_valid_actions(self):
valid = []
moves = {‘up’: [-1, 0], ‘down’: [1, 0], ‘left’: [0, -1], ‘right’: [0, 1]}
for action, delta in moves.items():
new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]
if (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size and
tuple(new_pos) not in self.obstacles):
valid.append(action)
return valid

We set up the entire GridWorld environment and define how the agent, goal, and obstacles exist in it. We establish the structure for state representation and valid movements, and we prepare the environment so we can interact with it dynamically. As we run this part, we see the world taking shape and becoming ready for the agents to explore. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass GridWorld(GridWorld):
def step(self, action):
self.step_count += 1
moves = {‘up’: [-1, 0], ‘down’: [1, 0], ‘left’: [0, -1], ‘right’: [0, 1]}

if action not in moves:
return self._get_state(), -1, False, “Invalid action”

delta = moves[action]
new_pos = [self.agent_pos[0] + delta[0], self.agent_pos[1] + delta[1]]

if not (0 <= new_pos[0] < self.size and 0 <= new_pos[1] < self.size):
return self._get_state(), -1, False, “Hit wall”

if tuple(new_pos) in self.obstacles:
return self._get_state(), -1, False, “Hit obstacle”

self.agent_pos = new_pos
pos_tuple = tuple(self.agent_pos)
reward = -0.1
if pos_tuple not in self.visited:
reward += 0.5
self.visited.add(pos_tuple)

done = False
info = “Moved”
if self.agent_pos == self.goal_pos:
reward += 10
done = True
info = “Goal reached!”
elif self.step_count >= self.max_steps:
done = True
info = “Max steps reached”

return self._get_state(), reward, done, info

def render(self, agent_thoughts=None):
grid = np.zeros((self.size, self.size, 3))
for pos in self.visited:
grid[pos[0], pos[1]] = [0.7, 0.9, 1.0]
for obs in self.obstacles:
grid[obs[0], obs[1]] = [0.2, 0.2, 0.2]
grid[self.goal_pos[0], self.goal_pos[1]] = [0, 1, 0]
grid[self.agent_pos[0], self.agent_pos[1]] = [1, 0, 0]

plt.figure(figsize=(10, 8))
plt.imshow(grid, interpolation=’nearest’)
plt.title(f”Step: {self.step_count} | Visited: {len(self.visited)}/{self.size*self.size}”)
for i in range(self.size + 1):
plt.axhline(i – 0.5, color=’gray’, linewidth=0.5)
plt.axvline(i – 0.5, color=’gray’, linewidth=0.5)
if agent_thoughts:
plt.text(0.5, -1.5, agent_thoughts, ha=’center’, fontsize=9,
bbox=dict(boxstyle=’round’, facecolor=’wheat’, alpha=0.8),
wrap=True, transform=plt.gca().transData)
plt.axis(‘off’)
plt.tight_layout()
plt.show()

We define how each step in the environment works and how the world is visually rendered. We calculate rewards, detect collisions, track progress, and display everything through a clean grid visualization. As we execute this logic, we watch the agent’s journey unfold in real time with clear feedback. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass ActionAgent:
def __init__(self):
self.q_values = defaultdict(lambda: defaultdict(float))
self.epsilon = 0.3
self.learning_rate = 0.1
self.discount = 0.95

def choose_action(self, state):
valid_actions = state[‘can_move’]
if not valid_actions:
return None
pos = state[‘position’]
if np.random.random() < self.epsilon:
action = np.random.choice(valid_actions)
reasoning = f”Exploring randomly: chose ‘{action}'”
else:
action_values = {a: self.q_values[pos][a] for a in valid_actions}
action = max(action_values, key=action_values.get)
reasoning = f”Exploiting: chose ‘{action}’ (Q={self.q_values[pos][action]:.2f})”
return action, reasoning

def learn(self, state, action, reward, next_state):
pos = state[‘position’]
next_pos = next_state[‘position’]
current_q = self.q_values[pos][action]
next_max_q = max([self.q_values[next_pos][a] for a in next_state[‘can_move’]], default=0)
new_q = current_q + self.learning_rate * (
reward + self.discount * next_max_q – current_q)
self.q_values[pos][action] = new_q

class ToolAgent:
def analyze(self, state, action_taken, reward, history):
suggestions = []
distance = state[‘distance_to_goal’]
if distance <= 3:
suggestions.append(” Very close to goal! Prioritize direct path.”)
exploration_rate = state[‘visited_count’] / (state[‘steps’] + 1)
if exploration_rate < 0.5 and distance > 5:
suggestions.append(” Low exploration rate. Consider exploring more.”)
if len(history) >= 5:
recent_rewards = [h[2] for h in history[-5:]]
avg_reward = np.mean(recent_rewards)
if avg_reward < -0.5:
suggestions.append(” Negative reward trend. Try different strategy.”)
elif avg_reward > 0.3:
suggestions.append(” Good progress! Current strategy working.”)
if len(state[‘can_move’]) <= 2:
suggestions.append(” Limited movement options. Be careful.”)
return suggestions

We implement the Action Agent and Tool Agent, giving the system both learning capability and analytical feedback. We observe how the Action Agent chooses actions through a balance of exploration and exploitation, while the Tool Agent evaluates performance and suggests improvements. Together, they create a learning loop that evolves with experience. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SupervisorAgent:
def decide(self, state, proposed_action, tool_suggestions):
if not proposed_action:
return None, “No valid actions available”

decision = proposed_action
reasoning = f”Approved action ‘{proposed_action}'”

for suggestion in tool_suggestions:
if “goal” in suggestion.lower() and “close” in suggestion.lower():
goal_direction = self._get_goal_direction(state)
if goal_direction in state[‘can_move’]:
decision = goal_direction
reasoning = f”Override: Moving ‘{goal_direction}’ toward goal”
break

return decision, reasoning

def _get_goal_direction(self, state):
pos = state[‘position’]
goal = state[‘goal’]
if goal[0] > pos[0]:
return ‘down’
elif goal[0] < pos[0]:
return ‘up’
elif goal[1] > pos[1]:
return ‘right’
else:
return ‘left’

We introduce the Supervisor Agent, which acts as the final decision-maker in the system. We see how it interprets suggestions, overrides risky choices, and ensures that actions remain aligned with overall goals. As we use this component, we experience a coordinated multi-agent decision flow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef train_multi_agent(episodes=5, visualize=True):
env = GridWorld(size=8)
action_agent = ActionAgent()
tool_agent = ToolAgent()
supervisor = SupervisorAgent()

episode_rewards = []
episode_steps = []

for episode in range(episodes):
state = env.reset()
total_reward = 0
done = False
history = []

print(f”n{‘=’*60}”)
print(f”EPISODE {episode + 1}/{episodes}”)
print(f”{‘=’*60}”)

while not done:
action_result = action_agent.choose_action(state)
if action_result is None:
break
proposed_action, action_reasoning = action_result

suggestions = tool_agent.analyze(state, proposed_action, total_reward, history)
final_action, supervisor_reasoning = supervisor.decide(state, proposed_action, suggestions)

if final_action is None:
break

next_state, reward, done, info = env.step(final_action)
total_reward += reward
action_agent.learn(state, final_action, reward, next_state)
history.append((state, final_action, reward, next_state))

if visualize:
clear_output(wait=True)
thoughts = (f”Action Agent: {action_reasoning}n”
f”Supervisor: {supervisor_reasoning}n”
f”Tool Agent: {‘, ‘.join(suggestions[:2]) if suggestions else ‘No suggestions’}n”
f”Reward: {reward:.2f} | Total: {total_reward:.2f}”)
env.render(thoughts)
time.sleep(0.3)

state = next_state

episode_rewards.append(total_reward)
episode_steps.append(env.step_count)

print(f”nEpisode {episode+1} Complete!”)
print(f”Total Reward: {total_reward:.2f}”)
print(f”Steps Taken: {env.step_count}”)
print(f”Cells Visited: {len(env.visited)}/{env.size**2}”)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(episode_rewards, marker=’o’)
plt.title(‘Episode Rewards’)
plt.xlabel(‘Episode’)
plt.ylabel(‘Total Reward’)
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(episode_steps, marker=’s’, color=’orange’)
plt.title(‘Episode Steps’)
plt.xlabel(‘Episode’)
plt.ylabel(‘Steps to Complete’)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

return action_agent, tool_agent, supervisor

if __name__ == “__main__”:
print(” Multi-Agent RL System: Grid World Navigation”)
print(“=” * 60)
print(“Components:”)
print(” • Action Agent: Proposes actions using Q-learning”)
print(” • Tool Agent: Analyzes performance and suggests improvements”)
print(” • Supervisor Agent: Makes final decisions”)
print(“=” * 60)

trained_agents = train_multi_agent(episodes=5, visualize=True)

We run the full training loop where all agents collaborate inside the environment across multiple episodes. We track rewards, observe movement patterns, and visualize learning progression with each trial. As we complete this loop, we see the multi-agent system improving and becoming more efficient at navigating the grid world.

In conclusion, we see how a multi-agent RL system emerges from clean components and how each layer contributes to smarter navigation: the Action Agent learns via Q-updates, the Tool Agent guides improvements, and the Supervisor ensures safe, goal-oriented action selection. We appreciate how this simple yet dynamic grid world helps us visualize learning, exploration, and decision-making in real time.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post How to Design a Mini Reinforcement Learning Environment-Acting Agent with Intelligent Local Feedback, Adaptive Decision-Making, and Multi-Agent Coordination appeared first on MarkTechPost.