Iterative fine-tuning on Amazon Bedrock for strategic model improvemen …

Organizations often face challenges when implementing single-shot fine-tuning approaches for their generative AI models. The single-shot fine-tuning method involves selecting training data, configuring hyperparameters, and hoping the results meet expectations without the ability to make incremental adjustments. Single-shot fine-tuning frequently leads to suboptimal results and requires starting the entire process from scratch when improvements are needed.
Amazon Bedrock now supports iterative fine-tuning, enabling systematic model refinement through controlled, incremental training rounds. With this capability you can build upon previously customized models, whether they were created through fine-tuning or distillation, providing a foundation for continuous improvement without the risks associated with complete retraining.
In this post, we will explore how to implement the iterative fine-tuning capability of Amazon Bedrock to systematically improve your AI models. We’ll cover the key advantages over single-shot approaches, walk through practical implementation using both the console and SDK, discuss deployment options, and share best practices for maximizing your iterative fine-tuning results.
When to use iterative fine-tuning
Iterative fine-tuning provides several advantages over single-shot approaches that make it valuable for production environments. Risk mitigation becomes possible through incremental improvements, so you can test and validate changes before committing to larger modifications. With this approach, you can make data-driven optimization based on real performance feedback rather than theoretical assumptions about what might work. The methodology also helps developers to apply different training techniques sequentially to refine model behavior. Most importantly, iterative fine-tuning accommodates evolving business requirements driven by continuous live data traffic. As user patterns change over time and new use cases emerge that weren’t present in initial training, you can leverage this fresh data to refine your model’s performance without starting from scratch.
How to implement iterative fine-tuning on Amazon Bedrock
Setting up iterative fine-tuning involves preparing your environment and creating training jobs that build upon your existing custom models, whether through the console interface or programmatically using the SDK.
Prerequisites
Before beginning iterative fine-tuning, you need a previously customized model as your starting point. This base model can originate from either fine-tuning or distillation processes and supports customizable models and variants available on Amazon Bedrock. You’ll also need:

Standard IAM permissions for Amazon Bedrock model customization
Incremental training data focused on addressing specific performance gaps
S3 bucket for training data and job outputs

Your incremental training data should target the specific areas where your current model needs improvement rather than attempting to retrain on all possible scenarios.
Using the AWS Management Console
The Amazon Bedrock console provides a straightforward interface for creating iterative fine-tuning jobs.
Navigate to the Custom Models section and select Create fine-tuning job. The key difference in iterative fine-tuning lies in the base model selection, where you choose your previously customized model instead of a foundation model.
During training, you can visit the Custom models page in the Amazon Bedrock console to track the job status.
Once complete, you can monitor your jobs performance metrics on console through multiple metric charts, on the Training metrics and Validation metrics tabs.
Using the SDK
Programmatic implementation of iterative fine-tuning follows similar patterns to standard fine-tuning with one critical difference: specifying your previously customized model as the base model identifier. Here’s an example implementation:

import boto3
from datetime import datetime
import uuid

# Initialize Bedrock client
bedrock = boto3.client(‘bedrock’)

# Define job parameters
job_name = f”iterative-finetuning-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”
custom_model_name = f”iterative-model-{str(uuid.uuid4())[:8]}”

# Key difference: Use your previously customized model ARN as base
# This could be from previous fine-tuning or distillation
base_model_id = “arn:aws:bedrock:<Region>:<AccountID>:custom-model/<your-previous-custom-model-id>”

# S3 paths for training data and outputs
training_data_uri = “s3://<your-bucket>/<iterative-training-data>”
output_path = “s3://<your-bucket>/<iterative-output-folder>/”

# Hyperparameters adjusted based on previous iteration learnings
hyperparameters = {
“epochCount”: “3” # Example
}

# Create the iterative fine-tuning job
response = bedrock.create_model_customization_job(
customizationType=”FINE_TUNING”,
jobName=job_name,
customModelName=custom_model_name,
roleArn=role_arn,
baseModelIdentifier=base_model_id, # Your previously customized model
hyperParameters=hyperparameters,
trainingDataConfig={
“s3Uri”: training_data_uri
},
outputDataConfig={
“s3Uri”: output_path
}
)

job_arn = response.get(‘jobArn’)
print(f”Iterative fine-tuning job created with ARN: {job_arn}”)

Setting up inference for your iteratively fine-tuned model
Once your iterative fine-tuning job completes, you have two primary options for deploying your model for inference, provisioned throughput and on-demand inference, each suited to different usage patterns and requirements.
Provisioned Throughput
Provisioned Throughput offers stable performance for predictable workloads where consistent throughput requirements exist. This option provides dedicated capacity so that the iteratively fine-tuned model maintains performance standards during peak usage periods. Setup involves purchasing model units based on expected traffic patterns and performance requirements.
On-demand inference
On-demand inference provides flexibility for variable workloads and experimentation scenarios. Amazon Bedrock now supports Amazon Nova Micro, Lite, and Pro models as well as Llama 3.3 models for on-demand inference with pay-per-token pricing. This option avoids the need for capacity planning so you can test your iteratively fine-tuned model without upfront commitments. The pricing model scales automatically with usage, making it cost-effective for applications with unpredictable or low-volume inference patterns.
Best practices
Successful iterative fine-tuning requires attention to several key areas. Most importantly, your data strategy should emphasize quality over quantity in incremental datasets. Rather than adding large volumes of new training examples, focus on high-quality data that addresses specific performance gaps identified in previous iterations.
To track progress effectively, evaluation consistency across iterations allows meaningful comparison of improvements. Establish baseline metrics during your first iteration and maintain the same evaluation framework throughout the process. You can use Amazon Bedrock Evaluations to help you systematically identify where gaps exist in your model performance after each customization run. This consistency helps you understand whether changes are producing meaningful improvements.
Finally, recognizing when to stop the iterative process helps to prevent diminishing returns on your investment. Monitor performance improvements between iterations and consider concluding the process when gains become marginal relative to the effort required.
Conclusion
Iterative fine-tuning on Amazon Bedrock provides a systematic approach to model improvement that reduces risks while enabling continuous refinement. With the iterative fine-tuning methodology organizations can build upon existing investments in custom models rather than starting from scratch when adjustments are needed.
To get started with iterative fine-tuning, access the Amazon Bedrock console and navigate to the Custom models section. For detailed implementation guidance, refer to the Amazon Bedrock documentation.

About the authors
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Gautam Kumar is an Engineering Manager at AWS AI Bedrock, leading model customization initiatives across large-scale foundation models. He specializes in distributed training and fine-tuning. Outside work, he enjoys reading and traveling.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Voice AI-powered drive-thru ordering with Amazon Nova Sonic and dynami …

Artificial Intelligence (AI) is transforming the quick-service restaurant industry, particularly in drive-thru operations where efficiency and customer satisfaction intersect. Traditional systems create significant obstacles in service delivery, from staffing limitations and order accuracy issues to inconsistent customer experiences across locations. These challenges, combined with rising labor costs and demand fluctuations, have pushed the industry to seek innovative solutions.
In this post, we’ll demonstrate how to implement a Quick Service Restaurants (QSRs) drive-thru solution using Amazon Nova Sonic and AWS services. We’ll walk through building an intelligent system that combines voice AI with interactive menu displays, providing technical insights and implementation guidance to help restaurants modernize their drive-thru operations.
For QSRs, the stakes are particularly high during peak hours, when long wait times and miscommunication between customers and staff can significantly impact business performance. Common pain points include order accuracy issues, service quality variations across different shifts, and limited ability to handle sudden spikes in customer demand. Modern consumers expect the same seamless, efficient service they experience with digital ordering systems, creating an unprecedented opportunity for voice AI technology to support 24/7 availability and consistent service quality.
Amazon Nova Sonic is a foundation model (FM) within the Amazon Nova family, designed specifically for voice-enabled applications. Available through Amazon Bedrock, developers can use Nova Sonic to create applications that understand spoken language, process complex conversational interactions, and generate appropriate responses for real-time customer engagement. This innovative speech-to-speech model addresses traditional voice application challenges through:

Accurately recognizes streaming speech across accents with robustness to background noise
Adapts speech response to user’s tone and sentiment
Bidirectional streaming speech I/O with low user perceived latency
Graceful interruption handling and natural turn-taking in conversations
Industry-leading price-performance

When integrated with AWS serverless services, Nova Sonic delivers natural, human-like voice interactions that helps improve the drive-thru experience. The architecture creates a cost-effective system that enhances both service consistency and operational efficiency through intelligent automation.
Solution overview
Our voice AI drive-thru solution creates an intelligent ordering system that combines real-time voice interaction with a robust backend infrastructure, delivering a natural customer experience. The system processes speech in real-time, understanding various accents, speaking styles, and handling background noise common in drive-thru environments. Integrating voice commands with interactive menu displays enhances user feedback while streamlining the ordering process by reducing verbal interactions.
The system is built on AWS serverless architecture, integrating key components including Amazon Cognito for authentication with role-based access control, AWS Amplify for the digital menu board, Amazon API Gateway to facilitate access to Amazon DynamoDB tables, AWS Lambda functions with Amazon Nova Canvas for menu image generation, and Amazon Simple Storage Service (Amazon S3) with Amazon CloudFront for image storage and delivery.
The following architecture diagram illustrates how these services interconnect to for natural conversations between customers and the digital menu board, orchestrating the entire customer journey from drive-thru entry to order completion.

Let’s examine how each component works together to power this intelligent ordering system.
Prerequisites
You must have the following in place to complete the solution in this post:

An AWS account
FM access in Amazon Bedrock for Amazon Nova Sonic and Amazon Nova Canvas in the same AWS Region where you will deploy this solution
The accompanying AWS CloudFormation templates downloaded from the aws-samples GitHub repo

Deploy solution resources using AWS CloudFormation
Deploy the CloudFormation templates in an AWS Region where Amazon Bedrock is available and has support for the following models: Amazon Nova Sonic and Amazon Nova Canvas.
This solution consists of two CloudFormation templates that work together to create a complete restaurant drive-thru ordering system. The nova-sonic-infrastructure-drivethru.yaml template establishes the foundational AWS infrastructure including Cognito user authentication, S3 storage with CloudFront CDN for menu images, DynamoDB tables for menu items and customer data, and API Gateway endpoints with proper CORS configuration. The nova-sonic-application-drivethru.yaml template builds upon this foundation by deploying a Lambda function that populates the system with a complete embedded drive-thru menu featuring burgers, wings, fries, drinks, sauces, and combo meals, while using the Amazon Nova Canvas AI model to automatically generate professional food photography for each menu item and storing them in the S3 bucket for delivery through CloudFront.
During the deployment of the first CloudFormation template nova-sonic-infrastructure-drivethru.yaml, you will need to specify the following parameters:

Stack name
Environment – Deployment environment: dev, staging, or prod (defaults to dev)
UserEmail – Valid email address for the user account (required)

Important: You must enable access to the selected Amazon Nova Sonic model and Amazon Nova Canvas model in the Amazon Bedrock console before deployment.
AWS resource usage will incur costs. When deployment is complete, the following resources will be deployed:

Amazon Cognito resources:

User pool – CognitoUserPool
App client – AppClient
Identity pool – CognitoIdentityPool
Groups – AppUserGroup
User – AppUser

AWS Identity and Access Management (IAM) resources:

IAM roles:

AuthenticatedRole
DefaultAuthenticatedRole
ApiGatewayDynamoDBRole
LambdaExecutionRole
S3BucketCleanupRole

Amazon DynamoDB tables:

MenuTable – Stores menu items, pricing, and customization options
LoyaltyTable – Stores customer loyalty information and points
CartTable – Stores shopping cart data for active sessions
OrderTable – Stores completed and pending orders
ChatTable – Stores completed chat details

Amazon S3, CloudFront and AWS WAF resources:

MenuImagesBucket – S3 bucket for storing menu item images
MenuImageCloudFrontDistribution – CloudFront distribution for global content delivery
CloudFrontOriginAccessIdentity – Secure access between CloudFront and S3
CloudFrontWebACL – WAF protection for CloudFront distribution with security rules

Amazon API Gateway resources:

REST API – app-api with Cognito authorization
API resources and methods:

/menu (GET, OPTIONS)
/loyalty (GET, OPTIONS)
/cart (POST, DELETE, OPTIONS)
/order (POST, OPTIONS)
/chat (POST, OPTIONS)

API deployment to specified environment stage

AWS Lambda function:

S3BucketCleanupLambda – Cleans up S3 bucket on stack deletion

CloudFormation Custom Resource:

S3BucketCleanup – Triggers S3BucketCleanupLambda

After you deploy the CloudFormation template, copy the following from the Outputs tab on the AWS CloudFormation console to use during the configuration of your frontend application:

cartApiUrl
loyaltyApiUrl
menuApiUrl
orderApiUrl
chatApiUrl
UserPoolClientId
UserPoolId
IdentityPoolId

The following screenshot shows you what the Outputs tab will look like.

These output values are essential for configuring your frontend application (deployed via AWS Amplify) to connect with the backend services. The API URLs will be used for making REST API calls, while the Cognito IDs will be used for user authentication and authorization.
During the deployment of the second CloudFormation template nova-sonic-application-drivethru.yaml you will need to specify the following parameters:

Stack name
InfrastructureStackName – This stack name matches the one you previously deployed using nova-sonic-infrastructure-drivethru.yaml

When deployment is complete, the following resources will be deployed:

AWS Lambda function:

DriveThruMenuLambda – Populates menu data and generates AI images

CloudFormation Custom Resource:

DriveThruMenuPopulation – Triggers DriveThruMenuLambda

Once both CloudFormation templates are successfully deployed, you’ll have a fully functional restaurant drive-thru ordering system with AI-generated menu images, complete authentication, and ready-to-use API endpoints for your Amplify frontend deployment.
Deploy the Amplify application
You need to manually deploy the Amplify application using the frontend code found on GitHub. Complete the following steps:

Download the frontend code NovaSonic-FrontEnd.zip from GitHub.
Use the .zip file to manually deploy the application in Amplify.
Return to the Amplify page and use the domain it automatically generated to access the application.

User authentication
The solution uses Amazon Cognito user pools and identity pools to implement secure, role-based access control for restaurant’s digital menu board. User pools handle authentication and group management through the AppUserGroup, and identity pools provide temporary AWS credentials mapped to specific IAM roles including AuthenticatedRole. The system makes sure that only verified digital menu board users can access the application and interact with the menu APIs, cart management, order processing, and loyalty services, while also providing secure access to Amazon Bedrock. This combines robust security with an intuitive ordering experience for both customers and restaurant operations.
Serverless data management
The solution implements a serverless API architecture using Amazon API Gateway to create a single REST API (app-api) that facilitates communication between the frontend interface and backend services. The API includes five resource endpoints (/menu, /loyalty, /cart, /chat,/order) with Cognito-based authentication and direct DynamoDB integration for data operations. The backend utilizes five DynamoDB tables: MenuTable for menu items and pricing, LoyaltyTable for customer profiles and loyalty points, CartTable for active shopping sessions, ChatTable for capturing chat history and OrderTable for order tracking and history. This architecture provides fast, consistent performance at scale with Global Secondary Indexes enabling efficient queries by customer ID and order status for optimal drive-thru operations.
Menu and image generation and distribution
The solution uses Amazon S3 and CloudFront for secure, global content delivery of menu item images. The CloudFormation template creates a MenuImagesBucket with restricted access through a CloudFront Origin Access Identity, making sure images are served securely using the CloudFront distribution for fast loading times worldwide. AWS Lambda powers the AI-driven content generation through the DriveThruMenuLambda function, which automatically populates sample menu data and generates high-quality menu item images using Amazon Nova Canvas. This serverless function executes during stack deployment to create professional food photography for the menu items, from classic burgers to specialty wings, facilitating consistent visual presentation across the entire menu. The Lambda function integrates with DynamoDB to store generated image URLs and uses S3 for persistent storage, creating a complete automated workflow that scales based on demand while optimizing costs through pay-per-use pricing.
Voice AI processing
The solution uses Amazon Nova Sonic as the core voice AI engine. The digital menu board establishes direct integration with Amazon Nova Sonic through secure WebSocket connections, for immediate processing of customer speech input and conversion to structured ordering data. The CloudFormation template configures IAM permissions for the AuthenticatedRole to access the amazon.nova-sonic-v1:0 foundation model, allowing authenticated users to interact with the voice AI service. Nova Sonic handles complex natural language understanding and intent recognition, processing customer requests like menu inquiries, order modifications, and item customizations while maintaining conversation context throughout the ordering process. This direct integration minimizes latency concerns and provides customers with a natural, conversational ordering experience that rivals human interaction while maintaining reliable service across drive-thru locations.
Hosting the digital menu board
AWS Amplify hosts and delivers the digital menu board interface as a scalable frontend application. The interface displays AI-generated menu images through CloudFront, with real-time pricing from DynamoDB, optimized for drive-thru environments. The React-based application automatically scales during peak hours, using the global content delivery network available in CloudFront for fast loading times. It integrates with Amazon Cognito for authentication, establishes WebSocket connections to Amazon Nova Sonic for voice processing, and uses API Gateway endpoints for menu and order management. This serverless solution maintains high availability while providing real-time visual updates as customers interact through voice commands.
WebSocket connection flow
The following sequence diagram illustrates the WebSocket connection setup enabling direct browser-to-Nova Sonic communication. This architecture leverages the AWS SDK update (client-bedrock-runtime v3.842.0), which introduces WebSocketHandler support in browsers, avoiding the need for a server.

This advancement allows frontend applications to establish direct WebSocket connections to Nova Sonic, reducing latency and complexity while enabling real-time conversational AI in the browser. The initialization process includes credential validation, Bedrock client establishment, AI assistant configuration, and audio input setup (16kHz PCM). This direct client-to-service communication represents a shift from traditional architectures, offering more efficient and scalable conversational AI applications.
Voice interaction and dynamic menu
The following sequence diagram illustrates the flow of a customer’s burger query, demonstrating how natural language requests are processed to deliver synchronized audio responses and visual updates.

This diagram shows how a query (“Can you show me what burgers you have?”) is handled. Nova Sonic calls getMenuItems ({category: “burgers”}) to retrieve menu data, while Frontend App components fetch and structure burger items and prices. Nova Sonic generates a contextual response and triggers showCategory ({category: “burgers”}) to highlight the burger section in the UI. This process facilitates real-time synchronization between audio responses and visual menu updates, creating a seamless customer experience throughout the conversation.
Drive-thru solution walkthrough
After deploying your application in AWS Amplify, open the generated URL in your browser. You’ll see two setup options: Choose Sample and Manual Setup. Select Choose Sample then pick AI Drive-Thru Experience from the sample list, and then select Load Sample. This will automatically import the system prompt, tools, and tool configurations for the drive-thru solution. We will configure these settings in the following steps.

After selecting Load Sample, you’ll be prompted to configure the connection settings. You’ll need to use the Amazon Cognito and API Gateway information from your CloudFormation stack outputs. These values are required because they connect your digital menu board to backend services.
Enter the configuration values you copied from the CloudFormation outputs (nova-sonic-infrastructure-drivethru.yaml). These are organized into two sections, as demonstrated in the following videos. After you enter the configuration details in each section, select Save button at the top of the screen.
Amazon Cognito configuration:

UserPoolId
UserPoolClientId
IdentityPoolId

Agent configuration:

Auto-Initiate Conversation – Nova Sonic is initially set to wait for you to start the conversation. However, you can enable automatic conversation initiation by checking the ‘Enable auto-initiate’ box. There is a pre-recorded ‘Hello’ that you can use that’s stored locally.

Tools global parameters:

menuAPIURL
cartAPIURL
orderAPIUR
loyaltyAPIURL
chatAPIURL

After completing the configuration, click the Save and Exit button located at the top of the page. This action will redirect you to a sign-in screen. To access the system, use the username appuser and the password automatically generated and emailed to you to the email that was provided during the CloudFormation deployment.
After entering the temporary password, you’ll be asked to verify your account through a temporary code sent to your email.
Upon your initial login attempt, you’ll be required to create a new password to replace the temporary one, as demonstrated in the following video.

Begin your drive-thru experience by clicking the microphone icon. The AI assistant welcomes you and guides you through placing your order while dynamically updating the digital menu board to highlight relevant items. The system intelligently suggests complementary items and adapts its communication style to enhance your ordering experience.

Clean up
If you decide to discontinue using the solution, you can follow these steps to remove it, its associated resources deployed using AWS CloudFormation, and the Amplify deployment:

Delete the CloudFormation stack:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Locate the stack you created during the deployment process of nova-sonic-application-drivethru.yaml (you assigned a name to it).
Select the stack and choose Delete.
Repeat this for nova-sonic-infrastructure-drivethru.yaml

Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Conclusion
The voice AI-powered drive-thru ordering system using Amazon Nova Sonic provides restaurants with a practical solution to common operational challenges including staffing constraints, order accuracy issues, and peak-hour bottlenecks. The serverless architecture built on AWS services—Amazon Cognito for authentication, API Gateway for data communication, DynamoDB for storage, and AWS Amplify for hosting, creates a scalable system that handles varying demand while maintaining consistent performance. The system supports essential restaurant operations including menu management, cart functionality, loyalty programs, and order processing through direct API Gateway and DynamoDB integration. For restaurants looking to modernize their drive-thru operations, this solution offers measurable benefits including reduced wait times, improved order accuracy, and operational efficiency gains. The pay-per-use pricing model and automated scaling help control costs while supporting business growth. As customer expectations shift toward more efficient service experiences, implementing voice AI technology provides restaurants with a competitive advantage and positions them well for future technological developments in the food service industry.
Additional resources
To learn more about Amazon Nova Sonic and additional solutions, refer to the following resources:

Introducing Amazon Nova Sonic: Human-like voice conversations for generative AI applications
Frontend application source code used in this blog is available on GitHub
Voice AI-Powered Hotel In-Room Service with Amazon Nova Sonic

About the Authors

Salman Ahmed
Salman is a Senior Technical Account Manager in AWS Enterprise Support. He specializes in guiding customers through the design, implementation, and support of AWS solutions. Combining his networking expertise with a drive to explore new technologies, he helps organizations successfully navigate their cloud journey. Outside of work, he enjoys photography, traveling, and watching his favorite sports teams.

Sergio Barraza
Sergio is a Senior Technical Account Manager at AWS, helping customers on designing and optimizing cloud solutions. With more than 25 years in software development, he guides customers through AWS services adoption. Outside of work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.

Ravi Kumar
Ravi is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. Ravi is passionate about generative AI and actively explores its applications in cloud computing. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.

Ankush Goyal
Ankush is a Senior Technical Account Manager at AWS Enterprise Support, specializing in helping customers in the travel and hospitality industries optimize their cloud infrastructure. With over 20 years of IT experience, he focuses on leveraging AWS networking services to drive operational efficiency and cloud adoption. Ankush is passionate about delivering impactful solutions and enabling clients to streamline their cloud operations.

Leland Johnson
Leland is a Sr. Solutions Architect for AWS focusing on travel and hospitality. As a Solutions Architect, he plays a crucial role in guiding customers through their cloud journey by designing scalable and secure cloud solutions. Outside of work, he enjoys playing music and flying light aircraft.

Optimizing document AI and structured outputs by fine-tuning Amazon No …

Multimodal fine-tuning represents a powerful approach for customizing vision large language models (LLMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that matter to your business.
A common use case is document processing, which includes extracting structured information from complex layouts including invoices, purchase orders, forms, tables, or technical diagrams. Although off-shelf LLMs often struggle with specialized documents like tax forms, invoices, and loan applications, fine-tuned models can learn from high data variations and can deliver significantly higher accuracy while reducing processing costs.
This post provides a comprehensive hands-on guide to fine-tune Amazon Nova Lite for document processing tasks, with a focus on tax form data extraction. Using our open-source GitHub repository code sample, we demonstrate the complete workflow from data preparation to model deployment. Since Amazon Bedrock provides on-demand inference with pay-per-token pricing for Amazon Nova, we can benefit from the accuracy improvement from model customization and maintain the pay-as-you-go cost structure.
The document processing challenge
Given a single or multi-page document, the goal is to extract or derive specific structured information from the document so that it can be used for downstream systems or additional insights. The following diagram shows how a vision LLM can be used to derive the structured information based on a combination of text and vision capabilities.

The key challenges for enterprises in workflow automation when processing documents, like invoices or W2 tax forms, are the following:

Complex layouts: Specialized forms contain multiple sections with specific fields arranged in a structured format.
Variability of document types: Many diverse document types exist (invoices, contracts, forms).
Variability within a single document type: Each vendor can send a different invoice format and style or type.
Data quality variations: Scanned documents vary in quality, orientation, and completeness.
Language barriers: Documents can be in multiple languages.
Critical accuracy requirements: Tax-related data extraction demands extremely high accuracy.
Structured output needs: Extracted data must be formatted consistently for downstream processing.
Scalability and integration: Grow with business needs and integrate with existing systems; for example, Enterprise Resource Planning (ERP) systems.

Approaches for intelligent document processing that use LLMs or vision LLMs fall into three main categories:

Zero-shot prompting: An LLM or vision LLM is used to derive the structured information based on the input document, instructions, and the target schema.
Few-shot prompting: A technique used with LLMs or vision LLMs where a few of other additional examples (document + target output) are provided within the prompt to guide the model in completing a specific task. Unlike zero-shot prompting, which relies solely on natural language instructions, few-shot prompting can improve accuracy and consistency by demonstrating the desired input-output behavior through a set of examples.
Fine-tuning: Customize or fine-tune the weights of a given LLM or vision LLM by providing larger amounts of annotated documents (input/output pairs), to teach the model exactly how to extract or interpret relevant information.

For the first two approaches, refer to the amazon-nova-samples repository, which contains sample code on how to use the Amazon Bedrock Converse API for structured output by using tool calling.
Off-shelf LLMs excel at general document understanding, but they might not optimally handle domain-specific challenges. A fine-tuned Nova model can enhance performance by:

Learning document-specific layouts and field relationships
Adapting to common quality variations in your document dataset
Providing consistent, structured outputs
Maintaining high accuracy across different document variations. For example, invoice documents can have hundreds of different vendors, each with different formats, layouts or even different languages.

Creating the annotated dataset and selecting the customization technique
While there are various methods for customization of Amazon Nova models available, the most relevant for document processing are the following:

Fine-tune for specific tasks: Adapt Nova models for specific tasks using supervised fine-tuning (SFT). Choose between Parameter-Efficient Fine-Tuning (PEFT) for light-weight adaptation with limited data, or full fine-tuning when you have extensive training datasets to update all parameters of the model.
Distill to create smaller, faster models: Use knowledge distillation to transfer knowledge from a larger, more intelligent model, like Nova Premier (teacher) to a smaller, faster, more cost-efficient model (student), ideal for when you don’t have enough annotated training datasets and the teacher model provides the accuracy that meets your requirement.

To be able to learn from previous examples, you need to either have an annotated dataset from which we can learn or a model that is good enough for your task so that you can use it as a teacher model.

Automated dataset annotation with historic data from Enterprise Resource Planning (ERP) systems, such as SAP: Many customers have already historic documents that have been manually processed and consumed by downstream systems, like ERP or customer relationship management (CRM) systems. Explore existing downstream systems like SAP and the data they contain. This data can often be mapped back to the original source document it has been derived from and helps you to bootstrap an annotated dataset very quickly.
Manual dataset annotation: Identify the most relevant documents and formats, and annotate them using human annotators, so that you have document/JSON pairs where the JSON contains the target information that you want to extract or derive from your source documents.
Annotate with the teacher model: Explore if a larger model like Nova Premier can provide accurate enough results using prompt engineering. If that is the case, you can also use distillation.

For the first and second options, we recommend supervised model fine-tuning. For the third, model distillation is the right approach.
Amazon Bedrock currently provides both fine-tuning and distillation techniques, so that anyone with a basic data science skillset can very easily submit jobs. They run on compute completely managed by Amazon, so you don’t have worry about instance sizes or capacity limits.
Nova customization is also available with Amazon SageMaker with more options and controls. For example, if you have sufficient high-quality labeled data and you want deeper customization for your use case, full rank fine-tuning might produce higher accuracy. Full rank fine tuning is supported with SageMaker training jobs and SageMaker HyperPod.
Data preparation best practices
The quality and structure of your training data fundamentally determine the success of fine-tuning. Here are key steps and considerations for preparing effective multimodal datasets and configuring your fine-tuning job:
Dataset analysis and base model evaluation
Our demonstration uses a synthetic dataset of W2 tax forms: the Fake W-2 (US Tax Form) Dataset. This public dataset comprises simulated US tax returns (W-2 statements for years 2016-19), including noisy images that mimic low-quality scanned W2 tax forms.
Before fine-tuning, it’s crucial to:

Analyze dataset characteristics (image quality, field completeness, class distribution), define use-case-specific evaluation metrics, and establish baseline model performance.
Compare each predicted field value against the ground truth, calculating precision, recall, and F1 scores for individual fields and overall performance.

Prompt optimization
Crafting an effective prompt is essential for aligning the model with task requirements. Our system comprises two key components:

System prompt: Defines the task, provides detailed instructions for each field to be extracted, and specifies the output format.
User prompt: Follows Nova vision understanding best practices, utilizing the {media_file}-then-{text} structure as outlined in the Amazon Nova model user guide.

Iterate on your prompts using the base model to optimize performance before fine-tuning.
Dataset preparation
Prepare your dataset in JSONL format and split it into training, validation, and test sets:

Training set: 70-80% of data
Validation set: 10-20% of data
Test set: 10-20% of data

Fine-tuning job configuration and monitoring
Once the dataset is prepared and uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, we can configure and submit the fine-tuning job on Bedrock. When configuring your fine-tuning job on Amazon Bedrock, key parameters include:

Parameter
Definition
Purpose

Epochs
Number of complete passes through the training dataset
Determines how many times the model sees the entire dataset during training

Learning rate
Step size for gradient descent optimization
Controls how much model weights are adjusted in response to estimated error

Learning rate warmup steps
Number of steps to gradually increase the learning rate
Prevents instability by slowly ramping up the learning rate from a small value to the target rate

Amazon Bedrock customization provides validation loss metrics throughout the training process. Monitor these metrics to:

Assess model convergence
Detect potential overfitting
Gain early insights into model performance on unseen data

The following graph shows an example metric analysis:

When analyzing the training and validation loss curves, the relative behavior between these metrics provides crucial insights into the model’s learning dynamics. Optimal learning patterns can be observed as:

Both training and validation losses decrease steadily over time
The curves maintain relatively parallel trajectories
The gap between training and validation loss remains stable
Final loss values converge to similar ranges

Model inference options for customized models
Once your custom model has been created in Bedrock, you have two main ways to make inferences to that model: use on-demand custom model inference (ODI) deployments, or use Provisioned Throughput endpoints. Let’s talk about why and when to choose one over the other.
On-demand custom model deployments provide a flexible and cost-effective way to leverage your custom Bedrock models. With on-demand deployments, you only pay for the compute resources you use, based on the number of tokens processed during inference. This makes on-demand a great choice for workloads with variable or unpredictable usage patterns, where you want to avoid over-provisioning resources. The on-demand approach also offers automatic scaling, so you don’t have to worry about managing infrastructure capacity. Bedrock will automatically provision the necessary compute power to handle your requests in near real time. This self-service, serverless experience can simplify your operations and deployment workflows.
Alternatively, Provisioned Throughput endpoints are recommended for workloads with steady traffic patterns and consistent high-volume requirements, offering predictable performance and cost benefits over on-demand scaling.
This example uses the ODI option to leverage per-token based pricing; the following code snippet is how you can create an ODI endpoint for your custom model:

# Function to create on-demand inferencing deployment for custom model
def create_model_deployment(custom_model_arn):
    “””
    Create an on-demand inferencing deployment for the custom model
    
    Parameters:
    ———–
    custom_model_arn : str
        ARN of the custom model to deploy
        
    Returns:
    ——–
    deployment_arn : str
        ARN of the created deployment
    “””
    try:
        print(f”Creating on-demand inferencing deployment for model: {custom_model_arn}”)
        
        # Generate a unique name for the deployment
        deployment_name = f”nova-ocr-deployment-{time.strftime(‘%Y%m%d-%H%M%S’)}”
        
        # Create the deployment
        response = bedrock.create_custom_model_deployment(
            modelArn=custom_model_arn,
            modelDeploymentName=deployment_name,
            description=f”on-demand inferencing deployment for model: {custom_model_arn}”,
        )
        
        # Get the deployment ARN
        deployment_arn = response.get(‘customModelDeploymentArn’)
        
        print(f”Deployment request submitted. Deployment ARN: {deployment_arn}”)
        return deployment_arn
    
    except Exception as e:
        print(f”Error creating deployment: {e}”)
        return None

Evaluation: Accuracy improvement with fine-tuning
Our evaluation of the base model and the fine-tuned Nova model shows significant improvements across all field categories. Let’s break down the performance gains:

Field category
Metric
Base model
Fine-tuned model
Improvement

Employee information
Accuracy
58%
82.33%
24.33%

Precision
57.05%
82.33%
25.28%

Recall
100%
100%
0%

F1 score
72.65%
90.31%
17.66%

Employer information
Accuracy
58.67%
92.67%
34%

Precision
53.66%
92.67%
39.01%

Recall
100%
100%
0%

F1 score
69.84%
96.19%
26.35%

Earnings
Accuracy
62.71%
85.57%
22.86%

Precision
60.97%
85.57%
24.60%

Recall
99.55%
100%
0.45%

F1 score
75.62%
92.22%
16.60%

Benefits
Accuracy
45.50%
60%
14.50%

Precision
45.50%
60%
14.50%

Recall
93.81%
100%
6.19%

F1 score
61.28%
75%
13.72%

Multi-state employment
Accuracy
58.29%
94.19%
35.90%

Precision
52.14%
91.83%
39.69%

Recall
99.42%
100%
0.58%

F1 score
68.41%
95.74%
27.33%

The following graphic shows a bar chart comparing the F1 scores of the base model and fine-tuned model for each field category, with the improvement percentage shown in the previous table:

Key observations:

Substantial improvements across all categories, with the most significant gains in employer information and multi-state employment
Consistent 100% recall maintained or achieved in the fine-tuned model, indicating comprehensive field extraction
Notable precision improvements, particularly in categories that were challenging for the base model

Clean up
To avoid incurring unnecessary costs when you’re no longer using your custom model, it’s important to properly clean up the resources. Follow these steps to remove both the deployment and the custom model:

Delete the custom model deployment
Delete the custom model

Cost analysis
In our example, we chose to use Bedrock fine-tuning job which is PEFT and ODI is available. PEFT fine tuning Nova Lite paired with on-demand inference capabilities offers a cost-effective and scalable solution for enhanced document processing. The cost structure is straightforward and transparent:
One-time cost:

Model training: $0.002 per 1,000 tokens × number of epochs

Ongoing costs:

Storage: $1.95 per month per custom model
On-demand Inference: Same per-token pricing as the base model

Example 1 page from above dataset: 1895 tokens/1000 * $0.00006 + 411 tokens/1000 * $0.00024 = $0.00021

On-demand inference allows you to run your custom Nova models without maintaining provisioned endpoints, enabling pay-as-you-go pricing based on actual token usage. This approach eliminates the need for capacity planning while ensuring cost-efficient scaling.
Conclusion
In this post, we’ve demonstrated how fine-tuning Amazon Nova Lite can transform document processing accuracy while maintaining cost efficiency. Our evaluation shows significant performance gains, with up to 39% improvement in precision for critical fields and perfect recall across key document categories. While our implementation did not require constrained decoding, tool calling with Nova can provide additional reliability for more complex structured outputs, especially when working with intricate JSON schemas. Please refer to the resource on structured output with tool calling for further information.
The flexible deployment options, including on-demand inference with pay-per-use pricing, eliminate infrastructure overhead while maintaining the same inference costs as the base model. With the dataset we used for this example, runtime inference per page cost was $0.00021, making it a cost-effective solution. Through practical examples and step-by-step guides, we’ve shown how to prepare training data, fine-tune models, and evaluate performance with clear metrics.
To get started with your own implementation, visit our GitHub repository for complete code samples and detailed documentation.

About the authors
Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.
Arlind Nocaj is a GTM Specialist Solutions Architect for AI/ML and Generative AI for europe central based in AWS Zurich Office, who guides enterprise customers through their digital transformation journeys. With a PhD in network analytics and visualization (Graph Drawing) and over a decade of experience as a research scientist and software engineer, he brings a unique blend of academic rigor and practical expertise to his role. His primary focus lies in using the full potential of data, algorithms, and cloud technologies to drive innovation and efficiency. His areas of expertise include Machine Learning, Generative AI and in particular Agentic systems with Multi-modal LLMs for document processing and structured insights.
Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.
Malte Reimann is a Solutions Architect based in Zurich, working with customers across Switzerland and Austria on their cloud initiatives. His focus lies in practical machine learning applications—from prompt optimization to fine-tuning vision language models for document processing. The most recent example, working in a small team to provide deployment options for Apertus on AWS. An active member of the ML community, Malte balances his technical work with a disciplined approach to fitness, preferring early morning gym sessions when it’s empty. During summer weekends, he explores the Swiss Alps on foot and enjoying time in nature. His approach to both technology and life is straightforward: consistent improvement through deliberate practice, whether that’s optimizing a customer’s cloud deployment or preparing for the next hike in the clouds.

Building a Context-Folding LLM Agent for Long-Horizon Reasoning with M …

In this tutorial, we explore how to build a Context-Folding LLM Agent that efficiently solves long, complex tasks by intelligently managing limited context. We design the agent to break down a large task into smaller subtasks, perform reasoning or calculations when needed, and then fold each completed sub-trajectory into concise summaries. By doing this, we preserve essential knowledge while keeping the active memory small. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try:
import transformers
except:
subprocess.run([sys.executable, “-m”, “pip”, “install”, “-q”, “transformers”, “accelerate”, “sentencepiece”], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get(“CF_MODEL”, “google/flan-t5-small”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline(“text2text-generation”, model=model, tokenizer=tokenizer, device_map=”auto”)
def llm_gen(prompt: str, max_new_tokens=160, temperature=0.0) -> str:
out = llm(prompt, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0][“generated_text”]
return out.strip()

We begin by setting up our environment and loading a lightweight Hugging Face model. We use this model to generate and process text locally, ensuring the agent runs smoothly on Google Colab without any API dependencies. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg, ast.FloorDiv: op.floordiv, ast.Mod: op.mod}
def _eval_node(n):
if isinstance(n, ast.Num): return n.n
if isinstance(n, ast.UnaryOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand))
if isinstance(n, ast.BinOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.right))
raise ValueError(“Unsafe expression”)
def calc(expr: str):
node = ast.parse(expr, mode=’eval’).body
return _eval_node(node)
class FoldingMemory:
def __init__(self, max_chars:int=800):
self.active=[]; self.folds=[]; self.max_chars=max_chars
def add(self,text:str):
self.active.append(text.strip())
while len(self.active_text())>self.max_chars and len(self.active)>1:
popped=self.active.pop(0)
fold=f”- Folded: {popped[:120]}…”
self.folds.append(fold)
def fold_in(self,summary:str): self.folds.append(summary.strip())
def active_text(self)->str: return “n”.join(self.active)
def folded_text(self)->str: return “n”.join(self.folds)
def snapshot(self)->Dict: return {“active_chars”:len(self.active_text()),”n_folds”:len(self.folds)}

We define a simple calculator tool for basic arithmetic and create a memory system that dynamically folds past context into concise summaries. This helps us maintain a manageable active memory while retaining essential information. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserSUBTASK_DECOMP_PROMPT=”””You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with ‘- ‘ in priority order.
Task: “{task}” “””
SUBTASK_SOLVER_PROMPT=”””You are a precise problem solver with minimal steps.
If a calculation is needed, write one line ‘CALC(expr)’.
Otherwise write ‘ANSWER: <final>’.
Think briefly; avoid chit-chat.

Task: {task}
Subtask: {subtask}
Notes (folded context):
{notes}

Now respond with either CALC(…) or ANSWER: …”””
SUBTASK_SUMMARY_PROMPT=”””Summarize the subtask outcome in <=3 bullets, total <=50 tokens.
Subtask: {name}
Steps:
{trace}
Final: {final}
Return only bullets starting with ‘- ‘.”””
FINAL_SYNTH_PROMPT=”””You are a senior agent. Synthesize a final, coherent solution using ONLY:
– The original task
– Folded summaries (below)
Avoid repeating steps. Be concise and actionable.

Task: {task}
Folded summaries:
{folds}

Final answer:”””
def parse_bullets(text:str)->List[str]:
return [ln[2:].strip() for ln in text.splitlines() if ln.strip().startswith(“- “)]

We design prompt templates that guide the agent in decomposing tasks, solving subtasks, and summarizing outcomes. These structured prompts enable clear communication between reasoning steps and the model’s responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_subtask(task:str, subtask:str, memory:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]:
notes=(memory.folded_text() or “(none)”)
trace=[]; final=””
for _ in range(max_tool_iters):
prompt=SUBTASK_SOLVER_PROMPT.format(task=task,subtask=subtask,notes=notes)
out=llm_gen(prompt,max_new_tokens=96); trace.append(out)
m=re.search(r”CALC((.+?))”,out)
if m:
try:
val=calc(m.group(1))
trace.append(f”TOOL:CALC -> {val}”)
out2=llm_gen(prompt+f”nTool result: {val}nNow produce ‘ANSWER: …’ only.”,max_new_tokens=64)
trace.append(out2)
if out2.strip().startswith(“ANSWER:”):
final=out2.split(“ANSWER:”,1)[1].strip(); break
except Exception as e:
trace.append(f”TOOL:CALC ERROR -> {e}”)
if out.strip().startswith(“ANSWER:”):
final=out.split(“ANSWER:”,1)[1].strip(); break
if not final:
final=”No definitive answer; partial reasoning:n”+”n”.join(trace[-2:])
summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(name=subtask,trace=”n”.join(trace),final=final),max_new_tokens=80)
summary_bullets=”n”.join(parse_bullets(summ)[:3]) or f”- {subtask}: {final[:60]}…”
return final, summary_bullets, trace
class ContextFoldingAgent:
def __init__(self,max_active_chars:int=800):
self.memory=FoldingMemory(max_chars=max_active_chars)
self.metrics={“subtasks”:0,”tool_calls”:0,”chars_saved_est”:0}
def decompose(self,task:str)->List[str]:
plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96)
subs=parse_bullets(plan)
return subs[:4] if subs else [“Main solution”]
def run(self,task:str)->Dict:
t0=time.time()
self.memory.add(f”TASK: {task}”)
subtasks=self.decompose(task)
self.metrics[“subtasks”]=len(subtasks)
folded=[]
for st in subtasks:
self.memory.add(f”SUBTASK: {st}”)
final,fold_summary,trace=run_subtask(task,st,self.memory)
self.memory.fold_in(fold_summary)
folded.append(f”- {st}: {final}”)
self.memory.add(f”SUBTASK_DONE: {st}”)
final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200)
t1=time.time()
return {“task”:task,”final”:final.strip(),”folded_summaries”:self.memory.folded_text(),
“active_context_chars”:len(self.memory.active_text()),
“subtask_finals”:folded,”runtime_sec”:round(t1-t0,2)}

We implement the agent’s core logic, in which each subtask is executed, summarized, and folded back into memory. This step demonstrates how context folding enables the agent to reason iteratively without losing track of prior reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDEMO_TASKS=[
“Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.”,
“Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation.”
]
def pretty(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__==”__main__”:
agent=ContextFoldingAgent(max_active_chars=700)
for i,task in enumerate(DEMO_TASKS,1):
print(“=”*70)
print(f”DEMO #{i}: {task}”)
res=agent.run(task)
print(“n— Folded Summaries —n”+(res[“folded_summaries”] or “(none)”))
print(“n— Final Answer —n”+res[“final”])
print(“n— Diagnostics —“)
diag={k:res[k] for k in [“active_context_chars”,”runtime_sec”]}
diag[“n_subtasks”]=len(agent.decompose(task))
print(pretty(diag))

We run the agent on sample tasks to observe how it plans, executes, and synthesizes final results. Through these examples, we see the complete context-folding process in action, producing concise and coherent outputs.

In conclusion, we demonstrate how context folding enables long-horizon reasoning while avoiding memory overload. We see how each subtask is planned, executed, summarized, and distilled into compact knowledge, mimicking how an intelligent agent would handle complex workflows over time. By combining decomposition, tool use, and context compression, we create a lightweight yet powerful agentic system that scales reasoning efficiently.

Check out the FULL CODES here and Paper . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Building a Context-Folding LLM Agent for Long-Horizon Reasoning with Memory Compression and Tool Use appeared first on MarkTechPost.

Anthropic Launches Claude Haiku 4.5: Small AI Model that Delivers Sonn …

Anthropic released Claude Haiku 4.5, a latency-optimized “small” model that delivers similar levels of coding performance to Claude Sonnet 4 while running more than twice as fast at one-third the cost. The model is immediately available via Anthropic’s API and in partner catalogs on Amazon Bedrock and Google Cloud Vertex AI. Pricing is $1/MTok input and $5/MTok output. Anthropic positions Haiku 4.5 as a drop-in replacement for Haiku 3.5 and Sonnet 4 in cost-sensitive, interactive workloads.

Positioning and lineup

Haiku 4.5 targets real-time assistants, customer-support automations, and pair-programming where tight latency budgets and throughput dominate. It surpasses Sonnet 4 on “computer use” tasks—the GUI/browser manipulation underpinning products like Claude for Chrome—and is described as materially improving responsiveness in Claude Code for multi-agent projects and rapid prototyping. Anthropic makes clear that Sonnet 4.5 remains the frontier model and “the best coding model in the world,” while Haiku 4.5 offers near-frontier performance with greater cost-efficiency. A recommended pattern is Sonnet 4.5 for multi-step planning and parallel execution by a pool of Haiku 4.5 workers.

Availability, identifiers, and pricing

From day one, developers can call the model (claude-haiku-4-5) on Anthropic’s API. Anthropic also states availability on Amazon Bedrock and Vertex AI; model catalogs may update region coverage and IDs over time, but the company confirms cloud availability in the launch post. The API price for Haiku 4.5 is $1/MTok (input) and $5/MTok (output), with prompt-caching listed at $1.25/MTok write and $0.10/MTok read.

Benchmarks

Anthropic summarizes results across standard and agentic suites and includes methodology details to qualify the numbers:

SWE-bench Verified: simple scaffold with two tools (bash, file edits), 73.3% averaged over 50 trials, no test-time compute, 128K thinking budget, default sampling. Includes a minor prompt addendum encouraging extensive tool use and writing tests first.

Terminal-Bench: Terminus-2 agent, average over 11 runs (6 without thinking, 5 with 32K thinking budget).

OSWorld-Verified: 100 max steps, averaged across 4 runs with a 128K total thinking budget and 2K per-step configuration.

AIME / MMMLU: averages over multiple runs using default sampling and 128K thinking budgets.

https://www.anthropic.com/news/claude-haiku-4-5

https://www.anthropic.com/news/claude-haiku-4-5

The post emphasizes coding parity with Sonnet 4 and computer-use gains relative to Sonnet 4 under these scaffolds. Users should replicate with their own orchestration, tool stacks, and thinking budgets before generalizing.

Key Takeaways

Haiku 4.5 delivers Sonnet-4-level coding performance at one-third the cost and more than twice the speed.

It surpasses Sonnet 4 on computer-use tasks, improving responsiveness in Claude for Chrome and multi-agent flows in Claude Code.

Recommended orchestration: use Sonnet 4.5 for multi-step planning and parallelize execution with multiple Haiku 4.5 workers.

Pricing is $1/$5 per million input/output tokens; available via Claude API, Amazon Bedrock, and Google Cloud Vertex AI.

Released under ASL-2 with a lower measured misalignment rate than Sonnet 4.5 and Opus 4.1 in Anthropic’s tests.

Editorial Comments

Anthropic’s positioning of Claude Haiku 4.5 is strategically sound: by delivering similar levels of coding performance to Claude Sonnet 4 at one-third the cost and more than twice the speed, while surpassing Sonnet 4 on computer use, the company gives devs a clean planner–executor split—Sonnet 4.5 for multi-step planning and a pool of Haiku 4.5 workers for parallel execution—without forcing architectural changes (“drop-in replacement” across API, Amazon Bedrock, Vertex AI). The ASL-2 release, coupled with a documented lower misalignment rate than Sonnet 4.5 and Opus 4.1, lowers the friction for enterprise rollout where safety gates and cost envelopes dominate deployment math.

Check out the Technical details, system card, model page, and documentation . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic Launches Claude Haiku 4.5: Small AI Model that Delivers Sonnet-4-Level Coding Performance at One-Third the Cost and more than Twice the Speed appeared first on MarkTechPost.

Meta AI’s ‘Early Experience’ Trains Language Agents without Rewa …

How would your agent stack change if a policy could train purely from its own outcome-grounded rollouts—no rewards, no demos—yet beat imitation learning across eight benchmarks? Meta Superintelligence Labs propose ‘Early Experience‘, a reward-free training approach that improves policy learning in language agents without large human demonstration sets and without reinforcement learning (RL) in the main loop. The core idea is simple: let the agent branch from expert states, take its own actions, collect the resulting future states, and convert those consequences into supervision. The research team instantiates this with two concrete strategies—Implicit World Modeling (IWM) and Self-Reflection (SR)—and reports consistent gains across eight environments and multiple base models.

https://arxiv.org/pdf/2510.08558

What Early Experience changes?

Traditional pipelines lean on imitation learning (IL) over expert trajectories, which is cheap to optimize but hard to scale and brittle out-of-distribution; reinforcement learning (RL) promises learning from experience but needs verifiable rewards and stable infrastructure—often missing in web and multi-tool settings. Early Experience sits between them: it is reward-free like imitation learning (IL), but the supervision is grounded in consequences of the agent’s own actions, not just expert actions. In short, the agent proposes, acts, and learns from what actually happens next—no reward function required.

Implicit World Modeling (IWM): Train the model to predict the next observation given the state and chosen action, tightening the agent’s internal model of environment dynamics and reducing off-policy drift.

Self-Reflection (SR): Present expert and alternative actions at the same state; have the model explain why the expert action is better using the observed outcomes, then fine-tune the policy from this contrastive signal.

Both strategies use the same budgets and decoding settings as IL; only the data source differs (agent-generated branches rather than more expert trajectories).

https://arxiv.org/pdf/2510.08558

Understanding the Benchmarks

The research team evaluate on eight language-agent environments spanning web navigation, long-horizon planning, scientific/embodied tasks, and multi-domain API workflows—e.g., WebShop (transactional browsing), TravelPlanner (constraint-rich planning), ScienceWorld, ALFWorld, Tau-Bench, and others. Early Experience yields average absolute gains of +9.6 success and +9.4 out-of-domain (OOD) over IL across the full matrix of tasks and models. These gains persist when the same checkpoints are used to initialize RL (GRPO), improving post-RL ceilings by up to +6.4 compared to reinforcement learning (RL) started from imitation learning (IL).

Efficiency: less expert data, same optimization budget

A key practical win is demo efficiency. With a fixed optimization budget, Early Experience matches or beats IL using a fraction of expert data. On WebShop, 1/8 of the demonstrations with Early Experience already exceeds IL trained on the full demo set; on ALFWorld, parity is hit at 1/2 the demos. The advantage grows with more demonstrations, indicating the agent-generated future states provide supervision signals that demonstrations alone do not capture.

How the data is built?

The pipeline seeds from a limited set of expert rollouts to obtain representative states. At selected states, the agent proposes alternative actions, executes them, and records the next observations.

For IWM, the training data are triplets ⟨state, action, next-state⟩ and the objective is next-state prediction.

For SR, the prompts include the expert action and several alternatives plus their observed outcomes; the model produces a grounded rationale explaining why the expert action is preferable, and this supervision is then used to improve the policy.

Where reinforcement learning (RL) fits?

Early Experience is not “RL without rewards.” It is a supervised recipe that uses agent-experienced outcomes as labels. In environments with verifiable rewards, the research team simply add RL after Early Experience. Because the initialization is better than IL, the same RL schedule climbs higher and faster, with up to +6.4 final success over IL-initialized RL across tested domains. This positions Early Experience as a bridge: reward-free pre-training from consequences, followed (where possible) by standard reinforcement learning (RL).

Key Takeaways

Reward-free training via agent-generated future states (not rewards) using Implicit World Modeling and Self-Reflection outperforms imitation learning across eight environments.

Reported absolute gains over IL: +18.4 (WebShop), +15.0 (TravelPlanner), +13.3 (ScienceWorld) under matched budgets and settings.

Demo efficiency: exceeds IL on WebShop with 1/8 of demonstrations; reaches ALFWorld parity with 1/2—at fixed optimization cost.

As an initializer, Early Experience boosts subsequent RL (GRPO) endpoints by up to +6.4 versus RL started from IL.

Validated on multiple backbone families (3B–8B) with consistent in-domain and out-of-domain improvements; positioned as a bridge between imitation learning (IL) and reinforcement learning (RL).

Editorial Comments

Early Experience is a pragmatic contribution: it replaces brittle rationale-only augmentation with outcome-grounded supervision that an agent can generate at scale, without reward functions. The two variants—Implicit World Modeling (next-observation prediction to anchor environment dynamics) and Self-Reflection (contrastive, outcome-verified rationales against expert actions)—directly attack off-policy drift and long-horizon error accumulation, explaining the consistent gains over imitation learning across eight environments and the stronger RL ceilings when used as an initializer for GRPO. In web and tool-use settings where verifiable rewards are scarce, this reward-free supervision is the missing middle between IL and RL and is immediately actionable for production agent stacks.

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI’s ‘Early Experience’ Trains Language Agents without Rewards—and Outperforms Imitation Learning appeared first on MarkTechPost.

Transforming enterprise operations: Four high-impact use cases with Am …

Since the launch of Amazon Nova at AWS re:Invent 2024, we have seen adoption trends across industries, with notable gains in operational efficiency, compliance, and customer satisfaction. With its capabilities in secure, multimodal AI and domain customization, Nova is enhancing workflows and enabling cost efficiencies across core use cases.
In this post, we share four high-impact, widely adopted use cases built with Nova in Amazon Bedrock, supported by real-world customers deployments, offerings available from AWS partners, and experiences. These examples are ideal for organizations researching their own AI adoption strategies and use cases across industries.
Customer service
Traditional chatbots often frustrate users with scripted, inflexible responses that fail to understand context or intent. For enterprises, these are missed opportunities to resolve issues quickly, lower support costs, and drive customer loyalty. AI-powered applications can understand natural language, adapt to individual customer needs, and integrate with backend systems in real time. Organizations are transforming support from a cost center into a strategic driver of satisfaction and retention. These are often high-volume and interactive scenarios, so the balance of cost, speed, and intelligence is critical.
Customer service applications built with Nova in Amazon Bedrock can seamlessly integrate with business data stored with AWS, and offer the security, privacy, and reliability for production use in enterprise environments.

Infosys, a leading global IT services and consulting organization, developed Infosys Event AI for real-time transcription, multilingual translation, and intelligent summarization of live event content. Infosys Event AI is built with Amazon Nova Pro in Amazon Bedrock. During a recent event in Bangalore, the AI assistant handled around 230 users per minute and was queried an average of 57 times per minute, generating more than 9,000 session summaries. This solution enhanced knowledge retention, engagement, and inclusivity by making event insights instantly accessible in multiple languages and formats for hearing-impaired and remote participants. By transforming event content into a persistent, searchable multilingual knowledge asset, Infosys Event AI accelerates learning and collaboration.
Fortinet, an AWS Partner and cybersecurity company, uses Amazon Nova Micro to power its AI support assistant, delivering significant performance improvements at a fraction of the cost. By switching to Nova Micro in Amazon Bedrock, Fortinet achieved an 85 times reduction in inference costs, dramatically lowering TCO while maintaining rapid response times. The assistant now helps users quickly navigate complex documentation across more than 60 products, improving support efficiency and elevating customer satisfaction.
Amazon Customer Service uses Nova with its AI-driven issue resolution system. The system is a two-step approach combining intent detection and issue resolution. Amazon Customer Service customized Nova Micro, resulting in 76.9% accuracy for in-domain issues and 69.2% in generalization testing, surpassing current baselines by 5.4% and 7.3%, respectively. Additionally, Nova Lite is used for tool selection, achieving 86.1% accuracy and 4.8% improvement over existing systems.
AWS Summit New York City 2025 was attended by 18,000 participants, featuring the AI assistant Diana for customer service developed with Nova Sonic. By dialing a phone number, the Sonic-powered voice assistant answered hundreds of queries about the event, including session details, location, and FAQs.

Search
Large enterprises face slow, siloed, and inefficient search across vast stores of structured and unstructured data, costing time, productivity, and customer responsiveness. By adopting AI-powered, multimodal search that understands natural language and enforces secure access, organizations can deliver instant, relevant answers from documents, images, and technical files. This accelerates decision-making, shortens deal cycles, improves customer satisfaction, and reduces the cost of knowledge discovery at scale. Search applications increasingly rely on a mix of information across modalities, including text, documents, images, and video.
Nova is among the fastest and most cost-effective multimodal models, offering vision fine-tuning capabilities. Nova also integrates with broader Amazon models including Amazon Titan Multimodal Embeddings and data services including Amazon OpenSearch Service for more robust search capabilities and performance.

Siemens faced growing performance bottlenecks as its massive datasets strained traditional search systems, slowing retrieval speeds and impacting productivity across its global operations. To address this, Siemens integrated Amazon Nova, achieving a threefold boost in search performance that dramatically accelerated data retrieval and improved workflow efficiency. Amazon Nova delivers high-speed, scalable search capabilities, and Siemens’s implementation facilitates seamless integration with existing systems, maintaining business continuity with minimal disruption. This enhanced user experience and positioned Siemens to handle future data growth with ease, supported by continuous performance monitoring and tight infrastructure alignment.
CBRE Global Pulse System (GPS)—built with Amazon Nova Pro in Amazon Bedrock and OpenSearch Service—transforms property search across thousands of users worldwide. Built in partnership with AWS Professional Services and GenAI Specialists, GPS replaces slow, fragmented legacy systems with an AI-driven, multimodal search platform capable of handling complex queries, massive PDFs, and strict permission controls. Key results include 75% faster document ingestion, 70% lower database latency, 87% faster keyword searches, and 51% faster natural language queries. When fully deployed to over 6,000 users later in 2025, GPS is projected to save over 98,000 employee workdays annually, unlocking $320,000 ARR and significant operational efficiency. By shifting from Anthropic’s Claude Sonnet to Nova Pro and Anthropic’s Claude Haiku 3, CBRE also cut AI inference costs by 3.5 times and 12 times, respectively, without sacrificing accuracy.

Video understanding and analysis
Organizations are adopting video understanding applications to drive business value across multiple fronts, including customer behavior analysis, traffic patterns, and manufacturing quality control. Security and safety benefits are realized through real-time threat detection and workplace safety monitoring, and customer experience is enhanced through personalized content recommendations and improved content searchability. Organizations gain competitive advantage through data-driven decision-making and innovation in service delivery, while reducing costs by minimizing manual review processes and decreasing security incidents. This comprehensive approach to video analysis helps companies extract insights from their video data, ultimately leading to improved operations, better decision-making, and enhanced customer experiences. As developers build, iterate, and evolve these applications, there is a growing demand to natively understand video as opposed to dealing with the overhead of frames, time stamps, and synchronization.
Amazon Nova models can analyze, classify, and summarize information in the video based on provided instructions. Applications built with Nova understanding models in Amazon Bedrock offer comprehensive analysis of multiple video formats through flexible input methods, with the ability to analyze, classify, and summarize video content while handling files up to 1 GB through Amazon Simple Storage Service (Amazon S3) integration.

Bitcentral partnered with Caylent to transform how archived content is discovered, accessed, and reused. Using Nova Pro in Amazon Bedrock, Caylent deployed a solution that aligned with the needs of journalists, producers, and broadcasters across more than 1,600 client sites. By embedding semantic video search, contextual metadata generation, and AI-powered content analysis into its workflows, Bitcentral redefined how archived footage is indexed, discovered, and reused. Journalists and producers can now surface high-value content in real time and unlock new revenue streams.
Loka, an AWS Premier Partner, built a video surveillance offering to automatically identify and classify millions of visual events in video footage. This system effectively distinguishes between routine events and critical incidents, helping filter out non-essential activities and alerts. The solution proved highly successful, reducing irrelevant alerts by 55% while maintaining a threat detection rate above 97%. By implementing this automated filtering system, Loka doubled video monitoring efficiency for their client. The tool, built on Amazon Bedrock using Amazon Nova Pro, significantly reduced the workload for human operators while improving overall threat detection capabilities.
Accenture Spotlight can analyze long-form videos and automatically generate personalized short-form clips and highlights, which are particularly useful for sports content like soccer, Formula 1, and rugby. Spotlight is capable of matching content to specific audience demographics and can process real-time CCTV footage in retail settings to create personalized offers. The system is built with Amazon Nova in Amazon Bedrock and operates through three specialized super agents working under a central orchestrator. Spotlight can process videos in minutes rather than the traditional hours or days, while achieving cost savings that are 10 times better than conventional methods. The solution is versatile enough to be used across different industries, from media and entertainment to retail, while maintaining high quality standards and brand alignment through its human-in-the-loop quality assurance option.

Creative content generation
Organizations are seeking ways to revolutionize creative content generation including stock imagery, marketing campaign assets, and product visualizations. It is often slowed down by fragmented workflows, high production costs, and the need to continuously balance scale with personalization. Marketing teams struggle to keep up with the demand for fresh, high-quality assets across multiple channels, while creative fatigue and long lead times limit their agility.
Amazon Nova addresses these challenges with Nova Canvas and Nova Reel: high-quality creative models that transform text and image inputs into professional-grade images and videos. Nova creative models are designed to deliver customizable visual content with control features, making creative content generation accessible and efficient for media, entertainment, retail, marketing, and advertising industries.

Dentsu is reimagining how ads come to life with Amazon Nova creative generation models. What used to take weeks of brainstorming, filming, and editing now happens in days. Their creative teams can sketch out an idea in plain language and watch it turn into polished videos and custom images, ready for markets across the globe in over 200 languages. Built-in safeguards like moderation, watermarking, and IP indemnity mean every piece stays brand safe. For Co-op, Dentsu went a step further—pairing Nova with Amazon Ads to design custom audience profiles that delivered a +4-point lift in brand preference among 25–34-year-olds and a +5-point lift in favorability among affluent shoppers.
Quantiphi, an AWS Premier Global Consulting Partner, developed Qreator, a generative AI-powered marketing content creation service built on AWS. Their service helps marketers create content through natural language prompts while maintaining brand consistency and cross-channel adaptability. With Qreator, business can achieve an approximate 30% reduction in content creation time and get to market approximately 40% faster, automating what was a manual process, and improving consistency across formats and channels.
The Fragrance Lab is a unique AWS activation that was showcased at the Cannes Lions International Festival of Creativity. It demonstrates how to build personalized products and campaign assets using Amazon Nova foundation models in Amazon Bedrock. Although our activation at Cannes Lions focused on personalized fragrance development and ad campaign creation, the underlying architecture and methodology can be adapted across diverse categories, such as fashion, food, and beverage, opening endless possibilities for customized customer experiences. The Fragrance Lab activation won two International Business Awards: Gold for Exhibition Event Experience and Silver for Experiential Event.

Conclusion
The four use cases presented in this post demonstrate the utility of Amazon Nova across industries and applications. From Infosys’s Event AI improving accessibility and engagement, to CBRE’s revolutionary property search system, to Loka’s intelligent video surveillance, and Dentsu’s creative content generation, each implementation showcases significant, measurable improvements in efficiency, cost reduction, and customer satisfaction.
Organizations using Amazon Nova are achieving tangible business outcomes through evidence-based adoption strategies. By partnering with Amazon and AWS Partners, organizations are accelerating their AI transformation while maintaining strong foundations in security, compliance, and privacy-by-design principles.
To get started building with Nova, visit the Amazon Nova user guide or visit the AWS console.

About the Authors
Abhinav Bhargava is a Sr Product Marketing Manager at AWS on the Amazon Nova team, where he focuses on scaling generative AI adoption through customer-centric solutions. With a background in design and sustainability, he brings a unique perspective to connecting technology and creativity to drive enterprise innovation. Based in Seattle, Abhinav enjoys playing volleyball, traveling, and learning about new cultures.
Raechel Frick is a Sr Product Marketing Manager at AWS. With over 20 years of experience in the tech industry, she brings a customer-first approach and growth mindset to building integrated marketing programs. Based in the greater Seattle area, Raechel balances her professional life with being a soccer mom and after-school carpool manager, demonstrating her ability to excel both in the corporate world and family life.

Building smarter AI agents: AgentCore long-term memory deep dive

Building AI agents that remember user interactions requires more than just storing raw conversations. While Amazon Bedrock AgentCore short-term memory captures immediate context, the real challenge lies in transforming these interactions into persistent, actionable knowledge that spans across sessions. This is the information that transforms fleeting interactions into meaningful, continuous relationships between users and AI agents. In this post, we’re pulling back the curtain on how the Amazon Bedrock AgentCore Memory long-term memory system works.
If you’re new to AgentCore Memory, we recommend reading our introductory blog post first: Amazon Bedrock AgentCore Memory: Building context-aware agents. In brief, AgentCore Memory is a fully managed service that enables developers to build context-aware AI agents by providing both short-term working memory and long-term intelligent memory capabilities.
The challenge of persistent memory
When humans interact, we don’t just remember exact conversations—we extract meaning, identify patterns, and build understanding over time. Teaching AI agents to respond the same requires solving several complex challenges:

Agent memory systems must distinguish between meaningful insights and routine chatter, determining which utterances deserve long-term storage versus temporary processing. A user saying “I’m vegetarian” should be remembered, but “hmm, let me think” should not.
Memory systems need to recognize related information across time and merge it without creating duplicates or contradictions. When a user mentions they’re allergic to shellfish in January and mentions “can’t eat shrimp” in March, these needs to be recognized as related facts and consolidated with existing knowledge without creating duplicates or contradictions.
Memories must be processed in order of temporal context. Preferences that change over time (for example, the user loved spicy chicken in a restaurant last year, but today, they prefer mild flavors) require careful handling to make sure the most recent preference is respected while maintaining historical context.
As memory stores grow to contain thousands or millions of records, finding relevant memories quickly becomes a significant challenge. The system must balance comprehensive memory retention with efficient retrieval.

Solving these problems requires sophisticated extraction, consolidation, and retrieval mechanisms that go beyond simple storage. Amazon Bedrock AgentCore Memory tackles these complexities by implementing a research-backed long-term memory pipeline that mirrors human cognitive processes while maintaining the precision and scale required for enterprise applications.
How AgentCore long-term memory works
When the agentic application sends conversational events to AgentCore Memory, it initiates a pipeline to transform raw conversational data into structured, searchable knowledge through a multi-stage process. Let’s explore each component of this system. 
1. Memory extraction: From conversation to insights
When new events are stored in short-term memory, an asynchronous extraction process analyzes the conversational content to identify meaningful information. This process leverages large language models (LLMs) to understand context and extract relevant details that should be preserved in long-term memory. The extraction engine processes incoming messages alongside prior context to generate memory records in a predefined schema. As a developer, you can configure one or more Memory strategies to extract only the information types relevant to your application needs. The extraction process supports three built-in memory strategies:

Semantic memory: Extracts facts and knowledge. Example:

“The customer’s company has 500 employees across Seattle, Austin, and Boston”

User preferences: Captures explicit and implicit preferences given context. Example:

{“preference”: “Prefers Python for development work”, “categories”: [“programming”, ”code-style”], “context”: “User wants to write a student enrollment website”}

Summary memory: Creates running narratives of conversations under different topics scoped to sessions and preserves the key information in a structured XML format. Example:

<topic=“Material-UI TextareaAutosize inputRef Warning Fix Implementation”> A developer successfully implemented a fix for the issue in Material-UI where the TextareaAutosize component gives a “Does not recognize the ‘inputRef’ prop” warning when provided to OutlinedInput through the ‘inputComponent’ prop. </topic>

For each strategy, the system processes events with timestamps for maintaining the continuity of context and conflict resolution. Multiple memories can be extracted from a single event, and each memory strategy operates independently, allowing parallel processing.
2. Memory consolidation
Rather than simply adding new memories to existing storage, the system performs intelligent consolidation to merge related information, resolve conflicts, and minimize redundancies. This consolidation makes sure the agent’s memory remains coherent and up to date as new information arrives.
The consolidation process works as follows:

Retrieval: For each newly extracted memory, the system retrieves the top most semantically similar existing memories from the same namespace and strategy.
Intelligent processing: The new memory and retrieved memories are sent to the LLM with a consolidation prompt. The prompt preserves the semantic context, thus avoiding unnecessary updates (for example, “loves pizza” and “likes pizza” are considered essentially the same information). Preserving these core principles, the prompt is designed to handle various scenarios:

You are an expert in managing data. Your job is to manage memory store. 
Whenever a new input is given, your job is to decide which operation to perform.

Here is the new input text.
TEXT: {query}

Here is the relevant and existing memories
MEMORY: {memory}

You can call multiple tools to manage the memory stores…
Based on this prompt, the LLM determines the appropriate action:

ADD: When the new information is distinct from existing memories
UPDATE: Enhance existing memories when the new knowledge complements or updates the existing memories
NO-OP: When the information is redundant

Vector store updates: The system applies the determined actions, maintaining an immutable audit trail by marking the outdated memories as INVALID instead of instantly deleting them.

This approach makes sure that contradictory information is resolved (prioritizing recent information), duplicates are minimized, and related memories are appropriately merged.
Handling edge cases
The consolidation process gracefully handles several challenging scenarios:

Out-of-order events: Although the system processes events in temporal order within sessions, it can handle late-arriving events through careful timestamp tracking and consolidation logic.
Conflicting information: When new information contradicts existing memories, the system prioritizes recency while maintaining a record of previous states:

Existing: “Customer budget is $500”
New: “Customer mentioned budget increased to $750”
Result: New active memory with $750, previous memory marked inactive

Memory failures: If consolidation fails for one memory, it doesn’t impact others. The system uses exponential backoff and retry mechanisms to handle transient failures. If consolidation ultimately fails, the memory is added to the system to help prevent potential loss of information.

Advanced custom memory strategy configurations
While built-in memory strategies cover common use cases, AgentCore Memory recognizes that different domains require tailored approaches for memory extraction and consolidation. The system supports built-in strategies with overrides for custom prompts that extend the built-in extraction and consolidation logic, letting teams adapt memory handling to their specific requirements. To maintain system compatibility and focus on criteria and logic rather than output formats, custom prompts help developers customize what information gets extracted or filtered out, how memories should be consolidated, and how to resolve conflicts between contradictory information.
AgentCore Memory also supports custom model selection for memory extraction and consolidation. This flexibility helps developers balance accuracy and latency based on their specific needs. You can define them via the APIs when you create the memory_resource as a strategy override or via the console (as shown below in the console screenshot).

Apart from override functionality, we also offer self-managed strategies that provide complete control over your memory processing pipeline. With self-managed strategies, you can implement custom extraction and consolidation algorithms using any models or prompts while leveraging AgentCore Memory for storage and retrieval. Also, using the Batch APIs, you can directly ingest extracted records into AgentCore Memory while maintaining full ownership of the processing logic.
Performance characteristics
We evaluated our built-in memory strategy across three public benchmarking datasets to assess different aspects of long-term conversational memory:

LoCoMo: Multi-session conversations generated through a machine-human pipeline with persona-based interactions and temporal event graphs. Tests long-term memory capabilities across realistic conversation patterns.
LongMemEval: Evaluates memory retention in long conversations across multiple sessions and extended time periods. We randomly sampled 200 QA pairs for evaluation efficiency.
PrefEval: Tests preference memory across 20 topics using 21-session instances to evaluate the system’s ability to remember and consistently apply user preferences over time.
PolyBench-QA: A question-answering dataset containing 807 Question Answer (QA) pairs across 80 trajectories, collected from a coding agent solving tasks in PolyBench.

We use two standard metrics: correctness and compression rate. LLM-based correctness evaluates whether the system can correctly recall and use stored information when needed. Compression rate is defined as output memory token count / full context token count, and evaluates how effectively the memory system stores information. Higher compression rates indicate the system maintains essential information while reducing storage overhead. This compression rate directly translates to faster inference speeds and lower token consumption–the most critical consideration for deploying agents at scale because it enables more efficient processing of large conversational histories and reduces operational costs.

Memory Type
Dataset
Correctness
Compression Rate

RAG baseline (full conversation history)
LoCoMo
77.73%
0%

LongMemEval-S
75.2%
0%

PrefEval
51%
0%

Semantic Memory
LoCoMo
70.58%
89%

LongMemEval-S
73.60%
94%

Preference Memory
PrefEval
79%
68%

Summarization
PolyBench-QA
83.02%
95%

The retrieval-augmented-generation (RAG) baseline performs well on factual QA tasks due to complete conversation history access, but struggles with preference inference. The memory system achieves strong practical trade-offs: though information compression leads to slightly lower correctness on some factual tasks, it provides 89-95% compression rates for scalable deployment, maintaining bounded context sizes, and performs effectively at their specialized use cases.
For more complex tasks requiring inference (understanding user preferences or behavioral patterns), memory demonstrates clear advantages in both performance accuracy and storage efficiency—the extracted insights are more valuable than raw conversational data for these use cases.
Beyond accuracy metrics, AgentCore Memory delivers the performance characteristics necessary for production deployment.

Extraction and consolidation operations complete within 20-40 seconds for standard conversations after the extraction is triggered.
Semantic search retrieval (retrieve_memory_records API) returns results in approximately 200 milliseconds.
Parallel processing architecture enables multiple memory strategies to process independently; thus, different memory types can be processed simultaneously without blocking each other.

These latency characteristics, combined with the high compression rates, enable the system to maintain responsive user experiences while managing extensive conversational histories efficiently across large-scale deployments.
Best practices for long-term memory
To maximize the effectiveness of long-term memory in your agents:

Choose the right memory strategies: Select built-in strategies that align with your use case or create custom strategies for domain-specific needs. Semantic memory captures factual knowledge, preference memory tailors towards individual preference, and summarization memory distills complex information for better context management. For example, a customer support agent might use semantic memory to capture customer transaction history and past issues, while summarization memory creates short narratives of current support conversations and troubleshooting workflows across different topics.
Design meaningful namespaces: Structure your namespaces to reflect your application’s hierarchy. This also enables precise memory isolation and efficient retrieval. For example, use customer-support/user/john-doe for individual agent memories and customer-support/shared/product-knowledge for team-wide information.
Monitor consolidation patterns: Regularly review what memories are being created (using list_memories or retrieve_memory_records API), updated, or skipped. This helps refine your extraction strategies and helps the system capture relevant information that’s better fitted to your use case.
Plan for async processing: Remember that long-term memory extraction is asynchronous. Design your application to handle the delay between event ingestion and memory availability. Consider using short-term memory for immediate retrieval needs while long-term memories are being processed and consolidated in the background. You might also want to implement fallback mechanisms or loading states to manage user expectations during processing delays.

Conclusion
The Amazon Bedrock AgentCore Memory long-term memory system represents a significant advancement in building AI agents. By combining sophisticated extraction algorithms, intelligent consolidation processes, and immutable storage designs, it provides a robust foundation for agents that learn, adapt, and improve over time.
The science behind this system, from research-backed prompts to innovative consolidation workflow, makes sure that your agents don’t just remember, but understand. This transforms one-time interactions into continuous learning experiences, creating AI agents that become more helpful and personalized with every conversation.
Resources: – AgentCore Memory Docs – AgentCore Memory code samples – Getting started with AgentCore – Workshop

About the authors
Akarsha Sehwag is a Generative AI Data Scientist for Amazon Bedrock AgentCore GTM team. With over six years of expertise in AI/ML, she has built production-ready enterprise solutions across diverse customer segments in Generative AI, Deep Learning and Computer Vision domains. Outside of work, she likes to hike, bike or play Badminton.
Jiarong Jiang is a Principal Applied Scientist at AWS, driving innovations in Retrieval-Augmented Generation (RAG) and agent memory systems to improve the accuracy and intelligence of enterprise AI. She’s passionate about enabling customers to build context-aware, reasoning-driven applications that leverage their own data effectively.
Jay Lopez-Braus is a Senior Technical Product Manager at AWS. He has over ten years of product management experience. In his free time, he enjoys all things outdoors.
Dani Mitchell is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS). He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock and Bedrock AgentCore.
Peng Shi is a Senior Applied Scientist at AWS, where he leads advancements in agent memory systems to enhance the accuracy, adaptability, and reasoning capabilities of AI. His work focuses on creating more intelligent and context-aware applications that bridge cutting-edge research with real-world impact.

Configure and verify a distributed training cluster with AWS Deep Lear …

Training state-of-the-art large language models (LLMs) demands massive, distributed compute infrastructure. Meta’s Llama 3, for instance, ran on 16,000 NVIDIA H100 GPUs for over 30.84 million GPU hours. Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service that simplifies the deployment, management, and scaling of Kubernetes clusters that can scale up to the ranges needed to train LLMs. To facilitate the configuration of such large, distributed workloads, AWS Deep Learning Containers (DLCs) provide pre-built, performance-optimized images for popular frameworks like PyTorch, so teams can launch jobs faster and with fewer compatibility issues. However, even with Amazon EKS and DLCs, configuring clusters for large training workloads is not a trivial task.
A source of complexity for the configuration of the training cluster is the configuration of the GPUs in the GPU-powered instances used in distributed training. GPU-powered Amazon Elastic Compute Cloud (Amazon EC2) instances come in two families: the G family (for example, G6 with NVIDIA L4 Tensor Core GPUs) for cost-efficient inference and lighter training, and the P family (for example, P6 with NVIDIA GB200 NVL72) for massive, distributed jobs. A single P5 has 8 H100 GPUs with 640 GB HBM3 and delivers 3,200 Gbps EFA networking, ideal for multi-billion-parameter model training. Although G instances are more affordable, they lack the high-bandwidth, low-latency fabric, and memory throughput needed for extreme scale. P instances, though fast, require precise configuration of networking, storage, and GPU topologies, making them powerful but operationally complex and a potential source of misconfigurations or errors for the distributed job.
Misconfiguration issues in distributed training with Amazon EKS can be prevented following a systematic approach to launch required components and verify their proper configuration. This post walks through the steps to set up and verify an EKS cluster for training large models using DLCs.
Solution overview
The solution consists of the following high-level steps:

Build a Docker image with the required dependencies using a PyTorch Framework DLC.
Launch the required infrastructure in a stable, GPU-ready cluster with Amazon EKS.
Install task-specific plugins required for GPU device plugins, Elastic Fabric Adapter (EFA) support, distributed training frameworks, and persistent file storage.
Run health checks to verify node readiness and the correct configuration of NVIDIA and EFA plugins.
Launch a small training job to verify the whole system.

We walk through these steps using a fleet of two p4d.24xlarge instances that we are consuming from a capacity reservation. The scripts used in this post are available in GitHub. Similar scripts for other GPU-powered instances are available in the following GitHub repository. The overall component setup, including worker nodes with persistent storage, plugins, and drivers, is shown in the following diagram.

Prerequisites
To deploy this solution, you need to have these prerequisites:

An AWS account with billing enabled
Sufficient service quotas for on-demand G instances, or access to a capacity reservation
Hugging Face token with access to Meta Llama 2 7B

Build Docker image from AWS DLC
DLCs are pre-built, performance-optimized Docker images that make it straightforward to run popular frameworks like PyTorch and TensorFlow on AWS. Each DLC ships with a fully integrated stack that includes compatible versions of CUDA, cuDNN, and NCCL, plus optional EFA support for high-throughput, low-latency distributed training. These containers are validated across Amazon EC2, Amazon Elastic Container Service (Amazon ECS), and Amazon EKS, providing consistent performance on G- and P-family GPU instances. This uniform environment is critical for distributed workloads, where even minor version mismatches can trigger throughput degradation, stalled all-reduce operations, or CUDA/NCCL errors. Although it’s possible to build training containers from scratch, doing so at production scale is tedious: GPU drivers, CUDA, NCCL, and networking libraries must be aligned with strict version and hardware requirements. DLCs simplify this by providing secure, regularly updated images that are already optimized for AWS infrastructure.
Most distributed training jobs need additional libraries, launch utilities, or orchestration scripts that the base DLCs don’t include. As a result, teams typically use DLCs as a foundation and extend them with the dependencies required for their workloads. This approach preserves the reliability of AWS optimized images while providing the flexibility to customize for large-scale training.
In this post, we show the process of building a custom Docker container by adding custom Python libraries to the PyTorch 2.7.1 Training DLC to launch a training job with Meta Llama 2 7B. For more details, refer to AWS Deep Learning Containers for PyTorch 2.7 Training on EC2, ECS and EKS. To prevent mismatches with the NVIDIA drivers and CUDA versions, we recommend using an EC2 instance powered by a Deep Learning AMI (DLAMI) to build the image. The DLAMI is used only for building a container image used by the training job referenced in this post. It’s different from an Amazon EKS optimized AMI, which is used to run worker nodes in an EKS cluster to run that training job.
Complete the following steps to build a Docker image:

Launch an EC2 instance using the “Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04)” for 64-bit (x86) architecture. Use at least a c5.4xlarge instance or larger, and enable HTTP/HTTPS traffic from the internet.

Allocate at least 100 GiB for storage.

Connect to the EC2 instance using an SSH client and your private key for authentication.
Clone the GitHub repository to access the scripts for this post:

git clone https://github.com/aws-samples/sample-aws-deep-learning-containers.git
cd training/eks

Install the AWS CLI, kubectl, and eksctl to manage the training clusters from the command line of the EC2 instance:

source ./setup_ec2.sh

Run the following script to authenticate into the DLC registry, build the custom image with the dependencies specified in the Dockerfile, and push the custom image to a private repository:

bash ./build.sh

Launch EKS cluster
In this step, we use a YAML file to launch an EKS cluster that contains the required infrastructure for the distributed training job. We launch two managed node groups in an existing virtual private cloud (VPC) and subnets:

A system node group (c5.2xlarge) for running cluster system pods and auto scaling components
A GPU node group (p4d.24xlarge) with EFA enabled networking and RAID0 local storage, designed for distributed training

The script also installs several Amazon EKS add-ons (for example, an EBS CSI driver, Amazon CloudWatch observability, or a node monitoring agent) for storage provisioning and cluster observability.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: eks-p4d
region: PLACEHOLDER_AWS_REGION
version: “1.33”

# List availability zones where cluster subnets will be created
availabilityZones:
– PLACEHOLDER_AZ1
– PLACEHOLDER_AZ2

# Substitute vpc and subnet ids below
# if you want a VPC to be created, comment out vpc related lines
vpc:
id: PLACEHOLDER_VPC_ID
subnets:
private:
private-one:
id: PLACEHOLDER_SUBNET_PRIVATE_1
private-two:
id: PLACEHOLDER_SUBNET_PRIVATE_2
public:
public-one:
id: PLACEHOLDER_SUBNET_PUBLIC_1
public-two:
id: PLACEHOLDER_SUBNET_PUBLIC_2

iam:
withOIDC: true

# EKS-managed node group(s)
managedNodeGroups:
# Nodegroup for system pods
– name: sys
instanceType: c5.2xlarge
desiredCapacity: 1
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
nodeRepairConfig:
enabled: true

# GPU nodegroup
# List availability zones where instances in from this nodegroup will be launched
# Update capacityReservationID with your own if you have a capacity reservation
# Update desiredCapacity to the number of instances you want to launch
– name: p4d
instanceType: p4d.24xlarge
instancePrefix: p4d
privateNetworking: true
efaEnabled: true
minSize: 0
desiredCapacity: 2
maxSize: 4
volumeSize: 500
# if you have Capacity Reservation the AZ has to be same
# if you don’t have CR nodes will be assigned per availability
availabilityZones: [“PLACEHOLDER_AZ”]
capacityReservation:
capacityReservationTarget:
capacityReservationID: “cr-xxxxxxxxxx”
# Utilize the local instance store volume(s)
overrideBootstrapCommand: |
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
instance:
localStorage:
strategy: RAID0

iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
ebs: true
fsx: true
nodeRepairConfig:
enabled: true

addons:
# vpc-cni, coredns, and kube-proxy addons are installed by default by EKS
# we setup additional drivers as addon including storage plugins
– name: aws-ebs-csi-driver
wellKnownPolicies: # add IAM and service account
ebsCSIController: true
– name: aws-fsx-csi-driver
attachPolicyARNs:
– arn:aws:iam::aws:policy/AmazonFSxFullAccess
– name: eks-node-monitoring-agent
resolveConflicts: overwrite
– name: amazon-cloudwatch-observability
resolveConflicts: overwrite
attachPolicyARNs:
– arn:aws:iam::aws:policy/CloudWatchFullAccess

Other sample configurations for training clusters are available in the GitHub repo:

eks-g4dn-vpc.yaml – G4dn with EFA
eks-p4de-odcr.yaml – P4de with capacity reservation
eks-p5-odcr.yaml – P5 with capacity reservation

You can modify the chosen YAML file with your AWS Region, Kubernetes version, VPC and subnets, and optional capacity reservation details. Managed node groups are recommended because they handle node lifecycle, software, and cluster integration automatically, reducing operational overhead compared to self-managed nodes.
After the YAML file has been updated, launch your cluster:

eksctl create cluster -f ./eks-p4d-odcr.yaml

Provisioning takes 15–30 minutes. You can verify the status of your nodes with the following command:

kubectl get nodes

With a successful deployment, you should see all nodes in Ready status.
Use the following command to see all pods created by installed add-ons in Running status:

kubectl get pods –A

Install training-specific plugins
After you set up a basic EKS cluster, you must install additional plugins to enable critical functionalities for distributed training workloads. These plugins make sure GPUs, high-speed networking, distributed training frameworks, and persistent storage are available and correctly integrated into the cluster:

NVIDIA GPU plugin – The NVIDIA device plugin exposes GPU resources to Kubernetes, enabling pods to request and use GPUs
EFA plugin – The EFA device plugin provides high-performance networking for EFA enabled instances (for example P4 and P5), which is essential for multi-node training
Distributed training plugins – These plugins include services like etcd—for rendezvous in PyTorch—and the Kubeflow Training Operator (with the MPI Operator) to enable large-scale job orchestration
Persistent file storage – The FSx CSI driver and EBS CSI driver enable scalable, high-throughput storage for datasets, model checkpoints, monitoring, and logs in Amazon FSx for Lustre and Amazon Elastic Block Store (Amazon EBS), respectively

By enabling these plugins, the cluster becomes production-ready for large-scale training workloads.
Install the NVIDIA device plugin
Because we’re using an Amazon EKS optimized AMI with GPU support, the NVIDIA device plugin is already included. Verify that the plugin pods are running with the following command:

kubectl get pods -n kube-system | grep nvidia

The expected output is as follows:

nvidia-device-plugin-daemonset-xxxxx 1/1 Running 0 3m48s
nvidia-device-plugin-daemonset-yyyyy 1/1 Running 0 3m48s

If the plugin is missing, install it manually with the following command:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.3/deployments/static/nvidia-device-plugin.yml

Verify the availability of GPUs in your nodes with the following command:

kubectl get nodes -o json | jq ‘.items[].status.capacity.”nvidia.com/gpu”‘

The expected output for nodes with 8 GPUs is as follows:

“8”
“8”

Install the EFA plugin
If you are using EFA enabled instances (such as P4d, P4de, or P5), verify that EFA resources are advertised:

kubectl get nodes -o=custom-columns=NAME:.metadata.name,EFA:.status.allocatable.vpc\.amazonaws\.com/efa

The expected values will depend on your instance type:

P4d or p4de: 4
P5: 32

If EFA is not visible, use the following command to install the plugin:

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/efa-device-plugin/efa-k8s-device-plugin.yaml

Install distributed training plugins: etcd and Kubeflow Training Operator
In distributed PyTorch training workloads on Kubernetes, etcd serves as an elegant coordination mechanism that enables seamless worker orchestration. This powerful backend service built for Kubernetes acts as a central meeting point where training workers can perform three critical functions: register their presence in the cluster, discover their peer workers, and achieve synchronized startup across the distributed training job. This coordination pattern is particularly valuable when running large-scale machine learning (ML) workloads on Amazon EKS to enable efficient distributed training.
Create an etcd store with the following command:

kubectl apply -f etcd.yaml

Verify its deployment:

kubectl get pods

The output should look like the following code:

NAME READY STATUS RESTARTS AGE
etcd-xxxxx-xxx 1/1 Running 0 10s

The Kubeflow Training Operator simplifies distributed PyTorch training on Amazon EKS by providing custom resources (such as PyTorchJob) that automate the complex orchestration of multi-node training deployments, including worker pod lifecycle management and fault handling. By using the built-in MPI Operator, it enables efficient inter-node communication patterns critical for distributed deep learning workloads, handling the intricacies of MPI process placement, rank assignment, and network configuration that would otherwise require significant manual setup and expertise.
Deploy Kubeflow Training Operator:

kubectl apply –server-side -k “github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.3”

Kubeflow Training Operator (v1) is a legacy project to Kubeflow Trainer (v2), which is currently in alpha status, and APIs may change.
Install storage plugins: FSx for Luster and Amazon EBS
For latency-sensitive and high-bandwidth throughput dynamic workloads, such as distributed training and model serving across multiple GPU compute instances, we recommend FSx for Lustre. It provides a fully managed, high-performance parallel file system that is designed for compute-intensive workloads like high-performance computing (HPC) and ML.
We installed the FSx for Lustre file system CSI driver using the Amazon EKS add-on while creating the cluster to mount FSx for Lustre file systems on Amazon EKS as a persistent volume (PV). In this step, you deploy an FSx for Lustre file system as a standalone high-performance cache or as an Amazon Simple Storage Service (Amazon S3) linked file system to act as a high-performance cache for Amazon S3 data, providing fast I/O and high throughput for data access across your GPU compute instances.
Create the FSx for Lustre file system with the following command:

bash ./fsx_create.sh

Create a PVC object to allow Kubernetes pods to claim storage on the FSx for Lustre file system:

kubectl apply -f ./fsx-pvc-static.yaml

In FSx for Lustre, throughput scales with storage type and provisioned capacity. Optimize your deployment based on your dataset size and checkpointing needs.
The EBS CSI driver gives Amazon EKS the ability to dynamically create and attach block volumes (using Amazon EBS) to pods. When creating node groups, EBS root volumes can be preconfigured (size, type: gp2/gp3/io1/io2). We have already installed the EBS CSI driver through the EKS cluster setup. Verify that the instance role includes the policy arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy, because without it, EBS PVC provisioning will fail.
In summary, by layering these plugins on top of a baseline EKS cluster, you can unlock the following:

GPUs for compute
High-performance networking
Orchestration for distributed training
Persistent storage

Together, these plugins create an environment capable of supporting large-scale, fault-tolerant, high-performance deep learning workloads on Amazon EKS.
Verify plugins for distributed training
When you first launch a distributed GPU training cluster on Amazon EKS (with AWS DLCs), it’s critical to validate that the environment is healthy before starting large-scale jobs. This prevents wasted time and cost due to misconfigurations or hardware issues. The checks discussed in this section cover the most important areas.
GPU driver and NVIDIA-SMI validation
Each GPU node must have a valid driver installation that matches the CUDA version in your AWS DLC. You can verify this either by running a script inside a GPU-enabled pod or by connecting with AWS Systems Manager.
Regardless of the option you chose, confirm the following as part of your validation:

The driver version matches the CUDA version in your DLC
The GPU model, temperature, and utilization look correct
No errors are reported

Option 1: Run inside a GPU-enabled debug pod
The NVIDIA System Management Interface (nvidia-smi) is a command line utility intended to aid in the management and monitoring of NVIDIA GPU devices. This utility makes it possible for administrators to query GPU device state.
Apply an nvidia-smi job manifest using the following code:

kubectl apply -f nvidia_smi.yaml
kubectl logs nvidia-smi

Option 2: Connect directly using Systems Manager
Find the instance ID of your node:

aws ec2 describe-instances
–filters “Name=tag:eks:nodegroup-name,Values=eks-p4d”
–query “Reservations[].Instances[].InstanceId”
–output text

Start a Systems Manager session:

aws ssm start-session –target <instance-id>

Run the nvidia-smi check to query the state of your GPUs:

nvidia-smi

NCCL and multi-node communication
Distributed training depends on fast GPU-to-GPU communication, often using the NVIDIA Collective Communications Library (NCCL).
Deploy NCCL tests with the following script:

kubectl apply -f ./nccl-tests.yaml

Verify that the NCCL worker pods are up and running:

kubectl get pods | grep nccl

The results should look like the following code:

nccl-tests-launcher 1/1 Running 0 12s
nccl-tests-worker-0 1/1 Running 0 13s
nccl-tests-worker-1 1/1 Running 0 12s

Validate the following:

All-reduce and communication operations complete without errors
Bandwidth and latency values are within expected ranges
If using EFA, confirm that the NCCL is using AWS_OFI_NCCL as the transport layer (optimal for HPC networking)

Validate training environment with sample workload
Finally, validate that your framework (PyTorch), GPUs, and networking all integrate properly by running a small training workload. In this case, we demonstrate this by running supervised fine-tuning on a Meta Llama 2 model.

Get a Hugging Face token. Llama 2 7B is a gated model, so you must request access to the model and then pass your Hugging Face token to the FSDP script. To register and obtain a token here, see User access tokens. Then insert the token into your conf file.
Run the validation script to load the environment variables and generate a job YAML manifest from the template:

bash ./fsdp.sh

Start a PyTorch distributed job:

kubectl apply -f ./fsdp.yaml

The expected output is as follows:

pytorchjob.kubeflow.org/fsdp created

Check that the worker pods have been created:

kubectl get pods | grep fsdp

The output should show both FSDP worker pods as Running:

fsdp-worker-0 1/1 Running 0 7m11s
fsdp-worker-1 1/1 Running 0 7m11s

Inspect the job:

kubectl describe -f ./fsdp.yaml

You should see pod events like those in the following screenshot.

After the pod is created, review the logs for errors or failures:

kubectl logs -f fsdp-worker-0

When the job is complete, the pods should move to a Completed state:

fsdp-worker-0 0/1 Completed 0 9m32s
fsdp-worker-1 0/1 Completed 0 9m32s

If the job starts properly, you can stop the job with the following commands:

kubectl delete -f ./fsdp.yaml
kubectl delete -f ./etcd.yaml

Both the worker pods and the etcd pod must be deleted and recreated before launching a new job, otherwise you might encounter RendezvousClosedError.
These initial health checks help validate the following:

The cluster and nodes are ready
GPUs are installed, visible, and healthy
Multi-node communication is optimized
The AWS DLC environment can run ML workloads

After these checks pass, you can scale up to large-scale distributed training jobs.
Clean up
Delete the cluster using the following command when it’s no longer needed to prevent incurring cost:

eksctl delete cluster -f ./eks-p4d-odcr.yaml

Conclusion
Distributed training requires an infrastructure foundation that delivers both computing power and predictability. When you integrate the Amazon EKS optimized AMI together with AWS DLCs, the result is a GPU-enabled cluster offering a consistent, validated runtime environment that spans all nodes. The implementation of high-bandwidth, low-latency networking capabilities enhanced with EFA helps distributed workloads execute at maximum efficiency. The addition of GPU plugins, coupled with storage integration and distributed training frameworks, creates a streamlined approach to scaling and orchestration. The final step of executing targeted initial health checks, which include NCCL connectivity testing, confirms the cluster is fully prepared for long-duration training operations. After these components are properly configured, teams can redirect their energy from infrastructure maintenance to achieving breakthrough advances in model performance.
For scripts for running FSDP distributed training on Amazon EKS, refer to the following GitHub repo. For distributed training reference architectures, and tests, refer to the following GitHub repo. For a list of available DLC images, refer to the following GitHub repo. For an alternative implementation for running ML training and inference on Amazon EKS using a JARK stack, refer to Deploy Generative AI Models on Amazon EKS.

About the authors
Meryem Ozcelik is a GenAI/ML Specialist Solution Architect at Amazon Web Services. Her work focuses on designing and implementing generative AI and machine learning solutions, specializing in Amazon Bedrock, SageMaker, and AI/ML workload optimization on AWS. She helps accelerating AI adoption through architectural guidance, best practices, and scalable ML infrastructure design. Meryem holds a Master’s Degree in Computer Science from Georgia Institute of Technology.
Pratik Yeole is a solutions architect specializing in container services at AWS. He helps customers adopt modern cloud-native architectures and best practices. He is a tenured Amazonian with expertise in containers and AI/ML. For leisure, he plays cricket, chess and enjoys game nights/hikes/restaurants with family and friends.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Jinyan Li is a Software Development Engineer at Amazon Web Services. Her work focuses on building and improving containerized environments for machine learning workloads on AWS. She holds a Master’s degree in Computer Science from Northeastern University.
Sirut “G” Buasai is a Software Development Engineer at Amazon Web Services, working within the SageMaker AI organization. He specializes in optimizing deep learning containers and developing cloud-native solutions for machine learning workloads. His expertise includes container optimization, Kubernetes development, and ML model performance benchmarking.

Alibaba’s Qwen AI Releases Compact Dense Qwen3-VL 4B/8B (Instruct &a …

Do you actually need a giant VLM when dense Qwen3-VL 4B/8B (Instruct/Thinking) with FP8 runs in low VRAM yet retains 256K→1M context and the full capability surface? Alibaba’s Qwen team has expanded its multimodal lineup with dense Qwen3-VL models at 4B and 8B scales, each shipping in two task profiles—Instruct and Thinking—plus FP8-quantized checkpoints for low-VRAM deployment. The drop arrives as a smaller, edge-friendly complement to the previously released 30B (MoE) and 235B (MoE) tiers and keeps the same capability surface: image/video understanding, OCR, spatial grounding, and GUI/agent control.

https://github.com/QwenLM/Qwen3-VL/tree/main

What’s in the release?

SKUs and variants: The new additions comprise four dense models—Qwen3-VL-4B and Qwen3-VL-8B, each in Instruct and Thinking editions—alongside FP8 versions of the 4B/8B Instruct and Thinking checkpoints. The official announcement explicitly frames these as “compact, dense” models with lower VRAM usage and full Qwen3-VL capabilities retained.

Context length and capability surface: The model cards list native 256K context with expandability to 1M, and document the full feature set: long-document and video comprehension, 32-language OCR, 2D/3D spatial grounding, visual coding, and agentic GUI control on desktop and mobile. These attributes carry over to the new 4B/8B SKUs.

Architecture notes: Qwen3-VL highlights three core updates: Interleaved-MRoPE for robust positional encoding over time/width/height (long-horizon video), DeepStack for fusing multi-level ViT features and sharpening image–text alignment, and Text–Timestamp Alignment beyond T-RoPE for event localization in video. These design details appear in the new cards as well, signaling architectural continuity across sizes.

Project timeline: The Qwen3-VL GitHub “News” section records the publication of Qwen3-VL-4B (Instruct/Thinking) and Qwen3-VL-8B (Instruct/Thinking) on Oct 15, 2025, following earlier releases of the 30B MoE tier and organization-wide FP8 availability.

FP8: deployment-relevant details

Numerics and parity claim: The FP8 repositories state fine-grained FP8 quantization with block size 128, with performance metrics nearly identical to the original BF16 checkpoints. For teams evaluating precision trade-offs on multimodal stacks (vision encoders, cross-modal fusion, long-context attention), having vendor-produced FP8 weights reduces re-quantization and re-validation burden.

Tooling status: The 4B-Instruct-FP8 card notes that Transformers does not yet load these FP8 weights directly, and recommends vLLM or SGLang for serving; the card includes working launch snippets. Separately, the vLLM recipes guide recommends FP8 checkpoints for H100 memory efficiency. Together, these point to immediate, supported paths for low-VRAM inference.

Key Takeaways

Qwen released dense Qwen3-VL 4B and 8B models, each in Instruct and Thinking variants, with FP8 checkpoints.

FP8 uses fine-grained FP8 (block size 128) with near-BF16 metrics; Transformers loading is not yet supported—use vLLM/SGLang.

Capability surface is preserved: 256K→1M context, 32-language OCR, spatial grounding, video reasoning, and GUI/agent control.

Model Card-reported sizes: Qwen3-VL-4B ≈ 4.83B params; Qwen3-VL-8B-Instruct ≈ 8.77B params.

Editorial Comments

Qwen’s decision to ship dense Qwen3-VL 4B/8B in both Instruct and Thinking forms with FP8 checkpoints is the practical part of the story: lower-VRAM, deployment-ready weights (fine-grained FP8, block size 128) and explicit serving guidance (vLLM/SGLang) makes it easily deployable. The capability surface—256K context expandable to 1M, 32-language OCR, spatial grounding, video understanding, and agent control—remains intact at these smaller scales, which matters more than leaderboard rhetoric for teams targeting single-GPU or edge budgets.

Check out the Model on Hugging Face and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Alibaba’s Qwen AI Releases Compact Dense Qwen3-VL 4B/8B (Instruct & Thinking) With FP8 Checkpoints appeared first on MarkTechPost.

Andrej Karpathy Releases ‘nanochat’: A Minimal, End-to-End ChatGPT …

Andrej Karpathy has open-sourced nanochat, a compact, dependency-light codebase that implements a full ChatGPT-style stack—from tokenizer training to web UI inference—aimed at reproducible, hackable LLM training on a single multi-GPU node.

The repo provides a single-script “speedrun” that executes the full loop: tokenization, base pretraining, mid-training on chat/multiple-choice/tool-use data, Supervised Finetuning (SFT), optional RL on GSM8K, evaluation, and serving (CLI + ChatGPT-like web UI). The recommended setup is an 8×H100 node; at ~$24/hour, the 4-hour speedrun lands near $100. A post-run report.md summarizes metrics (CORE, ARC-E/C, MMLU, GSM8K, HumanEval, ChatCORE).

Tokenizer and data path

Tokenizer: custom Rust BPE (built via Maturin), with a 65,536-token vocab; training uses FineWeb-EDU shards (re-packaged/shuffled for simple access). The walkthrough reports ~4.8 characters/token compression and compares against GPT-2/4 tokenizers.

Eval bundle: a curated set for CORE (22 autocompletion datasets like HellaSwag, ARC, BoolQ, etc.), downloaded into ~/.cache/nanochat/eval_bundle.

Model, scaling, and “speedrun” target

The speedrun config trains a depth-20 Transformer (≈560M params with 1280 hidden channels, 10 attention heads of dim 128) for ~11.2B tokens consistent with Chinchilla-style scaling (params × ~20 tokens). The author estimates this run as a ~4e19 FLOPs capability model. Training uses Muon for matmul parameters and AdamW for embeddings/unembeddings; loss is reported in bits-per-byte (bpb) to be tokenizer-invariant.

Mid-training, SFT, and tool use

After pretraining, mid-training adapts the base model to conversations (SmolTalk) and explicitly teaches multiple-choice behavior (100K MMLU auxiliary-train questions) and tool use by inserting <|python_start|>…<|python_end|> blocks; a small GSM8K slice is included to seed calculator-style usage. The default mixture: SmolTalk (460K), MMLU aux-train (100K), GSM8K main (8K), totaling 568K rows.

SFT then fine-tunes on higher-quality conversations while matching test-time formatting (padded, non-concatenated rows) to reduce train/inference mismatch. The repo’s example post-SFT metrics (speedrun tier) report ARC-Easy 0.3876, ARC-Challenge 0.2807, MMLU 0.3151, GSM8K 0.0455, HumanEval 0.0854, ChatCORE 0.0884.

Tool use is wired end-to-end: the custom Engine implements KV cache, prefill/decode inference, and a simple Python interpreter sandbox for tool-augmented runs—used in both training and evaluation flows.

Optional RL on GSM8K via a simplified GRPO loop

The final (optional) stage applies reinforcement learning on GSM8K with a simplified GRPO routine. The walkthrough clarifies what’s omitted relative to canonical PPO-style RLHF: no trust region via a reference model, no KL penalties, on-policy updates (discard PPO ratios/clip), token-level GAPO-style normalization, and mean-shift advantage. Practically, it behaves close to REINFORCE while keeping the group-relative advantage calculation. Scripts scripts.chat_rl and scripts.chat_eval -i rl -a GSM8K demonstrate the loop.

Cost/quality scaling and bigger models

The README sketches two larger targets beyond the ~$100 speedrun:

~$300 tier: d=26 (~12 hours), slightly surpasses GPT-2 CORE; requires more pretraining shards and batch-size adjustments.

~$1,000 tier: ~41.6 hours, with materially improved coherence and basic reasoning/coding ability.

The repo also note prior experimental runs where a d=30 model trained for ~24 hours reached 40s on MMLU, 70s on ARC-Easy, 20s on GSM8K.

Evaluation snapshot (speedrun tier)

An example report.md table for the ~$100/≈4-hour run shows: CORE 0.2219 (base); after mid-training/SFT, ARC-E 0.3561→0.3876, ARC-C ~0.2875→0.2807, MMLU 0.3111→0.3151, GSM8K 0.0250→0.0455, HumanEval 0.0671→0.0854, ChatCORE 0.0730→0.0884; wall-clock 3h51m.

https://github.com/karpathy/nanochat/discussions/1

Key Takeaways

nanochat is a minimal, end-to-end ChatGPT-style stack (~8K LOC) that runs via a single speedrun.sh on one 8×H100 node (~4h ≈ $100).

The pipeline covers tokenizer (Rust BPE), base pretraining, mid-training, SFT, optional RL on GSM8K (simplified GRPO), evaluation, and serving (CLI + Web UI).

Speedrun metrics (example report.md): CORE 0.2219 base; after SFT—ARC-Easy 0.3876, ARC-Challenge 0.2807, MMLU 0.3151, GSM8K 0.0455, HumanEval 0.0854.

Scaling tiers are outlined: ~$300 (d=26, ~12h) “slightly outperforms GPT-2 CORE”; ~$1,000 (~41.6h) for materially better coherence/reasoning.

Editorial Comments

Karpathy’s nanochat lands in a useful middle ground: a single, clean, dependency-light repository that stitches tokenizer training (Rust BPE), pretraining on FineWeb-EDU, mid-training (SmolTalk/MMLU aux/GSM8K with tool use tags), SFT, optional simplified GRPO on GSM8K, and a thin Engine (KV cache, prefill/decode, Python interpreter) into a reproducible speedrun on an 8×H100 node, producing a traceable report.md with CORE/ARC/MMLU/GSM8K/HumanEval and a minimal Web UI.

Check out the Technical details and Codes. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Excited to release new repo: nanochat!(it’s among the most unhinged I’ve written).Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,… pic.twitter.com/LLhbLCoZFt— Andrej Karpathy (@karpathy) October 13, 2025

The post Andrej Karpathy Releases ‘nanochat’: A Minimal, End-to-End ChatGPT-Style Pipeline You Can Train in ~4 Hours for ~$100 appeared first on MarkTechPost.

A Coding Implementation of Advanced PyTest to Build Customized and Aut …

In this tutorial, we explore the advanced capabilities of PyTest, one of the most powerful testing frameworks in Python. We build a complete mini-project from scratch that demonstrates fixtures, markers, plugins, parameterization, and custom configuration. We focus on showing how PyTest can evolve from a simple test runner into a robust, extensible system for real-world applications. By the end, we understand not just how to write tests, but how to control and customize PyTest’s behavior to fit any project’s needs. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport sys, subprocess, os, textwrap, pathlib, json

subprocess.run([sys.executable, “-m”, “pip”, “install”, “-q”, “pytest>=8.0”], check=True)

root = pathlib.Path(“pytest_advanced_tutorial”).absolute()
if root.exists():
import shutil; shutil.rmtree(root)
(root / “calc”).mkdir(parents=True)
(root / “app”).mkdir()
(root / “tests”).mkdir()

We begin by setting up our environment, importing essential Python libraries for file handling and subprocess execution. We install the latest version of PyTest to ensure compatibility and then create a clean project structure with folders for our main code, application modules, and tests. This gives us a solid foundation to organize everything neatly before writing any test logic. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser(root / “pytest.ini”).write_text(textwrap.dedent(“””
[pytest]
addopts = -q -ra –maxfail=1 -m “not slow”
testpaths = tests
markers =
slow: slow tests (use –runslow to run)
io: tests hitting the file system
api: tests patching external calls
“””).strip()+”n”)

(root / “conftest.py”).write_text(textwrap.dedent(r”’
import os, time, pytest, json
def pytest_addoption(parser):
parser.addoption(“–runslow”, action=”store_true”, help=”run slow tests”)
def pytest_configure(config):
config.addinivalue_line(“markers”, “slow: slow tests”)
config._summary = {“passed”:0,”failed”:0,”skipped”:0,”slow_ran”:0}
def pytest_collection_modifyitems(config, items):
if config.getoption(“–runslow”):
return
skip = pytest.mark.skip(reason=”need –runslow to run”)
for item in items:
if “slow” in item.keywords: item.add_marker(skip)
def pytest_runtest_logreport(report):
cfg = report.config._summary
if report.when==”call”:
if report.passed: cfg[“passed”]+=1
elif report.failed: cfg[“failed”]+=1
elif report.skipped: cfg[“skipped”]+=1
if “slow” in report.keywords and report.passed: cfg[“slow_ran”]+=1
def pytest_terminal_summary(terminalreporter, exitstatus, config):
s=config._summary
terminalreporter.write_sep(“=”, “SESSION SUMMARY (custom plugin)”)
terminalreporter.write_line(f”Passed: {s[‘passed’]} | Failed: {s[‘failed’]} | Skipped: {s[‘skipped’]}”)
terminalreporter.write_line(f”Slow tests run: {s[‘slow_ran’]}”)
terminalreporter.write_line(“PyTest finished successfully ” if s[“failed”]==0 else “Some tests failed “)

@pytest.fixture(scope=”session”)
def settings(): return {“env”:”prod”,”max_retries”:2}
@pytest.fixture(scope=”function”)
def event_log(): logs=[]; yield logs; print(“\nEVENT LOG:”, logs)
@pytest.fixture
def temp_json_file(tmp_path):
p=tmp_path/”data.json”; p.write_text(‘{“msg”:”hi”}’); return p
@pytest.fixture
def fake_clock(monkeypatch):
t={“now”:1000.0}; monkeypatch.setattr(time,”time”,lambda: t[“now”]); return t
”’))

We now create our PyTest configuration and plugin files. In pytest.ini, we define markers, default options, and test paths to control how tests are discovered and filtered. In conftest.py, we implement a custom plugin that tracks passed, failed, and skipped tests, adds a –runslow option, and provides fixtures for reusable test resources. This helps us extend PyTest’s core behavior while keeping our setup clean and modular. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser(root/”calc”/”__init__.py”).write_text(textwrap.dedent(”’
from .vector import Vector
def add(a,b): return a+b
def div(a,b):
if b==0: raise ZeroDivisionError(“division by zero”)
return a/b
def moving_avg(xs,k):
if k<=0 or k>len(xs): raise ValueError(“bad window”)
out=[]; s=sum(xs[:k]); out.append(s/k)
for i in range(k,len(xs)):
s+=xs[i]-xs[i-k]; out.append(s/k)
return out
”’))

(root/”calc”/”vector.py”).write_text(textwrap.dedent(”’
class Vector:
__slots__=(“x”,”y”,”z”)
def __init__(self,x=0,y=0,z=0): self.x,self.y,self.z=float(x),float(y),float(z)
def __add__(self,o): return Vector(self.x+o.x,self.y+o.y,self.z+o.z)
def __sub__(self,o): return Vector(self.x-o.x,self.y-o.y,self.z-o.z)
def __mul__(self,s): return Vector(self.x*s,self.y*s,self.z*s)
__rmul__=__mul__
def norm(self): return (self.x**2+self.y**2+self.z**2)**0.5
def __eq__(self,o): return abs(self.x-o.x)<1e-9 and abs(self.y-o.y)<1e-9 and abs(self.z-o.z)<1e-9
def __repr__(self): return f”Vector({self.x:.2f},{self.y:.2f},{self.z:.2f})”
”’))

We now build the core calculation module for our project. In the calc package, we define simple mathematical utilities, including addition, division with error handling, and a moving-average function, to demonstrate logic testing. Alongside this, we create a Vector class that supports arithmetic operations, equality checks, and norm computation, a perfect example for testing custom objects and comparisons using PyTest. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser(root/”app”/”io_utils.py”).write_text(textwrap.dedent(”’
import json, pathlib, time
def save_json(path,obj):
path=pathlib.Path(path); path.write_text(json.dumps(obj)); return path
def load_json(path): return json.loads(pathlib.Path(path).read_text())
def timed_operation(fn,*a,**kw):
t0=time.time(); out=fn(*a,**kw); t1=time.time(); return out,t1-t0
”’))
(root/”app”/”api.py”).write_text(textwrap.dedent(”’
import os, time, random
def fetch_username(uid):
if os.environ.get(“API_MODE”)==”offline”: return f”cached_{uid}”
time.sleep(0.001); return f”user_{uid}_{random.randint(100,999)}”
”’))

(root/”tests”/”test_calc.py”).write_text(textwrap.dedent(”’
import pytest, math
from calc import add,div,moving_avg
from calc.vector import Vector
@pytest.mark.parametrize(“a,b,exp”,[(1,2,3),(0,0,0),(-1,1,0)])
def test_add(a,b,exp): assert add(a,b)==exp
@pytest.mark.parametrize(“a,b,exp”,[(6,3,2),(8,2,4)])
def test_div(a,b,exp): assert div(a,b)==exp
@pytest.mark.xfail(raises=ZeroDivisionError)
def test_div_zero(): div(1,0)
def test_avg(): assert moving_avg([1,2,3,4,5],3)==[2,3,4]
def test_vector_ops(): v=Vector(1,2,3)+Vector(4,5,6); assert v==Vector(5,7,9)
”’))

(root/”tests”/”test_io_api.py”).write_text(textwrap.dedent(”’
import pytest, os
from app.io_utils import save_json,load_json,timed_operation
from app.api import fetch_username
@pytest.mark.io
def test_io(temp_json_file,tmp_path):
d={“x”:5}; p=tmp_path/”a.json”; save_json(p,d); assert load_json(p)==d
assert load_json(temp_json_file)=={“msg”:”hi”}
def test_timed(capsys):
val,dt=timed_operation(lambda x:x*3,7); print(“dt=”,dt); out=capsys.readouterr().out
assert “dt=” in out and val==21
@pytest.mark.api
def test_api(monkeypatch):
monkeypatch.setenv(“API_MODE”,”offline”)
assert fetch_username(9)==”cached_9″
”’))

(root/”tests”/”test_slow.py”).write_text(textwrap.dedent(”’
import time, pytest
@pytest.mark.slow
def test_slow(event_log,fake_clock):
event_log.append(f”start@{fake_clock[‘now’]}”)
fake_clock[“now”]+=3.0
event_log.append(f”end@{fake_clock[‘now’]}”)
assert len(event_log)==2
”’))

We add lightweight app utilities for JSON I/O and a mocked API to exercise real-world behaviors without external services. We write focused tests that use parametrization, xfail, markers, tmp_path, capsys, and monkeypatch to validate logic and side effects. We include a slow test wired to our event_log and fake_clock fixtures to demonstrate controlled timing and session-wide state. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprint(” Project created at:”, root)
print(“n RUN #1 (default, skips @slow)n”)
r1=subprocess.run([sys.executable,”-m”,”pytest”,str(root)],text=True)
print(“n RUN #2 (–runslow)n”)
r2=subprocess.run([sys.executable,”-m”,”pytest”,str(root),”–runslow”],text=True)

summary_file=root/”summary.json”
summary={
“total_tests”:sum(“test_” in str(p) for p in root.rglob(“test_*.py”)),
“runs”: [“default”,”–runslow”],
“results”: [“success” if r1.returncode==0 else “fail”,
“success” if r2.returncode==0 else “fail”],
“contains_slow_tests”: True,
“example_event_log”:[“start@1000.0″,”end@1003.0”]
}
summary_file.write_text(json.dumps(summary,indent=2))
print(“n FINAL SUMMARY”)
print(json.dumps(summary,indent=2))
print(“n Tutorial completed — all tests & summary generated successfully.”)

We now run our test suite twice: first with the default configuration that skips slow tests, and then again with the –runslow flag to include them. After both runs, we generate a JSON summary containing test outcomes, the total number of test files, and a sample event log. This final summary gives us a clear snapshot of our project’s testing health, confirming that all components work flawlessly from start to finish.

In conclusion, we see how PyTest helps us test smarter, not harder. We design a plugin that tracks results, uses fixtures for state management, and controls slow tests with custom options, all while keeping the workflow clean and modular. We conclude with a detailed JSON summary that demonstrates how easily PyTest can integrate with modern CI and analytics pipelines. With this foundation, we are now confident to extend PyTest further, combining coverage, benchmarking, or even parallel execution for large-scale, professional-grade testing.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post A Coding Implementation of Advanced PyTest to Build Customized and Automated Testing with Plugins, Fixtures, and JSON Reporting appeared first on MarkTechPost.

Build a device management agent with Amazon Bedrock AgentCore

The proliferation of Internet of Things (IoT) devices has transformed how we interact with our environments, from homes to industrial settings. However, as the number of connected devices grows, so does the complexity of managing them. Traditional device management interfaces often require navigating through multiple applications, each with its own UI and learning curve. This fragmentation creates friction for users trying to monitor and control their IoT environment.
In this post, we explore how to build a conversational device management system using Amazon Bedrock AgentCore. With this solution, users can manage their IoT devices through natural language, using a UI for tasks like checking device status, configuring WiFi networks, and monitoring user activity. To learn more about how Amazon Bedrock AgentCore enables deploying and operating highly effective agents securely at scale using a variety of frameworks and models, refer to Enabling customers to deliver production-ready AI agents at scale.
The challenge of device management
Managing a modern IoT environment involves navigating numerous challenges that can hinder user experience and technology adoption. Interface fragmentation forces users to juggle multiple applications and management tools for different devices, and technical complexity can make even basic configuration tasks intimidating for non-specialists. Adding to these difficulties are visibility limitations that prevent comprehensive monitoring of device status, and inadequate user management capabilities that make it difficult to track device usage patterns.
Together, these pain points create significant friction for users trying to implement and maintain IoT solutions effectively.
Solution overview
The conversational AI solution using agents offers a comprehensive approach to IoT complexity through its unified conversational interface that consolidates device management tasks into a single access point. Users can perform sophisticated operations through natural language interaction instead of navigating technical menus, while gaining comprehensive visibility across connected devices and transforming complex configuration tasks into straightforward conversations. The system delivers essential capabilities, including device management for inventory control and status monitoring, WiFi network management for simplified network configuration, user management for access control, and activity tracking for temporal analysis of user interactions. This seamless management experience minimizes monitoring vulnerabilities and provides valuable insights into usage patterns and potential security concerns, effectively removing the typical barriers to successful IoT implementation while maintaining appropriate system authorization throughout the network.
Architecture overview

The device management system follows a modular architecture that uses several AWS services. The architecture consists of the following components:

User and application interface – Users interact with the system through a web application that serves as the frontend interface.
Foundation models – This system uses various foundation models (FMs) in Amazon Bedrock to power natural language understanding and generation capabilities.
Amazon Bedrock AgentCore Gateway – This feature acts as the secure entry point for authenticated requests, validating bearer tokens before routing requests to the appropriate target.
Amazon Bedrock AgentCore Identity – This feature manages agent identity and permissions, controlling what actions the agent can perform on behalf of users.
Amazon Bedrock AgentCore Memory – This feature supports both short-term and long-term memory, maintaining immediate conversation context within a session and storing persistent insights and preferences across sessions. This enables agents to provide consistent, context-aware responses without developers needing to manage complex memory infrastructure.
Amazon Bedrock AgentCore Observability – This feature monitors agent performance, tracks metrics, and provides insights into system usage and behavior for debugging and optimization.
Amazon Bedrock AgentCore Runtime – This secure, serverless environment supports AI agents built with open source frameworks. It maintains complete session isolation by dedicating isolated containers per user session, enabling scalable and secure management of long-running, stateful interactions.
Amazon Cognito – Amazon Cognito handles user authentication through bearer token generation and validation, facilitating secure access to the system.
Amazon DynamoDB – Amazon DynamoDB stores system data across five tables.
AWS Lambda – The solution connects the gateway to AWS Lambda functions that execute specific device management operations. Lambda contains the business logic for device management, implementing seven core tools.

This architecture enables a seamless flow from user query to response: the user submits a natural language request through the application, which is authenticated through Amazon Cognito and processed by Amazon Bedrock AgentCore Runtime. The runtime determines the appropriate tool to invoke and sends the request through the gateway to the Lambda function, which queries or updates DynamoDB as needed. The result flows back through the same path, with the runtime generating a natural language response based on the data retrieved.
Refer to the GitHub repository for detailed deployment instructions.
Key functionalities of the device management agent
The device management system uses Lambda to implement seven essential tools for device management, including listing devices, retrieving settings, managing WiFi networks, and monitoring user activity, all invoked by the agent as needed. This functionality is supported by our flexible NoSQL database architecture in DynamoDB, which comprises five distinct tables—Devices, DeviceSettings, WifiNetworks, Users, and UserActivities—storing specialized data to maintain comprehensive system records. Together, these components create a robust foundation that enables efficient device management while maintaining detailed audit trails of system activities.
Key features showcase

Performance and security considerations
The solution balances robust concurrent processing capabilities with comprehensive protection measures. The device management system efficiently handles multiple simultaneous requests through automatically scaling Lambda functions, consistent DynamoDB performance regardless of data volume, and intelligent retry logic with exponential backoff when encountering rate limitations. To scale across hundreds of tools, the semantic search capability in Amazon Bedrock AgentCore Gateway enables efficient and relevant discovery of tools by meaning, facilitating quick and accurate responses even at large scale.
The system implements industry-leading security practices, including Amazon Cognito authentication, Amazon Bedrock AgentCore Identity, layered access control through gateway and Lambda level permission verification, comprehensive data encryption at rest and in transit, and Amazon Bedrock Guardrails to help prevent prompt injection attacks while maintaining interaction safety.
Conclusion
The device management system presented in this post uses Amazon Bedrock AgentCore to transform IoT management through conversational AI, creating an intuitive interface where complex device operations become simple dialogue. Its composable, reusable, and decoupled agentic architecture alleviates undifferentiated heavy lifting by providing built-in features for secure, scalable deployment and seamless integration. By combining large language models with an AWS infrastructure, the solution provides enterprise-grade capabilities without burdening developers with infrastructure management. Key benefits include simplified user experiences through natural language interaction, operational efficiency with unified interfaces, comprehensive device visibility, and future-proof architecture that evolves with AI advancements. The system’s model-agnostic approach supports continuous improvement as new FMs emerge, and robust security and observability features help organizations confidently deploy scalable, next-generation device management solutions tailored to their specific IoT environments.
To implement this solution, refer to the GitHub repository.

About the Author
Godwin Sahayaraj Vincent is an Enterprise Solutions Architect at AWS who is passionate about Machine Learning and providing guidance to customers to design, deploy and manage their AWS workloads and architectures. In his spare time, he loves to play cricket with his friends and tennis with his three kids.
Ramesh Kumar Venkatraman is a Senior Solutions Architect at AWS who is passionate about Generative AI, Containers and Databases. He works with AWS customers to design, deploy and manage their AWS workloads and architectures. In his spare time, he loves to play with his two kids and follows cricket.
Chhavi Kaushik is an AWS Solutions Architect specializing in cloud-native architectures and digital transformation. She is passionate about helping customers harness the power of Generative AI, designing and implementing enterprise-scale solutions that combine AWS’s cutting-edge AI/ML services. Outside of her professional life, Chhavi enjoys exploring the California outdoors, making the most of the Bay Area’s beautiful weather and lifestyle.

How Amazon Bedrock Custom Model Import streamlined LLM deployment for …

This post is cowritten by Salesforce’s AI Platform team members Srikanta Prasad, Utkarsh Arora, Raghav Tanaji, Nitin Surya, Gokulakrishnan Gopalakrishnan, and Akhilesh Deepak Gotmare.
Salesforce’s Artificial Intelligence (AI) platform team runs customized large language models (LLMs)—fine-tuned versions of Llama, Qwen, and Mistral—for agentic AI applications like Agentforce. Deploying these models creates operational overheads: teams spend months optimizing instance families, serving engines, and configurations. This process is time-consuming, hard to maintain with frequent releases, and expensive due to GPU capacity reservations for peak usage.
Salesforce solved this by adopting Amazon Bedrock Custom Model Import. With Amazon Bedrock Custom Model Import, teams can import and deploy customized models through a unified API, minimizing infrastructure management while integrating with Amazon Bedrock features like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents. This shift lets Salesforce focus on models and business logic instead of infrastructure.
This post shows how Salesforce integrated Amazon Bedrock Custom Model Import into their machine learning operations (MLOps) workflow, reused existing endpoints without application changes, and benchmarked scalability. We share key metrics on operational efficiency and cost optimization gains, and offer practical insights for simplifying your deployment strategy.
Integration approach
Salesforce’s transition from Amazon SageMaker Inference to Amazon Bedrock Custom Model Import required careful integration with their existing MLOps pipeline to avoid disrupting production workloads. The team’s primary goal was to maintain their current API endpoints and model serving interfaces, keeping zero downtime and no required changes to downstream applications. With this approach, they could use the serverless capabilities of Amazon Bedrock while preserving the investment in their existing infrastructure and tooling. The integration strategy focused on creating a seamless bridge between their current deployment workflows and Amazon Bedrock managed services, enabling gradual migration without additional operational risk.
As shown in the following deployment flow diagram, Salesforce enhanced their existing model delivery pipeline with a single additional step to use Amazon Bedrock Custom Model Import. After their continuous integration and continuous delivery (CI/CD) process saves model artifacts to their model store (an Amazon Simple Storage Service (Amazon S3) bucket), they now call the Amazon Bedrock Custom Model Import API to register the model with Amazon Bedrock. This control plane operation is lightweight because Amazon Bedrock pulls the model directly from Amazon S3, adding minimal overhead (5–7 mins, depending on model size) to their deployment timeline—their overall model release process remains at approximately 1 hour. The integration delivered an immediate performance benefit: SageMaker no longer needs to download weights at container startup because Amazon Bedrock preloads the model. The main configuration changes involved granting Amazon Bedrock permissions to allow cross-account access to their S3 model bucket and updating AWS Identity and Access Management (IAM) policies to allow inference clients to invoke Amazon Bedrock endpoints.

The following inference flow diagram illustrates how Salesforce maintained their existing application interfaces while using Amazon Bedrock serverless capabilities. Client requests flow through their established preprocessing layer for business logic like prompt formatting before reaching Amazon Bedrock, with postprocessing applied to the raw model output. To handle complex processing requirements, they deployed lightweight SageMaker CPU containers that act as intelligent proxies—running their custom model.py logic while forwarding the actual inference to Amazon Bedrock endpoints. This hybrid architecture preserves their existing tooling framework: their prediction service continues calling SageMaker endpoints without routing changes, and they retain mature SageMaker monitoring and logging for preprocessing and postprocessing logic. The trade-off involves an additional network hop adding 5–10 millisecond latency and the cost of always-on CPU instances, but this approach delivers backward-compatibility with existing integrations while keeping the GPU-intensive inference fully serverless through Amazon Bedrock.

Scalability benchmarking
To validate the performance capabilities of Amazon Bedrock Custom Model Import, Salesforce conducted comprehensive load testing across various concurrency scenarios. Their testing methodology focused on measuring how the transparent auto scaling behavior of Amazon Bedrock—where the service automatically spins up model copies on-demand and scales out under heavy load—would impact real-world performance. Each test involved sending standardized payloads containing model IDs and input data through their proxy containers to Amazon Bedrock endpoints, measuring latency and throughput under different load patterns. Results (see the following table) show that at low concurrency, Amazon Bedrock achieved 44% lower latency than the ml.g6e.xlarge baseline (bf16 precision). Under higher loads, Amazon Bedrock Custom Model Import maintained consistent throughput with acceptable latency (less than 10 milliseconds), demonstrating the serverless architecture’s ability to handle production workloads without manual scaling.

Concurrency (Count)
P95 Latency (in Seconds)
Throughput (Request per Minute)

1
7.2
11

4
7.96
41

16
9.35
133

32
10.44
232

The results show P95 latency and throughput performance of the ApexGuru model (fine-tuned QWEN-2.5 13B) at varying concurrency levels. Amazon Bedrock Custom Model Import auto scaled from one to three copies as concurrency reached 32. Each model copy used 1 model unit.
Results and metrics
Beyond scalability improvements, Salesforce evaluated Amazon Bedrock Custom Model Import across two critical business dimensions: operational efficiency and cost optimization. The operational efficiency gains were substantial—the team achieved a 30% reduction in time to iterate and deploy models to production. This improvement stemmed from alleviating complex decision-making around instance selection, parameter tuning, and choosing between serving engines like vLLM vs. TensorRT-LLM. The streamlined deployment process allowed developers to focus on model performance rather than infrastructure configuration.
Cost optimization delivered even more dramatic results, with Salesforce achieving up to 40% cost reduction through Amazon Bedrock. This savings was primarily driven by their diverse traffic patterns across generative AI applications—ranging from low to high production traffic—where they previously had to reserve GPU capacity for peak workloads. The pay-per-use model proved especially beneficial for development, performance testing, and staging environments that only required GPU resources during active development cycles, avoiding the need for round-the-clock reserved capacity that often sat idle.
Lessons learned
Salesforce’s journey with Amazon Bedrock Custom Model Import revealed several key insights that can guide other organizations considering a similar approach. First, although Amazon Bedrock Custom Model Import supports popular open source model architectures (Qwen, Mistral, Llama) and expands its portfolio frequently based on demand, teams working with cutting-edge architectures might need to wait for support. However, organizations fine-tuning with the latest model architectures should verify compatibility before committing to the deployment timeline.
For pre- and post-inference processing, Salesforce evaluated alternative approaches using Amazon API Gateway and AWS Lambda functions, which offer complete serverless scaling and pay-per-use pricing down to milliseconds of execution. However, they found this approach less backward-compatible with existing integrations and observed cold start impacts when using larger libraries in their processing logic.
Cold start latency emerged as a critical consideration, particularly for larger (over 7B parameter) models. Salesforce observed cold start delays of a couple of minutes with 26B parameter models, with latency varying based on model size. For latency-sensitive applications that can’t tolerate such delays, they recommend keeping endpoints warm by maintaining at least one model copy active through health check invocations every 14 minutes. This approach balances cost-efficiency with performance requirements for production workloads.
Conclusion
Salesforce’s adoption of Amazon Bedrock Custom Model Import shows how to simplify LLM deployment without sacrificing scalability or performance. They achieved 30% faster deployments and 40% cost savings while maintaining backward-compatibility through their hybrid architecture using SageMaker proxy containers alongside Amazon Bedrock serverless inference. For highly customized models or unsupported architectures, Salesforce continues using SageMaker AI as a managed ML solution.
Their success came from methodical execution: thorough load testing, and gradual migration starting with non-critical workloads. The results prove serverless AI deployment works for production, especially with variable traffic patterns. ApexGuru is now deployed in their production environment.
For teams managing LLMs at scale, this case study provides a clear blueprint. Check your model architecture compatibility, plan for cold starts with larger models, and preserve existing interfaces. Amazon Bedrock Custom Model Import offers a proven path to serverless AI that can reduce overhead, speed deployment, and cut costs while meeting performance requirements.
To learn more about pricing for Amazon Bedrock, refer to Optimizing cost for using foundational models with Amazon Bedrock and Amazon Bedrock pricing.
For help choosing between Amazon Bedrock and SageMaker AI, see Amazon Bedrock or Amazon SageMaker AI?
For more information about Amazon Bedrock Custom Model Import, see How to configure cross-account model deployment using Amazon Bedrock Custom Model Import.
For more details about ApexGuru, refer to Get AI-Powered Insights for Your Apex Code with ApexGuru.

About the authors
Srikanta Prasad is a Senior Manager in Product Management specializing in generative AI solutions at Salesforce. He leads Model Hosting and Inference initiatives, focusing on LLM inference serving, LLMOps, and scalable AI deployments.
Utkarsh Arora is an Associate Member of Technical Staff at Salesforce, combining strong academic grounding from IIIT Delhi with early career contributions in ML engineering and research. 
Raghav Tanaji is a Lead Member of Technical Staff at Salesforce, specializing in machine learning, pattern recognition, and statistical learning. He holds an M.Tech from IISc Bangalore.
Akhilesh Deepak Gotmare is a Senior Research Staff Member at Salesforce Research, based in Singapore. He is an AI Researcher focusing on deep learning, natural language processing, and code-related applications
Gokulakrishnan Gopalakrishnan is a Principal Software Engineer at Salesforce, where he leads engineering efforts on ApexGuru. With 15+ years of experience, including at Microsoft, he specializes in building scalable software systems
Nitin Surya is a Lead Member of Technical Staff at Salesforce with 8+ years in software/ML engineering. He holds a B.Tech in CS from VIT University and MS in CS (AI/ML focus) from University of Illinois Chicago.
Hrushikesh Gangur is a Principal Solutions Architect at AWS based in San Francisco, California. He specializes in generative and agentic AI, helping startups and ISVs build and deploy AI applications.

Ivy Framework Agnostic Machine Learning Build, Transpile, and Benchmar …

In this tutorial, we explore Ivy’s remarkable ability to unify machine learning development across frameworks. We begin by writing a fully framework-agnostic neural network that runs seamlessly on NumPy, PyTorch, TensorFlow, and JAX. We then dive into code transpilation, unified APIs, and advanced features like Ivy Containers and graph tracing, all designed to make deep learning code portable, efficient, and backend-independent. As we progress, we witness how Ivy simplifies model creation, optimization, and benchmarking without locking us into any single ecosystem. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q ivy tensorflow torch jax jaxlib

import ivy
import numpy as np
import time

print(f”Ivy version: {ivy.__version__}”)

class IvyNeuralNetwork:
“””A simple neural network written purely in Ivy that works with any backend.”””

def __init__(self, input_dim=4, hidden_dim=8, output_dim=3):
self.w1 = ivy.random_uniform(shape=(input_dim, hidden_dim), low=-0.5, high=0.5)
self.b1 = ivy.zeros((hidden_dim,))
self.w2 = ivy.random_uniform(shape=(hidden_dim, output_dim), low=-0.5, high=0.5)
self.b2 = ivy.zeros((output_dim,))

def forward(self, x):
“””Forward pass using pure Ivy operations.”””
h = ivy.matmul(x, self.w1) + self.b1
h = ivy.relu(h)

out = ivy.matmul(h, self.w2) + self.b2
return ivy.softmax(out)

def train_step(self, x, y, lr=0.01):
“””Simple training step with manual gradients.”””
pred = self.forward(x)

loss = -ivy.mean(ivy.sum(y * ivy.log(pred + 1e-8), axis=-1))

pred_error = pred – y

h_activated = ivy.relu(ivy.matmul(x, self.w1) + self.b1)
h_t = ivy.permute_dims(h_activated, axes=(1, 0))
dw2 = ivy.matmul(h_t, pred_error) / x.shape[0]
db2 = ivy.mean(pred_error, axis=0)

self.w2 = self.w2 – lr * dw2
self.b2 = self.b2 – lr * db2

return loss

def demo_framework_agnostic_network():
“””Demonstrate the same network running on different backends.”””
print(“n” + “=”*70)
print(“PART 1: Framework-Agnostic Neural Network”)
print(“=”*70)

X = np.random.randn(100, 4).astype(np.float32)
y = np.eye(3)[np.random.randint(0, 3, 100)].astype(np.float32)

backends = [‘numpy’, ‘torch’, ‘tensorflow’, ‘jax’]
results = {}

for backend in backends:
try:
ivy.set_backend(backend)

if backend == ‘jax’:
import jax
jax.config.update(‘jax_enable_x64’, True)

print(f”n Running with {backend.upper()} backend…”)

X_ivy = ivy.array(X)
y_ivy = ivy.array(y)

net = IvyNeuralNetwork()

start_time = time.time()
for epoch in range(50):
loss = net.train_step(X_ivy, y_ivy, lr=0.1)

elapsed = time.time() – start_time

predictions = net.forward(X_ivy)
accuracy = ivy.mean(
ivy.astype(ivy.argmax(predictions, axis=-1) == ivy.argmax(y_ivy, axis=-1), ‘float32’)
)

results[backend] = {
‘loss’: float(ivy.to_numpy(loss)),
‘accuracy’: float(ivy.to_numpy(accuracy)),
‘time’: elapsed
}

print(f” Final Loss: {results[backend][‘loss’]:.4f}”)
print(f” Accuracy: {results[backend][‘accuracy’]:.2%}”)
print(f” Time: {results[backend][‘time’]:.3f}s”)

except Exception as e:
print(f” {backend} error: {str(e)[:80]}”)
results[backend] = None

ivy.unset_backend()
return results

We build and train a simple neural network entirely with Ivy to demonstrate true framework-agnostic design. We run the same model seamlessly across NumPy, PyTorch, TensorFlow, and JAX backends, observing consistent behavior and performance. Through this, we experience how Ivy abstracts away framework differences while maintaining efficiency and accuracy. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_transpilation():
“””Demonstrate transpiling code from PyTorch to TensorFlow and JAX.”””
print(“n” + “=”*70)
print(“PART 2: Framework Transpilation”)
print(“=”*70)

try:
import torch
import tensorflow as tf

def pytorch_computation(x):
“””A simple PyTorch computation.”””
return torch.mean(torch.relu(x * 2.0 + 1.0))

x_torch = torch.randn(10, 5)

print(“n Original PyTorch function:”)
result_torch = pytorch_computation(x_torch)
print(f” PyTorch result: {result_torch.item():.6f}”)

print(“n Transpilation Demo:”)
print(” Note: ivy.transpile() is powerful but complex.”)
print(” It works best with traced/compiled functions.”)
print(” For simple demonstrations, we’ll show the unified API instead.”)

print(“n Equivalent computation across frameworks:”)
x_np = x_torch.numpy()

ivy.set_backend(‘numpy’)
x_ivy = ivy.array(x_np)
result_np = ivy.mean(ivy.relu(x_ivy * 2.0 + 1.0))
print(f” NumPy result: {float(ivy.to_numpy(result_np)):.6f}”)

ivy.set_backend(‘tensorflow’)
x_ivy = ivy.array(x_np)
result_tf = ivy.mean(ivy.relu(x_ivy * 2.0 + 1.0))
print(f” TensorFlow result: {float(ivy.to_numpy(result_tf)):.6f}”)

ivy.set_backend(‘jax’)
import jax
jax.config.update(‘jax_enable_x64’, True)
x_ivy = ivy.array(x_np)
result_jax = ivy.mean(ivy.relu(x_ivy * 2.0 + 1.0))
print(f” JAX result: {float(ivy.to_numpy(result_jax)):.6f}”)

print(f”n All results match within numerical precision!”)

ivy.unset_backend()

except Exception as e:
print(f” Demo error: {str(e)[:80]}”)

In this part, we explore how Ivy enables smooth transpilation and interoperability between frameworks. We take a simple PyTorch computation and reproduce it identically in TensorFlow, NumPy, and JAX using Ivy’s unified API. Through this, we see how Ivy bridges framework boundaries, enabling consistent results across different deep learning ecosystems. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_unified_api():
“””Show how Ivy’s unified API works across different operations.”””
print(“n” + “=”*70)
print(“PART 3: Unified API Across Frameworks”)
print(“=”*70)

operations = [
(“Matrix Multiplication”, lambda x: ivy.matmul(x, ivy.permute_dims(x, axes=(1, 0)))),
(“Element-wise Operations”, lambda x: ivy.add(ivy.multiply(x, x), 2)),
(“Reductions”, lambda x: ivy.mean(ivy.sum(x, axis=0))),
(“Neural Net Ops”, lambda x: ivy.mean(ivy.relu(x))),
(“Statistical Ops”, lambda x: ivy.std(x)),
(“Broadcasting”, lambda x: ivy.multiply(x, ivy.array([1.0, 2.0, 3.0, 4.0]))),
]

X = np.random.randn(5, 4).astype(np.float32)

for op_name, op_func in operations:
print(f”n {op_name}:”)

for backend in [‘numpy’, ‘torch’, ‘tensorflow’, ‘jax’]:
try:
ivy.set_backend(backend)

if backend == ‘jax’:
import jax
jax.config.update(‘jax_enable_x64’, True)

x_ivy = ivy.array(X)
result = op_func(x_ivy)
result_np = ivy.to_numpy(result)

if result_np.shape == ():
print(f” {backend:12s}: scalar value = {float(result_np):.4f}”)
else:
print(f” {backend:12s}: shape={result_np.shape}, mean={np.mean(result_np):.4f}”)

except Exception as e:
print(f” {backend:12s}: {str(e)[:60]}”)

ivy.unset_backend()

In this section, we test Ivy’s unified API by performing various mathematical, neural, and statistical operations across multiple backends. We seamlessly execute the same code on NumPy, PyTorch, TensorFlow, and JAX, confirming consistent results and syntax. Through this, we realize how Ivy simplifies multi-framework coding into a single, coherent interface that just works everywhere. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_advanced_features():
“””Demonstrate advanced Ivy features.”””
print(“n” + “=”*70)
print(“PART 4: Advanced Ivy Features”)
print(“=”*70)

print(“n Ivy Containers – Nested Data Structures:”)
try:
ivy.set_backend(‘torch’)

container = ivy.Container({
‘layer1’: {‘weights’: ivy.random_uniform(shape=(4, 8)), ‘bias’: ivy.zeros((8,))},
‘layer2’: {‘weights’: ivy.random_uniform(shape=(8, 3)), ‘bias’: ivy.zeros((3,))}
})

print(f” Container keys: {list(container.keys())}”)
print(f” Layer1 weight shape: {container[‘layer1’][‘weights’].shape}”)
print(f” Layer2 bias shape: {container[‘layer2’][‘bias’].shape}”)

def scale_fn(x, _):
return x * 2.0

scaled_container = container.cont_map(scale_fn)
print(f” Applied scaling to all tensors in container”)

except Exception as e:
print(f” Container demo: {str(e)[:80]}”)

print(“n Array API Standard Compliance:”)
backends_tested = []
for backend in [‘numpy’, ‘torch’, ‘tensorflow’, ‘jax’]:
try:
ivy.set_backend(backend)

if backend == ‘jax’:
import jax
jax.config.update(‘jax_enable_x64’, True)

x = ivy.array([1.0, 2.0, 3.0])
y = ivy.array([4.0, 5.0, 6.0])

result = ivy.sqrt(ivy.square(x) + ivy.square(y))
print(f” {backend:12s}: L2 norm operations work “)
backends_tested.append(backend)
except Exception as e:
print(f” {backend:12s}: {str(e)[:50]}”)

print(f”n Successfully tested {len(backends_tested)} backends”)

print(“n Complex Multi-step Operations:”)
try:
ivy.set_backend(‘torch’)

x = ivy.random_uniform(shape=(10, 5), low=0, high=1)

result = ivy.mean(
ivy.relu(
ivy.matmul(x, ivy.permute_dims(x, axes=(1, 0)))
),
axis=0
)

print(f” Chained operations (matmul → relu → mean)”)
print(f” Input shape: (10, 5), Output shape: {result.shape}”)
print(f” Complex operation graph executed successfully”)

except Exception as e:
print(f” {str(e)[:80]}”)

ivy.unset_backend()

We dive into Ivy’s power features beyond the basics. We organize parameters with ivy.Container, validate Array API–style ops across NumPy, PyTorch, TensorFlow, and JAX, and chain complex steps (matmul → ReLU → mean) to see graph-like execution flow. We come away confident that Ivy scales from neat data structures to robust multi-backend computation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef benchmark_operation(op_func, x, iterations=50):
“””Benchmark an operation.”””
start = time.time()
for _ in range(iterations):
result = op_func(x)
return time.time() – start

def demo_performance():
“””Compare performance across backends.”””
print(“n” + “=”*70)
print(“PART 5: Performance Benchmarking”)
print(“=”*70)

X = np.random.randn(100, 100).astype(np.float32)

def complex_operation(x):
“””A more complex computation.”””
z = ivy.matmul(x, ivy.permute_dims(x, axes=(1, 0)))
z = ivy.relu(z)
z = ivy.mean(z, axis=0)
return ivy.sum(z)

print(“n Benchmarking matrix operations (50 iterations):”)
print(” Operation: matmul → relu → mean → sum”)

for backend in [‘numpy’, ‘torch’, ‘tensorflow’, ‘jax’]:
try:
ivy.set_backend(backend)

if backend == ‘jax’:
import jax
jax.config.update(‘jax_enable_x64’, True)

x_ivy = ivy.array(X)

_ = complex_operation(x_ivy)

elapsed = benchmark_operation(complex_operation, x_ivy, iterations=50)

print(f” {backend:12s}: {elapsed:.4f}s ({elapsed/50*1000:.2f}ms per op)”)

except Exception as e:
print(f” {backend:12s}: {str(e)[:60]}”)

ivy.unset_backend()

if __name__ == “__main__”:
print(“””
╔════════════════════════════════════════════════════════════════════╗
║ Advanced Ivy Tutorial – Framework-Agnostic ML ║
║ Write Once, Run Everywhere! ║
╚════════════════════════════════════════════════════════════════════╝
“””)

results = demo_framework_agnostic_network()
demo_transpilation()
demo_unified_api()
demo_advanced_features()
demo_performance()

print(“n” + “=”*70)
print(” Tutorial Complete!”)
print(“=”*70)
print(“n Key Takeaways:”)
print(” 1. Ivy enables writing ML code once that runs on any framework”)
print(” 2. Same operations work identically across NumPy, PyTorch, TF, JAX”)
print(” 3. Unified API provides consistent operations across backends”)
print(” 4. Switch backends dynamically for optimal performance”)
print(” 5. Containers help manage complex nested model structures”)
print(“n Next Steps:”)
print(” – Build your own framework-agnostic models”)
print(” – Use ivy.Container for managing model parameters”)
print(” – Explore ivy.trace_graph() for computation graph optimization”)
print(” – Try different backends to find optimal performance”)
print(” – Check docs at: https://docs.ivy.dev/”)
print(“=”*70)

We benchmark the same complex operation across NumPy, PyTorch, TensorFlow, and JAX to compare real-world throughput. We warm up each backend, run 50 iterations, and log total time and per-op latency so we can choose the fastest stack for our workload.

In conclusion, we experience firsthand how Ivy empowers us to “write once and run everywhere.” We observe identical model behavior, seamless backend switching, and consistent performance across multiple frameworks. By unifying APIs, simplifying interoperability, and offering advanced graph optimization and container features, Ivy paves the way for a more flexible, modular, and efficient future of machine learning development. We now stand equipped to build and deploy models effortlessly across diverse environments, all using the same elegant Ivy codebase.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Ivy Framework Agnostic Machine Learning Build, Transpile, and Benchmark Across All Major Backends appeared first on MarkTechPost.