How TP ICAP transformed CRM data into real-time insights with Amazon B …

This post is co-written with Ross Ashworth at TP ICAP.
The ability to quickly extract insights from customer relationship management systems (CRMs) and vast amounts of meeting notes can mean the difference between seizing opportunities and missing them entirely. TP ICAP faced this challenge, having thousands of vendor meeting records stored in their CRM. Using Amazon Bedrock, their Innovation Lab built a production-ready solution that transforms hours of manual analysis into seconds by providing AI-powered insights, using a combination of Retrieval Augmented Generation (RAG) and text-to-SQL approaches.
This post shows how TP ICAP used Amazon Bedrock Knowledge Bases and Amazon Bedrock Evaluations to build ClientIQ, an enterprise-grade solution with enhanced security features for extracting CRM insights using AI, delivering immediate business value.
The challenge
TP ICAP had accumulated tens of thousands of vendor meeting notes in their CRM system over many years. These notes contained rich, qualitative information and details about product offerings, integration discussions, relationship insights, and strategic direction. However, this data was being underutilized and business users were spending hours manually searching through records, knowing the information existed but unable to efficiently locate it. The TP ICAP Innovation Lab set out to make the information more accessible, actionable, and quickly summarized for their internal stakeholders. Their solution needed to surface relevant information quickly, be accurate, and maintain proper context.
ClientIQ: TP ICAP’s custom CRM assistant
With ClientIQ, users can interact with their Salesforce meeting data through natural language queries. For example:

Ask questions about meeting data in plain English, such as “How can we improve our relationship with customers?”, “What do our clients think about our solution?”, or “How were our clients impacted by Brexit?”
Refine their queries through follow-up questions.
Apply filters to restrict model answers to a particular time period.
Access source documents directly through links to specific Salesforce records.

ClientIQ provides comprehensive responses while maintaining full traceability by including references to the source data and direct links to the original Salesforce records. The conversational interface supports natural dialogue flow, so users can refine and explore their queries without starting over. The following screenshot shows an example interaction (examples in this post use fictitious data and AnyCompany, a fictitious company, for demonstration purposes).

ClientIQ performs multiple tasks to fulfill a user’s request:

It uses a large language model (LLM) to analyze each user query to determine the optimal processing path.
It routes requests to one of two workflows:

The RAG workflow for getting insights from unstructured meeting notes. For example, “Was topic A discussed with AnyCompany the last 14 days?”
The SQL generation workflow for answering analytical queries by querying structured data. For example, “Get me a report on meeting count per region for last 4 weeks.”

It then generates the responses in natural language.
ClientIQ respects existing permission boundaries and access controls, helping verify users only access the data they’re authorized to. For example, if a user only has access to their regional accounts in the CRM system, ClientIQ only returns information from these accounts.

Solution overview
Although the team considered using their CRM’s built-in AI assistant, they opted to develop a more customized, cost-effective solution that would precisely match their requirements. They partnered with AWS and built an enterprise-grade solution powered by Amazon Bedrock. With Amazon Bedrock, TP ICAP evaluated and selected the best models for their use case and built a production-ready RAG solution in weeks rather than months, without having to manage the underlying infrastructure. They specifically used the following Amazon Bedrock managed capabilities:

Amazon Bedrock foundation models – Amazon Bedrock provides a range of foundation models (FMs) from providers, including Anthropic, Meta, Mistral AI, and Amazon, accessible through a single API. TP ICAP experimented with different models for various tasks and selected the best model for each task, balancing latency, performance, and cost. For instance, they used Anthropic’s Claude 3.5 Sonnet for classification tasks and Amazon Nova Pro for text-to-SQL generation. Because Amazon Bedrock is fully managed, they didn’t need to spend time setting up infrastructure for hosting these models, reducing the time to delivery.
Amazon Bedrock Knowledge Bases – The FMs needed access to the information in TP ICAP’s Salesforce system to provide accurate, relevant responses. TP ICAP used Amazon Bedrock Knowledge Bases to implement RAG, a technique that enhances generative AI responses by incorporating relevant data from your organization’s knowledge sources. Amazon Bedrock Knowledge Bases is a fully managed RAG capability with built-in session context management and source attribution. The final implementation delivers precise, contextually relevant responses while maintaining traceability to source documents.
Amazon Bedrock Evaluations – For consistent quality and performance, the team wanted to implement automated evaluations. By using Amazon Bedrock Evaluations and the RAG evaluation tool for Amazon Bedrock Knowledge Bases in their development environment and CI/CD pipeline, they were able to evaluate and compare FMs with human-like quality. They evaluated different dimensions, including response accuracy, relevance, and completeness, and quality of RAG retrieval.

Since launch, their approach scales efficiently to analyze thousands of responses and facilitates data-driven decision-making about model and inference parameter selection, and RAG configuration.The following diagram showcases the architecture of the solution.

The user query workflow consists of the following steps:

The user logs in through a frontend React application, hosted in an Amazon Simple Storage Service (Amazon S3) bucket and accessible only within the organization’s network through an internal-only Application Load Balancer.
After logging in, a WebSocket connection is opened between the client and Amazon API Gateway to enable real-time, bi-directional communication.
After the connection is established, an AWS Lambda function (connection handler) is invoked, which process the payload, logs tracking data to Amazon DynamoDB, and publishes request data to an Amazon Simple Notification Service (Amazon SNS) topic for downstream processing.
Lambda functions for different types of tasks consume messages from Amazon Simple Queue Service (Amazon SQS) for scalable and event-driven processing.
The Lambda functions use Amazon Bedrock FMs to determine whether a question is best answered by querying structured data in Amazon Athena or by retrieving information from an Amazon Bedrock knowledge base.
After processing, the answer is returned to the user in real time using the existing WebSocket connection through API Gateway.

Data ingestion
ClientIQ needs to be regularly updated with the latest Salesforce data. Rather than using an off-the-shelf option, TP ICAP developed a custom connector to interface with their highly tailored Salesforce implementation and ingest the latest data to Amazon S3. This bespoke approach provided the flexibility needed to handle their specific data structures while remaining simple to configure and maintain. The connector, which employs Salesforce Object Query Language (SOQL) queries to retrieve the data, runs daily and has proven to be fast and reliable. To optimize the quality of the results during the RAG retrieval workflow, TP ICAP opted for a custom chunking approach in their Amazon Bedrock knowledge base. The custom chunking happens as part of the ingestion process, where the connector splits the data into individual CSV files, one per meeting. These files are also automatically tagged with relevant topics from a predefined list, using Amazon Nova Pro, to further increase the quality of the retrieval results. The final outputs in Amazon S3 contain a CSV file per meeting and a matching JSON metadata file containing tags such as date, division, brand, and region. The following is an example of the associated metadata file:

{
“metadataAttributes”: {
“Tier”: “Bronze”,
“Number_Date_of_Visit”: 20171130,
“Author_Region_C”: “AMER”,
“Brand_C”: “Credit”,
“Division_C”: “Credit”,
“Visiting_City_C”: “Chicago”,
“Client_Name”: “AnyCompany”
}
}

As soon as the data is available in Amazon S3, an AWS Glue job is triggered to populate the AWS Glue Data Catalog. This is later used by Athena when querying the Amazon S3 data.
The Amazon Bedrock knowledge base is also synced with Amazon S3. As part of this process, each CSV file is converted into embeddings using Amazon Titan v1 and indexed in the vector store, Amazon OpenSearch Serverless. The metadata is also ingested and available for filtering the vector store results during retrieval, as described in the following section.
Boosting RAG retrieval quality
In a RAG query workflow, the first step is to retrieve the documents that are relevant to the user’s query from the vector store and append them to the query as context. Common ways to find the relevant documents include semantic search, keyword search, or a combination of both, referred to as hybrid search. ClientIQ uses hybrid search to first filter documents based on their metadata and then perform semantic search within the filtered results. This pre-filtering provides more control over the retrieved documents and helps disambiguate queries. For example, a question such as “find notes from executive meetings with AnyCompany in Chicago” can mean meetings with any AnyCompany division that took place in Chicago or meetings with AnyCompany’s division headquartered in Chicago.
TP ICAP used the manual metadata filtering capability in Amazon Bedrock Knowledge Bases to implement hybrid search in their vector store, OpenSearch Serverless. With this approach, in the preceding example, the documents are first pre-filtered for “Chicago” as Visiting_City_C. After that, a semantic search is performed to find the documents that contain executive meeting notes for AnyCompany. The final output contains notes from meetings in Chicago, which is what is expected in this case. The team enhanced this functionality further by using the implicit metadata filtering of Amazon Bedrock Knowledge Bases. This capability relies on Amazon Bedrock FMs to automatically analyze the query, understand which values can be mapped to metadata fields, and rewrite the query accordingly before performing the retrieval.
Finally, for additional precision, users can manually specify filters through the application UI, giving them greater control over their search results. This multi-layered filtering approach significantly improves context and final response accuracy while maintaining fast retrieval speeds.
Security and access control
To maintain Salesforce’s granular permissions model in the ClientIQ solution, TP ICAP implemented a security framework using Okta group claims mapped to specific divisions and regions. When a user signs in, their group claims are attached to their session. When the user asks a question, these claims are automatically matched against metadata fields in Athena or OpenSearch Serverless, depending on the path followed.
For example, if a user has access to see information for EMEA only, then the documents are automatically filtered by the EMEA region. In Athena, this is done by automatically adjusting the query to include this filter. In Amazon Bedrock Knowledge Bases, this is done by introducing an additional metadata field filter for region=EMEA in the hybrid search. This is highlighted in the following diagram.

Results that don’t match the user’s permission tags are filtered out, so that users can only access data they’re authorized to see. This unified security model maintains consistency between Salesforce permissions and ClientIQ access controls, preserving data governance across solutions.
The team also developed a custom administrative interface for admins that manage permission in Salesforce to add or remove users from groups using Okta’s APIs.
Automated evaluation
The Innovation Lab team faced a common challenge in building their RAG application: how to scientifically measure and improve its performance. To address that, they developed an evaluation strategy using Amazon Bedrock Evaluations that involves three phrases:

Ground truth creation – They worked closely with stakeholders and testing teams to develop a comprehensive set of 100 representative question answers pairs that mirrored real-world interactions.
RAG evaluation – In their development environment, they programmatically triggered RAG evaluations in Amazon Bedrock Evaluations to process the ground truth data in Amazon S3 and run comprehensive assessments. They evaluated different chunking strategies, including default and custom chunking, tested different embedding models for retrieval, and compared FMs for generation using a range of inference parameters.
Metric-driven optimization – Amazon Bedrock generates evaluation reports containing metrics, scores, and insights upon completion of an evaluation job. The team tracked content relevance and content coverage for retrieval and quality, and responsible AI metrics such as response relevance, factual accuracy, retrieval precision, and contextual comprehension for generation. They used the evaluation reports to make optimizations until they reached their performance goals.

The following diagram illustrates this approach.

In addition, they integrated RAG evaluation directly into their continuous integration and continuous delivery (CI/CD) pipeline, so every deployment automatically validates that changes don’t degrade response quality. The automated testing approach gives the team confidence to iterate quickly while maintaining consistently high standards for the production solution.
Business outcomes
ClientIQ has transformed how TP ICAP extracts value from their CRM data. Following the initial launch with 20 users, the results showed that the solution has driven a 75% reduction in time spent on research tasks. Stakeholders also reported an improvement in insight quality, with more comprehensive and contextual information being surfaced. Building on this success, the TP ICAP Innovation Lab plans to evolve ClientIQ into a more intelligent virtual assistant capable of handling broader, more complex tasks across multiple enterprise systems. Their mission remains consistent: to help technical and non-technical teams across the business to unlock business benefits with generative AI.
Conclusion
In this post, we explored how the TP ICAP Innovation Lab team used Amazon Bedrock FMs, Amazon Bedrock Knowledge Bases, and Amazon Bedrock Evaluations to transform thousands of meeting records from an underutilized resource into a valuable asset and accelerate time to insights while maintaining enterprise-grade security and governance. Their success demonstrates that with the right approach, businesses can implement production-ready AI solutions and deliver business value in weeks. To learn more about building similar solutions with Amazon Bedrock, visit the Amazon Bedrock documentation or discover real-world success stories and implementations on the AWS Financial Services Blog.

About the authors
Ross Ashworth works in TP ICAP’s AI Innovation Lab, where he focuses on enabling the business to harness Generative AI across a range of projects. With over a decade of experience working with AWS technologies, Ross brings deep technical expertise to designing and delivering innovative, practical solutions that drive business value. Outside of work, Ross is a keen cricket fan and former amateur player. He is now a member at The Oval, where he enjoys attending matches with his family, who also share his passion for the sport.
Anastasia Tzeveleka is a Senior Generative AI/ML Specialist Solutions Architect at AWS. Her experience spans the entire AI lifecycle, from collaborating with organizations training cutting-edge Large Language Models (LLMs) to guiding enterprises in deploying and scaling these models for real-world applications. In her spare time, she explores new worlds through fiction.

Principal Financial Group accelerates build, test, and deployment of A …

This guest post was written by Mulay Ahmed and Caroline Lima-Lane of Principal Financial Group. The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.
With US contact centers that handle millions of customer calls annually, Principal Financial Group® wanted to modernize their customer call experience. In the post Principal Financial Group increases Voice Virtual Assistant performance using Genesys, Amazon Lex, and Amazon QuickSight, we discussed the overall Principal Virtual Assistant solution using Genesys Cloud, Amazon Lex V2, multiple AWS services, and a custom reporting and analytics solution using Amazon QuickSight.
This post focuses on the acceleration of the Virtual Assistant (VA) platform delivery processes through automated build, testing, and deployment of an Amazon Lex V2 bot (including other database and analytics resources described later in this post) using a GitHub continuous integration and delivery (CI/CD) pipeline with automated execution of the Amazon Lex V2 Test Workbench for quality assurance. This solution helps Principal® scale and maintain VA implementations with confidence and speed using infrastructure as code (IaC), configuration as code (CaC,) and an automated CI/CD approach instead of testing and deploying the Amazon Lex V2 bot on the AWS Management Console.
Principal is a global financial company with nearly 20,000 employees passionate about improving the wealth and well-being of people and businesses. In business for 145 years, Principal is helping approximately 70 million customers (as of Q4 2024) plan, protect, invest, and retire, while working to support the communities where it does business.The enterprise virtual assistant engineering team at Principal, in collaboration with AWS, used Amazon Lex V2 to implement a voice virtual assistant to provide self-service and routing capabilities for contact center customers. The following engineering opportunities were recognized and prioritized:

Elimination of console-driven configuration, testing, and deployment of an Amazon Lex V2 bot
Collaboration through structured version control and parallel development workflows for multiple team members
Acceleration of development cycles with automated build, test, and deployment processes for Amazon Lex bot creation and optimization
Enhanced quality assurance controls through automated testing gates and coding standard validation for reliable releases

With the automation solutions described in the post, as of September 2024, Principal has accelerated development efforts by 50% across all environments (development, pilot, and production) through streamlined implementation and deployment processes. This solution also enhances deployment reliability through automated workflows, providing consistent updates while minimizing errors across development, pilot, and production environments, and maximizes development efficiency by integrating the Test Workbench with GitHub, enabling version control and automated testing.With the automation of the Test Workbench and its integration with GitHub, the solution strengthens the CI/CD pipeline by maintaining alignment between test files and bot versions, creating a more agile and reliable development process.
Solution overview
The solution uses the services described in Principal Financial Group increases Voice Virtual Assistant performance using Genesys, Amazon Lex, and Amazon QuickSight. The following services/APIs are also used as part of this solution:

AWS Step Functions to orchestrate the deployment workflow
The Test Workbench APIs, which are invoked within the Step Functions state machine as a sequence of tasks
AWS Lambda to process data to support some of the Test Workbench APIs inputs

VA code organization and management
The Principal VA implementation uses Genesys Cloud as the contact center application and the following AWS services organized as different stacks:

Bot stack:

The Amazon Lex V2 CDK is used for defining and deploying the bot infrastructure
Lambda functions handle the bot logic and manage routing logic (for Amazon Lex and Genesys Cloud)
AWS Secrets Manager stores secrets for calling downstream systems endpoints

Testing stack:

Step Functions orchestrates the testing workflow
Lambda functions are used in the testing process
Test files contains test cases and scenarios in Test Workbench format
Simulated data is used to simulate various scenarios for testing without connecting to downstream systems or APIs

Data stack:

Amazon Dynamo DB manages and stores bot prompts
Amazon Simple Storage Service (Amazon S3) stores testing data

Analytics stack:

Amazon S3 stores logs and processed data
Amazon Data Firehose streams logs to Amazon S3
Lambda orchestrates extract, transform, and load (ETL) operations
AWS Glue manages the Data Catalog and ETL jobs
Amazon Athena is used for querying and analyzing analytics data in Amazon S3
Amazon QuickSight is used for data visualization and business intelligence

CI/CD pipeline:

GitHub serves as the source code repository
A GitHub workflow automates the CI/CD pipeline

Amazon Lex V2 configuration as code and CI/CD workflow
The following diagram illustrates how multiple developers can work on changes to the bot stack and test in parallel by deploying changes locally or using a GitHub workflow.

The process consists of the following steps:

A developer clones the repository and creates a new branch for changes.
Developer A or B makes changes to the bot configuration or Lambda functions using code.
The developer creates a pull request.
The developer deploys the Amazon Lex V2 CDK stack through one of the following methods:

Create a pull request and ensure all code quality and standards checks are passing.
Merge it with the main branch.
Deploy the Amazon Lex V2 CDK stack from their local environment.

The developer runs the Test Workbench as part of the CI/CD pipeline or from their local environment using the automation scripts.

Tests results are displayed in GitHub Actions and the terminal (if run locally).
The pipeline succeeds only if defined checks such as linting, unit testing, infrastructure testing and integration, and Test Workbench functional testing pass.

After all tests and checks pass, a new pre-release can be drafted to deploy to the staging environment. After staging deployment and testing (automated and UAT) is successful, a new release can be created for production deployment (after manual review and approval).

Amazon Lex Test Workbench automation
The solution uses GitHub and AWS services, such as Step Functions state machines and Lambda functions, to orchestrate the entire Amazon Lex V2 Bot testing process (instead of using the existing manual testing process for Amazon Lex). The pipeline triggers the upload of test sets, Lambda functions to interact with the Amazon Lex V2 bot and Test Workbench, then another Lambda function to read the tests results and provide results in the pipeline.
To maintain consistent, repeatable evaluations of your Amazon Lex V2 bots, it’s essential to manage and organize your test datasets effectively. The following key practices help keep test sets up-to-date:

Test set files are version-controlled and linked to each bot and its version
Separate golden test sets are created for each intent and updated on a regular basis to include production customer utterances, increasing intent recognition rates
The versioned test data is deployed as part of each bot deployment in non-production environments

The following diagram illustrates the end-to-end automated process for testing Amazon Lex V2 bots after each deployment.

The post-deployment workflow consists of the following steps:

The developer checks the test file into the GitHub repository (or deploys directly from local). After each bot deployment, GitHub triggers the test script using the GitHub workflow.
The test scripts upload the test files to an S3 bucket.
The test script invokes a Step Functions state machine, using a bot name and list of file keys as inputs.
Amazon Lex Model API calls are invoked to get the bot ID (ListBots) and alias (ListBotAliases).
Each test file key is iterated within a Map state, where the following tasks are executed:

Call Amazon Lex APIs to start import jobs:

StartImport – Creates a test set ID and stores it under an S3 bucket specified location.
DescribeImport – Checks if the status of StartImport is complete.

Run the test set:

StartTestExecution – Creates a test execution ID and executes the test.
ListTestExecutions – Gathers all test executions. A Lambda function filters out the current test execution id and its status.

Get test results.

When the test is complete:

The ListTestExecutionResultItems API is invoked to gather overall test results.
The ListTestExecutionResultItems API is invoked to fetch test failure details at the utterance level if present.

A Lambda function orchestrates the final cleanup and reporting:

DeleteTestSet cleans up test sets that are no longer needed from an S3 bucket.
The pipeline outputs the results and if there are test failures, these are listed in the GitHub action or local terminal job report.

Developers conduct the manual process of reviewing the test result files from the Test Workbench console.

Conclusion
In this post, we presented how Principal accelerated the development, testing, and deployment of Amazon Lex V2 bots and supporting AWS services using code. In addition to the reporting and analytics solution, this provides a robust solution for the continued enhancement and maintenance of the Virtual Assistant ecosystem.
By automating Test Workbench processes and integrating them with version control and CI/CD processes, Principal was able to decrease testing and deployment time, increase test coverage, streamline their development workflows, and deliver quality conversational experience to customers. For a deeper dive into other relevant services, refer to Evaluating Lex V2 bot performance with the Test Workbench.
AWS and Amazon are not affiliates of any company of the Principal Financial Group. This communication is intended to be educational in nature and is not intended to be taken as a recommendation. Insurance products issued by Principal National Life Insurance Co (except in NY) and Principal Life Insurance Company. Plan administrative services offered by Principal Life. Principal Funds, Inc. is distributed by Principal Funds Distributor, Inc. Securities offered through Principal Securities, Inc., member SIPC and/or independent broker/dealers. Referenced companies are members of the Principal Financial Group, Des Moines, IA 50392. ©2025 Principal Financial Services, Inc. 4373397-042025

About the authors
Mulay Ahmed is a Solutions Architect at Principal with expertise in architecting complex enterprise-grade solutions, including AWS Cloud implementations.
Caroline Lima-Lane is a Software Engineer at Principal with a vast background in the AWS Cloud space.

Beyond vibes: How to properly select the right LLM for the right task

Choosing the right large language model (LLM) for your use case is becoming both increasingly challenging and essential. Many teams rely on one-time (ad hoc) evaluations based on limited samples from trending models, essentially judging quality on “vibes” alone.
This approach involves experimenting with a model’s responses and forming subjective opinions about its performance. However, relying on these informal tests of model output is risky and unscalable, often misses subtle errors, overlooks unsafe behavior, and provides no clear criteria for improvement.
A more holistic approach entails evaluating the model based on metrics around qualitative and quantitative aspects, such as quality of response, cost, and performance. This also requires the evaluation system to compare models based on these predefined metrics and give a comprehensive output comparing models across all these areas. However, these evaluations don’t scale effectively enough to help organizations take full advantage of the model choices available.
In this post, we discuss an approach that can guide you to build comprehensive and empirically driven evaluations that can help you make better decisions when selecting the right model for your task.
From vibes to metrics and why it matters
Human brains excel at pattern-matching, and models are designed to be convincing. Although a vibes-based approach can serve as a starting point, without systematic evaluation, we lack the evidence needed to trust a model in production. This limitation makes it difficult to compare models fairly or identify specific areas for improvement.
The limitations of “just trying it out” include:

Subjective bias – Human testers might favor responses based on style or tone rather than factual accuracy. Users can be swayed by “exotic words” or formatting. A model whose writing sounds confident might win on vibes while actually introducing inaccuracies.
Lack of coverage – A few interactive prompts won’t cover the breadth of real-world inputs, often missing edge cases that reveal model weaknesses.
Inconsistency – Without defined metrics, evaluators might disagree on why one model is better based on different priorities (brevity vs. factual detail), making it difficult to align model choice with business goals.
No trackable benchmarks – Without quantitative metrics, it’s impossible to track accuracy degradation during prompt optimization or model changes.

Established benchmarks like MMLU, HellaSwag, and HELM offer valuable standardized assessments across reasoning, knowledge retrieval, and factuality dimensions, efficiently helping narrow down candidate models without extensive internal resources.
However, exclusive reliance on these benchmarks is problematic: they measure generalized rather than domain-specific performance, prioritize easily quantifiable metrics over business-critical capabilities, and can’t account for your organization’s unique constraints around latency, costs, and safety requirements. A high-ranking model might excel at trivia while failing with your industry terminology or producing responses too verbose or costly for your specific implementation.
A robust evaluation framework is vital for building trust, which is why no single metric can capture what makes an LLM response “good.” Instead, you must evaluate across multiple dimensions:

Accuracy – Does the model produce accurate information? Does it fully answer the question or cover required points? Is the response on-topic, contextually relevant, well-structured, and logically coherent?
Latency – How fast does the model produce a response? For interactive applications, response time directly impacts user experience.
Cost-efficiency – What is the monetary cost per API call or token? Different models have varying pricing structures and infrastructure costs.

By evaluating along these facets, you can make informed decisions aligned with product requirements. For example, if robustness under adversarial inputs is crucial, a slightly slower but more aligned model might be preferable. For simple internal tasks, trading some accuracy for cost-efficiency might make sense.
Although many metrics require qualitative judgment, you can structure and quantify these with careful evaluation methods. Industry best practices combine quantitative metrics with human or AI raters for subjective criteria, moving from “I like this answer more” to “Model A scored 4/5 on correctness and 5/5 on completeness.” This detail enables meaningful discussion and improvement, and technical managers should demand such accuracy measurements before deploying any model.
Unique evaluation dimensions for LLM performance
In this post, we make the case for structured, multi-metric assessment of foundation models (FMs) and discuss the importance of creating ground truth as a prerequisite to model evaluation. We use the open source 360-Eval framework as a practical, code-first tool to orchestrate rigorous evaluations across multiple models and cloud providers.
We show the approach by comparing four LLMs within Amazon Bedrock, across a spectrum of correctness, completeness, relevance, format, coherence, and instruction following, to understand how each model responds matches our ground truth dataset. Our evaluation measures the accuracy, latency, and cost for each model, painting a 360° picture of their strengths and weaknesses.
To evaluate FMs, it’s highly recommended that you break up model performance into distinct dimensions. The following is a sample set of criteria and what each one measures:

Correctness (accuracy) – The factual accuracy of the model’s output. For tasks with a known answer, you can measure this using exact match or cosine similarity; for open-ended responses, you might rely on human or LLM judgment of factual consistency.
Completeness – The extent to which the model’s response addresses all parts of the query or problem. In human/LLM evaluations, completeness is often scored on a scale (did the answer partly address or fully address the query).
Relevance – Measures if the content of the response is on-topic and pertinent to the user’s request. Relevance scoring looks at how well the response stays within scope. High relevance means the model understood the query and stayed focused on it.
Coherence – The logical flow and clarity of the response. Coherence can be judged by human or LLM evaluators, or approximated with metrics like coherence scores or by checking discourse structure.
Following instructions – How well the model obeys explicit instructions in the prompt (formatting, style, length, and so on). For example, if asked “List three bullet-point advantages,” does the model produce a three-item bullet list? If the system or user prompt sets a role or tone, does the model adhere to it? Instruction-following can be evaluated by programmatically checking if the output meets the specified criteria (for example, contains the required sections) or using evaluator ratings.

Performing such comprehensive evaluations manually can be extremely time-consuming. Each model needs to be run on many if not hundreds of prompts, and each output must be checked for across all metrics. Doing this by hand or writing one-off scripts is error-prone and doesn’t scale. In practice, these can be evaluated automatically using LLM-as-a-judge or human feedback. This is where evaluation frameworks come into play.
After you’ve chosen an evaluation philosophy, it’s wise to invest in tooling to support it. Instead of combining ad hoc evaluation scripts, you can use dedicated frameworks to streamline the process of testing LLMs across many metrics and models.
Automating 360° model evaluation with 360-Eval
360-Eval is a lightweight solution that captures the depth and breadth of model evaluation. You can use it as an evaluation orchestrator to define the following:

Your dataset of test prompts and respective golden answers (expected answers or reference outputs)
Models you want to evaluate
The metrics and tasks framework evaluating the models against

The tool is designed to capture relevant and user-defined dimensions of model performance in one workflow, supporting multi-model comparisons out of the box. You can evaluate models hosted in Amazon Bedrock or Amazon SageMaker, or call external APIs—the framework is flexible in integrating different model endpoints. This is ideal for a scenario where you might want to use the full power of Amazon Bedrock models without having to sacrifice performance.
The framework consists of the following key components:

Data configuration – You specify your evaluation dataset; for example, a JSONL file of prompts with optional expected outputs, the task, and a description. The framework can also work with a custom prompt CSV dataset you provide.
API gateway – Using the versatile LiteLLM framework, it abstracts the API differences so the evaluation loop can treat all models uniformly. Inference metadata such as time-to-first-token (TTFT), time-to-last-token (TTLT), total token output, API errors count, and pricing is also captured.
Evaluation architecture – 360-Eval uses LLM-as-a-judge to score and calculate the weight of model outputs on qualities like correctness or relevance. You can provide all the metrics you care about into one pipeline. Each evaluation algorithm will produce a score and verdict per test case per model.

Choosing the right model: A real-world example
For our example use case, AnyCompany is developing an innovative software as a service (SaaS) solution that streamlines database architecture for developers and businesses. Their platform accepts natural language requirements as input and uses LLMs to automatically generate PostgreSQL-specific data models. Users can describe their requirements in plain English—for example, “I need a cloud-based order management platform designed to streamline operations for small to medium businesses”—and the tool intelligently extracts the entity and attribute information and creates an optimized table structure specifically for PostgreSQL. This solution avoids hours of manual entity and database design work, reduces the expertise barrier for database modeling, and supports PostgreSQL best practices even for teams without dedicated database specialists.
In our example, we provide our model a set of requirements (as prompts) relevant to the task and ask it to extract the dominant entity and its attributes (a data extraction task) and also produce a relevant create table statement using PostgreSQL (a text-to-SQL task).
Example prompt:

Given the following requirement, extract the data model and attributes that you will
recommend. I need the output in a single line. You can provide the attributes separated
by comma: “A global manufacturing company uses a web-based supply chain management
system to track inventory across 50 locations, manage relationships with over 200
suppliers, forecast material needs, and automatically trigger purchase orders when stock
levels reach predefined thresholds……”

The following table shows our task types, criteria, and golden answers for this example prompt. We have shortened the prompt for brevity. In a real-world use case, your requirements might span multiple paragraphs.

task_type
task_criteria
golden_answer

DATA EXTRACTION
Check if the extracted entity and attributes matches the requirements

Supply Chain Inventory: inventory_id, product_sku,
location_id, quantity_on_hand, reorder_threshold,
supplier_id, last_order_date, forecasted_demand,
cost_per_unit, status, last_updated

TEXT-TO-SQL
Given the requirements check if the generated create table matches the requirements

CREATE TABLE supply_chain_inventory (
inventory_id SERIAL PRIMARY KEY,
product_sku VARCHAR(50) NOT NULL,
location_id INTEGER NOT NULL,
quantity_on_hand INTEGER NOT NULL,
reorder_threshold INTEGER NOT NULL,
supplier_id INTEGER,
last_order_date TIMESTAMP,
forecasted_demand NUMERIC(10,2),
cost_per_unit NUMERIC(10,2),
status VARCHAR(20),
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

AnyCompany wants to find a model that will solve the task in the fastest and most cost-effective way, without compromising on quality.
360-Eval UI
To reduce the complexity of the process, we have built a UI on top of the evaluation engine.
The UI_README.md file has instructions to launch and run the evaluation using the UI. You must also follow the instructions in the README.md to install the Python packages as prerequisites and enable Amazon Bedrock model access.
Let’s explore the different pages in the UI in more detail.
Setup page
As you launch the UI, you land on the initial Setup page, where you select your evaluation data, define your label, define your task as discreetly as possible, and set the temperature the models will have when being evaluated. Then you select the models you want to evaluate against your dataset, the judges that will evaluate the models’ accuracy (using custom metrics and the standard quality and relevance metrics), configure pricing and AWS Region options, and finally configure how you want the evaluation to take place, such as concurrency, request per minute, and experiment counts (unique runs).

This is where you specify the CSV file with sample prompts, task type, and task criteria according to your needs.
Monitor page
After the evaluation criteria and parameters are defined, they are displayed on the Monitor page, which you can navigate to by choosing Monitor in the Navigation section. On this page, you can monitor all your evaluations, including those currently running, those queued, and those not yet scheduled to run. You can choose the evaluation you want to run, and if any evaluation is no longer relevant, you can remove it here as well.
The workflow is as follows:

Execute the prompts in the input file against the models selected.
Capture the metrics such as input token count, output token count, and TTFT.
Use the input and output tokens to calculate the cost of running each prompt against the models.
Use an LLM-as-a-judge to evaluate the accuracy against predefined metrics (correctness, completeness, relevance, format, coherence, following instructions) and any user-defined metrics.

Evaluations page
Detailed information of the evaluations, such as the evaluation configuration, the judge models used to evaluate, the Regions where the models are hosted, the input and output cost, and the task and its criteria the model was evaluated with, are displayed on the Evaluations page.

Reports page
Lastly, the Reports page is where you can select the completed evaluations to generate a report in HTML format. You can also delete old and irrelevant reports.

Understanding the evaluation report
The tool output is an HTML file that shows the results of the evaluation. It includes the following sections:

Executive Summary – This section provides an overall summary of the results. It provides a quick summary of which model was most accurate, which model was the fastest overall, and which model provided the best success-to-cost ratio.
Recommendations – This section contains more details and a breakdown of what you see in the executive summary, in a tabular format.
Latency Metrics – In this section, you can review the performance aspect of your evaluation. We use the TTFT and output tokens per second as a measure for performance.
Cost Metrics – This section shows the overall cost of running the evaluation, which indicates what you can expect in your AWS billing.
Task Analysis – The tool further breaks down the performance and cost metrics by task type. In our case, there will be a section for the text-to-SQL task and one for data extraction.
Judge Scores Analysis – In this section, you can review the quality of each model based on the various metrics. You can also explore prompt optimizations to improve your model. In our case, our prompts were more biased towards the Anthropic family, but if you use the Amazon Bedrock prompt optimization feature, you might be able to address this bias.

Interpreting the evaluation results
By using the 360-Eval UI, AnyCompany ran the evaluation with their own dataset and got the following results. They chose four different LLMs in Amazon Bedrock to conduct the evaluation. For this post, the exact models used aren’t relevant. We call these models Model-A, Model-B, Model-C, and Model-D.
These results will vary in your case depending on the dataset and prompts. The results here are a reflection of our own example within a test account. As shown in the following figures, Model-A was the fastest, followed by Model-B. Model-C was 3–4 times slower than Model-A. Model-D was the slowest.

As shown in the following figure, Model B was the cheapest. Model A was three times more expensive than Model-B. Model-C and Model-D were both very expensive.

The next focus was the quality of the evaluation. The two most important metrics to were the correctness and completeness of the response. In the following evaluation, only Model-D scored more than 3 for both task types.

Model-C was the next closest contender.

Model-B scored lowest in the correctness and completeness metrics.

Model-A missed slightly on the completeness for the text-to-SQL use case.

Evaluation summary
Let’s revisit AnyCompany’s criteria, which was to find a model that will solve the task in the fastest and most cost-effective way, without compromising on quality. There was no obvious winner.
AnyCompany then considered providing a tiered pricing model to their customers. Premium-tier customers will receive the most accurate model at a premium price, and basic-tier customers will get the model with the best price-performance.
Although for this use case, Model-D was the slowest and more expensive, it scored highest on the most crucial metrics: correctness and completeness of responses. For a database modeling tool, accuracy is far more important than speed or cost, because incorrect database schemas might lead to significant downstream issues in application development. AnyCompany chose Model-D for premium-tier customers.
Cost is a major constraint for the basic-tier, so AnyCompany chose Model-A, because it scored reasonably well on correctness for both tasks and only slightly missed on completeness for one task type, while being faster and less expensive than the top performers.
AnyCompany also considered Model-B as a viable option for free-tier customers.
Conclusion
As FMs become more reliant, they can also become more complex. Because their strengths and weaknesses more difficult to detect, evaluating them requires a systematic approach. By using a data-driven, multi-metric evaluation, technical leaders can make informed decisions rooted in the model’s actual performance, including factual accuracy, user experience, compliance, and cost.
Adopting frameworks like 360-Eval can operationalize this approach. You can encode your evaluation philosophy into a standardized procedure, making sure every new model or version is judged the same, and enabling side-by-side comparisons.
The framework handles the heavy lifting of running models on test cases and computing metrics, so your team can focus on interpreting results and making decisions. As the field of generative AI continues to evolve rapidly, having this evaluation infrastructure can help you find the right model for your use case. Furthermore, this approach can enable faster iteration on prompts and policies, and ultimately help you develop more reliable and effective AI systems in production.

About the authors
Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.
Anubhav Sharma is a Principal Solutions Architect at AWS with over 2 decades of experience in coding and architecting business-critical applications. Known for his strong desire to learn and innovate, Anubhav has spent the past 6 years at AWS working closely with multiple independent software vendors (ISVs) and enterprises. He specializes in guiding these companies through their journey of building, deploying, and operating SaaS solutions on AWS.

Qualifire AI Open-Sources Rogue: An End-to-End Agentic AI Testing Fram …

Agentic systems are stochastic, context-dependent, and policy-bounded. Conventional QA—unit tests, static prompts, or scalar “LLM-as-a-judge” scores—fails to expose multi-turn vulnerabilities and provides weak audit trails. Developer teams need protocol-accurate conversations, explicit policy checks, and machine-readable evidence that can gate releases with confidence.

Qualifire AI has open-sourced Rogue, a Python framework that evaluates AI agents over the Agent-to-Agent (A2A) protocol. Rogue converts business policies into executable scenarios, drives multi-turn interactions against a target agent, and outputs deterministic reports suitable for CI/CD and compliance reviews.

Quick Start

Prerequisites

uvx – If not installed, follow uv installation guide

Python 3.10+

An API key for an LLM provider (e.g., OpenAI, Google, Anthropic).

Installation

Option 1: Quick Install (Recommended)

Use our automated install script to get up and running quickly:

Copy CodeCopiedUse a different Browser# TUI
uvx rogue-ai
# Web UI
uvx rogue-ai ui
# CLI / CI/CD
uvx rogue-ai cli

Option 2: Manual Installation

(a) Clone the repository:

Copy CodeCopiedUse a different Browsergit clone https://github.com/qualifire-dev/rogue.git
cd rogue

(b) Install dependencies:

If you are using uv:

Copy CodeCopiedUse a different Browseruv sync

Or, if you are using pip:

Copy CodeCopiedUse a different Browserpip install -e .

(c) OPTIONALLY: Set up your environment variables: Create a .env file in the root directory and add your API keys. Rogue uses LiteLLM, so you can set keys for various providers.

Copy CodeCopiedUse a different BrowserOPENAI_API_KEY=”sk-…”
ANTHROPIC_API_KEY=”sk-…”
GOOGLE_API_KEY=”…”

Running Rogue

Rogue operates on a client-server architecture where the core evaluation logic runs in a backend server, and various clients connect to it for different interfaces.

Default Behavior

When you run uvx rogue-ai without any mode specified, it:

Starts the Rogue server in the background

Launches the TUI (Terminal User Interface) client

Copy CodeCopiedUse a different Browseruvx rogue-ai

Available Modes

Default (Server + TUI): uvx rogue-ai – Starts server in background + TUI client

Server: uvx rogue-ai server – Runs only the backend server

TUI: uvx rogue-ai tui – Runs only the TUI client (requires server running)

Web UI: uvx rogue-ai ui – Runs only the Gradio web interface client (requires server running)

CLI: uvx rogue-ai cli – Runs non-interactive command-line evaluation (requires server running, ideal for CI/CD)

Mode Arguments

Server Mode

Copy CodeCopiedUse a different Browseruvx rogue-ai server [OPTIONS]

Options:

–host HOST – Host to run the server on (default: 127.0.0.1 or HOST env var)

–port PORT – Port to run the server on (default: 8000 or PORT env var)

–debug – Enable debug logging

TUI Mode

Copy CodeCopiedUse a different Browseruvx rogue-ai tui [OPTIONS]
Web UI Mode
uvx rogue-ai ui [OPTIONS]

Options:

–rogue-server-url URL – Rogue server URL (default: http://localhost:8000)

–port PORT – Port to run the UI on

–workdir WORKDIR – Working directory (default: ./.rogue)

–debug – Enable debug logging

Example: Testing the T-Shirt Store Agent

This repository includes a simple example agent that sells T-shirts. You can use it to see Rogue in action.

Install example dependencies:

If you are using uv:

Copy CodeCopiedUse a different Browser uv sync –group examples

or, if you are using pip:

Copy CodeCopiedUse a different Browserpip install -e .[examples]

(a) Start the example agent server in a separate terminal:

If you are using uv:

Copy CodeCopiedUse a different Browseruv run examples/tshirt_store_agent

If not:

Copy CodeCopiedUse a different Browserpython examples/tshirt_store_agent

This will start the agent on http://localhost:10001.

(b) Configure Rogue in the UI to point to the example agent:

Agent URL: http://localhost:10001

Authentication: no-auth

(c) Run the evaluation and watch Rogue test the T-Shirt agent’s policies!

You can use either the TUI (uvx rogue-ai) or Web UI (uvx rogue-ai ui) mode.

Where Rogue Fits: Practical Use Cases

Safety & Compliance Hardening: Validate PII/PHI handling, refusal behavior, secret-leak prevention, and regulated-domain policies with transcript-anchored evidence.

E-Commerce & Support Agents: Enforce OTP-gated discounts, refund rules, SLA-aware escalation, and tool-use correctness (order lookup, ticketing) under adversarial and failure conditions.

Developer/DevOps Agents: Assess code-mod and CLI copilots for workspace confinement, rollback semantics, rate-limit/backoff behavior, and unsafe command prevention.

Multi-Agent Systems: Verify plannerexecutor contracts, capability negotiation, and schema conformance over A2A; evaluate interoperability across heterogeneous frameworks.

Regression & Drift Monitoring: Nightly suites against new model versions or prompt changes; detect behavioral drift and enforce policy-critical pass criteria before release.

What Exactly Is Rogue—and Why Should Agent Dev Teams Care?

Rogue is an end-to-end testing framework designed to evaluate the performance, compliance, and reliability of AI agents. Rogue synthesizes business context and risk into structured tests with clear objectives, tactics and success criteria. The EvaluatorAgent runs protocol correct conversations in fast single turn or deep multi turn adversarial modes. Bring your own model, or let Rogue use Qualifire’s bespoke SLM judges to drive the tests. Streaming observability and deterministic artifacts: live transcripts,pass/fail verdicts, rationales tied to transcript spans, timing and model/version lineage.

Under the Hood: How Rogue Is Built

Rogue operates on a client-server architecture:

Rogue Server: Contains the core evaluation logic

Client Interfaces: Multiple interfaces that connect to the server:

TUI (Terminal UI): Modern terminal interface built with Go and Bubble Tea

Web UI: Gradio-based web interface

CLI: Command-line interface for automated evaluation and CI/CD

This architecture allows for flexible deployment and usage patterns, where the server can run independently and multiple clients can connect to it simultaneously.

Summary

Rogue helps developer teams test agent behavior the way it actually runs in production. It turns written policies into concrete scenarios, exercises those scenarios over A2A, and records what happened with transcripts you can audit. The result is a clear, repeatable signal you can use in CI/CD to catch policy breaks and regressions before they ship.

Find Rogue on GitHub

Thanks to the Qualifire team for the thought leadership/ Resources for this article. Qualifire team has supported this content/article.
The post Qualifire AI Open-Sources Rogue: An End-to-End Agentic AI Testing Framework Designed to Evaluate the Performance, Compliance, and Reliability of AI Agents appeared first on MarkTechPost.

QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Train …

What would you build if you could run Reinforcement Learning (RL) post-training on a 32B LLM in 4-bit NVFP4—on a single H100—with BF16-level accuracy and 1.2–1.5× step speedups? NVIDIA researchers (with collaborators from MIT, HKU, and Tsinghua) have open-sourced QeRL (Quantization-enhanced Reinforcement Learning), a training framework that pushes Reinforcement Learning (RL) post-training into 4-bit FP4 (NVFP4) while keeping gradient math in higher precision via LoRA. The research team reports >1.5× speedups in the rollout phase, ~1.8× end-to-end vs QLoRA in one setting, and the first demonstration of RL training for a 32B policy on a single H100-80GB GPU.

https://arxiv.org/pdf/2510.11696

What QeRL changes in the Reinforcement Learning (RL) loop?

Most RLHF/GRPO/DAPO pipelines spend the bulk of wall-clock time in rollouts (token generation). QeRL shifts the policy’s weight path to NVFP4 (FP4) with dual-level scaling and keeps logits/gradients in higher precision via LoRA, so backprop remains stable while the sampling path hits hardware-efficient FP4×BF16 kernels (Marlin). The result is faster prefill/decoding during rollouts without maintaining a separate full-precision policy.

Mechanically, the research team integrates Marlin-based FP4 kernels in both rollout and prefill, while LoRA limits trainable parameters. This directly targets the stage that dominates RL cost and latency for long reasoning traces.

https://arxiv.org/pdf/2510.11696

Quantization as exploration, made schedulable

A core empirical finding: deterministic FP4 quantization raises policy entropy, flattening token distributions early in training and improving exploration versus 16-bit LoRA and NF4-based QLoRA baselines. To control that effect over time, QeRL introduces Adaptive Quantization Noise (AQN)—channel-wise Gaussian perturbations mapped into LayerNorm scale parameters and annealed with an exponential schedule. This keeps kernel fusion intact (no extra weight tensors) while transitioning from exploration to exploitation.

In ablations, QeRL shows faster reward growth and higher final scores on math-reasoning tasks under both GRPO and DAPO, aligning with the hypothesis that structured noise in parameter space can be a useful exploration driver in RL, even though such noise is typically detrimental in supervised fine-tuning.

Reported results

On Qwen2.5 backbone model, the research team show that NVFP4+LoRA outperforms vanilla LoRA and QLoRA in rollout throughput and overall training time, with >2× rollout throughput on 14B/32B models against QLoRA and ~1.8× end-to-end vs QLoRA in a representative setup. They also demonstrate training a 32B policy with GRPO on a single H100-80GB, enabled by the lower memory footprint of weight-only FP4.

Accuracy is competitive with higher-precision baselines. For a 7B model, the research team reports GSM8K = 90.8% and MATH500 = 77.4%, surpassing 16-bit LoRA and QLoRA under their setup and matching full-parameter fine-tuning. Across broader math benchmarks (e.g., BigMath), QeRL maintains parity or advantage, while converging faster due to improved exploration.

https://arxiv.org/pdf/2510.11696

What this is—and isn’t?

QeRL is weight-only FP4 with LoRA updates; it does not claim FP4 precision for logits/gradients. The benefits concentrate in rollout/prefill throughput and memory footprint, with empirical evidence that quantization-induced entropy aids RL exploration when AQN modulates it over training. Generalization to modalities beyond math-reasoning tasks or to safety/tool-use RL depends on reward design and sequence lengths.

Key Takeaways

QeRL combines NVFP4 4-bit weight quantization with LoRA to accelerate the rollout phase and cut memory, enabling RL for a 32B LLM on a single H100-80GB.

Quantization acts as exploration: FP4 increases policy entropy, while Adaptive Quantization Noise (AQN) schedules channel-wise noise via LayerNorm scales.

Reported efficiency: >1.5× rollout speedups vs 16-bit LoRA and ~1.8× end-to-end vs QLoRA; >2× rollout throughput vs QLoRA on 14B/32B setups.

Accuracy holds: Qwen2.5-7B reaches 90.8% on GSM8K and 77.4% on MATH500, matching full-parameter fine-tuning under the paper’s setup.

NVFP4 is a hardware-optimized 4-bit floating format with two-level scaling (FP8 E4M3 block scalers + FP32 tensor scale), enabling efficient Marlin-based kernels.

Editorial Comments

QeRL speeds up the RL rollout stage. It quantizes weights to NVFP4 and keeps updates and logits in higher precision using LoRA. It reports >1.5× rollout speedups and can train a 32B policy on a single H100-80GB GPU. It adds Adaptive Quantization Noise to make exploration a controlled signal during training. Results are shown mainly on math-reasoning tasks using GRPO and DAPO. The gains rely on NVFP4 kernel support such as Marlin.

Check out the FULL CODES here and Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post QeRL: NVFP4-Quantized Reinforcement Learning (RL) Brings 32B LLM Training to a Single H100—While Improving Exploration appeared first on MarkTechPost.

Iterative fine-tuning on Amazon Bedrock for strategic model improvemen …

Organizations often face challenges when implementing single-shot fine-tuning approaches for their generative AI models. The single-shot fine-tuning method involves selecting training data, configuring hyperparameters, and hoping the results meet expectations without the ability to make incremental adjustments. Single-shot fine-tuning frequently leads to suboptimal results and requires starting the entire process from scratch when improvements are needed.
Amazon Bedrock now supports iterative fine-tuning, enabling systematic model refinement through controlled, incremental training rounds. With this capability you can build upon previously customized models, whether they were created through fine-tuning or distillation, providing a foundation for continuous improvement without the risks associated with complete retraining.
In this post, we will explore how to implement the iterative fine-tuning capability of Amazon Bedrock to systematically improve your AI models. We’ll cover the key advantages over single-shot approaches, walk through practical implementation using both the console and SDK, discuss deployment options, and share best practices for maximizing your iterative fine-tuning results.
When to use iterative fine-tuning
Iterative fine-tuning provides several advantages over single-shot approaches that make it valuable for production environments. Risk mitigation becomes possible through incremental improvements, so you can test and validate changes before committing to larger modifications. With this approach, you can make data-driven optimization based on real performance feedback rather than theoretical assumptions about what might work. The methodology also helps developers to apply different training techniques sequentially to refine model behavior. Most importantly, iterative fine-tuning accommodates evolving business requirements driven by continuous live data traffic. As user patterns change over time and new use cases emerge that weren’t present in initial training, you can leverage this fresh data to refine your model’s performance without starting from scratch.
How to implement iterative fine-tuning on Amazon Bedrock
Setting up iterative fine-tuning involves preparing your environment and creating training jobs that build upon your existing custom models, whether through the console interface or programmatically using the SDK.
Prerequisites
Before beginning iterative fine-tuning, you need a previously customized model as your starting point. This base model can originate from either fine-tuning or distillation processes and supports customizable models and variants available on Amazon Bedrock. You’ll also need:

Standard IAM permissions for Amazon Bedrock model customization
Incremental training data focused on addressing specific performance gaps
S3 bucket for training data and job outputs

Your incremental training data should target the specific areas where your current model needs improvement rather than attempting to retrain on all possible scenarios.
Using the AWS Management Console
The Amazon Bedrock console provides a straightforward interface for creating iterative fine-tuning jobs.
Navigate to the Custom Models section and select Create fine-tuning job. The key difference in iterative fine-tuning lies in the base model selection, where you choose your previously customized model instead of a foundation model.
During training, you can visit the Custom models page in the Amazon Bedrock console to track the job status.
Once complete, you can monitor your jobs performance metrics on console through multiple metric charts, on the Training metrics and Validation metrics tabs.
Using the SDK
Programmatic implementation of iterative fine-tuning follows similar patterns to standard fine-tuning with one critical difference: specifying your previously customized model as the base model identifier. Here’s an example implementation:

import boto3
from datetime import datetime
import uuid

# Initialize Bedrock client
bedrock = boto3.client(‘bedrock’)

# Define job parameters
job_name = f”iterative-finetuning-{datetime.now().strftime(‘%Y-%m-%d-%H-%M-%S’)}”
custom_model_name = f”iterative-model-{str(uuid.uuid4())[:8]}”

# Key difference: Use your previously customized model ARN as base
# This could be from previous fine-tuning or distillation
base_model_id = “arn:aws:bedrock:<Region>:<AccountID>:custom-model/<your-previous-custom-model-id>”

# S3 paths for training data and outputs
training_data_uri = “s3://<your-bucket>/<iterative-training-data>”
output_path = “s3://<your-bucket>/<iterative-output-folder>/”

# Hyperparameters adjusted based on previous iteration learnings
hyperparameters = {
“epochCount”: “3” # Example
}

# Create the iterative fine-tuning job
response = bedrock.create_model_customization_job(
customizationType=”FINE_TUNING”,
jobName=job_name,
customModelName=custom_model_name,
roleArn=role_arn,
baseModelIdentifier=base_model_id, # Your previously customized model
hyperParameters=hyperparameters,
trainingDataConfig={
“s3Uri”: training_data_uri
},
outputDataConfig={
“s3Uri”: output_path
}
)

job_arn = response.get(‘jobArn’)
print(f”Iterative fine-tuning job created with ARN: {job_arn}”)

Setting up inference for your iteratively fine-tuned model
Once your iterative fine-tuning job completes, you have two primary options for deploying your model for inference, provisioned throughput and on-demand inference, each suited to different usage patterns and requirements.
Provisioned Throughput
Provisioned Throughput offers stable performance for predictable workloads where consistent throughput requirements exist. This option provides dedicated capacity so that the iteratively fine-tuned model maintains performance standards during peak usage periods. Setup involves purchasing model units based on expected traffic patterns and performance requirements.
On-demand inference
On-demand inference provides flexibility for variable workloads and experimentation scenarios. Amazon Bedrock now supports Amazon Nova Micro, Lite, and Pro models as well as Llama 3.3 models for on-demand inference with pay-per-token pricing. This option avoids the need for capacity planning so you can test your iteratively fine-tuned model without upfront commitments. The pricing model scales automatically with usage, making it cost-effective for applications with unpredictable or low-volume inference patterns.
Best practices
Successful iterative fine-tuning requires attention to several key areas. Most importantly, your data strategy should emphasize quality over quantity in incremental datasets. Rather than adding large volumes of new training examples, focus on high-quality data that addresses specific performance gaps identified in previous iterations.
To track progress effectively, evaluation consistency across iterations allows meaningful comparison of improvements. Establish baseline metrics during your first iteration and maintain the same evaluation framework throughout the process. You can use Amazon Bedrock Evaluations to help you systematically identify where gaps exist in your model performance after each customization run. This consistency helps you understand whether changes are producing meaningful improvements.
Finally, recognizing when to stop the iterative process helps to prevent diminishing returns on your investment. Monitor performance improvements between iterations and consider concluding the process when gains become marginal relative to the effort required.
Conclusion
Iterative fine-tuning on Amazon Bedrock provides a systematic approach to model improvement that reduces risks while enabling continuous refinement. With the iterative fine-tuning methodology organizations can build upon existing investments in custom models rather than starting from scratch when adjustments are needed.
To get started with iterative fine-tuning, access the Amazon Bedrock console and navigate to the Custom models section. For detailed implementation guidance, refer to the Amazon Bedrock documentation.

About the authors
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Gautam Kumar is an Engineering Manager at AWS AI Bedrock, leading model customization initiatives across large-scale foundation models. He specializes in distributed training and fine-tuning. Outside work, he enjoys reading and traveling.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Voice AI-powered drive-thru ordering with Amazon Nova Sonic and dynami …

Artificial Intelligence (AI) is transforming the quick-service restaurant industry, particularly in drive-thru operations where efficiency and customer satisfaction intersect. Traditional systems create significant obstacles in service delivery, from staffing limitations and order accuracy issues to inconsistent customer experiences across locations. These challenges, combined with rising labor costs and demand fluctuations, have pushed the industry to seek innovative solutions.
In this post, we’ll demonstrate how to implement a Quick Service Restaurants (QSRs) drive-thru solution using Amazon Nova Sonic and AWS services. We’ll walk through building an intelligent system that combines voice AI with interactive menu displays, providing technical insights and implementation guidance to help restaurants modernize their drive-thru operations.
For QSRs, the stakes are particularly high during peak hours, when long wait times and miscommunication between customers and staff can significantly impact business performance. Common pain points include order accuracy issues, service quality variations across different shifts, and limited ability to handle sudden spikes in customer demand. Modern consumers expect the same seamless, efficient service they experience with digital ordering systems, creating an unprecedented opportunity for voice AI technology to support 24/7 availability and consistent service quality.
Amazon Nova Sonic is a foundation model (FM) within the Amazon Nova family, designed specifically for voice-enabled applications. Available through Amazon Bedrock, developers can use Nova Sonic to create applications that understand spoken language, process complex conversational interactions, and generate appropriate responses for real-time customer engagement. This innovative speech-to-speech model addresses traditional voice application challenges through:

Accurately recognizes streaming speech across accents with robustness to background noise
Adapts speech response to user’s tone and sentiment
Bidirectional streaming speech I/O with low user perceived latency
Graceful interruption handling and natural turn-taking in conversations
Industry-leading price-performance

When integrated with AWS serverless services, Nova Sonic delivers natural, human-like voice interactions that helps improve the drive-thru experience. The architecture creates a cost-effective system that enhances both service consistency and operational efficiency through intelligent automation.
Solution overview
Our voice AI drive-thru solution creates an intelligent ordering system that combines real-time voice interaction with a robust backend infrastructure, delivering a natural customer experience. The system processes speech in real-time, understanding various accents, speaking styles, and handling background noise common in drive-thru environments. Integrating voice commands with interactive menu displays enhances user feedback while streamlining the ordering process by reducing verbal interactions.
The system is built on AWS serverless architecture, integrating key components including Amazon Cognito for authentication with role-based access control, AWS Amplify for the digital menu board, Amazon API Gateway to facilitate access to Amazon DynamoDB tables, AWS Lambda functions with Amazon Nova Canvas for menu image generation, and Amazon Simple Storage Service (Amazon S3) with Amazon CloudFront for image storage and delivery.
The following architecture diagram illustrates how these services interconnect to for natural conversations between customers and the digital menu board, orchestrating the entire customer journey from drive-thru entry to order completion.

Let’s examine how each component works together to power this intelligent ordering system.
Prerequisites
You must have the following in place to complete the solution in this post:

An AWS account
FM access in Amazon Bedrock for Amazon Nova Sonic and Amazon Nova Canvas in the same AWS Region where you will deploy this solution
The accompanying AWS CloudFormation templates downloaded from the aws-samples GitHub repo

Deploy solution resources using AWS CloudFormation
Deploy the CloudFormation templates in an AWS Region where Amazon Bedrock is available and has support for the following models: Amazon Nova Sonic and Amazon Nova Canvas.
This solution consists of two CloudFormation templates that work together to create a complete restaurant drive-thru ordering system. The nova-sonic-infrastructure-drivethru.yaml template establishes the foundational AWS infrastructure including Cognito user authentication, S3 storage with CloudFront CDN for menu images, DynamoDB tables for menu items and customer data, and API Gateway endpoints with proper CORS configuration. The nova-sonic-application-drivethru.yaml template builds upon this foundation by deploying a Lambda function that populates the system with a complete embedded drive-thru menu featuring burgers, wings, fries, drinks, sauces, and combo meals, while using the Amazon Nova Canvas AI model to automatically generate professional food photography for each menu item and storing them in the S3 bucket for delivery through CloudFront.
During the deployment of the first CloudFormation template nova-sonic-infrastructure-drivethru.yaml, you will need to specify the following parameters:

Stack name
Environment – Deployment environment: dev, staging, or prod (defaults to dev)
UserEmail – Valid email address for the user account (required)

Important: You must enable access to the selected Amazon Nova Sonic model and Amazon Nova Canvas model in the Amazon Bedrock console before deployment.
AWS resource usage will incur costs. When deployment is complete, the following resources will be deployed:

Amazon Cognito resources:

User pool – CognitoUserPool
App client – AppClient
Identity pool – CognitoIdentityPool
Groups – AppUserGroup
User – AppUser

AWS Identity and Access Management (IAM) resources:

IAM roles:

AuthenticatedRole
DefaultAuthenticatedRole
ApiGatewayDynamoDBRole
LambdaExecutionRole
S3BucketCleanupRole

Amazon DynamoDB tables:

MenuTable – Stores menu items, pricing, and customization options
LoyaltyTable – Stores customer loyalty information and points
CartTable – Stores shopping cart data for active sessions
OrderTable – Stores completed and pending orders
ChatTable – Stores completed chat details

Amazon S3, CloudFront and AWS WAF resources:

MenuImagesBucket – S3 bucket for storing menu item images
MenuImageCloudFrontDistribution – CloudFront distribution for global content delivery
CloudFrontOriginAccessIdentity – Secure access between CloudFront and S3
CloudFrontWebACL – WAF protection for CloudFront distribution with security rules

Amazon API Gateway resources:

REST API – app-api with Cognito authorization
API resources and methods:

/menu (GET, OPTIONS)
/loyalty (GET, OPTIONS)
/cart (POST, DELETE, OPTIONS)
/order (POST, OPTIONS)
/chat (POST, OPTIONS)

API deployment to specified environment stage

AWS Lambda function:

S3BucketCleanupLambda – Cleans up S3 bucket on stack deletion

CloudFormation Custom Resource:

S3BucketCleanup – Triggers S3BucketCleanupLambda

After you deploy the CloudFormation template, copy the following from the Outputs tab on the AWS CloudFormation console to use during the configuration of your frontend application:

cartApiUrl
loyaltyApiUrl
menuApiUrl
orderApiUrl
chatApiUrl
UserPoolClientId
UserPoolId
IdentityPoolId

The following screenshot shows you what the Outputs tab will look like.

These output values are essential for configuring your frontend application (deployed via AWS Amplify) to connect with the backend services. The API URLs will be used for making REST API calls, while the Cognito IDs will be used for user authentication and authorization.
During the deployment of the second CloudFormation template nova-sonic-application-drivethru.yaml you will need to specify the following parameters:

Stack name
InfrastructureStackName – This stack name matches the one you previously deployed using nova-sonic-infrastructure-drivethru.yaml

When deployment is complete, the following resources will be deployed:

AWS Lambda function:

DriveThruMenuLambda – Populates menu data and generates AI images

CloudFormation Custom Resource:

DriveThruMenuPopulation – Triggers DriveThruMenuLambda

Once both CloudFormation templates are successfully deployed, you’ll have a fully functional restaurant drive-thru ordering system with AI-generated menu images, complete authentication, and ready-to-use API endpoints for your Amplify frontend deployment.
Deploy the Amplify application
You need to manually deploy the Amplify application using the frontend code found on GitHub. Complete the following steps:

Download the frontend code NovaSonic-FrontEnd.zip from GitHub.
Use the .zip file to manually deploy the application in Amplify.
Return to the Amplify page and use the domain it automatically generated to access the application.

User authentication
The solution uses Amazon Cognito user pools and identity pools to implement secure, role-based access control for restaurant’s digital menu board. User pools handle authentication and group management through the AppUserGroup, and identity pools provide temporary AWS credentials mapped to specific IAM roles including AuthenticatedRole. The system makes sure that only verified digital menu board users can access the application and interact with the menu APIs, cart management, order processing, and loyalty services, while also providing secure access to Amazon Bedrock. This combines robust security with an intuitive ordering experience for both customers and restaurant operations.
Serverless data management
The solution implements a serverless API architecture using Amazon API Gateway to create a single REST API (app-api) that facilitates communication between the frontend interface and backend services. The API includes five resource endpoints (/menu, /loyalty, /cart, /chat,/order) with Cognito-based authentication and direct DynamoDB integration for data operations. The backend utilizes five DynamoDB tables: MenuTable for menu items and pricing, LoyaltyTable for customer profiles and loyalty points, CartTable for active shopping sessions, ChatTable for capturing chat history and OrderTable for order tracking and history. This architecture provides fast, consistent performance at scale with Global Secondary Indexes enabling efficient queries by customer ID and order status for optimal drive-thru operations.
Menu and image generation and distribution
The solution uses Amazon S3 and CloudFront for secure, global content delivery of menu item images. The CloudFormation template creates a MenuImagesBucket with restricted access through a CloudFront Origin Access Identity, making sure images are served securely using the CloudFront distribution for fast loading times worldwide. AWS Lambda powers the AI-driven content generation through the DriveThruMenuLambda function, which automatically populates sample menu data and generates high-quality menu item images using Amazon Nova Canvas. This serverless function executes during stack deployment to create professional food photography for the menu items, from classic burgers to specialty wings, facilitating consistent visual presentation across the entire menu. The Lambda function integrates with DynamoDB to store generated image URLs and uses S3 for persistent storage, creating a complete automated workflow that scales based on demand while optimizing costs through pay-per-use pricing.
Voice AI processing
The solution uses Amazon Nova Sonic as the core voice AI engine. The digital menu board establishes direct integration with Amazon Nova Sonic through secure WebSocket connections, for immediate processing of customer speech input and conversion to structured ordering data. The CloudFormation template configures IAM permissions for the AuthenticatedRole to access the amazon.nova-sonic-v1:0 foundation model, allowing authenticated users to interact with the voice AI service. Nova Sonic handles complex natural language understanding and intent recognition, processing customer requests like menu inquiries, order modifications, and item customizations while maintaining conversation context throughout the ordering process. This direct integration minimizes latency concerns and provides customers with a natural, conversational ordering experience that rivals human interaction while maintaining reliable service across drive-thru locations.
Hosting the digital menu board
AWS Amplify hosts and delivers the digital menu board interface as a scalable frontend application. The interface displays AI-generated menu images through CloudFront, with real-time pricing from DynamoDB, optimized for drive-thru environments. The React-based application automatically scales during peak hours, using the global content delivery network available in CloudFront for fast loading times. It integrates with Amazon Cognito for authentication, establishes WebSocket connections to Amazon Nova Sonic for voice processing, and uses API Gateway endpoints for menu and order management. This serverless solution maintains high availability while providing real-time visual updates as customers interact through voice commands.
WebSocket connection flow
The following sequence diagram illustrates the WebSocket connection setup enabling direct browser-to-Nova Sonic communication. This architecture leverages the AWS SDK update (client-bedrock-runtime v3.842.0), which introduces WebSocketHandler support in browsers, avoiding the need for a server.

This advancement allows frontend applications to establish direct WebSocket connections to Nova Sonic, reducing latency and complexity while enabling real-time conversational AI in the browser. The initialization process includes credential validation, Bedrock client establishment, AI assistant configuration, and audio input setup (16kHz PCM). This direct client-to-service communication represents a shift from traditional architectures, offering more efficient and scalable conversational AI applications.
Voice interaction and dynamic menu
The following sequence diagram illustrates the flow of a customer’s burger query, demonstrating how natural language requests are processed to deliver synchronized audio responses and visual updates.

This diagram shows how a query (“Can you show me what burgers you have?”) is handled. Nova Sonic calls getMenuItems ({category: “burgers”}) to retrieve menu data, while Frontend App components fetch and structure burger items and prices. Nova Sonic generates a contextual response and triggers showCategory ({category: “burgers”}) to highlight the burger section in the UI. This process facilitates real-time synchronization between audio responses and visual menu updates, creating a seamless customer experience throughout the conversation.
Drive-thru solution walkthrough
After deploying your application in AWS Amplify, open the generated URL in your browser. You’ll see two setup options: Choose Sample and Manual Setup. Select Choose Sample then pick AI Drive-Thru Experience from the sample list, and then select Load Sample. This will automatically import the system prompt, tools, and tool configurations for the drive-thru solution. We will configure these settings in the following steps.

After selecting Load Sample, you’ll be prompted to configure the connection settings. You’ll need to use the Amazon Cognito and API Gateway information from your CloudFormation stack outputs. These values are required because they connect your digital menu board to backend services.
Enter the configuration values you copied from the CloudFormation outputs (nova-sonic-infrastructure-drivethru.yaml). These are organized into two sections, as demonstrated in the following videos. After you enter the configuration details in each section, select Save button at the top of the screen.
Amazon Cognito configuration:

UserPoolId
UserPoolClientId
IdentityPoolId

Agent configuration:

Auto-Initiate Conversation – Nova Sonic is initially set to wait for you to start the conversation. However, you can enable automatic conversation initiation by checking the ‘Enable auto-initiate’ box. There is a pre-recorded ‘Hello’ that you can use that’s stored locally.

Tools global parameters:

menuAPIURL
cartAPIURL
orderAPIUR
loyaltyAPIURL
chatAPIURL

After completing the configuration, click the Save and Exit button located at the top of the page. This action will redirect you to a sign-in screen. To access the system, use the username appuser and the password automatically generated and emailed to you to the email that was provided during the CloudFormation deployment.
After entering the temporary password, you’ll be asked to verify your account through a temporary code sent to your email.
Upon your initial login attempt, you’ll be required to create a new password to replace the temporary one, as demonstrated in the following video.

Begin your drive-thru experience by clicking the microphone icon. The AI assistant welcomes you and guides you through placing your order while dynamically updating the digital menu board to highlight relevant items. The system intelligently suggests complementary items and adapts its communication style to enhance your ordering experience.

Clean up
If you decide to discontinue using the solution, you can follow these steps to remove it, its associated resources deployed using AWS CloudFormation, and the Amplify deployment:

Delete the CloudFormation stack:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Locate the stack you created during the deployment process of nova-sonic-application-drivethru.yaml (you assigned a name to it).
Select the stack and choose Delete.
Repeat this for nova-sonic-infrastructure-drivethru.yaml

Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Conclusion
The voice AI-powered drive-thru ordering system using Amazon Nova Sonic provides restaurants with a practical solution to common operational challenges including staffing constraints, order accuracy issues, and peak-hour bottlenecks. The serverless architecture built on AWS services—Amazon Cognito for authentication, API Gateway for data communication, DynamoDB for storage, and AWS Amplify for hosting, creates a scalable system that handles varying demand while maintaining consistent performance. The system supports essential restaurant operations including menu management, cart functionality, loyalty programs, and order processing through direct API Gateway and DynamoDB integration. For restaurants looking to modernize their drive-thru operations, this solution offers measurable benefits including reduced wait times, improved order accuracy, and operational efficiency gains. The pay-per-use pricing model and automated scaling help control costs while supporting business growth. As customer expectations shift toward more efficient service experiences, implementing voice AI technology provides restaurants with a competitive advantage and positions them well for future technological developments in the food service industry.
Additional resources
To learn more about Amazon Nova Sonic and additional solutions, refer to the following resources:

Introducing Amazon Nova Sonic: Human-like voice conversations for generative AI applications
Frontend application source code used in this blog is available on GitHub
Voice AI-Powered Hotel In-Room Service with Amazon Nova Sonic

About the Authors

Salman Ahmed
Salman is a Senior Technical Account Manager in AWS Enterprise Support. He specializes in guiding customers through the design, implementation, and support of AWS solutions. Combining his networking expertise with a drive to explore new technologies, he helps organizations successfully navigate their cloud journey. Outside of work, he enjoys photography, traveling, and watching his favorite sports teams.

Sergio Barraza
Sergio is a Senior Technical Account Manager at AWS, helping customers on designing and optimizing cloud solutions. With more than 25 years in software development, he guides customers through AWS services adoption. Outside of work, Sergio is a multi-instrument musician playing guitar, piano, and drums, and he also practices Wing Chun Kung Fu.

Ravi Kumar
Ravi is a Senior Technical Account Manager in AWS Enterprise Support who helps customers in the travel and hospitality industry to streamline their cloud operations on AWS. He is a results-driven IT professional with over 20 years of experience. Ravi is passionate about generative AI and actively explores its applications in cloud computing. In his free time, Ravi enjoys creative activities like painting. He also likes playing cricket and traveling to new places.

Ankush Goyal
Ankush is a Senior Technical Account Manager at AWS Enterprise Support, specializing in helping customers in the travel and hospitality industries optimize their cloud infrastructure. With over 20 years of IT experience, he focuses on leveraging AWS networking services to drive operational efficiency and cloud adoption. Ankush is passionate about delivering impactful solutions and enabling clients to streamline their cloud operations.

Leland Johnson
Leland is a Sr. Solutions Architect for AWS focusing on travel and hospitality. As a Solutions Architect, he plays a crucial role in guiding customers through their cloud journey by designing scalable and secure cloud solutions. Outside of work, he enjoys playing music and flying light aircraft.

Optimizing document AI and structured outputs by fine-tuning Amazon No …

Multimodal fine-tuning represents a powerful approach for customizing vision large language models (LLMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that matter to your business.
A common use case is document processing, which includes extracting structured information from complex layouts including invoices, purchase orders, forms, tables, or technical diagrams. Although off-shelf LLMs often struggle with specialized documents like tax forms, invoices, and loan applications, fine-tuned models can learn from high data variations and can deliver significantly higher accuracy while reducing processing costs.
This post provides a comprehensive hands-on guide to fine-tune Amazon Nova Lite for document processing tasks, with a focus on tax form data extraction. Using our open-source GitHub repository code sample, we demonstrate the complete workflow from data preparation to model deployment. Since Amazon Bedrock provides on-demand inference with pay-per-token pricing for Amazon Nova, we can benefit from the accuracy improvement from model customization and maintain the pay-as-you-go cost structure.
The document processing challenge
Given a single or multi-page document, the goal is to extract or derive specific structured information from the document so that it can be used for downstream systems or additional insights. The following diagram shows how a vision LLM can be used to derive the structured information based on a combination of text and vision capabilities.

The key challenges for enterprises in workflow automation when processing documents, like invoices or W2 tax forms, are the following:

Complex layouts: Specialized forms contain multiple sections with specific fields arranged in a structured format.
Variability of document types: Many diverse document types exist (invoices, contracts, forms).
Variability within a single document type: Each vendor can send a different invoice format and style or type.
Data quality variations: Scanned documents vary in quality, orientation, and completeness.
Language barriers: Documents can be in multiple languages.
Critical accuracy requirements: Tax-related data extraction demands extremely high accuracy.
Structured output needs: Extracted data must be formatted consistently for downstream processing.
Scalability and integration: Grow with business needs and integrate with existing systems; for example, Enterprise Resource Planning (ERP) systems.

Approaches for intelligent document processing that use LLMs or vision LLMs fall into three main categories:

Zero-shot prompting: An LLM or vision LLM is used to derive the structured information based on the input document, instructions, and the target schema.
Few-shot prompting: A technique used with LLMs or vision LLMs where a few of other additional examples (document + target output) are provided within the prompt to guide the model in completing a specific task. Unlike zero-shot prompting, which relies solely on natural language instructions, few-shot prompting can improve accuracy and consistency by demonstrating the desired input-output behavior through a set of examples.
Fine-tuning: Customize or fine-tune the weights of a given LLM or vision LLM by providing larger amounts of annotated documents (input/output pairs), to teach the model exactly how to extract or interpret relevant information.

For the first two approaches, refer to the amazon-nova-samples repository, which contains sample code on how to use the Amazon Bedrock Converse API for structured output by using tool calling.
Off-shelf LLMs excel at general document understanding, but they might not optimally handle domain-specific challenges. A fine-tuned Nova model can enhance performance by:

Learning document-specific layouts and field relationships
Adapting to common quality variations in your document dataset
Providing consistent, structured outputs
Maintaining high accuracy across different document variations. For example, invoice documents can have hundreds of different vendors, each with different formats, layouts or even different languages.

Creating the annotated dataset and selecting the customization technique
While there are various methods for customization of Amazon Nova models available, the most relevant for document processing are the following:

Fine-tune for specific tasks: Adapt Nova models for specific tasks using supervised fine-tuning (SFT). Choose between Parameter-Efficient Fine-Tuning (PEFT) for light-weight adaptation with limited data, or full fine-tuning when you have extensive training datasets to update all parameters of the model.
Distill to create smaller, faster models: Use knowledge distillation to transfer knowledge from a larger, more intelligent model, like Nova Premier (teacher) to a smaller, faster, more cost-efficient model (student), ideal for when you don’t have enough annotated training datasets and the teacher model provides the accuracy that meets your requirement.

To be able to learn from previous examples, you need to either have an annotated dataset from which we can learn or a model that is good enough for your task so that you can use it as a teacher model.

Automated dataset annotation with historic data from Enterprise Resource Planning (ERP) systems, such as SAP: Many customers have already historic documents that have been manually processed and consumed by downstream systems, like ERP or customer relationship management (CRM) systems. Explore existing downstream systems like SAP and the data they contain. This data can often be mapped back to the original source document it has been derived from and helps you to bootstrap an annotated dataset very quickly.
Manual dataset annotation: Identify the most relevant documents and formats, and annotate them using human annotators, so that you have document/JSON pairs where the JSON contains the target information that you want to extract or derive from your source documents.
Annotate with the teacher model: Explore if a larger model like Nova Premier can provide accurate enough results using prompt engineering. If that is the case, you can also use distillation.

For the first and second options, we recommend supervised model fine-tuning. For the third, model distillation is the right approach.
Amazon Bedrock currently provides both fine-tuning and distillation techniques, so that anyone with a basic data science skillset can very easily submit jobs. They run on compute completely managed by Amazon, so you don’t have worry about instance sizes or capacity limits.
Nova customization is also available with Amazon SageMaker with more options and controls. For example, if you have sufficient high-quality labeled data and you want deeper customization for your use case, full rank fine-tuning might produce higher accuracy. Full rank fine tuning is supported with SageMaker training jobs and SageMaker HyperPod.
Data preparation best practices
The quality and structure of your training data fundamentally determine the success of fine-tuning. Here are key steps and considerations for preparing effective multimodal datasets and configuring your fine-tuning job:
Dataset analysis and base model evaluation
Our demonstration uses a synthetic dataset of W2 tax forms: the Fake W-2 (US Tax Form) Dataset. This public dataset comprises simulated US tax returns (W-2 statements for years 2016-19), including noisy images that mimic low-quality scanned W2 tax forms.
Before fine-tuning, it’s crucial to:

Analyze dataset characteristics (image quality, field completeness, class distribution), define use-case-specific evaluation metrics, and establish baseline model performance.
Compare each predicted field value against the ground truth, calculating precision, recall, and F1 scores for individual fields and overall performance.

Prompt optimization
Crafting an effective prompt is essential for aligning the model with task requirements. Our system comprises two key components:

System prompt: Defines the task, provides detailed instructions for each field to be extracted, and specifies the output format.
User prompt: Follows Nova vision understanding best practices, utilizing the {media_file}-then-{text} structure as outlined in the Amazon Nova model user guide.

Iterate on your prompts using the base model to optimize performance before fine-tuning.
Dataset preparation
Prepare your dataset in JSONL format and split it into training, validation, and test sets:

Training set: 70-80% of data
Validation set: 10-20% of data
Test set: 10-20% of data

Fine-tuning job configuration and monitoring
Once the dataset is prepared and uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, we can configure and submit the fine-tuning job on Bedrock. When configuring your fine-tuning job on Amazon Bedrock, key parameters include:

Parameter
Definition
Purpose

Epochs
Number of complete passes through the training dataset
Determines how many times the model sees the entire dataset during training

Learning rate
Step size for gradient descent optimization
Controls how much model weights are adjusted in response to estimated error

Learning rate warmup steps
Number of steps to gradually increase the learning rate
Prevents instability by slowly ramping up the learning rate from a small value to the target rate

Amazon Bedrock customization provides validation loss metrics throughout the training process. Monitor these metrics to:

Assess model convergence
Detect potential overfitting
Gain early insights into model performance on unseen data

The following graph shows an example metric analysis:

When analyzing the training and validation loss curves, the relative behavior between these metrics provides crucial insights into the model’s learning dynamics. Optimal learning patterns can be observed as:

Both training and validation losses decrease steadily over time
The curves maintain relatively parallel trajectories
The gap between training and validation loss remains stable
Final loss values converge to similar ranges

Model inference options for customized models
Once your custom model has been created in Bedrock, you have two main ways to make inferences to that model: use on-demand custom model inference (ODI) deployments, or use Provisioned Throughput endpoints. Let’s talk about why and when to choose one over the other.
On-demand custom model deployments provide a flexible and cost-effective way to leverage your custom Bedrock models. With on-demand deployments, you only pay for the compute resources you use, based on the number of tokens processed during inference. This makes on-demand a great choice for workloads with variable or unpredictable usage patterns, where you want to avoid over-provisioning resources. The on-demand approach also offers automatic scaling, so you don’t have to worry about managing infrastructure capacity. Bedrock will automatically provision the necessary compute power to handle your requests in near real time. This self-service, serverless experience can simplify your operations and deployment workflows.
Alternatively, Provisioned Throughput endpoints are recommended for workloads with steady traffic patterns and consistent high-volume requirements, offering predictable performance and cost benefits over on-demand scaling.
This example uses the ODI option to leverage per-token based pricing; the following code snippet is how you can create an ODI endpoint for your custom model:

# Function to create on-demand inferencing deployment for custom model
def create_model_deployment(custom_model_arn):
    “””
    Create an on-demand inferencing deployment for the custom model
    
    Parameters:
    ———–
    custom_model_arn : str
        ARN of the custom model to deploy
        
    Returns:
    ——–
    deployment_arn : str
        ARN of the created deployment
    “””
    try:
        print(f”Creating on-demand inferencing deployment for model: {custom_model_arn}”)
        
        # Generate a unique name for the deployment
        deployment_name = f”nova-ocr-deployment-{time.strftime(‘%Y%m%d-%H%M%S’)}”
        
        # Create the deployment
        response = bedrock.create_custom_model_deployment(
            modelArn=custom_model_arn,
            modelDeploymentName=deployment_name,
            description=f”on-demand inferencing deployment for model: {custom_model_arn}”,
        )
        
        # Get the deployment ARN
        deployment_arn = response.get(‘customModelDeploymentArn’)
        
        print(f”Deployment request submitted. Deployment ARN: {deployment_arn}”)
        return deployment_arn
    
    except Exception as e:
        print(f”Error creating deployment: {e}”)
        return None

Evaluation: Accuracy improvement with fine-tuning
Our evaluation of the base model and the fine-tuned Nova model shows significant improvements across all field categories. Let’s break down the performance gains:

Field category
Metric
Base model
Fine-tuned model
Improvement

Employee information
Accuracy
58%
82.33%
24.33%

Precision
57.05%
82.33%
25.28%

Recall
100%
100%
0%

F1 score
72.65%
90.31%
17.66%

Employer information
Accuracy
58.67%
92.67%
34%

Precision
53.66%
92.67%
39.01%

Recall
100%
100%
0%

F1 score
69.84%
96.19%
26.35%

Earnings
Accuracy
62.71%
85.57%
22.86%

Precision
60.97%
85.57%
24.60%

Recall
99.55%
100%
0.45%

F1 score
75.62%
92.22%
16.60%

Benefits
Accuracy
45.50%
60%
14.50%

Precision
45.50%
60%
14.50%

Recall
93.81%
100%
6.19%

F1 score
61.28%
75%
13.72%

Multi-state employment
Accuracy
58.29%
94.19%
35.90%

Precision
52.14%
91.83%
39.69%

Recall
99.42%
100%
0.58%

F1 score
68.41%
95.74%
27.33%

The following graphic shows a bar chart comparing the F1 scores of the base model and fine-tuned model for each field category, with the improvement percentage shown in the previous table:

Key observations:

Substantial improvements across all categories, with the most significant gains in employer information and multi-state employment
Consistent 100% recall maintained or achieved in the fine-tuned model, indicating comprehensive field extraction
Notable precision improvements, particularly in categories that were challenging for the base model

Clean up
To avoid incurring unnecessary costs when you’re no longer using your custom model, it’s important to properly clean up the resources. Follow these steps to remove both the deployment and the custom model:

Delete the custom model deployment
Delete the custom model

Cost analysis
In our example, we chose to use Bedrock fine-tuning job which is PEFT and ODI is available. PEFT fine tuning Nova Lite paired with on-demand inference capabilities offers a cost-effective and scalable solution for enhanced document processing. The cost structure is straightforward and transparent:
One-time cost:

Model training: $0.002 per 1,000 tokens × number of epochs

Ongoing costs:

Storage: $1.95 per month per custom model
On-demand Inference: Same per-token pricing as the base model

Example 1 page from above dataset: 1895 tokens/1000 * $0.00006 + 411 tokens/1000 * $0.00024 = $0.00021

On-demand inference allows you to run your custom Nova models without maintaining provisioned endpoints, enabling pay-as-you-go pricing based on actual token usage. This approach eliminates the need for capacity planning while ensuring cost-efficient scaling.
Conclusion
In this post, we’ve demonstrated how fine-tuning Amazon Nova Lite can transform document processing accuracy while maintaining cost efficiency. Our evaluation shows significant performance gains, with up to 39% improvement in precision for critical fields and perfect recall across key document categories. While our implementation did not require constrained decoding, tool calling with Nova can provide additional reliability for more complex structured outputs, especially when working with intricate JSON schemas. Please refer to the resource on structured output with tool calling for further information.
The flexible deployment options, including on-demand inference with pay-per-use pricing, eliminate infrastructure overhead while maintaining the same inference costs as the base model. With the dataset we used for this example, runtime inference per page cost was $0.00021, making it a cost-effective solution. Through practical examples and step-by-step guides, we’ve shown how to prepare training data, fine-tune models, and evaluate performance with clear metrics.
To get started with your own implementation, visit our GitHub repository for complete code samples and detailed documentation.

About the authors
Sharon Li is an AI/ML Specialist Solutions Architect at Amazon Web Services (AWS) based in Boston, Massachusetts. With a passion for leveraging cutting-edge technology, Sharon is at the forefront of developing and deploying innovative generative AI solutions on the AWS cloud platform.
Arlind Nocaj is a GTM Specialist Solutions Architect for AI/ML and Generative AI for europe central based in AWS Zurich Office, who guides enterprise customers through their digital transformation journeys. With a PhD in network analytics and visualization (Graph Drawing) and over a decade of experience as a research scientist and software engineer, he brings a unique blend of academic rigor and practical expertise to his role. His primary focus lies in using the full potential of data, algorithms, and cloud technologies to drive innovation and efficiency. His areas of expertise include Machine Learning, Generative AI and in particular Agentic systems with Multi-modal LLMs for document processing and structured insights.
Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.
Malte Reimann is a Solutions Architect based in Zurich, working with customers across Switzerland and Austria on their cloud initiatives. His focus lies in practical machine learning applications—from prompt optimization to fine-tuning vision language models for document processing. The most recent example, working in a small team to provide deployment options for Apertus on AWS. An active member of the ML community, Malte balances his technical work with a disciplined approach to fitness, preferring early morning gym sessions when it’s empty. During summer weekends, he explores the Swiss Alps on foot and enjoying time in nature. His approach to both technology and life is straightforward: consistent improvement through deliberate practice, whether that’s optimizing a customer’s cloud deployment or preparing for the next hike in the clouds.

Building a Context-Folding LLM Agent for Long-Horizon Reasoning with M …

In this tutorial, we explore how to build a Context-Folding LLM Agent that efficiently solves long, complex tasks by intelligently managing limited context. We design the agent to break down a large task into smaller subtasks, perform reasoning or calculations when needed, and then fold each completed sub-trajectory into concise summaries. By doing this, we preserve essential knowledge while keeping the active memory small. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try:
import transformers
except:
subprocess.run([sys.executable, “-m”, “pip”, “install”, “-q”, “transformers”, “accelerate”, “sentencepiece”], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get(“CF_MODEL”, “google/flan-t5-small”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline(“text2text-generation”, model=model, tokenizer=tokenizer, device_map=”auto”)
def llm_gen(prompt: str, max_new_tokens=160, temperature=0.0) -> str:
out = llm(prompt, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0][“generated_text”]
return out.strip()

We begin by setting up our environment and loading a lightweight Hugging Face model. We use this model to generate and process text locally, ensuring the agent runs smoothly on Google Colab without any API dependencies. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserimport ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg, ast.FloorDiv: op.floordiv, ast.Mod: op.mod}
def _eval_node(n):
if isinstance(n, ast.Num): return n.n
if isinstance(n, ast.UnaryOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand))
if isinstance(n, ast.BinOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.right))
raise ValueError(“Unsafe expression”)
def calc(expr: str):
node = ast.parse(expr, mode=’eval’).body
return _eval_node(node)
class FoldingMemory:
def __init__(self, max_chars:int=800):
self.active=[]; self.folds=[]; self.max_chars=max_chars
def add(self,text:str):
self.active.append(text.strip())
while len(self.active_text())>self.max_chars and len(self.active)>1:
popped=self.active.pop(0)
fold=f”- Folded: {popped[:120]}…”
self.folds.append(fold)
def fold_in(self,summary:str): self.folds.append(summary.strip())
def active_text(self)->str: return “n”.join(self.active)
def folded_text(self)->str: return “n”.join(self.folds)
def snapshot(self)->Dict: return {“active_chars”:len(self.active_text()),”n_folds”:len(self.folds)}

We define a simple calculator tool for basic arithmetic and create a memory system that dynamically folds past context into concise summaries. This helps us maintain a manageable active memory while retaining essential information. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserSUBTASK_DECOMP_PROMPT=”””You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with ‘- ‘ in priority order.
Task: “{task}” “””
SUBTASK_SOLVER_PROMPT=”””You are a precise problem solver with minimal steps.
If a calculation is needed, write one line ‘CALC(expr)’.
Otherwise write ‘ANSWER: <final>’.
Think briefly; avoid chit-chat.

Task: {task}
Subtask: {subtask}
Notes (folded context):
{notes}

Now respond with either CALC(…) or ANSWER: …”””
SUBTASK_SUMMARY_PROMPT=”””Summarize the subtask outcome in <=3 bullets, total <=50 tokens.
Subtask: {name}
Steps:
{trace}
Final: {final}
Return only bullets starting with ‘- ‘.”””
FINAL_SYNTH_PROMPT=”””You are a senior agent. Synthesize a final, coherent solution using ONLY:
– The original task
– Folded summaries (below)
Avoid repeating steps. Be concise and actionable.

Task: {task}
Folded summaries:
{folds}

Final answer:”””
def parse_bullets(text:str)->List[str]:
return [ln[2:].strip() for ln in text.splitlines() if ln.strip().startswith(“- “)]

We design prompt templates that guide the agent in decomposing tasks, solving subtasks, and summarizing outcomes. These structured prompts enable clear communication between reasoning steps and the model’s responses. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef run_subtask(task:str, subtask:str, memory:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]:
notes=(memory.folded_text() or “(none)”)
trace=[]; final=””
for _ in range(max_tool_iters):
prompt=SUBTASK_SOLVER_PROMPT.format(task=task,subtask=subtask,notes=notes)
out=llm_gen(prompt,max_new_tokens=96); trace.append(out)
m=re.search(r”CALC((.+?))”,out)
if m:
try:
val=calc(m.group(1))
trace.append(f”TOOL:CALC -> {val}”)
out2=llm_gen(prompt+f”nTool result: {val}nNow produce ‘ANSWER: …’ only.”,max_new_tokens=64)
trace.append(out2)
if out2.strip().startswith(“ANSWER:”):
final=out2.split(“ANSWER:”,1)[1].strip(); break
except Exception as e:
trace.append(f”TOOL:CALC ERROR -> {e}”)
if out.strip().startswith(“ANSWER:”):
final=out.split(“ANSWER:”,1)[1].strip(); break
if not final:
final=”No definitive answer; partial reasoning:n”+”n”.join(trace[-2:])
summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(name=subtask,trace=”n”.join(trace),final=final),max_new_tokens=80)
summary_bullets=”n”.join(parse_bullets(summ)[:3]) or f”- {subtask}: {final[:60]}…”
return final, summary_bullets, trace
class ContextFoldingAgent:
def __init__(self,max_active_chars:int=800):
self.memory=FoldingMemory(max_chars=max_active_chars)
self.metrics={“subtasks”:0,”tool_calls”:0,”chars_saved_est”:0}
def decompose(self,task:str)->List[str]:
plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96)
subs=parse_bullets(plan)
return subs[:4] if subs else [“Main solution”]
def run(self,task:str)->Dict:
t0=time.time()
self.memory.add(f”TASK: {task}”)
subtasks=self.decompose(task)
self.metrics[“subtasks”]=len(subtasks)
folded=[]
for st in subtasks:
self.memory.add(f”SUBTASK: {st}”)
final,fold_summary,trace=run_subtask(task,st,self.memory)
self.memory.fold_in(fold_summary)
folded.append(f”- {st}: {final}”)
self.memory.add(f”SUBTASK_DONE: {st}”)
final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200)
t1=time.time()
return {“task”:task,”final”:final.strip(),”folded_summaries”:self.memory.folded_text(),
“active_context_chars”:len(self.memory.active_text()),
“subtask_finals”:folded,”runtime_sec”:round(t1-t0,2)}

We implement the agent’s core logic, in which each subtask is executed, summarized, and folded back into memory. This step demonstrates how context folding enables the agent to reason iteratively without losing track of prior reasoning. Check out the FULL CODES here.

Copy CodeCopiedUse a different BrowserDEMO_TASKS=[
“Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.”,
“Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation.”
]
def pretty(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__==”__main__”:
agent=ContextFoldingAgent(max_active_chars=700)
for i,task in enumerate(DEMO_TASKS,1):
print(“=”*70)
print(f”DEMO #{i}: {task}”)
res=agent.run(task)
print(“n— Folded Summaries —n”+(res[“folded_summaries”] or “(none)”))
print(“n— Final Answer —n”+res[“final”])
print(“n— Diagnostics —“)
diag={k:res[k] for k in [“active_context_chars”,”runtime_sec”]}
diag[“n_subtasks”]=len(agent.decompose(task))
print(pretty(diag))

We run the agent on sample tasks to observe how it plans, executes, and synthesizes final results. Through these examples, we see the complete context-folding process in action, producing concise and coherent outputs.

In conclusion, we demonstrate how context folding enables long-horizon reasoning while avoiding memory overload. We see how each subtask is planned, executed, summarized, and distilled into compact knowledge, mimicking how an intelligent agent would handle complex workflows over time. By combining decomposition, tool use, and context compression, we create a lightweight yet powerful agentic system that scales reasoning efficiently.

Check out the FULL CODES here and Paper . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Building a Context-Folding LLM Agent for Long-Horizon Reasoning with Memory Compression and Tool Use appeared first on MarkTechPost.

Anthropic Launches Claude Haiku 4.5: Small AI Model that Delivers Sonn …

Anthropic released Claude Haiku 4.5, a latency-optimized “small” model that delivers similar levels of coding performance to Claude Sonnet 4 while running more than twice as fast at one-third the cost. The model is immediately available via Anthropic’s API and in partner catalogs on Amazon Bedrock and Google Cloud Vertex AI. Pricing is $1/MTok input and $5/MTok output. Anthropic positions Haiku 4.5 as a drop-in replacement for Haiku 3.5 and Sonnet 4 in cost-sensitive, interactive workloads.

Positioning and lineup

Haiku 4.5 targets real-time assistants, customer-support automations, and pair-programming where tight latency budgets and throughput dominate. It surpasses Sonnet 4 on “computer use” tasks—the GUI/browser manipulation underpinning products like Claude for Chrome—and is described as materially improving responsiveness in Claude Code for multi-agent projects and rapid prototyping. Anthropic makes clear that Sonnet 4.5 remains the frontier model and “the best coding model in the world,” while Haiku 4.5 offers near-frontier performance with greater cost-efficiency. A recommended pattern is Sonnet 4.5 for multi-step planning and parallel execution by a pool of Haiku 4.5 workers.

Availability, identifiers, and pricing

From day one, developers can call the model (claude-haiku-4-5) on Anthropic’s API. Anthropic also states availability on Amazon Bedrock and Vertex AI; model catalogs may update region coverage and IDs over time, but the company confirms cloud availability in the launch post. The API price for Haiku 4.5 is $1/MTok (input) and $5/MTok (output), with prompt-caching listed at $1.25/MTok write and $0.10/MTok read.

Benchmarks

Anthropic summarizes results across standard and agentic suites and includes methodology details to qualify the numbers:

SWE-bench Verified: simple scaffold with two tools (bash, file edits), 73.3% averaged over 50 trials, no test-time compute, 128K thinking budget, default sampling. Includes a minor prompt addendum encouraging extensive tool use and writing tests first.

Terminal-Bench: Terminus-2 agent, average over 11 runs (6 without thinking, 5 with 32K thinking budget).

OSWorld-Verified: 100 max steps, averaged across 4 runs with a 128K total thinking budget and 2K per-step configuration.

AIME / MMMLU: averages over multiple runs using default sampling and 128K thinking budgets.

https://www.anthropic.com/news/claude-haiku-4-5

https://www.anthropic.com/news/claude-haiku-4-5

The post emphasizes coding parity with Sonnet 4 and computer-use gains relative to Sonnet 4 under these scaffolds. Users should replicate with their own orchestration, tool stacks, and thinking budgets before generalizing.

Key Takeaways

Haiku 4.5 delivers Sonnet-4-level coding performance at one-third the cost and more than twice the speed.

It surpasses Sonnet 4 on computer-use tasks, improving responsiveness in Claude for Chrome and multi-agent flows in Claude Code.

Recommended orchestration: use Sonnet 4.5 for multi-step planning and parallelize execution with multiple Haiku 4.5 workers.

Pricing is $1/$5 per million input/output tokens; available via Claude API, Amazon Bedrock, and Google Cloud Vertex AI.

Released under ASL-2 with a lower measured misalignment rate than Sonnet 4.5 and Opus 4.1 in Anthropic’s tests.

Editorial Comments

Anthropic’s positioning of Claude Haiku 4.5 is strategically sound: by delivering similar levels of coding performance to Claude Sonnet 4 at one-third the cost and more than twice the speed, while surpassing Sonnet 4 on computer use, the company gives devs a clean planner–executor split—Sonnet 4.5 for multi-step planning and a pool of Haiku 4.5 workers for parallel execution—without forcing architectural changes (“drop-in replacement” across API, Amazon Bedrock, Vertex AI). The ASL-2 release, coupled with a documented lower misalignment rate than Sonnet 4.5 and Opus 4.1, lowers the friction for enterprise rollout where safety gates and cost envelopes dominate deployment math.

Check out the Technical details, system card, model page, and documentation . Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Anthropic Launches Claude Haiku 4.5: Small AI Model that Delivers Sonnet-4-Level Coding Performance at One-Third the Cost and more than Twice the Speed appeared first on MarkTechPost.

Meta AI’s ‘Early Experience’ Trains Language Agents without Rewa …

How would your agent stack change if a policy could train purely from its own outcome-grounded rollouts—no rewards, no demos—yet beat imitation learning across eight benchmarks? Meta Superintelligence Labs propose ‘Early Experience‘, a reward-free training approach that improves policy learning in language agents without large human demonstration sets and without reinforcement learning (RL) in the main loop. The core idea is simple: let the agent branch from expert states, take its own actions, collect the resulting future states, and convert those consequences into supervision. The research team instantiates this with two concrete strategies—Implicit World Modeling (IWM) and Self-Reflection (SR)—and reports consistent gains across eight environments and multiple base models.

https://arxiv.org/pdf/2510.08558

What Early Experience changes?

Traditional pipelines lean on imitation learning (IL) over expert trajectories, which is cheap to optimize but hard to scale and brittle out-of-distribution; reinforcement learning (RL) promises learning from experience but needs verifiable rewards and stable infrastructure—often missing in web and multi-tool settings. Early Experience sits between them: it is reward-free like imitation learning (IL), but the supervision is grounded in consequences of the agent’s own actions, not just expert actions. In short, the agent proposes, acts, and learns from what actually happens next—no reward function required.

Implicit World Modeling (IWM): Train the model to predict the next observation given the state and chosen action, tightening the agent’s internal model of environment dynamics and reducing off-policy drift.

Self-Reflection (SR): Present expert and alternative actions at the same state; have the model explain why the expert action is better using the observed outcomes, then fine-tune the policy from this contrastive signal.

Both strategies use the same budgets and decoding settings as IL; only the data source differs (agent-generated branches rather than more expert trajectories).

https://arxiv.org/pdf/2510.08558

Understanding the Benchmarks

The research team evaluate on eight language-agent environments spanning web navigation, long-horizon planning, scientific/embodied tasks, and multi-domain API workflows—e.g., WebShop (transactional browsing), TravelPlanner (constraint-rich planning), ScienceWorld, ALFWorld, Tau-Bench, and others. Early Experience yields average absolute gains of +9.6 success and +9.4 out-of-domain (OOD) over IL across the full matrix of tasks and models. These gains persist when the same checkpoints are used to initialize RL (GRPO), improving post-RL ceilings by up to +6.4 compared to reinforcement learning (RL) started from imitation learning (IL).

Efficiency: less expert data, same optimization budget

A key practical win is demo efficiency. With a fixed optimization budget, Early Experience matches or beats IL using a fraction of expert data. On WebShop, 1/8 of the demonstrations with Early Experience already exceeds IL trained on the full demo set; on ALFWorld, parity is hit at 1/2 the demos. The advantage grows with more demonstrations, indicating the agent-generated future states provide supervision signals that demonstrations alone do not capture.

How the data is built?

The pipeline seeds from a limited set of expert rollouts to obtain representative states. At selected states, the agent proposes alternative actions, executes them, and records the next observations.

For IWM, the training data are triplets ⟨state, action, next-state⟩ and the objective is next-state prediction.

For SR, the prompts include the expert action and several alternatives plus their observed outcomes; the model produces a grounded rationale explaining why the expert action is preferable, and this supervision is then used to improve the policy.

Where reinforcement learning (RL) fits?

Early Experience is not “RL without rewards.” It is a supervised recipe that uses agent-experienced outcomes as labels. In environments with verifiable rewards, the research team simply add RL after Early Experience. Because the initialization is better than IL, the same RL schedule climbs higher and faster, with up to +6.4 final success over IL-initialized RL across tested domains. This positions Early Experience as a bridge: reward-free pre-training from consequences, followed (where possible) by standard reinforcement learning (RL).

Key Takeaways

Reward-free training via agent-generated future states (not rewards) using Implicit World Modeling and Self-Reflection outperforms imitation learning across eight environments.

Reported absolute gains over IL: +18.4 (WebShop), +15.0 (TravelPlanner), +13.3 (ScienceWorld) under matched budgets and settings.

Demo efficiency: exceeds IL on WebShop with 1/8 of demonstrations; reaches ALFWorld parity with 1/2—at fixed optimization cost.

As an initializer, Early Experience boosts subsequent RL (GRPO) endpoints by up to +6.4 versus RL started from IL.

Validated on multiple backbone families (3B–8B) with consistent in-domain and out-of-domain improvements; positioned as a bridge between imitation learning (IL) and reinforcement learning (RL).

Editorial Comments

Early Experience is a pragmatic contribution: it replaces brittle rationale-only augmentation with outcome-grounded supervision that an agent can generate at scale, without reward functions. The two variants—Implicit World Modeling (next-observation prediction to anchor environment dynamics) and Self-Reflection (contrastive, outcome-verified rationales against expert actions)—directly attack off-policy drift and long-horizon error accumulation, explaining the consistent gains over imitation learning across eight environments and the stronger RL ceilings when used as an initializer for GRPO. In web and tool-use settings where verifiable rewards are scarce, this reward-free supervision is the missing middle between IL and RL and is immediately actionable for production agent stacks.

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Meta AI’s ‘Early Experience’ Trains Language Agents without Rewards—and Outperforms Imitation Learning appeared first on MarkTechPost.

Transforming enterprise operations: Four high-impact use cases with Am …

Since the launch of Amazon Nova at AWS re:Invent 2024, we have seen adoption trends across industries, with notable gains in operational efficiency, compliance, and customer satisfaction. With its capabilities in secure, multimodal AI and domain customization, Nova is enhancing workflows and enabling cost efficiencies across core use cases.
In this post, we share four high-impact, widely adopted use cases built with Nova in Amazon Bedrock, supported by real-world customers deployments, offerings available from AWS partners, and experiences. These examples are ideal for organizations researching their own AI adoption strategies and use cases across industries.
Customer service
Traditional chatbots often frustrate users with scripted, inflexible responses that fail to understand context or intent. For enterprises, these are missed opportunities to resolve issues quickly, lower support costs, and drive customer loyalty. AI-powered applications can understand natural language, adapt to individual customer needs, and integrate with backend systems in real time. Organizations are transforming support from a cost center into a strategic driver of satisfaction and retention. These are often high-volume and interactive scenarios, so the balance of cost, speed, and intelligence is critical.
Customer service applications built with Nova in Amazon Bedrock can seamlessly integrate with business data stored with AWS, and offer the security, privacy, and reliability for production use in enterprise environments.

Infosys, a leading global IT services and consulting organization, developed Infosys Event AI for real-time transcription, multilingual translation, and intelligent summarization of live event content. Infosys Event AI is built with Amazon Nova Pro in Amazon Bedrock. During a recent event in Bangalore, the AI assistant handled around 230 users per minute and was queried an average of 57 times per minute, generating more than 9,000 session summaries. This solution enhanced knowledge retention, engagement, and inclusivity by making event insights instantly accessible in multiple languages and formats for hearing-impaired and remote participants. By transforming event content into a persistent, searchable multilingual knowledge asset, Infosys Event AI accelerates learning and collaboration.
Fortinet, an AWS Partner and cybersecurity company, uses Amazon Nova Micro to power its AI support assistant, delivering significant performance improvements at a fraction of the cost. By switching to Nova Micro in Amazon Bedrock, Fortinet achieved an 85 times reduction in inference costs, dramatically lowering TCO while maintaining rapid response times. The assistant now helps users quickly navigate complex documentation across more than 60 products, improving support efficiency and elevating customer satisfaction.
Amazon Customer Service uses Nova with its AI-driven issue resolution system. The system is a two-step approach combining intent detection and issue resolution. Amazon Customer Service customized Nova Micro, resulting in 76.9% accuracy for in-domain issues and 69.2% in generalization testing, surpassing current baselines by 5.4% and 7.3%, respectively. Additionally, Nova Lite is used for tool selection, achieving 86.1% accuracy and 4.8% improvement over existing systems.
AWS Summit New York City 2025 was attended by 18,000 participants, featuring the AI assistant Diana for customer service developed with Nova Sonic. By dialing a phone number, the Sonic-powered voice assistant answered hundreds of queries about the event, including session details, location, and FAQs.

Search
Large enterprises face slow, siloed, and inefficient search across vast stores of structured and unstructured data, costing time, productivity, and customer responsiveness. By adopting AI-powered, multimodal search that understands natural language and enforces secure access, organizations can deliver instant, relevant answers from documents, images, and technical files. This accelerates decision-making, shortens deal cycles, improves customer satisfaction, and reduces the cost of knowledge discovery at scale. Search applications increasingly rely on a mix of information across modalities, including text, documents, images, and video.
Nova is among the fastest and most cost-effective multimodal models, offering vision fine-tuning capabilities. Nova also integrates with broader Amazon models including Amazon Titan Multimodal Embeddings and data services including Amazon OpenSearch Service for more robust search capabilities and performance.

Siemens faced growing performance bottlenecks as its massive datasets strained traditional search systems, slowing retrieval speeds and impacting productivity across its global operations. To address this, Siemens integrated Amazon Nova, achieving a threefold boost in search performance that dramatically accelerated data retrieval and improved workflow efficiency. Amazon Nova delivers high-speed, scalable search capabilities, and Siemens’s implementation facilitates seamless integration with existing systems, maintaining business continuity with minimal disruption. This enhanced user experience and positioned Siemens to handle future data growth with ease, supported by continuous performance monitoring and tight infrastructure alignment.
CBRE Global Pulse System (GPS)—built with Amazon Nova Pro in Amazon Bedrock and OpenSearch Service—transforms property search across thousands of users worldwide. Built in partnership with AWS Professional Services and GenAI Specialists, GPS replaces slow, fragmented legacy systems with an AI-driven, multimodal search platform capable of handling complex queries, massive PDFs, and strict permission controls. Key results include 75% faster document ingestion, 70% lower database latency, 87% faster keyword searches, and 51% faster natural language queries. When fully deployed to over 6,000 users later in 2025, GPS is projected to save over 98,000 employee workdays annually, unlocking $320,000 ARR and significant operational efficiency. By shifting from Anthropic’s Claude Sonnet to Nova Pro and Anthropic’s Claude Haiku 3, CBRE also cut AI inference costs by 3.5 times and 12 times, respectively, without sacrificing accuracy.

Video understanding and analysis
Organizations are adopting video understanding applications to drive business value across multiple fronts, including customer behavior analysis, traffic patterns, and manufacturing quality control. Security and safety benefits are realized through real-time threat detection and workplace safety monitoring, and customer experience is enhanced through personalized content recommendations and improved content searchability. Organizations gain competitive advantage through data-driven decision-making and innovation in service delivery, while reducing costs by minimizing manual review processes and decreasing security incidents. This comprehensive approach to video analysis helps companies extract insights from their video data, ultimately leading to improved operations, better decision-making, and enhanced customer experiences. As developers build, iterate, and evolve these applications, there is a growing demand to natively understand video as opposed to dealing with the overhead of frames, time stamps, and synchronization.
Amazon Nova models can analyze, classify, and summarize information in the video based on provided instructions. Applications built with Nova understanding models in Amazon Bedrock offer comprehensive analysis of multiple video formats through flexible input methods, with the ability to analyze, classify, and summarize video content while handling files up to 1 GB through Amazon Simple Storage Service (Amazon S3) integration.

Bitcentral partnered with Caylent to transform how archived content is discovered, accessed, and reused. Using Nova Pro in Amazon Bedrock, Caylent deployed a solution that aligned with the needs of journalists, producers, and broadcasters across more than 1,600 client sites. By embedding semantic video search, contextual metadata generation, and AI-powered content analysis into its workflows, Bitcentral redefined how archived footage is indexed, discovered, and reused. Journalists and producers can now surface high-value content in real time and unlock new revenue streams.
Loka, an AWS Premier Partner, built a video surveillance offering to automatically identify and classify millions of visual events in video footage. This system effectively distinguishes between routine events and critical incidents, helping filter out non-essential activities and alerts. The solution proved highly successful, reducing irrelevant alerts by 55% while maintaining a threat detection rate above 97%. By implementing this automated filtering system, Loka doubled video monitoring efficiency for their client. The tool, built on Amazon Bedrock using Amazon Nova Pro, significantly reduced the workload for human operators while improving overall threat detection capabilities.
Accenture Spotlight can analyze long-form videos and automatically generate personalized short-form clips and highlights, which are particularly useful for sports content like soccer, Formula 1, and rugby. Spotlight is capable of matching content to specific audience demographics and can process real-time CCTV footage in retail settings to create personalized offers. The system is built with Amazon Nova in Amazon Bedrock and operates through three specialized super agents working under a central orchestrator. Spotlight can process videos in minutes rather than the traditional hours or days, while achieving cost savings that are 10 times better than conventional methods. The solution is versatile enough to be used across different industries, from media and entertainment to retail, while maintaining high quality standards and brand alignment through its human-in-the-loop quality assurance option.

Creative content generation
Organizations are seeking ways to revolutionize creative content generation including stock imagery, marketing campaign assets, and product visualizations. It is often slowed down by fragmented workflows, high production costs, and the need to continuously balance scale with personalization. Marketing teams struggle to keep up with the demand for fresh, high-quality assets across multiple channels, while creative fatigue and long lead times limit their agility.
Amazon Nova addresses these challenges with Nova Canvas and Nova Reel: high-quality creative models that transform text and image inputs into professional-grade images and videos. Nova creative models are designed to deliver customizable visual content with control features, making creative content generation accessible and efficient for media, entertainment, retail, marketing, and advertising industries.

Dentsu is reimagining how ads come to life with Amazon Nova creative generation models. What used to take weeks of brainstorming, filming, and editing now happens in days. Their creative teams can sketch out an idea in plain language and watch it turn into polished videos and custom images, ready for markets across the globe in over 200 languages. Built-in safeguards like moderation, watermarking, and IP indemnity mean every piece stays brand safe. For Co-op, Dentsu went a step further—pairing Nova with Amazon Ads to design custom audience profiles that delivered a +4-point lift in brand preference among 25–34-year-olds and a +5-point lift in favorability among affluent shoppers.
Quantiphi, an AWS Premier Global Consulting Partner, developed Qreator, a generative AI-powered marketing content creation service built on AWS. Their service helps marketers create content through natural language prompts while maintaining brand consistency and cross-channel adaptability. With Qreator, business can achieve an approximate 30% reduction in content creation time and get to market approximately 40% faster, automating what was a manual process, and improving consistency across formats and channels.
The Fragrance Lab is a unique AWS activation that was showcased at the Cannes Lions International Festival of Creativity. It demonstrates how to build personalized products and campaign assets using Amazon Nova foundation models in Amazon Bedrock. Although our activation at Cannes Lions focused on personalized fragrance development and ad campaign creation, the underlying architecture and methodology can be adapted across diverse categories, such as fashion, food, and beverage, opening endless possibilities for customized customer experiences. The Fragrance Lab activation won two International Business Awards: Gold for Exhibition Event Experience and Silver for Experiential Event.

Conclusion
The four use cases presented in this post demonstrate the utility of Amazon Nova across industries and applications. From Infosys’s Event AI improving accessibility and engagement, to CBRE’s revolutionary property search system, to Loka’s intelligent video surveillance, and Dentsu’s creative content generation, each implementation showcases significant, measurable improvements in efficiency, cost reduction, and customer satisfaction.
Organizations using Amazon Nova are achieving tangible business outcomes through evidence-based adoption strategies. By partnering with Amazon and AWS Partners, organizations are accelerating their AI transformation while maintaining strong foundations in security, compliance, and privacy-by-design principles.
To get started building with Nova, visit the Amazon Nova user guide or visit the AWS console.

About the Authors
Abhinav Bhargava is a Sr Product Marketing Manager at AWS on the Amazon Nova team, where he focuses on scaling generative AI adoption through customer-centric solutions. With a background in design and sustainability, he brings a unique perspective to connecting technology and creativity to drive enterprise innovation. Based in Seattle, Abhinav enjoys playing volleyball, traveling, and learning about new cultures.
Raechel Frick is a Sr Product Marketing Manager at AWS. With over 20 years of experience in the tech industry, she brings a customer-first approach and growth mindset to building integrated marketing programs. Based in the greater Seattle area, Raechel balances her professional life with being a soccer mom and after-school carpool manager, demonstrating her ability to excel both in the corporate world and family life.

Building smarter AI agents: AgentCore long-term memory deep dive

Building AI agents that remember user interactions requires more than just storing raw conversations. While Amazon Bedrock AgentCore short-term memory captures immediate context, the real challenge lies in transforming these interactions into persistent, actionable knowledge that spans across sessions. This is the information that transforms fleeting interactions into meaningful, continuous relationships between users and AI agents. In this post, we’re pulling back the curtain on how the Amazon Bedrock AgentCore Memory long-term memory system works.
If you’re new to AgentCore Memory, we recommend reading our introductory blog post first: Amazon Bedrock AgentCore Memory: Building context-aware agents. In brief, AgentCore Memory is a fully managed service that enables developers to build context-aware AI agents by providing both short-term working memory and long-term intelligent memory capabilities.
The challenge of persistent memory
When humans interact, we don’t just remember exact conversations—we extract meaning, identify patterns, and build understanding over time. Teaching AI agents to respond the same requires solving several complex challenges:

Agent memory systems must distinguish between meaningful insights and routine chatter, determining which utterances deserve long-term storage versus temporary processing. A user saying “I’m vegetarian” should be remembered, but “hmm, let me think” should not.
Memory systems need to recognize related information across time and merge it without creating duplicates or contradictions. When a user mentions they’re allergic to shellfish in January and mentions “can’t eat shrimp” in March, these needs to be recognized as related facts and consolidated with existing knowledge without creating duplicates or contradictions.
Memories must be processed in order of temporal context. Preferences that change over time (for example, the user loved spicy chicken in a restaurant last year, but today, they prefer mild flavors) require careful handling to make sure the most recent preference is respected while maintaining historical context.
As memory stores grow to contain thousands or millions of records, finding relevant memories quickly becomes a significant challenge. The system must balance comprehensive memory retention with efficient retrieval.

Solving these problems requires sophisticated extraction, consolidation, and retrieval mechanisms that go beyond simple storage. Amazon Bedrock AgentCore Memory tackles these complexities by implementing a research-backed long-term memory pipeline that mirrors human cognitive processes while maintaining the precision and scale required for enterprise applications.
How AgentCore long-term memory works
When the agentic application sends conversational events to AgentCore Memory, it initiates a pipeline to transform raw conversational data into structured, searchable knowledge through a multi-stage process. Let’s explore each component of this system. 
1. Memory extraction: From conversation to insights
When new events are stored in short-term memory, an asynchronous extraction process analyzes the conversational content to identify meaningful information. This process leverages large language models (LLMs) to understand context and extract relevant details that should be preserved in long-term memory. The extraction engine processes incoming messages alongside prior context to generate memory records in a predefined schema. As a developer, you can configure one or more Memory strategies to extract only the information types relevant to your application needs. The extraction process supports three built-in memory strategies:

Semantic memory: Extracts facts and knowledge. Example:

“The customer’s company has 500 employees across Seattle, Austin, and Boston”

User preferences: Captures explicit and implicit preferences given context. Example:

{“preference”: “Prefers Python for development work”, “categories”: [“programming”, ”code-style”], “context”: “User wants to write a student enrollment website”}

Summary memory: Creates running narratives of conversations under different topics scoped to sessions and preserves the key information in a structured XML format. Example:

<topic=“Material-UI TextareaAutosize inputRef Warning Fix Implementation”> A developer successfully implemented a fix for the issue in Material-UI where the TextareaAutosize component gives a “Does not recognize the ‘inputRef’ prop” warning when provided to OutlinedInput through the ‘inputComponent’ prop. </topic>

For each strategy, the system processes events with timestamps for maintaining the continuity of context and conflict resolution. Multiple memories can be extracted from a single event, and each memory strategy operates independently, allowing parallel processing.
2. Memory consolidation
Rather than simply adding new memories to existing storage, the system performs intelligent consolidation to merge related information, resolve conflicts, and minimize redundancies. This consolidation makes sure the agent’s memory remains coherent and up to date as new information arrives.
The consolidation process works as follows:

Retrieval: For each newly extracted memory, the system retrieves the top most semantically similar existing memories from the same namespace and strategy.
Intelligent processing: The new memory and retrieved memories are sent to the LLM with a consolidation prompt. The prompt preserves the semantic context, thus avoiding unnecessary updates (for example, “loves pizza” and “likes pizza” are considered essentially the same information). Preserving these core principles, the prompt is designed to handle various scenarios:

You are an expert in managing data. Your job is to manage memory store. 
Whenever a new input is given, your job is to decide which operation to perform.

Here is the new input text.
TEXT: {query}

Here is the relevant and existing memories
MEMORY: {memory}

You can call multiple tools to manage the memory stores…
Based on this prompt, the LLM determines the appropriate action:

ADD: When the new information is distinct from existing memories
UPDATE: Enhance existing memories when the new knowledge complements or updates the existing memories
NO-OP: When the information is redundant

Vector store updates: The system applies the determined actions, maintaining an immutable audit trail by marking the outdated memories as INVALID instead of instantly deleting them.

This approach makes sure that contradictory information is resolved (prioritizing recent information), duplicates are minimized, and related memories are appropriately merged.
Handling edge cases
The consolidation process gracefully handles several challenging scenarios:

Out-of-order events: Although the system processes events in temporal order within sessions, it can handle late-arriving events through careful timestamp tracking and consolidation logic.
Conflicting information: When new information contradicts existing memories, the system prioritizes recency while maintaining a record of previous states:

Existing: “Customer budget is $500”
New: “Customer mentioned budget increased to $750”
Result: New active memory with $750, previous memory marked inactive

Memory failures: If consolidation fails for one memory, it doesn’t impact others. The system uses exponential backoff and retry mechanisms to handle transient failures. If consolidation ultimately fails, the memory is added to the system to help prevent potential loss of information.

Advanced custom memory strategy configurations
While built-in memory strategies cover common use cases, AgentCore Memory recognizes that different domains require tailored approaches for memory extraction and consolidation. The system supports built-in strategies with overrides for custom prompts that extend the built-in extraction and consolidation logic, letting teams adapt memory handling to their specific requirements. To maintain system compatibility and focus on criteria and logic rather than output formats, custom prompts help developers customize what information gets extracted or filtered out, how memories should be consolidated, and how to resolve conflicts between contradictory information.
AgentCore Memory also supports custom model selection for memory extraction and consolidation. This flexibility helps developers balance accuracy and latency based on their specific needs. You can define them via the APIs when you create the memory_resource as a strategy override or via the console (as shown below in the console screenshot).

Apart from override functionality, we also offer self-managed strategies that provide complete control over your memory processing pipeline. With self-managed strategies, you can implement custom extraction and consolidation algorithms using any models or prompts while leveraging AgentCore Memory for storage and retrieval. Also, using the Batch APIs, you can directly ingest extracted records into AgentCore Memory while maintaining full ownership of the processing logic.
Performance characteristics
We evaluated our built-in memory strategy across three public benchmarking datasets to assess different aspects of long-term conversational memory:

LoCoMo: Multi-session conversations generated through a machine-human pipeline with persona-based interactions and temporal event graphs. Tests long-term memory capabilities across realistic conversation patterns.
LongMemEval: Evaluates memory retention in long conversations across multiple sessions and extended time periods. We randomly sampled 200 QA pairs for evaluation efficiency.
PrefEval: Tests preference memory across 20 topics using 21-session instances to evaluate the system’s ability to remember and consistently apply user preferences over time.
PolyBench-QA: A question-answering dataset containing 807 Question Answer (QA) pairs across 80 trajectories, collected from a coding agent solving tasks in PolyBench.

We use two standard metrics: correctness and compression rate. LLM-based correctness evaluates whether the system can correctly recall and use stored information when needed. Compression rate is defined as output memory token count / full context token count, and evaluates how effectively the memory system stores information. Higher compression rates indicate the system maintains essential information while reducing storage overhead. This compression rate directly translates to faster inference speeds and lower token consumption–the most critical consideration for deploying agents at scale because it enables more efficient processing of large conversational histories and reduces operational costs.

Memory Type
Dataset
Correctness
Compression Rate

RAG baseline (full conversation history)
LoCoMo
77.73%
0%

LongMemEval-S
75.2%
0%

PrefEval
51%
0%

Semantic Memory
LoCoMo
70.58%
89%

LongMemEval-S
73.60%
94%

Preference Memory
PrefEval
79%
68%

Summarization
PolyBench-QA
83.02%
95%

The retrieval-augmented-generation (RAG) baseline performs well on factual QA tasks due to complete conversation history access, but struggles with preference inference. The memory system achieves strong practical trade-offs: though information compression leads to slightly lower correctness on some factual tasks, it provides 89-95% compression rates for scalable deployment, maintaining bounded context sizes, and performs effectively at their specialized use cases.
For more complex tasks requiring inference (understanding user preferences or behavioral patterns), memory demonstrates clear advantages in both performance accuracy and storage efficiency—the extracted insights are more valuable than raw conversational data for these use cases.
Beyond accuracy metrics, AgentCore Memory delivers the performance characteristics necessary for production deployment.

Extraction and consolidation operations complete within 20-40 seconds for standard conversations after the extraction is triggered.
Semantic search retrieval (retrieve_memory_records API) returns results in approximately 200 milliseconds.
Parallel processing architecture enables multiple memory strategies to process independently; thus, different memory types can be processed simultaneously without blocking each other.

These latency characteristics, combined with the high compression rates, enable the system to maintain responsive user experiences while managing extensive conversational histories efficiently across large-scale deployments.
Best practices for long-term memory
To maximize the effectiveness of long-term memory in your agents:

Choose the right memory strategies: Select built-in strategies that align with your use case or create custom strategies for domain-specific needs. Semantic memory captures factual knowledge, preference memory tailors towards individual preference, and summarization memory distills complex information for better context management. For example, a customer support agent might use semantic memory to capture customer transaction history and past issues, while summarization memory creates short narratives of current support conversations and troubleshooting workflows across different topics.
Design meaningful namespaces: Structure your namespaces to reflect your application’s hierarchy. This also enables precise memory isolation and efficient retrieval. For example, use customer-support/user/john-doe for individual agent memories and customer-support/shared/product-knowledge for team-wide information.
Monitor consolidation patterns: Regularly review what memories are being created (using list_memories or retrieve_memory_records API), updated, or skipped. This helps refine your extraction strategies and helps the system capture relevant information that’s better fitted to your use case.
Plan for async processing: Remember that long-term memory extraction is asynchronous. Design your application to handle the delay between event ingestion and memory availability. Consider using short-term memory for immediate retrieval needs while long-term memories are being processed and consolidated in the background. You might also want to implement fallback mechanisms or loading states to manage user expectations during processing delays.

Conclusion
The Amazon Bedrock AgentCore Memory long-term memory system represents a significant advancement in building AI agents. By combining sophisticated extraction algorithms, intelligent consolidation processes, and immutable storage designs, it provides a robust foundation for agents that learn, adapt, and improve over time.
The science behind this system, from research-backed prompts to innovative consolidation workflow, makes sure that your agents don’t just remember, but understand. This transforms one-time interactions into continuous learning experiences, creating AI agents that become more helpful and personalized with every conversation.
Resources: – AgentCore Memory Docs – AgentCore Memory code samples – Getting started with AgentCore – Workshop

About the authors
Akarsha Sehwag is a Generative AI Data Scientist for Amazon Bedrock AgentCore GTM team. With over six years of expertise in AI/ML, she has built production-ready enterprise solutions across diverse customer segments in Generative AI, Deep Learning and Computer Vision domains. Outside of work, she likes to hike, bike or play Badminton.
Jiarong Jiang is a Principal Applied Scientist at AWS, driving innovations in Retrieval-Augmented Generation (RAG) and agent memory systems to improve the accuracy and intelligence of enterprise AI. She’s passionate about enabling customers to build context-aware, reasoning-driven applications that leverage their own data effectively.
Jay Lopez-Braus is a Senior Technical Product Manager at AWS. He has over ten years of product management experience. In his free time, he enjoys all things outdoors.
Dani Mitchell is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS). He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock and Bedrock AgentCore.
Peng Shi is a Senior Applied Scientist at AWS, where he leads advancements in agent memory systems to enhance the accuracy, adaptability, and reasoning capabilities of AI. His work focuses on creating more intelligent and context-aware applications that bridge cutting-edge research with real-world impact.

Configure and verify a distributed training cluster with AWS Deep Lear …

Training state-of-the-art large language models (LLMs) demands massive, distributed compute infrastructure. Meta’s Llama 3, for instance, ran on 16,000 NVIDIA H100 GPUs for over 30.84 million GPU hours. Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service that simplifies the deployment, management, and scaling of Kubernetes clusters that can scale up to the ranges needed to train LLMs. To facilitate the configuration of such large, distributed workloads, AWS Deep Learning Containers (DLCs) provide pre-built, performance-optimized images for popular frameworks like PyTorch, so teams can launch jobs faster and with fewer compatibility issues. However, even with Amazon EKS and DLCs, configuring clusters for large training workloads is not a trivial task.
A source of complexity for the configuration of the training cluster is the configuration of the GPUs in the GPU-powered instances used in distributed training. GPU-powered Amazon Elastic Compute Cloud (Amazon EC2) instances come in two families: the G family (for example, G6 with NVIDIA L4 Tensor Core GPUs) for cost-efficient inference and lighter training, and the P family (for example, P6 with NVIDIA GB200 NVL72) for massive, distributed jobs. A single P5 has 8 H100 GPUs with 640 GB HBM3 and delivers 3,200 Gbps EFA networking, ideal for multi-billion-parameter model training. Although G instances are more affordable, they lack the high-bandwidth, low-latency fabric, and memory throughput needed for extreme scale. P instances, though fast, require precise configuration of networking, storage, and GPU topologies, making them powerful but operationally complex and a potential source of misconfigurations or errors for the distributed job.
Misconfiguration issues in distributed training with Amazon EKS can be prevented following a systematic approach to launch required components and verify their proper configuration. This post walks through the steps to set up and verify an EKS cluster for training large models using DLCs.
Solution overview
The solution consists of the following high-level steps:

Build a Docker image with the required dependencies using a PyTorch Framework DLC.
Launch the required infrastructure in a stable, GPU-ready cluster with Amazon EKS.
Install task-specific plugins required for GPU device plugins, Elastic Fabric Adapter (EFA) support, distributed training frameworks, and persistent file storage.
Run health checks to verify node readiness and the correct configuration of NVIDIA and EFA plugins.
Launch a small training job to verify the whole system.

We walk through these steps using a fleet of two p4d.24xlarge instances that we are consuming from a capacity reservation. The scripts used in this post are available in GitHub. Similar scripts for other GPU-powered instances are available in the following GitHub repository. The overall component setup, including worker nodes with persistent storage, plugins, and drivers, is shown in the following diagram.

Prerequisites
To deploy this solution, you need to have these prerequisites:

An AWS account with billing enabled
Sufficient service quotas for on-demand G instances, or access to a capacity reservation
Hugging Face token with access to Meta Llama 2 7B

Build Docker image from AWS DLC
DLCs are pre-built, performance-optimized Docker images that make it straightforward to run popular frameworks like PyTorch and TensorFlow on AWS. Each DLC ships with a fully integrated stack that includes compatible versions of CUDA, cuDNN, and NCCL, plus optional EFA support for high-throughput, low-latency distributed training. These containers are validated across Amazon EC2, Amazon Elastic Container Service (Amazon ECS), and Amazon EKS, providing consistent performance on G- and P-family GPU instances. This uniform environment is critical for distributed workloads, where even minor version mismatches can trigger throughput degradation, stalled all-reduce operations, or CUDA/NCCL errors. Although it’s possible to build training containers from scratch, doing so at production scale is tedious: GPU drivers, CUDA, NCCL, and networking libraries must be aligned with strict version and hardware requirements. DLCs simplify this by providing secure, regularly updated images that are already optimized for AWS infrastructure.
Most distributed training jobs need additional libraries, launch utilities, or orchestration scripts that the base DLCs don’t include. As a result, teams typically use DLCs as a foundation and extend them with the dependencies required for their workloads. This approach preserves the reliability of AWS optimized images while providing the flexibility to customize for large-scale training.
In this post, we show the process of building a custom Docker container by adding custom Python libraries to the PyTorch 2.7.1 Training DLC to launch a training job with Meta Llama 2 7B. For more details, refer to AWS Deep Learning Containers for PyTorch 2.7 Training on EC2, ECS and EKS. To prevent mismatches with the NVIDIA drivers and CUDA versions, we recommend using an EC2 instance powered by a Deep Learning AMI (DLAMI) to build the image. The DLAMI is used only for building a container image used by the training job referenced in this post. It’s different from an Amazon EKS optimized AMI, which is used to run worker nodes in an EKS cluster to run that training job.
Complete the following steps to build a Docker image:

Launch an EC2 instance using the “Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04)” for 64-bit (x86) architecture. Use at least a c5.4xlarge instance or larger, and enable HTTP/HTTPS traffic from the internet.

Allocate at least 100 GiB for storage.

Connect to the EC2 instance using an SSH client and your private key for authentication.
Clone the GitHub repository to access the scripts for this post:

git clone https://github.com/aws-samples/sample-aws-deep-learning-containers.git
cd training/eks

Install the AWS CLI, kubectl, and eksctl to manage the training clusters from the command line of the EC2 instance:

source ./setup_ec2.sh

Run the following script to authenticate into the DLC registry, build the custom image with the dependencies specified in the Dockerfile, and push the custom image to a private repository:

bash ./build.sh

Launch EKS cluster
In this step, we use a YAML file to launch an EKS cluster that contains the required infrastructure for the distributed training job. We launch two managed node groups in an existing virtual private cloud (VPC) and subnets:

A system node group (c5.2xlarge) for running cluster system pods and auto scaling components
A GPU node group (p4d.24xlarge) with EFA enabled networking and RAID0 local storage, designed for distributed training

The script also installs several Amazon EKS add-ons (for example, an EBS CSI driver, Amazon CloudWatch observability, or a node monitoring agent) for storage provisioning and cluster observability.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: eks-p4d
region: PLACEHOLDER_AWS_REGION
version: “1.33”

# List availability zones where cluster subnets will be created
availabilityZones:
– PLACEHOLDER_AZ1
– PLACEHOLDER_AZ2

# Substitute vpc and subnet ids below
# if you want a VPC to be created, comment out vpc related lines
vpc:
id: PLACEHOLDER_VPC_ID
subnets:
private:
private-one:
id: PLACEHOLDER_SUBNET_PRIVATE_1
private-two:
id: PLACEHOLDER_SUBNET_PRIVATE_2
public:
public-one:
id: PLACEHOLDER_SUBNET_PUBLIC_1
public-two:
id: PLACEHOLDER_SUBNET_PUBLIC_2

iam:
withOIDC: true

# EKS-managed node group(s)
managedNodeGroups:
# Nodegroup for system pods
– name: sys
instanceType: c5.2xlarge
desiredCapacity: 1
iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
nodeRepairConfig:
enabled: true

# GPU nodegroup
# List availability zones where instances in from this nodegroup will be launched
# Update capacityReservationID with your own if you have a capacity reservation
# Update desiredCapacity to the number of instances you want to launch
– name: p4d
instanceType: p4d.24xlarge
instancePrefix: p4d
privateNetworking: true
efaEnabled: true
minSize: 0
desiredCapacity: 2
maxSize: 4
volumeSize: 500
# if you have Capacity Reservation the AZ has to be same
# if you don’t have CR nodes will be assigned per availability
availabilityZones: [“PLACEHOLDER_AZ”]
capacityReservation:
capacityReservationTarget:
capacityReservationID: “cr-xxxxxxxxxx”
# Utilize the local instance store volume(s)
overrideBootstrapCommand: |
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
instance:
localStorage:
strategy: RAID0

iam:
withAddonPolicies:
autoScaler: true
cloudWatch: true
ebs: true
fsx: true
nodeRepairConfig:
enabled: true

addons:
# vpc-cni, coredns, and kube-proxy addons are installed by default by EKS
# we setup additional drivers as addon including storage plugins
– name: aws-ebs-csi-driver
wellKnownPolicies: # add IAM and service account
ebsCSIController: true
– name: aws-fsx-csi-driver
attachPolicyARNs:
– arn:aws:iam::aws:policy/AmazonFSxFullAccess
– name: eks-node-monitoring-agent
resolveConflicts: overwrite
– name: amazon-cloudwatch-observability
resolveConflicts: overwrite
attachPolicyARNs:
– arn:aws:iam::aws:policy/CloudWatchFullAccess

Other sample configurations for training clusters are available in the GitHub repo:

eks-g4dn-vpc.yaml – G4dn with EFA
eks-p4de-odcr.yaml – P4de with capacity reservation
eks-p5-odcr.yaml – P5 with capacity reservation

You can modify the chosen YAML file with your AWS Region, Kubernetes version, VPC and subnets, and optional capacity reservation details. Managed node groups are recommended because they handle node lifecycle, software, and cluster integration automatically, reducing operational overhead compared to self-managed nodes.
After the YAML file has been updated, launch your cluster:

eksctl create cluster -f ./eks-p4d-odcr.yaml

Provisioning takes 15–30 minutes. You can verify the status of your nodes with the following command:

kubectl get nodes

With a successful deployment, you should see all nodes in Ready status.
Use the following command to see all pods created by installed add-ons in Running status:

kubectl get pods –A

Install training-specific plugins
After you set up a basic EKS cluster, you must install additional plugins to enable critical functionalities for distributed training workloads. These plugins make sure GPUs, high-speed networking, distributed training frameworks, and persistent storage are available and correctly integrated into the cluster:

NVIDIA GPU plugin – The NVIDIA device plugin exposes GPU resources to Kubernetes, enabling pods to request and use GPUs
EFA plugin – The EFA device plugin provides high-performance networking for EFA enabled instances (for example P4 and P5), which is essential for multi-node training
Distributed training plugins – These plugins include services like etcd—for rendezvous in PyTorch—and the Kubeflow Training Operator (with the MPI Operator) to enable large-scale job orchestration
Persistent file storage – The FSx CSI driver and EBS CSI driver enable scalable, high-throughput storage for datasets, model checkpoints, monitoring, and logs in Amazon FSx for Lustre and Amazon Elastic Block Store (Amazon EBS), respectively

By enabling these plugins, the cluster becomes production-ready for large-scale training workloads.
Install the NVIDIA device plugin
Because we’re using an Amazon EKS optimized AMI with GPU support, the NVIDIA device plugin is already included. Verify that the plugin pods are running with the following command:

kubectl get pods -n kube-system | grep nvidia

The expected output is as follows:

nvidia-device-plugin-daemonset-xxxxx 1/1 Running 0 3m48s
nvidia-device-plugin-daemonset-yyyyy 1/1 Running 0 3m48s

If the plugin is missing, install it manually with the following command:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.3/deployments/static/nvidia-device-plugin.yml

Verify the availability of GPUs in your nodes with the following command:

kubectl get nodes -o json | jq ‘.items[].status.capacity.”nvidia.com/gpu”‘

The expected output for nodes with 8 GPUs is as follows:

“8”
“8”

Install the EFA plugin
If you are using EFA enabled instances (such as P4d, P4de, or P5), verify that EFA resources are advertised:

kubectl get nodes -o=custom-columns=NAME:.metadata.name,EFA:.status.allocatable.vpc\.amazonaws\.com/efa

The expected values will depend on your instance type:

P4d or p4de: 4
P5: 32

If EFA is not visible, use the following command to install the plugin:

kubectl apply -f https://raw.githubusercontent.com/aws-samples/aws-do-eks/main/Container-Root/eks/deployment/efa-device-plugin/efa-k8s-device-plugin.yaml

Install distributed training plugins: etcd and Kubeflow Training Operator
In distributed PyTorch training workloads on Kubernetes, etcd serves as an elegant coordination mechanism that enables seamless worker orchestration. This powerful backend service built for Kubernetes acts as a central meeting point where training workers can perform three critical functions: register their presence in the cluster, discover their peer workers, and achieve synchronized startup across the distributed training job. This coordination pattern is particularly valuable when running large-scale machine learning (ML) workloads on Amazon EKS to enable efficient distributed training.
Create an etcd store with the following command:

kubectl apply -f etcd.yaml

Verify its deployment:

kubectl get pods

The output should look like the following code:

NAME READY STATUS RESTARTS AGE
etcd-xxxxx-xxx 1/1 Running 0 10s

The Kubeflow Training Operator simplifies distributed PyTorch training on Amazon EKS by providing custom resources (such as PyTorchJob) that automate the complex orchestration of multi-node training deployments, including worker pod lifecycle management and fault handling. By using the built-in MPI Operator, it enables efficient inter-node communication patterns critical for distributed deep learning workloads, handling the intricacies of MPI process placement, rank assignment, and network configuration that would otherwise require significant manual setup and expertise.
Deploy Kubeflow Training Operator:

kubectl apply –server-side -k “github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.3”

Kubeflow Training Operator (v1) is a legacy project to Kubeflow Trainer (v2), which is currently in alpha status, and APIs may change.
Install storage plugins: FSx for Luster and Amazon EBS
For latency-sensitive and high-bandwidth throughput dynamic workloads, such as distributed training and model serving across multiple GPU compute instances, we recommend FSx for Lustre. It provides a fully managed, high-performance parallel file system that is designed for compute-intensive workloads like high-performance computing (HPC) and ML.
We installed the FSx for Lustre file system CSI driver using the Amazon EKS add-on while creating the cluster to mount FSx for Lustre file systems on Amazon EKS as a persistent volume (PV). In this step, you deploy an FSx for Lustre file system as a standalone high-performance cache or as an Amazon Simple Storage Service (Amazon S3) linked file system to act as a high-performance cache for Amazon S3 data, providing fast I/O and high throughput for data access across your GPU compute instances.
Create the FSx for Lustre file system with the following command:

bash ./fsx_create.sh

Create a PVC object to allow Kubernetes pods to claim storage on the FSx for Lustre file system:

kubectl apply -f ./fsx-pvc-static.yaml

In FSx for Lustre, throughput scales with storage type and provisioned capacity. Optimize your deployment based on your dataset size and checkpointing needs.
The EBS CSI driver gives Amazon EKS the ability to dynamically create and attach block volumes (using Amazon EBS) to pods. When creating node groups, EBS root volumes can be preconfigured (size, type: gp2/gp3/io1/io2). We have already installed the EBS CSI driver through the EKS cluster setup. Verify that the instance role includes the policy arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy, because without it, EBS PVC provisioning will fail.
In summary, by layering these plugins on top of a baseline EKS cluster, you can unlock the following:

GPUs for compute
High-performance networking
Orchestration for distributed training
Persistent storage

Together, these plugins create an environment capable of supporting large-scale, fault-tolerant, high-performance deep learning workloads on Amazon EKS.
Verify plugins for distributed training
When you first launch a distributed GPU training cluster on Amazon EKS (with AWS DLCs), it’s critical to validate that the environment is healthy before starting large-scale jobs. This prevents wasted time and cost due to misconfigurations or hardware issues. The checks discussed in this section cover the most important areas.
GPU driver and NVIDIA-SMI validation
Each GPU node must have a valid driver installation that matches the CUDA version in your AWS DLC. You can verify this either by running a script inside a GPU-enabled pod or by connecting with AWS Systems Manager.
Regardless of the option you chose, confirm the following as part of your validation:

The driver version matches the CUDA version in your DLC
The GPU model, temperature, and utilization look correct
No errors are reported

Option 1: Run inside a GPU-enabled debug pod
The NVIDIA System Management Interface (nvidia-smi) is a command line utility intended to aid in the management and monitoring of NVIDIA GPU devices. This utility makes it possible for administrators to query GPU device state.
Apply an nvidia-smi job manifest using the following code:

kubectl apply -f nvidia_smi.yaml
kubectl logs nvidia-smi

Option 2: Connect directly using Systems Manager
Find the instance ID of your node:

aws ec2 describe-instances
–filters “Name=tag:eks:nodegroup-name,Values=eks-p4d”
–query “Reservations[].Instances[].InstanceId”
–output text

Start a Systems Manager session:

aws ssm start-session –target <instance-id>

Run the nvidia-smi check to query the state of your GPUs:

nvidia-smi

NCCL and multi-node communication
Distributed training depends on fast GPU-to-GPU communication, often using the NVIDIA Collective Communications Library (NCCL).
Deploy NCCL tests with the following script:

kubectl apply -f ./nccl-tests.yaml

Verify that the NCCL worker pods are up and running:

kubectl get pods | grep nccl

The results should look like the following code:

nccl-tests-launcher 1/1 Running 0 12s
nccl-tests-worker-0 1/1 Running 0 13s
nccl-tests-worker-1 1/1 Running 0 12s

Validate the following:

All-reduce and communication operations complete without errors
Bandwidth and latency values are within expected ranges
If using EFA, confirm that the NCCL is using AWS_OFI_NCCL as the transport layer (optimal for HPC networking)

Validate training environment with sample workload
Finally, validate that your framework (PyTorch), GPUs, and networking all integrate properly by running a small training workload. In this case, we demonstrate this by running supervised fine-tuning on a Meta Llama 2 model.

Get a Hugging Face token. Llama 2 7B is a gated model, so you must request access to the model and then pass your Hugging Face token to the FSDP script. To register and obtain a token here, see User access tokens. Then insert the token into your conf file.
Run the validation script to load the environment variables and generate a job YAML manifest from the template:

bash ./fsdp.sh

Start a PyTorch distributed job:

kubectl apply -f ./fsdp.yaml

The expected output is as follows:

pytorchjob.kubeflow.org/fsdp created

Check that the worker pods have been created:

kubectl get pods | grep fsdp

The output should show both FSDP worker pods as Running:

fsdp-worker-0 1/1 Running 0 7m11s
fsdp-worker-1 1/1 Running 0 7m11s

Inspect the job:

kubectl describe -f ./fsdp.yaml

You should see pod events like those in the following screenshot.

After the pod is created, review the logs for errors or failures:

kubectl logs -f fsdp-worker-0

When the job is complete, the pods should move to a Completed state:

fsdp-worker-0 0/1 Completed 0 9m32s
fsdp-worker-1 0/1 Completed 0 9m32s

If the job starts properly, you can stop the job with the following commands:

kubectl delete -f ./fsdp.yaml
kubectl delete -f ./etcd.yaml

Both the worker pods and the etcd pod must be deleted and recreated before launching a new job, otherwise you might encounter RendezvousClosedError.
These initial health checks help validate the following:

The cluster and nodes are ready
GPUs are installed, visible, and healthy
Multi-node communication is optimized
The AWS DLC environment can run ML workloads

After these checks pass, you can scale up to large-scale distributed training jobs.
Clean up
Delete the cluster using the following command when it’s no longer needed to prevent incurring cost:

eksctl delete cluster -f ./eks-p4d-odcr.yaml

Conclusion
Distributed training requires an infrastructure foundation that delivers both computing power and predictability. When you integrate the Amazon EKS optimized AMI together with AWS DLCs, the result is a GPU-enabled cluster offering a consistent, validated runtime environment that spans all nodes. The implementation of high-bandwidth, low-latency networking capabilities enhanced with EFA helps distributed workloads execute at maximum efficiency. The addition of GPU plugins, coupled with storage integration and distributed training frameworks, creates a streamlined approach to scaling and orchestration. The final step of executing targeted initial health checks, which include NCCL connectivity testing, confirms the cluster is fully prepared for long-duration training operations. After these components are properly configured, teams can redirect their energy from infrastructure maintenance to achieving breakthrough advances in model performance.
For scripts for running FSDP distributed training on Amazon EKS, refer to the following GitHub repo. For distributed training reference architectures, and tests, refer to the following GitHub repo. For a list of available DLC images, refer to the following GitHub repo. For an alternative implementation for running ML training and inference on Amazon EKS using a JARK stack, refer to Deploy Generative AI Models on Amazon EKS.

About the authors
Meryem Ozcelik is a GenAI/ML Specialist Solution Architect at Amazon Web Services. Her work focuses on designing and implementing generative AI and machine learning solutions, specializing in Amazon Bedrock, SageMaker, and AI/ML workload optimization on AWS. She helps accelerating AI adoption through architectural guidance, best practices, and scalable ML infrastructure design. Meryem holds a Master’s Degree in Computer Science from Georgia Institute of Technology.
Pratik Yeole is a solutions architect specializing in container services at AWS. He helps customers adopt modern cloud-native architectures and best practices. He is a tenured Amazonian with expertise in containers and AI/ML. For leisure, he plays cricket, chess and enjoys game nights/hikes/restaurants with family and friends.
Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.
Jinyan Li is a Software Development Engineer at Amazon Web Services. Her work focuses on building and improving containerized environments for machine learning workloads on AWS. She holds a Master’s degree in Computer Science from Northeastern University.
Sirut “G” Buasai is a Software Development Engineer at Amazon Web Services, working within the SageMaker AI organization. He specializes in optimizing deep learning containers and developing cloud-native solutions for machine learning workloads. His expertise includes container optimization, Kubernetes development, and ML model performance benchmarking.

Alibaba’s Qwen AI Releases Compact Dense Qwen3-VL 4B/8B (Instruct &a …

Do you actually need a giant VLM when dense Qwen3-VL 4B/8B (Instruct/Thinking) with FP8 runs in low VRAM yet retains 256K→1M context and the full capability surface? Alibaba’s Qwen team has expanded its multimodal lineup with dense Qwen3-VL models at 4B and 8B scales, each shipping in two task profiles—Instruct and Thinking—plus FP8-quantized checkpoints for low-VRAM deployment. The drop arrives as a smaller, edge-friendly complement to the previously released 30B (MoE) and 235B (MoE) tiers and keeps the same capability surface: image/video understanding, OCR, spatial grounding, and GUI/agent control.

https://github.com/QwenLM/Qwen3-VL/tree/main

What’s in the release?

SKUs and variants: The new additions comprise four dense models—Qwen3-VL-4B and Qwen3-VL-8B, each in Instruct and Thinking editions—alongside FP8 versions of the 4B/8B Instruct and Thinking checkpoints. The official announcement explicitly frames these as “compact, dense” models with lower VRAM usage and full Qwen3-VL capabilities retained.

Context length and capability surface: The model cards list native 256K context with expandability to 1M, and document the full feature set: long-document and video comprehension, 32-language OCR, 2D/3D spatial grounding, visual coding, and agentic GUI control on desktop and mobile. These attributes carry over to the new 4B/8B SKUs.

Architecture notes: Qwen3-VL highlights three core updates: Interleaved-MRoPE for robust positional encoding over time/width/height (long-horizon video), DeepStack for fusing multi-level ViT features and sharpening image–text alignment, and Text–Timestamp Alignment beyond T-RoPE for event localization in video. These design details appear in the new cards as well, signaling architectural continuity across sizes.

Project timeline: The Qwen3-VL GitHub “News” section records the publication of Qwen3-VL-4B (Instruct/Thinking) and Qwen3-VL-8B (Instruct/Thinking) on Oct 15, 2025, following earlier releases of the 30B MoE tier and organization-wide FP8 availability.

FP8: deployment-relevant details

Numerics and parity claim: The FP8 repositories state fine-grained FP8 quantization with block size 128, with performance metrics nearly identical to the original BF16 checkpoints. For teams evaluating precision trade-offs on multimodal stacks (vision encoders, cross-modal fusion, long-context attention), having vendor-produced FP8 weights reduces re-quantization and re-validation burden.

Tooling status: The 4B-Instruct-FP8 card notes that Transformers does not yet load these FP8 weights directly, and recommends vLLM or SGLang for serving; the card includes working launch snippets. Separately, the vLLM recipes guide recommends FP8 checkpoints for H100 memory efficiency. Together, these point to immediate, supported paths for low-VRAM inference.

Key Takeaways

Qwen released dense Qwen3-VL 4B and 8B models, each in Instruct and Thinking variants, with FP8 checkpoints.

FP8 uses fine-grained FP8 (block size 128) with near-BF16 metrics; Transformers loading is not yet supported—use vLLM/SGLang.

Capability surface is preserved: 256K→1M context, 32-language OCR, spatial grounding, video reasoning, and GUI/agent control.

Model Card-reported sizes: Qwen3-VL-4B ≈ 4.83B params; Qwen3-VL-8B-Instruct ≈ 8.77B params.

Editorial Comments

Qwen’s decision to ship dense Qwen3-VL 4B/8B in both Instruct and Thinking forms with FP8 checkpoints is the practical part of the story: lower-VRAM, deployment-ready weights (fine-grained FP8, block size 128) and explicit serving guidance (vLLM/SGLang) makes it easily deployable. The capability surface—256K context expandable to 1M, 32-language OCR, spatial grounding, video understanding, and agent control—remains intact at these smaller scales, which matters more than leaderboard rhetoric for teams targeting single-GPU or edge budgets.

Check out the Model on Hugging Face and GitHub Repo. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Alibaba’s Qwen AI Releases Compact Dense Qwen3-VL 4B/8B (Instruct & Thinking) With FP8 Checkpoints appeared first on MarkTechPost.