Integrating custom dependencies in Amazon SageMaker Canvas workflows

When implementing machine learning (ML) workflows in Amazon SageMaker Canvas, organizations might need to consider external dependencies required for their specific use cases. Although SageMaker Canvas provides powerful no-code and low-code capabilities for rapid experimentation, some projects might require specialized dependencies and libraries that aren’t included by default in SageMaker Canvas. This post provides an example of how to incorporate code that relies on external dependencies into your SageMaker Canvas workflows.
Amazon SageMaker Canvas is a low-code no-code (LCNC) ML platform that guides users through every stage of the ML journey, from initial data preparation to final model deployment. Without writing a single line of code, users can explore datasets, transform data, build models, and generate predictions.
SageMaker Canvas offers comprehensive data wrangling capabilities that help you prepare your data, including:

Over 300 built-in transformation steps
Feature engineering capabilities
Data normalization and cleansing functions
A custom code editor supporting Python, PySpark, and SparkSQL

In this post, we demonstrate how to incorporate dependencies stored in Amazon Simple Storage Service (Amazon S3) within an Amazon SageMaker Data Wrangler flow. Using this approach, you can run custom scripts that depend on modules not inherently supported by SageMaker Canvas.
Solution overview
To showcase the integration of custom scripts and dependencies from Amazon S3 into SageMaker Canvas, we explore the following example workflow.
The solution follows three main steps:

Upload custom scripts and dependencies to Amazon S3
Use SageMaker Data Wrangler in SageMaker Canvas to transform your data using the uploaded code
Train and export the model

The following diagram is the architecture for the solution.

In this example, we work with two complementary datasets available in SageMaker Canvas that contain shipping information for computer screen deliveries. By joining these datasets, we create a comprehensive dataset that captures various shipping metrics and delivery outcomes. Our goal is to build a predictive model that can determine whether future shipments will arrive on time based on historical shipping patterns and characteristics.
Prerequisites
As a prerequisite, you need access to Amazon S3 and Amazon SageMaker AI. If you don’t already have a SageMaker AI domain configured in your account, you also need permissions to create a SageMaker AI domain.
Create the data flow
To create the data flow, follow these steps:

On the Amazon SageMaker AI console, in the navigation pane, under Applications and IDEs, select Canvas, as shown in the following screenshot. You might need to create a SageMaker domain if you haven’t done so already.
After your domain is created, choose Open Canvas.

In Canvas, select the Datasets tab and select canvas-sample-shipping-logs.csv, as shown in the following screenshot. After the preview appears, choose + Create a data flow.

The initial data flow will open with one source and one data type.

At the top right of the screen, and select Add data → tabular. Choose Canvas Datasets as the source and select canvas-sample-product-descriptions.csv.
Choose Next as shown in the following screenshot. Then choose Import.

After both datasets have been added, select the plus sign. From the dropdown menu, choose select Combine data. From the next dropdown menu, choose Join.

To perform an inner join on the ProductID column, in the right-hand menu, under Join type, choose Inner join. Under Join keys, choose ProductId, as shown in the following screenshot.

After the datasets have been joined, select the plus sign. In the dropdown menu, select + Add transform. A preview of the dataset will open.

The dataset contains XShippingDistance (long) and YShippingDistance (long) columns. For our purposes, we want to use a custom function that will find the total distance using the X and Y coordinates and then drop the individual coordinate columns. For this example, we find the total distance using a function that relies on the mpmath library.

To call the custom function, select + Add transform. In the dropdown menu, select Custom transform. Change the editor to Python (Pandas) and try to run the following function from the Python editor:

from mpmath import sqrt # Import sqrt from mpmath

def calculate_total_distance(df, x_col=”XShippingDistance”, y_col=”YShippingDistance”, new_col=”TotalDistance”):

# Use mpmath’s sqrt to calculate the total distance for each row
df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)

# Drop the original x and y columns
df = df.drop(columns=[x_col, y_col])

return df

df = calculate_total_distance(df)

Running the function produces the following error: ModuleNotFoundError: No module named ‘mpmath’, as shown in the following screenshot.

This error occurs because mpmath isn’t a module that is inherently supported by SageMaker Canvas. To use a function that relies on this module, we need to approach the use of a custom function differently.
Zip the script and dependencies
To use a function that relies on a module that isn’t natively supported in Canvas, the custom script must be zipped with the module(s) it relies on. For this example, we used our local integrated development environment (IDE) to create a script.py that relies on the mpmath library.
The script.py file contains two functions: one function that is compatible with the Python (Pandas) runtime (function calculate_total_distance), and one that is compatible with the Python (Pyspark) runtime (function udf_total_distance).

def calculate_total_distance(df, x_col=”XShippingDistance”, y_col=”YShippingDistance”, new_col=”TotalDistance”):
from npmath import sqrt # Import sqrt from npmath

# Use npmath’s sqrt to calculate the total distance for each row
df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)

# Drop the original x and y columns
df = df.drop(columns=[x_col, y_col])

return df

def udf_total_distance(df, x_col=”XShippingDistance”, y_col=”YShippingDistance”, new_col=”TotalDistance”):
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

spark = SparkSession.builder
.master(“local”)
.appName(“DistanceCalculation”)
.getOrCreate()

def calculate_distance(x, y):
import sys

# Add the path to npmath
mpmath_path = “/tmp/maths”
if mpmath_path not in sys.path:
sys.path.insert(0, mpmath_path)

from mpmath import sqrt
return float(sqrt(x**2 + y**2))

# Register and apply UDF
distance_udf = udf(calculate_distance, FloatType())
df = df.withColumn(new_col, distance_udf(df[x_col], df[y_col]))
df = df.drop(x_col, y_col)

return df

To make sure the script can run, install mpmath into the same directory as script.py by running pip install mpmath.
Run zip -r my_project.zip to create a .zip file containing the function and the mpmath installation. The current directory now contains a .zip file, our Python script, and the installation our script depends on, as shown in the following screenshot.

Upload to Amazon S3
After creating the .zip file, upload it to an Amazon S3 bucket.

After the zip file has been uploaded to Amazon S3, it’s accessible in SageMaker Canvas.
Run the custom script
Return to the data flow in SageMaker Canvas and replace the prior custom function code with the following code and choose Update.

import zipfile
import boto3
import sys
from pathlib import Path
import shutil
import importlib.util

def load_script_and_dependencies(bucket_name, zip_key, extract_to):
    “””
    Downloads a zip file from S3, unzips it, and ensures dependencies are available.

    Args:
        bucket_name (str): Name of the S3 bucket.
        zip_key (str): Key for the .zip file in the bucket.
        extract_to (str): Directory to extract files to.

    Returns:
        str: Path to the extracted folder containing the script and dependencies.
    “””
    
    s3_client = boto3.client(“s3″)
    
    # Local path for the zip file
    zip_local_path = ‘/tmp/dependencies.zip’
    
    # Download the .zip file from S3
    s3_client.download_file(bucket_name, zip_key, zip_local_path)
    print(f”Downloaded zip file from S3: {zip_key}”)

    # Unzip the file
    try:
        with zipfile.ZipFile(zip_local_path, ‘r’) as zip_ref:
            zip_ref.extractall(extract_to)
            print(f”Extracted files to {extract_to}”)
    except Exception as e:
        raise RuntimeError(f”Failed to extract zip file: {e}”)

    # Add the extracted folder to Python path
    if extract_to not in sys.path:
      sys.path.insert(0, extract_to)
          
    return extract_to
    

def call_function_from_script(script_path, function_name, df):
    “””
    Dynamically loads a function from a Python script using importlib.
    “””
    try:
        # Get the script name from the path
        module_name = script_path.split(‘/’)[-1].replace(‘.py’, ”)
        
        # Load the module specification
        spec = importlib.util.spec_from_file_location(module_name, script_path)
        if spec is None:
            raise ImportError(f”Could not load specification for module {module_name}”)
            
        # Create the module
        module = importlib.util.module_from_spec(spec)
        sys.modules[module_name] = module
        
        # Execute the module
        spec.loader.exec_module(module)
        
        # Get the function from the module
        if not hasattr(module, function_name):
            raise AttributeError(f”Function ‘{function_name}’ not found in the script.”)
            
        loaded_function = getattr(module, function_name)

        # Clean up: remove module from sys.modules after execution
        del sys.modules[module_name]
        
        # Call the function
        return loaded_function(df)
        
    except Exception as e:
        raise RuntimeError(f”Error loading or executing function: {e}”)

bucket_name = ‘canvasdatabuckett’  # S3 bucket name
zip_key = ‘functions/my_project.zip’  # S3 path to the zip file with our custom dependancy
script_name = ‘script.py’  # Name of the script in the zip file
function_name = ‘udf’ # Name of function to call from our script
extract_to = ‘/tmp/maths’ # Local path to our custom script and dependancies

# Step 1: Load the script and dependencies
extracted_path = load_script_and_dependencies(bucket_name, zip_key, extract_to)

# Step 2: Call the function from the script
script_path = f”{extracted_path}/{script_name}”
df = call_function_from_script(script_path, function_name, df)

This example code unzips the .zip file and adds the required dependencies to the local path so they’re available to the function at run time. Because mpmath was added to the local path, you can now call a function that relies on this external library.
The preceding code runs using the Python (Pandas) runtime and calculate_total_distance function. To use the Python (Pyspark) runtime, update the function_name variable to call the udf_total_distance function instead.
Complete the data flow
As a last step, remove irrelevant columns before training the model. Follow these steps:

On the SageMaker Canvas console, select + Add transform. From the dropdown menu, select Manage columns
Under Transform, choose Drop column. Under Columns to drop, add ProductId_0, ProductId_1, and OrderID, as shown in the following screenshot.

The final dataset should contain 13 columns. The complete data flow is pictured in the following image.

Train the model
To train the model, follow these steps:

At the top right of the page, select Create model and name your dataset and model.
Select Predictive analysis as the problem type and OnTimeDelivery as the target column, as shown in the screenshot below.

When building the model you can choose to run a Quick build or a Standard build. A Quick build prioritizes speed over accuracy and produces a trained model in less than 20 minutes. A standard build prioritizes accuracy over latency but the model takes longer to train.
Results
After the model build is complete, you can view the model’s accuracy, along with metrics like F1, precision and recall. In the case of a standard build, the model achieved 94.5% accuracy.

After the model training is complete, there are four ways you can use your model:

Deploy the model directly from SageMaker Canvas to an endpoint
Add the model to the SageMaker Model Registry
Export your model to a Jupyter Notebook
Send your model to Amazon QuickSight for use in dashboard visualizations

Clean up
To manage costs and prevent additional workspace charges, choose Log out to sign out of SageMaker Canvas when you’re done using the application, as shown in the following screenshot. You can also configure SageMaker Canvas to automatically shut down when idle.
If you created an S3 bucket specifically for this example, you might also want to empty and delete your bucket.

Summary
In this post, we demonstrated how you can upload custom dependencies to Amazon S3 and integrate them into SageMaker Canvas workflows. By walking through a practical example of implementing a custom distance calculation function with the mpmath library, we showed how to:

Package custom code and dependencies into a .zip file
Store and access these dependencies from Amazon S3
Implement custom data transformations in SageMaker Data Wrangler
Train a predictive model using the transformed data

This approach means that data scientists and analysts can extend SageMaker Canvas capabilities beyond the more than 300 included functions.
To try custom transforms yourself, refer to the Amazon SageMaker Canvas documentation and sign in to SageMaker Canvas today. For additional insights into how you can optimize your SageMaker Canvas implementation, we recommend exploring these related posts:

Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio
Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

About the Author
Nadhya Polanco is an Associate Solutions Architect at AWS based in Brussels, Belgium. In this role, she supports organizations looking to incorporate AI and Machine Learning into their workloads. In her free time, Nadhya enjoys indulging in her passion for coffee and exploring new destinations.

Generate training data and cost-effectively train categorical models w …

In this post, we explore how you can use Amazon Bedrock to generate high-quality categorical ground truth data, which is crucial for training machine learning (ML) models in a cost-sensitive environment. Generative AI solutions can play an invaluable role during the model development phase by simplifying training and test data creation for multiclass classification supervised learning use cases. We dive deep into this process on how to use XML tags to structure the prompt and guide Amazon Bedrock in generating a balanced label dataset with high accuracy. We also showcase a real-world example for predicting the root cause category for support cases. This use case, solvable through ML, can enable support teams to better understand customer needs and optimize response strategies.
Business challenge
The exploration and methodology described in this post addresses two key challenges: costs associated with generating a ground truth dataset for multiclass classification use cases can be prohibitive, and conventional approaches and synthetic dataset creation strategies for generating ground truth data are inadequate in generating balanced classes and meeting desired performance parameters for the real-world use cases.
Ground truth data generation is expensive and time consuming
Ground truth annotation needs to be accurate and consistent, often requiring massive time and expertise to ensure the dataset is balanced, diverse, and large enough for model training and testing. For a multiclass classification problem such as support case root cause categorization, this challenge compounds many fold.
Let’s say the task at hand is to predict the root cause categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry) for customer support cases. Based on our experiments using best-in-class supervised learning algorithms available in AutoGluon, we arrived at a 3,000 sample size for the training dataset for each category to attain an accuracy of 90%. This requirement translates into time and effort investment of trained personnel, who could be support engineers or other technical staff, to review tens of thousands of support cases to arrive at an even distribution of 3,000 per category. With each support case and the related correspondences averaging 5 minutes per review and assessment from a human labeler, this translates into 1,500 hours (5 minutes x 18,000 support cases) of work or 188 days considering an 8-hour workday. Besides the time in review and labeling, there is an upfront investment in training the labelers so the exercise split between 10 or more labelers is consistent. To break this down further, a ground truth labeling campaign split between 10 labelers would require close to 4 weeks to label 18,000 cases if the labelers spend 40 hours a week on the exercise.
Not only is such an extended and effort-intensive campaign expensive, but it can cause inconsistent labeling for categories every time the labeler puts aside the task and resumes it later. The exercise also doesn’t guarantee a balanced labeled ground truth dataset because some root cause categories such as Customer Education could be far more common than Feature Request or Software Defect, thereby extending the campaign.
Conventional techniques to get balanced classes or synthetic data generation have shortfalls
A balanced labeled dataset is critical for a multiclass classification use case to mitigate bias and make sure the model learns to accurately classify all classes, rather than favoring the majority class. If the dataset is imbalanced, with one or more classes having significantly fewer instances than others, the model might struggle to learn the patterns and features associated with the minority classes, leading to poor performance and biased predictions. This issue is particularly problematic in applications where accurate classification of minority classes is critical, such as medical diagnoses, fraud detection, or root cause categorization. For the use case of labeling the support root cause categories, it’s often harder to source examples for categories such as Software Defect, Feature Request, and Documentation Improvement for labeling than it is for Customer Education. This results in an imbalanced class distribution for training and test datasets.
To address this challenge, various techniques can be employed, including oversampling the minority classes, undersampling the majority classes, using ensemble methods that combine multiple classifiers trained on different subsets of the data, or synthetic data generation to augment minority classes. However, the ideal approach for achieving optimal performance is to start with a balanced and highly accurate labeled dataset for ground truth training.
Although oversampling for minority classes means extended and expensive data labeling with humans who review the support cases, synthetic data generation to augment the minority classes poses its own challenges. For the multiclass classification problem to label support case data, synthetic data generation can quickly result in overfitting. This is because it can be difficult to synthesize real-world examples of technical case correspondences that contain complex content related to software configuration, implementation guidance, documentation references, technical troubleshooting, and the like.
Because ground truth labeling is expensive and synthetic data generation isn’t an option for use cases such as root cause prediction, the effort to train a model is often put aside. This results in a missed opportunity to review the root cause trends that can guide investment in the right areas such as education for customers, documentation improvement, or other efforts to reduce the case volume and improve customer experience.
Solution overview
The preceding section discussed why conventional ground truth data generation techniques aren’t viable for certain supervised learning use cases and fall short in training a highly accurate model to predict the support case root cause in our example. Let’s look at how generative AI can help solve this problem.
Generative AI supports key use cases such as content creation, summarization, code generation, creative applications, data augmentation, natural language processing, scientific research, and many others. Amazon Bedrock is well-suited for this data augmentation exercise to generate high-quality ground truth data. Using highly tuned and custom tailored prompts with examples and techniques discussed in the following sections, support teams can pass the anonymized support case correspondence to Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock or other available large language models (LLMs) to predict the root cause label for a support case from one of the many categories (Customer Education, Feature Request, Software Defect, Documentation Improvement, Security Awareness, and Billing Inquiry). After achieving the desired accuracy, you can use this ground truth data in an ML pipeline with automated machine learning (AutoML) tools such as AutoGluon to train a model and inference the support cases.
Checking LLM accuracy for ground truth data
To evaluate an LLM for the task of category labeling, the process begins by determining if labeled data is available. If labeled data exists, the next step is to check if the model’s use case produces discrete outcomes. Where discrete outcomes with labeled data exist, standard ML methods such as precision, recall, or other classic ML metrics can be used. These metrics provide high precision but are limited to specific use cases due to limited ground truth data.
If the use case doesn’t yield discrete outputs, task-specific metrics are more appropriate. These include metrics such as ROUGE or cosine similarity for text similarity, and specific benchmarks for assessing toxicity (Detoxify), prompt stereotyping (cross-entropy loss), or factual knowledge (HELM, LAMA).
If labeled data is unavailable, the next question is whether the testing process should be automated. The automation decision depends on the cost-accuracy trade-off because higher accuracy comes at a higher cost. For cases where automation is not required, human-in-the-Loop (HIL) approaches can be used. This involves manual evaluation based on predefined assessment rules (for example, ground truth), yielding high evaluation precision, but often is time-consuming and costly.
When automation is preferred, using another LLM to assess outputs can be effective. Here, a reliable LLM can be instructed to rate generated outputs, providing automated scores and explanations. However, the precision of this method depends on the reliability of the chosen LLM. Each path represents a tailored approach based on the availability of labeled data and the need for automation, allowing for flexibility in assessing a wide range of FM applications.
The following figure illustrates an FM evaluation workflow.

For the use case, if a historic collection of 10,000 or more support cases labeled using Amazon SageMaker Ground Truth with HIL is available, it can be used for evaluating the accuracy of the LLM prediction. The key goal for generating new ground truth data using Amazon Bedrock should be to augment it for increasing diversity and increasing the training data size for AutoGluon training to arrive at a performant model that can be used for the final inference or root cause prediction. In the following sections, we explain how to take an incremental and measured approach to improve Anthropic’s Claude 3.5 Sonnet prediction accuracy through prompt engineering.
Prompt engineering for FM accuracy and consistency
Prompt engineering is the art and science of designing a prompt to get an LLM to produce the desired output. We suggest consulting LLM prompt engineering documentation such as Anthropic prompt engineering for experiments. Based on experiments conducted without a finely tuned and optimized prompt, we observed low accuracy rates of less than 60%. In the following sections, we provide a detailed explanation on how to construct your first prompt, and then gradually improve it to consistently achieve over 90% accuracy.
Designing the prompt
Before starting any scaled use of generative AI, you should have the following in place:

A clear definition of the problem you are trying to solve along with the end goal.
A way to test the model’s output for accuracy. The thumbs up/down technique to determine accuracy along with comparing with the 10,000 labeled dataset through SageMaker Ground Truth is well-suited for this exercise.
A defined success criterion on how accurate the model needs to be.

It’s helpful to think of an LLM as a new employee who is very well read, but knows nothing about your culture, your norms, what you are trying to do, or why you are trying to do it. The LLM’s performance will depend on how precisely you can explain what you want. How would a skilled manager handle a very smart, but new and inexperienced employee? The manager would provide contextual background, explain the problem, explain the rules they should apply when analyzing the problem, and give some examples of what good looks like along with why it is good. Later, if they saw the employee making mistakes, they might try to simplify the problem and provide constructive feedback by giving examples of what not to do, and why. One difference is that an employee would understand the job they are being hired for, so we need to explicitly tell the LLM to assume the persona of a support employee.
Prerequisites
To follow along with this post, set up Amazon SageMaker Studio to run Python in a notebook and interact with Amazon Bedrock. You also need the appropriate permissions to access Amazon Bedrock models.
Set up SageMaker Studio
Complete the following steps to set up SageMaker Studio:

On the SageMaker console, choose Studio under Applications and IDEs in the navigation pane.
Create a new SageMaker Studio instance if you haven’t already.
If prompted, set up a user profile for SageMaker Studio by providing a user name and specifying AWS Identity and Access Management (IAM) permissions.
Open a SageMaker Studio notebook:

Choose JupyterLab.
Create a private JupyterLab space.
Configure the space (set the instance type to ml.m5.large for optimal performance).
Launch the space.
On the File menu, choose New and Notebook to create a new notebook.

Configure SageMaker to meet your security and compliance objectives. Refer to Configure security in Amazon SageMaker AI for details.

Set up permissions for Amazon Bedrock access
Make sure you have the following permissions:

IAM role with Amazon Bedrock permissions – Make sure that your SageMaker Studio execution role has the necessary permissions to access Amazon Bedrock. Attach the AmazonBedrockFullAccesspolicy or a custom policy with specific Amazon Bedrock permissions to your IAM role.
AWS SDKs and authentication – Verify that your AWS credentials (usually from the SageMaker role) have Amazon Bedrock access. Refer to Getting started with the API to set up your environment to make Amazon Bedrock requests through the AWS API.
Model access – Grant permission to use Anthropic’s Claude 3.5 Sonnet. For instructions, see Add or remove access to Amazon Bedrock foundation models.

Test the code using the native inference API for Anthropic’s Claude
The following code uses the native inference API to send a text message to Anthropic’s Claude. The Python code invokes the Amazon Bedrock Runtime service:

import boto3
import json
from datetime import datetime
import time

# Create an Amazon Bedrock Runtime client in the AWS Region of your choice.
client = boto3.client(“bedrock-runtime”, region_name=”us-east-1″)

# Set the model ID, e.g., Claude 3 Haiku.
model_id = “anthropic.claude-3-5-sonnet-20240620-v1:0”

# Load the prompt from a file (showed and explained later in the blog)
with open(‘prompt.txt’, ‘r’) as file:
data = file.read()

def callBedrock(body):
# Format the request payload using the model’s native structure.

prompt = data + body;

# The prompt is then truncated to the max input window size of Sonnet 3.5
prompt = prompt[:180000]

# Define parametres passed to the model.
native_request = {
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 512,
“temperature”: 0.2,
“messages”: [
{
“role”: “user”,
“content”: [{“type”: “text”, “text”: prompt}],
}
],
}

# Convert the native request to JSON.
request = json.dumps(native_request)

try:
# Invoke the model with the request.
response = client.invoke_model(modelId=model_id, body=request)

except (Exception) as e:
print(f”ERROR: Can’t invoke ‘{model_id}’. Reason: {e}”)

# Load the response returned from Amazon Bedrock into a json object
model_response = json.loads(response[“body”].read())

# Extract and print the response text.
response_text = model_response[“content”][0][“text”]
return response_text;

Construct the initial prompt
We demonstrate the approach for the specific use case for root cause prediction with a goal of achieving 90% accuracy. Start by creating a prompt similar to the prompt you would give to humans in natural language. This can be a simple description of each root cause label and why you would choose it, how to interpret the case correspondences, how to analyze and choose the corresponding root cause label, and provide examples for every category. Ask the model to also provide the reasoning to understand how it reached to certain decisions. It can be especially interesting to understand the reasoning for the decisions you don’t agree with. See the following example code:

Please familiarize yourself with these categories.  When you evaluate a case, evaluate the definitions in order and label the case with the first definition that fits.  If a case morphs from one type to another, choose the type the case started out as. 

Read the correspondence, especially the original request, and the last correspondence from the support agent to the customer. If there are lot of correspondences, or the case does not seem straightforward to infer, read the correspondences date stamped in order to understand what happened. If the case references documentation, read or skim the documentation to determine whether the documentation clearly supports what the support agent mentioned and whether it answers the customers issue.

Software Defect:  “Software Defect” are cases where the application does not work as expected. The support agent confirms this through analysis and troubleshooting and mentions internal team is working on a fix or patch to address the bug or defect.

An example of Software Defect case is [Customer: “Our data pipeline jobs are failing with a ‘memory allocation error’ during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We’ve verified our infrastructure meets all requirements.” Agent: “After analyzing the logs, we’ve confirmed a memory leak in the aggregation module – a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue.”]
…. 

Analyze the results
We recommend using a small sample (for example, 150) of random cases and run them through Anthropic’s Claude 3.5 Sonnet using the initial prompt, and manually check the initial results. You can load the input data and model output into Excel, and add the following columns for analysis:

Claude Label – A calculated column with Anthropic’s Claude’s category
Label – True category after reviewing each case and selecting a specific root cause category to compare with the model’s prediction and derive an accuracy measurement
Close Call – 1 or 0 so that you can take numerical averages
Notes – For cases where there was something noteworthy about the case or inaccurate categorizations
Claude Correct – A calculated column (0 or 1) based on whether our category matched the model’s output category

Although the first run is expected to have low accuracy unfit for using the prompt for generating the ground truth data, the reasoning will help you understand why Anthropic’s Claude mislabeled the cases. In the example, many of the misses fell into these categories and the accuracy was only 61%:

Cases where Anthropic’s Claude categorized Customer Education cases as Software Defect because it interpreted the support agent instructions to reconfigure something as a workaround for a Software Defect.
Cases where users asked questions about billing that Anthropic’s Claude categorized as Customer Education. Although billing questions could also be Customer Education cases, we wanted these to be categorized as the more specific Billing Inquiry Likewise, although Security Awareness cases are also Customer Education, we wanted to categorize these as the more specific Security Awareness category.

Iterate on the prompt and make changes
Providing the LLM explicit instructions on correcting these errors should result in a major boost in accuracy. We tested the following adjustments with Anthropic’s Claude:

We defined and assigned a persona with background information for the LLM: “You are a Support Agent and an expert on the enterprise application software. You will be classifying customer cases into categories…”
We ordered the categories from more deterministic and well-defined to less specific and instructed Anthropic’s Claude to evaluate the categories in the order they appear in the prompt.
We recommend using the Anthropic documentation suggestion to use XML tags and the enclosed root cause categories in light XML but not a formal XML document, with elements delimited with tags. It’s ideal to create categories as nodes with a separate sub-node for each category. The category node should consist of a name of the category, a description, and what the output would look like. The categories should be delimited by begin and end tags.

You are a Support Agent and an expert on the enterprise application software. You will be classifying the customer support cases into categories, based on the given interaction between an agent and a customer. You can only choose ONE Category from the list below. You follow instructions well, step by step, and evaluate the categories in the order they appear in the prompt when making a decision.

The categories are defined as:

<categories>
<category>
<name>
“Software Defect”
</name>
<description>
“Software Defect” are cases where the application software does not work as expected. The agent confirms the application is not working as expected and may refer to internal team working on a fix or patch to address the bug or defect. The category includes common errors or failures related to performance, software version, functional defect, unexpected exception or usability bug when the customer is following the documented steps.
</description>
</category>

</categories>

We created a good examples node with at least one good example for every category. Each good example consisted of the example, the classification, and the reasoning:

Here are some good examples with reasoning:

<good examples>
<example>
<example data>
Customer: “Our data pipeline jobs are failing with a ‘memory allocation error’ during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We’ve verified our infrastructure meets all requirements.”
Agent: “After analyzing the logs, we’ve confirmed a memory leak in the aggregation module – a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue.”
</example data
<classification>
“Software Defect”
</classification>
<explanation>
Customer is reporting a data processing exception with a specific version and the agent confirms this is a regression and defect. The agent confirms that engineering is working to provide an emergency patch for the issue.
</explanation>
</example>

</good examples>

We created a bad examples node with examples of where the LLM miscategorized previous cases. The bad examples node should have the same set of fields as the good examples, such as example data, classification, explanation, but the explanation explained the error. The following is a snippet:

Here are some examples for wrong classification with reasoning:

<bad examples>

<example>
<example data>
Customer: “We need the ability to create custom dashboards that can aggregate data across multiple tenants in real-time. Currently, we can only view metrics per individual tenant, which requires manual consolidation for our enterprise reporting needs.”
Agent: “I understand your need for cross-tenant analytics. While the current functionality is limited to single-tenant views as designed, I’ve submitted your request to our product team as a high-priority feature enhancement. They’ll evaluate it for inclusion in our 2025 roadmap. I’ll update you when there’s news about this capability.”
</example data>
<example output>
<classification>
“Software Defect”
</classification>
<explanation>
Classification should be Feature Request and not Software Defect. The application does not have the function or capability being requested but it is working as documented or advertised. In the example, the agent mentions they have submitted with request to their product team to consider in the future roadmap.
</explanation>
</example>

<bad examples>

We also added instructions for how to format the output:

Given the above categories defined in XML, logically think through which category fits best and then complete the classification. Provide a response in XML with the following elements: classification, explanation (limited to 2 sentences). Return your results as this sample output XML below and do not append your thought process to the response.
 
<response>
<classification> Software Defect </classification>
<explanation> The support case is for ETL Pipeline Performance Degradation where the customer reports their nightly data transformation job takes 6 hours to complete instead of 2 hours before but no changes to configuration occurred. The agent mentions Engineering confirmed memory leak in version 5.1.2 and are deploying a Hotfix indicating this is a Software Defect.
</explanation>
</response>

Test with the new prompt
The preceding approach should result in an improved prediction accuracy. In our experiment, we saw 84% accuracy with the new prompt and the output was consistent and more straightforward to parse. Anthropic’s Claude followed the suggested output format in almost all cases. We wrote code to fix errors such as unexpected tags in the output and drop responses that could not be parsed.
The following is the code to parse the output:

# This python script parses LLM output into a comma separated list with the SupportID, Category, Reason
# Command line is python parse_llm_putput.py PathToLLMOutput.txt PathToParsedOutput.csv
# Note:  It will overwrite the output file without confirming
# it will write completion status and any error messages to stdout
 
import re
import sys
 
# these tokens are based on the format of the claude output.
# This will create three inputs CaseID, RootCause and Reasoning.  We will to extract them using re.match.
pattern = re.compile(
    “^([0-9]*).*<classification>(.*)</classification><explanation>(.*)</explanation>”
)
 
endToken = “</response>”
checkToken = “<classification>”
 
acceptableClassifications = [
    “Billing Inquiry”,
    “Documentation Improvement”,
    “Feature Request”,
    “Security Awareness”,
    “Software Defect”,
    “Customer Education”,
]
 
def parseResponse(response):
    # parsing is trivial withe regular expression groups
    m = pattern.match(response)
    return m
 
# get the input and output files
if len(sys.argv) != 3:
    print(“Command line error parse_llm_output.py inputfile outputfile”)
    exit(1)
 
# open the file
input = open(sys.argv[1], encoding=”utf8″)
output = open(sys.argv[2], “w”)
 
# read the entire file in.  This works well with 30,000 responses, but would need to be adjusted for say 3,000,000 responses
responses = input.read()
 
# get rid of the double quotes and newlines to avoid incorrect excel parsing and these are unnecessary
responses = responses.replace(‘”‘, “”)
responses = responses.replace(“n”, “”)
 
# initialize our placeholder, and counters
parsedChars = 0
skipped = 0
invalid = 0
responseCount = 0
 
# write the header
output.write(“CaseID,RootCause,Reasonn”)
 
# find the first response
index = responses.find(endToken, parsedChars)
 
while index > 0:
    # extract the response
    response = responses[parsedChars : index + len(endToken)]
    # parse it
    parsedResponse = parseResponse(response)
 
    # is the response valid
    if parsedResponse is None or len(response.split(checkToken)) != 2:
        # this happens when there is a missing /response delimiter or some other formatting problem, it clutters up and the next response
        skipped = skipped + 2
    else:
        # if we have a valid response write it to the file, enclose the reason in double quotes because it uses commas
        if parsedResponse.group(2).lower() not in acceptableClassifications:
            # make sure the classification is one we expect
            print(“Invalid Classification: {0}”.format(parsedResponse.group(2)))
            invalid = invalid + 1
        else:
            # write a valid line to the output file
            output.write(
                ‘{0},{1},”{2}”n’.format(
                    parsedResponse.group(1),
                    parsedResponse.group(2),
                    parsedResponse.group(3),
                )
            )
 
    # move the pointer past where we parsed and update the counter
    parsedChars = index + len(endToken)
    responseCount = responseCount + 1
 
    # find the next response
    index = responses.find(endToken, parsedChars)
 
print(“skipped {0} of {1} responses”.format(skipped, responseCount))
print(“{0} of these were invalid”.format(invalid))

Most mislabeled cases were close calls or had very similar traits. For example, when a customer described a problem, the support agent suggested possible solutions and asked for logs in order to troubleshoot. However, the customer self-resolved the case and so the resolution details weren’t conclusive. For this scenario, the root cause prediction was inaccurate. In our experiment, Anthropic’s Claude labeled these cases as Software Defects, but the most likely scenario is that the customer figured it out for themselves and never followed up.
Continued fine-tuning of the prompt to adjust examples and include such scenarios incrementally can help to get over 90% prediction accuracy, as we confirmed with our experimentation. The following code is an example of how to adjust the prompt and add a few more bad examples:

<example>
<example data>
Subject: Unable to configure custom routing rules in application gateway
Customer: Our team can’t set up routing rules in the application gateway. We’ve tried following the documentation but the traffic isn’t being directed as expected. This is blocking our production deployment.
Agent: I understand you’re having difficulties with routing rules configuration. To better assist you, could you please provide:
Current routing rule configuration
Application gateway logs
Expected traffic flow diagram
[No response from customer for 5 business days – Case closed by customer]
</example data>
    <example output>
      <classification>
       Software Defect
      </classification>
 <explanation>
Classification should be Customer Education and not Software Defect. The agent acknowledges the problem and asks the customer for additional information to troubleshoot, however, the customer does not reply and closes the case. Cases where the agent tells the customer how to solve the problem and provides documentation or asks for further details to troubleshoot but the customer self-resolves the case should be labeled Customer Education.
</explanation>
</example>

With the preceding adjustments and refinement to the prompt, we consistently obtained over 90% accuracy and noted that a few miscategorized cases were close calls where humans chose multiple categories including the one Anthropic’s Claude chose. See the appendix at the end of this post for the final prompt.
Run batch inference at scale with AutoGluon Multimodal
As illustrated in the previous sections, by crafting a well-defined and tailored prompt, Amazon Bedrock can help automate generation of ground truth data with balanced categories. This ground truth data is necessary to train the supervised learning model for a multiclass classification use case. We suggest taking advantage of the preprocessing capabilities of SageMaker to further refine the fields, encoding them into a format that’s optimal for model ingestion. The manifest files can be set up as the catalyst, triggering an AWS Lambda function that sets entire SageMaker pipeline into action. This end-to-end process seamlessly handles data inference and stores the results in Amazon Simple Storage Service (Amazon S3). We recommend AutoGluon Multimodal for training and prediction and deploying a model for a batch inference pipeline to predict the root cause for new or updated support cases at scale on a daily cadence.
Clean up
To prevent unnecessary expenses, it’s essential to properly decommission all provisioned resources. This cleanup process involves stopping notebook instances and deleting JupyterLab spaces, SageMaker domains, S3 bucket, IAM role, and associated user profiles. Refer to Clean up Amazon SageMaker notebook instance resources for details.
Conclusion
This post explored how Amazon Bedrock and advanced prompt engineering can generate high-quality labeled data for training ML models. Specifically, we focused on a use case of predicting the root cause category for customer support cases, a multiclass classification problem. Traditional approaches to generating labeled data for such problems are often prohibitively expensive, time-consuming, and prone to class imbalances. Amazon Bedrock, guided by XML prompt engineering, demonstrated the ability to generate balanced labeled datasets, at a lower cost, with over 90% accuracy for the experiment, and can help overcome labeling challenges for training categorical models for real-world use cases.
The following are our key takeaways:

Generative AI can simplify labeled data generation for complex multiclass classification problems
Prompt engineering is crucial for guiding LLMs to achieve desired outputs accurately
An iterative approach, incorporating good/bad examples and specific instructions, can significantly improve model performance
The generated labeled data can be integrated into ML pipelines for scalable inference and prediction using AutoML multimodal supervised learning algorithms for batch inference

Review your ground truth training costs with respect to time and effort for HIL labeling and service costs and do a comparative analysis with Amazon Bedrock to plan your next categorical model training at scale.
Appendix
The following code is the final prompt:

You are a Support Agent and an expert in the enterprise application software. You will be classifying the customer support cases into one of the 6 categories, based on the given interaction between the Support Agent and a customer. You can only choose ONE Category from the list below. You follow instructions well, step by step, and evaluate the categories in the order they appear in the prompt when making a decision.
 
The categories are defined as:
 
<categories>
 
<category>
<name>
“Billing Inquiry”
</name>
<description>
“Billing Inquiry” cases are the ones related to Account or Billing inquiries and questions related to charges, savings, or discounts. It also includes requests to provide guidance on account closing, request for Credit, cancellation requests, billing questions, and questions about discounts.
</description>
</category>
 
<category>
<name>
“Security Awareness”
</name>
<description>
“Security Awareness” cases are the cases associated with a security related incident. Security Awareness cases include exposed credentials, mitigating a security vulnerability, DDoS attacks, security concerns related to malicious traffic. Note that general security questions where the agent is helping to educate the user on the best practice such as SSO or MFA configuration, Security guidelines, or setting permissions for users and roles should be labeled as Customer Education and not Security Awareness.
</description>
</category>
 
<category>
<name>
“Feature Request”
</name>
<description>
“Feature Request” are the cases where the customer is experiencing a limitation in the application software and asking for a feature they want to have. Customer highlights a limitation and is requesting for the capability. For a Feature Request case, the support agent typically acknowledges that the question or expectation is a feature request for the software. Agent may use words such as the functionality or feature does not exist or it is currently not supported.
</description>
</category>
 
<category>
<name>
“Software Defect”
</name>
<description>
“Software Defect” are cases where the application does not work as expected. The support agent confirms this through analysis and troubleshooting and mentions internal team is working on a fix or patch to address the bug or defect.
</description>
</category>
 
<category>
<name>
“Documentation Improvement”
</name>
<description>
“Documentation Improvement” are cases where there is a lack of documentation, incorrect documentation, or insufficient documentation and when the case is not attributed to a Software Defect or a Feature Request. In Documentation Improvement cases the agent acknowledges the application documentation is incomplete or not up to date, or that they will ask documentation team to improve the documentation. For Documentation Improvement cases, the agent may suggest a workaround that is not part of application documentation and does not reference the standard application documentation or link. References to workarounds or sources such as Github or Stack Overflow, when used as an example of a solution, are examples of a Documentation Improvement case because the details and examples are missing from the official documentation.
</description>
</category>
 
<category>
<name>
“Customer Education”
</name>
<description>
“Customer Education” cases are cases where the customer could have resolved the case information using the existing application documentation. In these cases, the agent is educating the customer they are not using the feature correctly or have an incorrect configuration, while guiding them to the documentation. Customer Education cases include scenario where an agent provides troubleshooting steps for a problem or answers a question and provides links to the official application documentation. User Education cases include cases when the customer asks for best practices and agent provides knowledge article links to the support center documentation. Customer Education also includes cases created by the agent or application developers to suggest and educate the customer on a change to reduce cost, improve security, or improve application performance. Customer Education cases include cases where the customer asks a question or requests help with an error or configuration and the agent guides them appropriately with steps or documentation links. Customer Education cases also include the cases where the customer is using an unsupported configuration or version that may be End Of Life (EOL). Customer Education cases also include inconclusive cases where the customer reported an issue with the application but the case is closed without resolution details.
</description>
</category>
 
</categories>
 
Here are some good examples with reasoning:
 
<good examples>
 
<example>
<example data>
Customer: “I noticed unexpected charges of $12,500 on our latest invoice, which is significantly higher than our usual $7,000 monthly spend. We haven’t added new users, so I’m concerned about this increase.”
Support: “I understand your concern about the increased charges. Upon review, I see that 50 Premium Sales Cloud licenses were automatically activated on January 15th when your sandbox environments were refreshed. I can help adjust your sandbox configuration and discuss Enterprise License Agreement options to optimize costs.”
Customer: “Thank you for clarifying. Please tell me more about the Enterprise License options.”
</example data
<example output>
<classification>
“Billing Inquiry”
</classification>
<explanation>
Customer is asking a question to clarify the unexpected increase in their billing statement charge and the agent explains why this occurred. The customer wants to learn more about ways to optimize costs.
</explanation>
 
<example>
<example data>
Customer: “URGENT: We’ve detected unauthorized API calls from an unknown IP address accessing sensitive customer data in our production environment. Our monitoring shows 1000+ suspicious requests in the last hour.”
Support: “I understand the severity of this security incident. I’ve immediately revoked the compromised API credentials and initiated our security protocol. The suspicious traffic has been blocked. I’m escalating this to our Security team for forensic analysis. I’ll stay engaged until this is resolved.”
</example data
<example output>
<classification>
“Security Awareness”
</classification>
<explanation>
Customer reported unauthorized API calls and suspicious requests. The agent confirms revoking compromised API credentials and initiating the protocol.
</explanation>
 
<example>
<example data>
Customer: “Is there a way to create custom notification templates for different user groups? We need department-specific alert formats, but I can only find a single global template option.”
Support: “I understand you’re looking to customize notification templates per user group. Currently, this functionality isn’t supported in our platform – we only offer the global template system. I’ll submit this as a feature request to our product team. In the meantime, I can suggest using notification tags as a workaround.”
Customer: “Thanks, please add my vote for this feature.”
</example data
<example output>
<classification>
“Feature Request”
</classification>
<explanation>
Customer is asking for a new feature to have custom notification templates for different user groups since they have a use case that is currently not supported by the application. The agent confirms the functionality does not exist and mentions submitting a feature request to the product team.
</explanation>
 
<example>
<example data>
Customer: “Our data pipeline jobs are failing with a ‘memory allocation error’ during the aggregation phase. This started occurring after upgrading to version 4.2.1. The same ETL workflows were running fine before the upgrade. We’ve verified our infrastructure meets all requirements.”
Support: “After analyzing the logs, we’ve confirmed a memory leak in the aggregation module – a regression introduced in 4.2.1. Engineering has identified the root cause and is developing an emergency patch. We expect to release version 4.2.2 within 48 hours to resolve this issue.”
</example data
<example output>
<classification>
“Software Defect”
</classification>
<explanation>
Customer is reporting a data processing exception with a specific version and the agent confirms this is a regression and defect. The agent confirms that engineering is working to provide an emergency patch for the issue.
</explanation>
 
<example>
<example data>
Customer: “The data export function is failing consistently when we include custom fields. The export starts but crashes at 45% with error code DB-7721. This worked fine last week before the latest release.”
Support: “I’ve reproduced the issue in our test environment and confirmed this is a bug introduced in version 4.2.1. Our engineering team has identified the root cause – a query optimization error affecting custom field exports. They’re working on a hotfix (patch 4.2.1.3).”
Customer: “Please notify when fixed.”
</example data>
<example output>
<classification>
“Software Defect”
</classification>
<explanation>
This is a Software Defect as the data export function is not working as expected to export the custom fields. The agent acknowledged the issue and confirmed engineering is working on a hotfix.
</explanation>
 
<example>
<example data>
Customer: “I’m trying to implement the batch processing API but the documentation doesn’t explain how to handle partial failures or provide retry examples. The current docs only show basic success scenarios.”
Support: The documentation is lacking detailed error handling examples for batch processing. I’ll submit this to our documentation team to add comprehensive retry logic examples and partial failure scenarios. For now, I can share a working code snippet that demonstrates proper error handling and retry mechanisms.”
Customer: “Thanks, the code example would help.”
</example data
<example output>
<classification>
Documentation Improvement
</classification>
<explanation>
The agent acknowledges the gap in the documentation and mentions they will pass on this to the documentation team for further improvements. Agent mentions providing a working code snippet with retry examples.
</explanation>
 
<example>
<example data>
Customer: “We can’t get our SSO integration working. The login keeps failing and we’re not sure what’s wrong with our configuration.”
Support: “I can help guide you through the SSO setup. Looking at your configuration, I notice the SAML assertion isn’t properly formatted. Please follow our step-by-step SSO configuration guide here [link to docs]. Pay special attention to section 3.2 about SAML attributes. The guide includes validation steps to ensure proper integration.”
Customer: “Found the issue in section 3.2. Working now, thanks!”
</example data
<example output>
<classification>
Customer Education
</classification>
<explanation>
Customer is asking for help and guidance to get their SSO integration working. The agent went over the details and presented the steps along necessary along with the documentation links.
</explanation>
 
</good examples>
 
Here are some examples for wrong classification with reasoning:
 
<bad examples>
 
<example>
<example data>
Customer: “We want to enhance our application security. Currently, each team member has individual login credentials. What’s the recommended approach?”
Support: “recommend implementing SAML-based SSO with your existing identity provider. This will:
Centralize authentication
Enable MFA enforcement
Streamline user provisioning
Enhance security auditing
</example data>
<example output>
<classification>
“Security Awareness”
</classification>
<explanation>
Classification should be Customer Education and not Security Awareness. General security questions where the agent is helping to educate the user such as Security guidelines and best practices, should be labeled as Customer Education.
</explanation>
</example>
 
<example>
<example data>
Customer: “Our SAP invoices aren’t syncing instantly with Salesforce opportunities. We’ve configured MuleSoft Composer as per documentation, but updates only happen intermittently.”
Support: “I understand you’re looking for real-time synchronization. Currently, MuleSoft Composer’s fastest sync interval is 15 minutes by design. While I can help optimize your current setup, I’ll submit a feature request for real-time sync capability. Here’s how to optimize the current polling interval: doc link”
</example data>
<example output>
<classification>
Customer Education
</classification>
<explanation>
Classification should be Feature Request and not Customer Education. The agent tells the customer that fastest sync interval is 15 minutes by design. The agent also points out they will submit a Feature Request. Cases where the customer ask for features should be classified as Feature Request.
</explanation>
</example>
 
<example>
<example data>
Customer: “Our sales ETL pipeline keeps timing out with error ‘V_001’ at the transform step. This was working perfectly before.”
Support: “I’ve analyzed your configuration. The timeout occurs because the transformation spans 5 years of data containing 23 cross-object formula fields and is running without filters. Please implement these optimization steps from our documentation: Document link on ETL performance”
</example data>
<example output>
<classification>
Software Defect
</classification>
<explanation>
Classification should be Customer Education and not Software Defect. The agent tells the user that timeout is caused by misconfiguration and needs to be restricted using filters. The agent provides documentation explaining how to troubleshoot the issue. Cases where the agent tells the user how to solve the problem and provides documentation should be labeled Customer Education.
</explanation>
</example>
 
<example>
<example data>
Customer: “We are trying to deploy a custom workflow template but receiving this error: Resource handler returned message: ‘Error: Multiple or missing values for mandatory single-value field, Field: ACTION_TYPE, Parameter: Workflow Action (Status Code: 400, Request ID: TKT-2481-49bc)’ when deploying through Flow Designer.”
Support: “I’ve reviewed your Flow Designer deployment (instance: dev85xxx.xxx.com/flow/TKT-2481-49bc) which failed to create a Workflow Action resource. This error occurs when the action configuration is ambiguous. After checking the Flow Designer documentation [1], each Action Step in your template must define exactly one ‘Action Type’ attribute. The Flow Designer documentation [2] specifies that each workflow action requires a single, explicit action type definition. You cannot have multiple or undefined action types in a single step. This is similar to an issue reported in the Product Community [3]. Please review your workflow template and ensure each action step has exactly one defined Action Type. The documentation provides detailed configuration examples at [4]. Let me know if you need any clarification on implementing these changes.
</example data>
<example output>
<classification>
Documentation Improvement
</classification>
<explanation>
Classification should be Customer Education and not Documentation Improvement. The agent tells the user they have to change the action configuration and define an Action type attribute. Cases where the agent tells the user how to solve problem and provides documentation should be classified Customer Education.
</explanation>
</example>
 
</bad examples>
 
Given the above categories defined in XML, logically think through which category fits best and then complete the classification. Provide a response in XML with the following elements: classification, explanation (limited to 2 sentences). Return your results as this sample output XML below and do not append your thought process to the response.
 
<response>
<classification> Software Defect </classification>
<explanation> The support case is for ETL Pipeline Performance Degradation where the customer reports their nightly data transformation job takes 6 hours to complete instead of 2 hours before but no changes to configuration occurred. The agent mentions Engineering confirmed memory leak in version 5.1.2 and are deploying a Hotfix indicating this is a Software Defect.
</explanation>
</response>
 
Here is the conversation you need to categorize:

About the Authors
Sumeet Kumar is a Sr. Enterprise Support Manager at AWS leading the technical and strategic advisory team of TAM builders for automotive and manufacturing customers. He has diverse support operations experience and is passionate about creating innovative solutions using AI/ML.
Andy Brand is a Principal Technical Account Manager at AWS, where he helps education customers develop secure, performant, and cost-effective cloud solutions. With over 40 years of experience building, operating, and supporting enterprise software, he has a proven track record of addressing complex challenges.
Tom Coombs is a Principal Technical Account Manager at AWS, based in Switzerland. In Tom’s role, he helps enterprise AWS customers operate effectively in the cloud. From a development background, he specializes in machine learning and sustainability.
Ramu Ponugumati is a Sr. Technical Account Manager and a specialist in analytics and AI/ML at AWS. He works with enterprise customers to modernize and cost optimize workloads, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, playing badminton, and hiking.

Google DeepMind Researchers Propose CaMeL: A Robust Defense that Creat …

Large Language Models (LLMs) are becoming integral to modern technology, driving agentic systems that interact dynamically with external environments. Despite their impressive capabilities, LLMs are highly vulnerable to prompt injection attacks. These attacks occur when adversaries inject malicious instructions through untrusted data sources, aiming to compromise the system by extracting sensitive data or executing harmful operations. Traditional security methods, such as model training and prompt engineering, have shown limited effectiveness, underscoring the urgent need for robust defenses.

Google DeepMind Researchers propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. Unlike traditional approaches that require retraining or model modifications, CaMeL introduces a new paradigm inspired by proven software security practices. It explicitly extracts control and data flows from user queries, ensuring untrusted inputs never alter program logic directly. This design isolates potentially harmful data, preventing it from influencing the decision-making processes inherent to LLM agents.

Technically, CaMeL functions by employing a dual-model architecture: a Privileged LLM and a Quarantined LLM. The Privileged LLM orchestrates the overall task, isolating sensitive operations from potentially harmful data. The Quarantined LLM processes data separately and is explicitly stripped of tool-calling capabilities to limit potential damage. CaMeL further strengthens security by assigning metadata or “capabilities” to each data value, defining strict policies about how each piece of information can be utilized. A custom Python interpreter enforces these fine-grained security policies, monitoring data provenance and ensuring compliance through explicit control-flow constraints.

Results from empirical evaluation using the AgentDojo benchmark highlight CaMeL’s effectiveness. In controlled tests, CaMeL successfully thwarted prompt injection attacks by enforcing security policies at granular levels. The system demonstrated the ability to maintain functionality, solving 67% of tasks securely within the AgentDojo framework. Compared to other defenses like “Prompt Sandwiching” and “Spotlighting,” CaMeL outperformed significantly in terms of security, providing near-total protection against attacks while incurring moderate overheads. The overhead primarily manifests in token usage, with approximately a 2.82× increase in input tokens and a 2.73× increase in output tokens, acceptable considering the security guarantees provided.

Moreover, CaMeL addresses subtle vulnerabilities, such as data-to-control flow manipulations, by strictly managing dependencies through its metadata-based policies. For instance, a scenario where an adversary attempts to leverage benign-looking instructions from email data to control the system execution flow would be mitigated effectively by CaMeL’s rigorous data tagging and policy enforcement mechanisms. This comprehensive protection is essential, given that conventional methods might fail to recognize such indirect manipulation threats.

In conclusion, CaMeL represents a significant advancement in securing LLM-driven agentic systems. Its ability to robustly enforce security policies without altering the underlying LLM offers a powerful and flexible approach to defending against prompt injection attacks. By adopting principles from traditional software security, CaMeL not only mitigates explicit prompt injection risks but also safeguards against sophisticated attacks leveraging indirect data manipulation. As LLM integration expands into sensitive applications, adopting CaMeL could be vital in maintaining user trust and ensuring secure interactions within complex digital ecosystems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Google DeepMind Researchers Propose CaMeL: A Robust Defense that Creates a Protective System Layer around the LLM, Securing It even when Underlying Models may be Susceptible to Attacks appeared first on MarkTechPost.

This AI Paper Introduces PLAN-AND-ACT: A Modular Framework for Long-Ho …

Large language models are powering a new wave of digital agents to handle sophisticated web-based tasks. These agents are expected to interpret user instructions, navigate interfaces, and execute complex commands in ever-changing environments. The difficulty lies not in understanding language but in translating that understanding into precise, sequenced actions while adapting to dynamic contexts. Success for long-horizon tasks like booking travel or retrieving specific web data depends on managing a sequence of steps that evolves with each action. Despite major progress in language capabilities, creating agents that can effectively plan and adapt at each step remains an unsolved problem.

Composing broad goals into actionable steps is a major issue in building such agents. When a user requests “follow the top contributor of this GitHub project,” the agent must interpret the command and determine how to navigate to the contributor’s section, identify the relevant person, and initiate the following action. This task becomes even more complex in dynamic environments where content may shift between executions. Without a clear planning and updating strategy, agents can make inconsistent decisions or fail entirely. The scarcity of training data that shows how to plan and execute long tasks correctly adds another layer of difficulty.

Previously, researchers attempted to address these issues with models that either relied on single-agent strategies or applied reinforcement learning to guide actions. Single-agent systems like ReAct attempted to merge reasoning and execution but often faltered as the model was overwhelmed by thinking and acting at once. Reinforcement learning approaches showed promise but proved unstable and highly sensitive to environment-specific tuning. Collecting training data for these methods required extensive interaction with environments, making it time-consuming and impractical to scale. These methods also struggled to maintain performance consistency when tasks changed mid-process.

Researchers from UC Berkeley, the University of Tokyo, and ICSI introduced a new PLAN-AND-ACT system. Companies like Apple, Nvidia, Microsoft, and Intel supported the work. This framework splits task planning and execution into two modules: a PLANNER and an EXECUTOR. The PLANNER is tasked with creating a structured plan based on the user’s request, essentially outlining what steps need to be taken. The EXECUTOR then translates each step into environment-specific actions. By separating these responsibilities, the system allows the PLANNER to focus on strategy while the EXECUTOR handles execution, improving the reliability of both components. This modular design marks a significant shift from previous approaches.

The methodology behind PLAN-AND-ACT is detailed and focuses heavily on scalable training. Since human-annotated planning data is limited, researchers introduced a synthetic data generation pipeline. They began by collecting action trajectories from simulated agents—sequences of clicks, inputs, and responses. Large language models then analyzed these trajectories to reconstruct high-level plans grounded in actual outcomes. For example, a plan might specify identifying the top contributor, while the actions linked to it include clicking the “Contributors” tab and parsing the resulting HTML. The team expanded their dataset with 10,000 additional synthetic plans and then generated 5,000 more targeted plans based on failure analysis. This synthetic training method saved time and produced high-quality data that reflected real execution needs.

In testing, PLAN-AND-ACT achieved a task success rate of 53.94% on the WebArena-Lite benchmark, surpassing the previous best result of 49.1% from WebRL. Without any planner, a base executor only achieved 9.85%. Adding a non-finetuned planner boosted performance to 29.63% while finetuning on 10,000 synthetic plans brought results up to 44.24%. Incorporating dynamic replanning added a final 10.31% performance gain. Across all experiments, the data showed that most performance improvements came from enhancing the PLANNER rather than the EXECUTOR. Even with a base EXECUTOR, having a strong PLANNER led to substantial success rate increases, validating the researchers’ hypothesis that separating planning and execution yields better task outcomes.

In conclusion, this paper highlights how identifying the gap between goal understanding and environment interaction can lead to more effective AI systems. By focusing on structured planning and scalable data generation, the researchers proposed a method that solves a specific problem and demonstrates a framework that can extend to broader applications. PLAN-AND-ACT shows that effective planning, not just execution, is critical to AI agent success in complex environments.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post This AI Paper Introduces PLAN-AND-ACT: A Modular Framework for Long-Horizon Planning in Web-Based Language Agents appeared first on MarkTechPost.

DeepSeek AI Unveils DeepSeek-V3-0324: Blazing Fast Performance on Mac …

Artificial intelligence (AI) has made significant strides in recent years, yet challenges persist in achieving efficient, cost-effective, and high-performance models. Developing large language models (LLMs) often requires substantial computational resources and financial investment, which can be prohibitive for many organizations. Additionally, ensuring that these models possess strong reasoning capabilities and can be deployed effectively on consumer-grade hardware remains a hurdle.​

DeepSeek AI has addressed these challenges head-on with the release of DeepSeek-V3-0324, a significant upgrade to its V3 large language model. This new model not only enhances performance but also operates at an impressive speed of 20 tokens per second on a Mac Studio, a consumer-grade device. This advancement intensifies the competition with industry leaders like OpenAI, showcasing DeepSeek’s commitment to making high-quality AI models more accessible and efficient. ​

DeepSeek-V3-0324 introduces several technical improvements over its predecessor. Notably, it demonstrates significant enhancements in reasoning capabilities, with benchmark scores showing substantial increases:

MMLU-Pro: 75.9 → 81.2 (+5.3)

GPQA: 59.1 → 68.4 (+9.3)​

AIME: 39.6 → 59.4 (+19.8)​

LiveCodeBench: 39.2 → 49.2 (+10.0)

These improvements indicate a more robust understanding and processing of complex tasks. Additionally, the model has enhanced front-end web development skills, producing more executable code and aesthetically pleasing web pages and game interfaces. Its Chinese writing proficiency has also seen advancements, aligning with the R1 writing style and improving the quality of medium-to-long-form content. Furthermore, function calling accuracy has been increased, addressing issues present in previous versions.

The release of DeepSeek-V3-0324 under the MIT License underscores DeepSeek AI’s dedication to open-source collaboration, allowing developers worldwide to utilize and build upon this technology without restrictive licensing constraints. The model’s ability to run efficiently on devices like the Mac Studio, achieving 20 tokens per second, exemplifies its practical applicability and efficiency. This performance level not only makes advanced AI more accessible but also reduces the dependency on expensive, specialized hardware, thereby lowering the barrier to entry for many users and organizations. ​

In conclusion, DeepSeek AI’s release of DeepSeek-V3-0324 marks a significant milestone in the AI landscape. By addressing key challenges related to performance, cost, and accessibility, DeepSeek has positioned itself as a formidable competitor to established entities like OpenAI. The model’s technical advancements and open-source availability promise to democratize AI technology further, fostering innovation and broader adoption across various sectors.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post DeepSeek AI Unveils DeepSeek-V3-0324: Blazing Fast Performance on Mac Studio, Heating Up the Competition with OpenAI appeared first on MarkTechPost.

Amazon SageMaker JumpStart adds fine-tuning support for models in a pr …

Amazon SageMaker JumpStart is a machine learning (ML) hub that provides pre-trained models, solution templates, and algorithms to help developers quickly get started with machine learning. Within SageMaker JumpStart, the private model hub feature allows organizations to create their own internal repository of ML models, enabling teams to share and manage models securely within their organization.
Today, we are announcing an enhanced private hub feature with several new capabilities that give organizations greater control over their ML assets. These enhancements include the ability to fine-tune SageMaker JumpStart models directly within the private hub, support for adding and managing custom-trained models, deep linking capabilities for associated notebooks, and improved model version management. These new features streamline the ML workflow by combining the convenience of pre-built solutions with the flexibility of custom development, while maintaining enterprise-grade security and governance.
For enterprise customers, the ability to curate and fine-tune both pre-built and custom models is crucial for successful AI implementation. Model curation provides quality control, compliance, and security while preventing duplicate efforts across teams. When enterprises fine-tune curated models, they can specialize general-purpose solutions for their specific industry needs and gain competitive advantages through improved performance on their proprietary data. Similarly, the ability to fine-tune custom models enables organizations to continuously improve their AI solutions, adapt to changing business conditions, and preserve institutional knowledge, while maintaining cost-efficiency.
A common enterprise scenario involves centralized data science teams developing foundation models (FMs), evaluating the performance against open source FMs, and iterating on performance. After they develop their custom FM, it can serve as a baseline for the entire organization, and individual departments—such as legal, finance, or customer service—can fine-tune these models using their department-specific data that might be subject to different privacy requirements or access controls. This hub-and-spoke approach to model development maximizes resource efficiency while allowing for specialized optimization at the department level. This comprehensive approach to model management, now supported by the enhanced private hub features in SageMaker JumpStart, enables enterprises to balance standardization with customization while maintaining proper governance and control over their ML assets.
Solution overview
SageMaker JumpStart has introduced several new enhancements to its private model hub feature, allowing administrators greater control and flexibility in managing their organization’s ML models. These enhancements include:

Fine-tuning of models referenced in the private hub – Administrators can now add models from the SageMaker JumpStart catalog to their private hub and fine-tune them using Amazon SageMaker training jobs, without having to create the models from scratch.
Support for custom models – In addition to the pre-trained SageMaker JumpStart models, administrators can now add their own custom-trained models to the private hub and fine-tune them as needed.
Deep linking of notebooks – Administrators can now deep link to specific notebooks associated with the models in the private hub, making it straightforward for users to access and work with the models.
Updating models in the private hub – The private hub now supports updating models over time as new versions or iterations become available, allowing organizations to stay current with the latest model improvements.

These new capabilities give AWS customers more control over their ML infrastructure and enable faster model deployment and experimentation, while still maintaining the appropriate access controls and permissions within their organization.
In the following sections, we provide guidance on how to use these new private model hub features using the Amazon SageMaker SDK and Amazon SageMaker Studio console.
To learn more about how to manage models using private hubs, see Manage Amazon SageMaker JumpStart foundation model access with private hubs.
Prerequisites
To use the SageMaker Python SDK and run the code associated with this post, you need the following prerequisites:

An AWS account that contains your AWS resources
An AWS Identity and Access Management (IAM) role with access to SageMaker Studio notebooks
SageMaker JumpStart enabled in a SageMaker Studio domain

Create a private hub, curate models, and configure access control
This section provides a step-by-step guide for administrators to create a private hub, curate models, and configure access control for your organization’s users.

Because the feature has been integrated in the latest SageMaker Python SDK, to use the model granular access control feature with a private hub, let’s first update the SageMaker Python SDK:

!pip3 install sagemaker —force-reinstall —quiet

Next, import the SageMaker and Boto3 libraries:

import boto3 from sagemaker
import Session from sagemaker.session
import Hub

Configure your private hub:

HUB_NAME=”CompanyHub”
HUB_DISPLAY_NAME=”Allowlisted Models”
HUB_DESCRIPTION=”These are allowlisted models taken from the SageMaker Public Hub”
REGION=”<your_region_name>” # for example, “us-west-2”

In the preceding code, HUB_NAME specifies the name of your hub. HUB_DISPLAY_NAME is the display name for your hub that will be shown to users in UI experiences. HUB_DESCRIPTION is the description for your hub that will be shown to users.
Use an AWS Region where SageMaker JumpStart is available, as of March 2025: us-west-2, us-east-1, us-east-2, eu-west-1, eu-central-1, eu-central-2, eu-north-1, eu-south-2, me-south-1, me-central-1, ap-south-1, ap-south-2, eu-west-3, af-south-1, sa-east-1, ap-east-1, ap-northeast-2, ap-northeast-3, ap-southeast-3, ap-southeast-4, ap-southeast-5, ap-southeast-7, eu-west-2, eu-south-1, ap-northeast-1, us-west-1, ap-southeast-1, ap-southeast-2, ca-central-1, ca-west-1, cn-north-1, cn-northwest-1, il-central-1, mx-central-1, us-gov-east-1, us-gov-west-1.

Set up a Boto3 client for SageMaker:

sm_client = boto3.client(‘sagemaker’)
session = Session(sagemaker_client=sm_client)
session.get_caller_identity_arn()

Check if the following policies have been already added to your admin IAM role; if not, you can add them as inline policies (use the Region configured in Step 3):

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Action”: [
“s3:ListBucket”,
“s3:GetObject”,
“s3:GetObjectTagging”
],
“Resource”: [
“arn:aws:s3:::jumpstart-cache-prod-<REGION>”,
“arn:aws:s3:::jumpstart-cache-prod-<REGION>/*”
],
“Effect”: “Allow”
}
]
}

In addition to setting up IAM permissions to the admin role, you need to scope down permissions for your users so they can’t access public contents.

Use the following policy to deny access to the public hub for your users. These can be added as inline policies in the user’s IAM role (use the Region configured in Step 3):

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Action”: “s3:*”,
“Effect”: “Deny”,
“Resource”: [
“arn:aws:s3:::jumpstart-cache-prod-<REGION>”,
“arn:aws:s3:::jumpstart-cache-prod-<REGION>/*”
],
“Condition”: {
“StringNotLike”: {“s3:prefix”: [“*.ipynb”, “*/eula.txt”]}
}
},
{
“Action”: “sagemaker:*”,
“Effect”: “Deny”,
“Resource”: [
“arn:aws:sagemaker:<REGION>:aws:hub/SageMakerPublicHub”,
“arn:aws:sagemaker:<REGION>:aws:hub-content/SageMakerPublicHub/*/*”
]
}
]
}

After you have set up the private hub configuration and permissions, you’re ready to create the private hub.

Use the following code to create the private hub within your AWS account in the Region you specified earlier:

hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)

try:
hub.create(
description=HUB_DESCRIPTION,
display_name=HUB_DISPLAY_NAME
)
print(f”Successfully created Hub with name {HUB_NAME} in {REGION}”)
except Exception as e:
if “ResourceInUse” in str(e):
print(f”A hub with the name {HUB_NAME} already exists in your account.”)
else:
raise e

Use describe() to verify the configuration of your hub. After your private hub is set up, you can add a reference to models from the SageMaker JumpStart public hub to your private hub. No model artifacts need to be managed by the customer. The SageMaker team will manage version or security updates. For a list of available models, refer to Built-in Algorithms with pre-trained Model Table.
To search programmatically, run the following command:

from sagemaker.jumpstart.filters import Or

filter_value = Or(
“framework == meta”,
“framework == deepseek”
)
models = []
next_token = None

while True:
response = hub.list_sagemaker_public_hub_models(
filter=filter_value,
next_token=next_token
)
models.extend(response[“hub_content_summaries”])
next_token = response.get(“next_token”)

if not next_token:
break
print(models)

The filter argument is optional. For a list of filters you can apply, refer to the following GitHub repo.

Use the retrieved models from the preceding command to create model references for your private hub:

for model in models:
print(f”Adding {model.get(‘hub_content_name’)} to Hub”)
hub.create_model_reference(model_arn=model.get(“hub_content_arn”),
model_name=model.get(“hub_content_name”))

The SageMaker JumpStart private hub offers other useful features for managing and interacting with the curated models. Administrators can check the metadata of a specific model using the hub.describe_model(model_name=<model_name>) command. To list the available models in the private hub, you can use a simple loop:

response = hub.list_models()
models = response[“hub_content_summaries”]
while response[“next_token”]:
response = hub.list_models(next_token=response[“next_token”])
models.extend(response[“hub_content_summaries”])

for model in models:
print(model.get(‘HubContentArn’))

If you need to remove a specific model reference from the private hub, use the following command:

hub.delete_model_reference(“<model_name>”)

If you want to delete the private hub from your account and Region, you will need to delete all the HubContents first, then delete the private hub. Use the following code:

for model in models:
    hub.delete_model_reference(model_name=model.get(‘HubContentName’))
    
hub.delete()

Fine-tune models referenced in the private hub
This section walks through how to interact with allowlisted models in SageMaker JumpStart. We demonstrate how to list available models, identify a model from the public hub, and fine-tune the model using the SageMaker Python SDK as well as the SageMaker Studio UI.
User experience using the SageMaker Python SDK
To interact with your models using the SageMaker Python SDK, complete the following steps:

Just like the admin process, the first step is to force reinstall the SageMaker Python SDK:

!pip3 install sagemaker —force-reinstall —quiet

When interacting with the SageMaker SDK functions, add references to the hub_arn:

model_id=”meta-vlm-llama-3-2-11b-vision”
model_version=”2.1.8″
hub_arn=”<YourHubARN>”

from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
model_id=model_id, model_version=model_version, hub_arn=hub_arn
)
print(my_hyperparameters)
hyperparameters.validate(
model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters, hub_arn=hub_arn
)

You can then start a training job by specifying the model ID, version, and hub name:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
model_id=model_id,
hub_name=hub_arn,
model_version=model_version,
environment={“accept_eula”: “false”}, # Please change {“accept_eula”: “true”}
disable_output_compression=True,
instance_type=”ml.p4d.24xlarge”,
hyperparameters=my_hyperparameters,
)
estimator.fit({“training”: train_data_location})

For a custom model, see the example notebooks in GitHub.
User experience in SageMaker Studio
Complete the following steps to interact with allowlisted models using SageMaker Studio:

On the SageMaker Studio console, choose JumpStart in the navigation pane or in the Prebuilt and automated solutions section.
Choose one of model hubs you have access to.

If the user has access to multiple hubs, you will see a list of hubs, as shown in the following screenshot.

If the user has access to only one hub, you will be redirected to the model list.

To fine-tune a model, choose Train (this option will be enabled if it’s supported).
Modify your training job configurations like training data, instance type, and hyperparameters, and choose Submit.

Deep link notebooks in the private hub
You can now also access the notebook associated with the model in your curated hub.

Choose your model, then choose Preview notebooks.
Choose Open in JupyterLab to start the deep link workflow.
Select a running JupyterLab space and choose Open notebook.

You will need to upgrade your space to use a SageMaker distribution of at least 2.4.1. For more information on how to upgrade your SageMaker distribution, see Update the SageMaker Distribution Image.

This will automatically open the selected notebook in your JupyterLab instance, with your private HubName inputted into the necessary classes.

Update models in the private hub
Modify your existing private HubContent by calling the new sagemaker:UpdateHubContent API. You can now update an existing HubContent version in-place without needing to delete and re-add it. We don’t support updating the HubContentDocument at this time because there can be backward-incompatible changes that are introduced that fundamentally alter the performance and usage of the model itself. Refer to the public API documentation for more details.

client.update_hub_content(
    hub_content_name=”my-model”,
    hub_content_version=”1.0.0″,
    hub_content_type=”Model”,
    hub_name=”my-hub”,
    support_status=”DEPRECATED”
)

Additionally, you can modify your ModelReferences by calling the new sagemaker:UpdateHubContentReference API. Refer to the public API documentation for more usage details.

client.update_hub_content_reference(
    hub_content_name=”your-model”,
    hub_content_type=”ModelReference”,
    hub_name=”my-hub”,
    min_version=”1.2.0″
)

Conclusion
This post demonstrated the new enhancements to the SageMaker JumpStart private model hub feature, which gives enterprise customers greater control and flexibility in managing their ML assets. The key capabilities introduced include the ability to fine-tune pre-built SageMaker JumpStart models directly within the private hub, support for importing and fine-tuning custom-trained models, deep linking to associated notebooks for streamlined access and collaboration, and improved model version management through APIs. These features enable enterprises to curate a centralized repository of trusted, specialized ML models, while still providing the flexibility for individual teams and departments to fine-tune and adapt these models to their specific needs. The seamless integration with SageMaker Studio further streamlines the model development and deployment workflow, empowering enterprises to accelerate their ML initiatives while maintaining the appropriate security and control over their ML assets.
Now that you’ve seen how the enhanced private model hub features in Amazon SageMaker JumpStart can give your organization greater control and flexibility over managing your machine learning assets, start leveraging these capabilities to curate a centralized repository of trusted models and accelerate your AI initiatives.

About the Authors
Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Niris Okram is a senior academic research specialist solutions architect at AWS. He has extensive experience working with public, private and research customers on various fields related to cloud. He is passionate about designing and building systems to accelerate the customer’s mission on AWS cloud.
Benjamin Crabtree is a software engineer with the Amazon SageMaker and Bedrock teams. He is passionate about democratizing the new and frequent breakthroughs in AI. Ben received his undergraduate degree from the University of Michigan and now lives in Brooklyn, NY.
Banu Nagasundaram leads product, engineering, and strategic partnerships for SageMaker JumpStart, SageMaker’s machine learning and GenAI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Generative AI-powered game design: Accelerating early development with …

In the competitive world of game development, staying ahead of technological advancements is crucial. Generative AI has emerged as a game changer, offering unprecedented opportunities for game designers to push boundaries and create immersive virtual worlds. At the forefront of this revolution is Stability AI’s cutting-edge text-to-image AI model, Stable Diffusion 3.5 Large (SD3.5 Large), which is transforming the way we approach game environment creation.
SD3.5 Large, available in Amazon Bedrock, is Stability AI’s most advanced text-to-image model to date. With 8.1 billion parameters, this model excels at generating high-quality, 1-megapixel images from text descriptions with exceptional prompt adherence, making it ideal for creating detailed game environments at speed. Its improved architecture, based on the Multimodal Diffusion Transformer (MMDiT), combines multiple pre-trained text encoders for enhanced text understanding and uses QK-normalization to improve training stability.
The model demonstrates improved performance in image quality, typography, and complex prompt understanding. It excels at creating diverse, high-quality images across multiple styles, making it valuable for industries such as media, gaming, advertising, and education.
In this post, we explore how you can use SD3.5 Large to address practical gaming needs such as early concept art and character design.
Key improvements in SD3.5 Large compared to SD3 Large
SD3.5 Large offers the following improvements:

Enhanced photorealism – Delivers detailed 3D imagery with unprecedented realism
Superior scene complexity – Handles multiple subjects in intricate scenes with remarkable accuracy
Improved anatomical rendering – Generates more precise and natural human representations
Diverse representation – Creates images with inclusive representation of skin tones and features without extensive prompting

Real-world use cases for game environment creation
Image generation is poised to revolutionize a few key areas within the gaming industry. Firstly, it will significantly enhance the ideation and design process, allowing teams to rapidly create new scenes and objects, thereby accelerating the design cycle. Secondly, it will enable in-game content generation, empowering users to create new objects, modify avatar skins, or generate new textures. Although current adoption is more prevalent in the design phase, the continued advancement of generative AI is expected to lead to increased user-generated AI content (such as player avatars), which will substantially boost user creativity and overall gaming experience. This shift towards AI-assisted content creation in gaming promises to open up new realms of possibilities for both developers and players alike.
The following are sample prompts for creating early game worlds and their output:

A vibrant fantasy landscape featuring rolling hills, a sparkling river, and a majestic castle in the distance under a bright blue sky.

A dense tropical rainforest teeming with exotic plants and wildlife, sunlight filtering through the thick canopy, with a hidden waterfall cascading into a crystal-clear pool.

A futuristic city skyline at dusk, featuring sleek skyscrapers with neon lights and flying vehicles soaring between them, reflecting on the glassy surface of a river.

The following are sample prompts for creating early game assets and props from different angles:

An intricately designed realistic game weapon prop of a fiery blue and green blade set against a blurred background of a gargantuan temple. The blade merges geometrical design of the blade with an alien cultural aesthetic.
Close-up, side-angle view of an intricately designed realistic, game weapon prop of a fiery blue and green blade set against a blurred background of a gargantuan temple. The blade merges geometrical design of the blade with an alien cultural aesthetic.
Top-down view of an intricately designed realistic, game weapon prop of a fiery blue and green blade set against a blurred background of a gargantuan temple. The blade merges geometrical design of the blade with an alien cultural aesthetic.

Solution overview
To demonstrate the power of SD3.5 Large in game environment creation, let’s walk through a hypothetical workflow. We have provided a Jupyter notebook to deploy a sample gaming use case in the following GitHub repo. Use the us-west-2 AWS Region to run this demo.
Prerequisites
This notebook is designed to run on AWS, using Amazon Bedrock for both Anthropic’s Claude 3 Sonnet and Stability AI model access. Make sure you have the following set up before moving forward:

An AWS account.
An Amazon SageMaker domain.
Access to Stability AI’s SD3.5 Large text-to-image model through the Amazon Bedrock console. For instructions, see Manage access to Amazon Bedrock foundation models.

Define the game world
Start by outlining the core concepts of your game world, including its theme, atmosphere, and key locations. For example, “Mystic Realms is set in a vibrant fantasy world where players embark on quests to uncover ancient secrets and battle mystical creatures. The game features diverse environments, including enchanted forests, mystical mountains, and forgotten ruins. The atmosphere is whimsical and magical, with bright colors and fantastical elements that evoke a sense of wonder.”
Craft detailed prompts for worlds and objects
Use natural language to describe specific environments and objects you want to create. The following screenshot shows some generated prompts.

You can also generate initial concept images with Amazon Bedrock following these steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Model catalog.
For Providers, select Stability AI, then choose Stable Diffusion 3.5 Large.
Choose Open in playground.
Enter your prompt and choose Run. A high-fidelity image will be generated in seconds.

Iterate and refine
After you have a base concept you’re happy with, you can generate variations to explore different possibilities for the same environment. Analyze the generated images and refine your prompts to achieve the desired results. You might want to adjust elements like lighting, color palette, or specific environmental features. Finally, use the generated images as reference material for 3D artists to create fully realized game environments.
Clean up
To avoid charges, you must stop the active SageMaker notebook instances if you used the notebook demo. For instructions, refer to Clean up Amazon SageMaker notebook instance resources.
Conclusion
Stability AI’s latest series of models represents a significant advancement in generative AI, providing game developers, designers, and content creators with a powerful tool to enhance creative workflows and explore new dimensions of visual storytelling. By using Stability AI’s capabilities, organizations can address practical gaming needs, from concept art and character design to level creation and marketing campaigns. However, it’s essential to approach this technology with a responsible and ethical mindset, considering potential biases, respecting intellectual property rights, and mitigating the risks of misuse. By embracing these models while being aware of their limitations and ethical considerations, gaming professionals can push the boundaries of what’s possible in game design and visual content creation.
To get started, check out Stability AI models available in Amazon Bedrock.

About the Authors
Isha Dua is a Senior Solutions Architect based in the San Francisco Bay Area. She helps AWS Enterprise customers grow by understanding their goals and challenges, and guiding them on how they can architect their applications in a cloud-native manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and environmental sustainability.
Parth Patel is a Senior Solutions Architect at AWS in the San Francisco Bay Area. Parth guides customers to accelerate their journey to the cloud and help them adopt and grow on the AWS Cloud successfully. He focuses on machine learning, environmental sustainability, and application modernization.

Google AI Released Gemini 2.5 Pro Experimental: An Advanced AI Model t …

​In the evolving field of artificial intelligence, a significant challenge has been developing models that can effectively reason through complex problems, generate accurate code, and process multiple forms of data. Traditional AI systems often excel in specific tasks but struggle to generalize across diverse domains, limiting their practical applications. This fragmentation underscores the need for more integrated and versatile AI solutions.​

Addressing this, Google has introduced Gemini 2.5 Pro Experimental, an advanced AI model designed to enhance reasoning, coding, and multimodal capabilities. Building upon its predecessors, Gemini 2.5 Pro is engineered to tackle complex challenges in fields such as coding, science, and mathematics. Its multimodal design enables it to interpret and generate text, audio, images, video, and code, broadening its applicability across various sectors. ​

From a technical standpoint, Gemini 2.5 Pro incorporates advanced reasoning capabilities, allowing the model to process tasks methodically and make informed decisions. It features a substantial context window, currently supporting up to 1 million tokens, with plans to expand to 2 million tokens. This extensive context window enables the model to comprehend large datasets and address intricate problems that require synthesizing information from multiple sources. In coding applications, Gemini 2.5 Pro demonstrates proficiency by creating visually compelling web applications and efficiently performing code transformation and editing tasks.

https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#advanced-coding

Empirical evaluations highlight Gemini 2.5 Pro’s strong performance. It leads in benchmarks related to mathematics and science, such as GPQA and AIME 2025, reflecting its robust reasoning capabilities. Notably, it achieved a score of 18.8% on Humanity’s Last Exam, a dataset designed to assess advanced knowledge and reasoning. In coding benchmarks, Gemini 2.5 Pro scored 63.8% on SWE-Bench Verified, indicating its competence in agentic code evaluations. Furthermore, it topped the LMArena leaderboard by a significant margin, underscoring its advanced capabilities in multimodal reasoning, coding, and STEM fields.

In conclusion, Gemini 2.5 Pro Experimental represents a notable advancement in AI, reflecting Google’s commitment to developing more intelligent and versatile models. By integrating reasoning capabilities directly into its architecture, Gemini 2.5 Pro addresses previous limitations, offering enhanced performance and improved accuracy. Its ability to handle complex problems across coding, science, and mathematics, coupled with its multimodal proficiency, positions it as a valuable tool in the AI landscape. As AI continues to evolve, models like Gemini 2.5 Pro pave the way for more sophisticated and context-aware applications, fostering innovation across various sectors.

Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Google AI Released Gemini 2.5 Pro Experimental: An Advanced AI Model that Excels in Reasoning, Coding, and Multimodal Capabilities appeared first on MarkTechPost.

A Code Implementation for Advanced Human Pose Estimation Using MediaPi …

Human pose estimation is a cutting-edge computer vision technology that transforms visual data into actionable insights about human movement. By utilizing advanced machine learning models like MediaPipe’s BlazePose and powerful libraries such as OpenCV, developers can track body key points with unprecedented accuracy. In this tutorial, we explore the seamless integration of these, demonstrating how Python-based frameworks enable sophisticated pose detection across various domains, from sports analytics to healthcare monitoring and interactive applications. 

First, we install the essential libraries:

Copy CodeCopiedUse a different Browser!pip install mediapipe opencv-python-headless matplotlib

Then, we import the important libraries needed for our implementation:

Copy CodeCopiedUse a different Browserimport cv2
import mediapipe as mp
import matplotlib.pyplot as plt
import numpy as np

We initialize the MediaPipe Pose model in static image mode with segmentation enabled and a minimum detection confidence of 0.5. It also imports utilities for drawing landmarks and applying drawing styles.

Copy CodeCopiedUse a different Browsermp_pose = mp.solutions.pose
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

pose = mp_pose.Pose(
static_image_mode=True,
model_complexity=1,
enable_segmentation=True,
min_detection_confidence=0.5
)

Here, we define the detect_pose function, which reads an image, processes it to detect human pose landmarks using MediaPipe, and returns the annotated image along with the detected landmarks. If landmarks are found, they are drawn using default styling.

Copy CodeCopiedUse a different Browserdef detect_pose(image_path):
image = cv2.imread(image_path)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

results = pose.process(image_rgb)

annotated_image = image_rgb.copy()
if results.pose_landmarks:
mp_drawing.draw_landmarks(
annotated_image,
results.pose_landmarks,
mp_pose.POSE_CONNECTIONS,
landmark_drawing_spec=mp_drawing_styles.get_default_pose_landmarks_style()
)

return annotated_image, results.pose_landmarks

We define the visualize_pose function, which displays the original and pose-annotated images side by side using matplotlib. The extract_keypoints function converts detected pose landmarks into a dictionary of named keypoints with their x, y, z coordinates and visibility scores.

Copy CodeCopiedUse a different Browserdef visualize_pose(original_image, annotated_image):
plt.figure(figsize=(16, 8))

plt.subplot(1, 2, 1)
plt.title(‘Original Image’)
plt.imshow(cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB))
plt.axis(‘off’)

plt.subplot(1, 2, 2)
plt.title(‘Pose Estimation’)
plt.imshow(annotated_image)
plt.axis(‘off’)

plt.tight_layout()
plt.show()

def extract_keypoints(landmarks):
if landmarks:
keypoints = {}
for idx, landmark in enumerate(landmarks.landmark):
keypoints[mp_pose.PoseLandmark(idx).name] = {
‘x’: landmark.x,
‘y’: landmark.y,
‘z’: landmark.z,
‘visibility’: landmark.visibility
}
return keypoints
return None

Finally, we load an image from the specified path, detect and visualize human pose landmarks using MediaPipe, and then extract and print the coordinates and visibility of each detected keypoint.

Copy CodeCopiedUse a different Browserimage_path = ‘/content/Screenshot 2025-03-26 at 12.56.05 AM.png’
original_image = cv2.imread(image_path)
annotated_image, landmarks = detect_pose(image_path)

visualize_pose(original_image, annotated_image)

keypoints = extract_keypoints(landmarks)
if keypoints:
print(“Detected Keypoints:”)
for name, details in keypoints.items():
print(f”{name}: {details}”)

Sample Processed Output

In this tutorial, we explored human pose estimation using MediaPipe and OpenCV, demonstrating a comprehensive approach to body keypoint detection. We implemented a robust pipeline that transforms images into detailed skeletal maps, covering key steps including library installation, pose detection function creation, visualization techniques, and keypoint extraction. Using advanced machine learning models, we showcased how developers can transform raw visual data into meaningful movement insights across various domains like sports analytics and healthcare monitoring.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

The post A Code Implementation for Advanced Human Pose Estimation Using MediaPipe, OpenCV and Matplotlib appeared first on MarkTechPost.

RWKV-7: Advancing Recurrent Neural Networks for Efficient Sequence Mod …

Autoregressive Transformers have become the leading approach for sequence modeling due to their strong in-context learning and parallelizable training enabled by softmax attention. However, softmax attention has quadratic complexity in sequence length, leading to high computational and memory demands, especially for long sequences. While GPU optimizations mitigate this for short sequences, inference remains costly at scale. Researchers have explored recurrent architectures with compressive states that offer linear complexity and constant memory use to address this. Advances in linear attention and state-space models (SSMs) have shown promise, with RNN-based approaches like RWKV-4 achieving competitive performance while significantly lowering inference costs.

Researchers from multiple institutions, including the RWKV Project, EleutherAI, Tsinghua University, and others, introduce RWKV-7 “Goose,” a novel sequence modeling architecture that establishes new state-of-the-art (SoTA) performance at the 3 billion parameter scale for multilingual tasks. Despite being trained on significantly fewer tokens than competing models, RWKV-7 achieves comparable English language performance while maintaining constant memory usage and inference time per token. The architecture extends the delta rule by incorporating vector-valued state gating, adaptive in-context learning rates, and a refined value replacement mechanism. These improvements enhance expressivity, enable efficient state tracking, and allow recognition of all regular languages, exceeding the theoretical capabilities of Transformers under standard complexity assumptions. To support its development, the researchers release an extensive 3.1 trillion-token multilingual corpus, alongside multiple pre-trained RWKV-7 models ranging from 0.19 to 2.9 billion parameters, all available under an open-source Apache 2.0 license.

RWKV-7 introduces key innovations layered on the RWKV-6 architecture, including token-shift, bonus mechanisms, and a ReLU² feedforward network. The model’s training corpus, RWKV World v3, enhances its English, code, and multilingual capabilities. In addition to releasing trained models, the team provides proof that RWKV-7 can solve problems beyond TC₀ complexity, including S₅ state tracking and regular language recognition. This demonstrates its ability to handle computationally complex tasks more efficiently than Transformers. Furthermore, the researchers propose a cost-effective method to upgrade the RWKV architecture without full retraining, facilitating incremental improvements. The development of larger datasets and models will continue under open-source licensing, ensuring broad accessibility and reproducibility.

The RWKV-7 model employs a structured approach to sequence modeling, denoting model dimensions as D and using trainable matrices for computations. It introduces vector-valued state gating, in-context learning rates, and a refined delta rule formulation. The time-mixing process involves weight preparation using low-rank MLPs, with key components like replacement keys, decay factors, and learning rates designed for efficient state evolution. A weighted key-value (WKV) mechanism facilitates dynamic state transitions, approximating a forget gate. Additionally, RWKV-7 enhances expressivity through per-channel modifications and a two-layer MLP, improving computational stability and efficiency while preserving state-tracking capabilities.

RWKV-7 models were assessed using the LM Evaluation Harness on various English and multilingual benchmarks, demonstrating competitive performance with state-of-the-art models while utilizing fewer training tokens. Notably, RWKV-7 outperformed its predecessor in MMLU and significantly improved multilingual tasks. Additionally, evaluations of recent internet data confirmed its effectiveness in handling information. The model excelled in associative recall, mechanistic architecture design, and long-context retention. Despite constraints in training resources, RWKV-7 demonstrated superior efficiency, achieving strong benchmark results while requiring fewer FLOPs than leading transformer models.

In conclusion, RWKV-7 is an RNN-based architecture that achieves state-of-the-art results across multiple benchmarks while requiring significantly fewer training tokens. It maintains high parameter efficiency, linear time complexity, and constant memory usage, making it a strong alternative to Transformers. However, it faces limitations such as numerical precision sensitivity, lack of instruction tuning, prompt sensitivity, and restricted computational resources. Future improvements include optimizing speed, incorporating chain-of-thought reasoning, and scaling with larger datasets. The RWKV-7 models and training code are openly available under the Apache 2.0 License to encourage research and development in efficient sequence modeling.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post RWKV-7: Advancing Recurrent Neural Networks for Efficient Sequence Modeling appeared first on MarkTechPost.

Amazon Bedrock launches Session Management APIs for generative AI appl …

Amazon Bedrock announces the preview launch of Session Management APIs, a new capability that enables developers to simplify state and context management for generative AI applications built with popular open source frameworks such as LangGraph and LlamaIndex. Session Management APIs provide an out-of-the-box solution that enables developers to securely manage state and conversation context across multi-step generative AI workflows, alleviating the need to build, maintain, or scale custom backend solutions. In this post, we discuss the new Session Management APIs and how to handle session state in your generative AI applications.
By preserving session state between interactions, Session Management APIs enhance workflow continuity, enabling generative AI applications, such as virtual assistants and multi-agent research workflows, that require persistent context across extended interactions. Developers can use this capability to checkpoint workflow stages, save intermediate states, and resume tasks from points of failure or interruption. Additionally, they can pause and replay sessions and use detailed traces to debug and enhance their generative AI applications. By treating sessions as a first-class resource, this capability enables developers to enforce granular access control through AWS Identity and Access Management (IAM) and encrypt data using AWS Key Management Service (AWS KMS), making sure that data from different user sessions is securely isolated and supporting multi-tenant applications with strong privacy protections.
Building generative AI applications requires more than model API calls. Your applications must handle conversation history, user preferences, state tracking, and contextual shifts. As these applications grow in complexity, robust state management becomes crucial. Key reasons include:

Contextual coherence – Maintaining state makes sure that the application can track the flow of information, leading to more coherent and contextually relevant outputs.
User interaction tracking – In interactive applications, state management allows the system to remember user inputs and preferences, facilitating personalized experiences.
Resource optimization – Efficient state management helps in allocating computational resources effectively, making sure that the application runs smoothly without unnecessary redundancy.
Error handling and recovery – Developers can use this capability to checkpoint workflow stages, save intermediate states, and resume tasks from points of failure or interruption.

In this post, we discuss the new Session Management APIs and how to handle session state in your generative AI applications.
Background
State persistence in generative AI applications refers to the ability to maintain and recall information across multiple interactions. This is crucial for creating coherent and contextually relevant experiences. Some of the information that you might need to persist includes:

User information – Basic details about the user, such as ID, preferences, or history
Conversation history – A record of previous interactions within the current session
Context markers – Indicators of the current topic, intent, or stage in a multi-turn conversation
Application state – The current status of ongoing processes or workflows

Effective use of session attributes enables personalization by tailoring responses based on the ongoing conversation, continuity by allowing conversations to pick up where they left off even after interruptions, and complex task handling by managing multi-step processes or decision trees effectively. These capabilities enhance the user experience and the overall functionality of generative AI applications.
Challenges
Implementing robust state management in generative AI applications presents several interconnected challenges. The system must handle state persistence and retrieval in milliseconds to maintain fluid conversations. As traffic grows and contextual data expands, state management also needs to efficiently scale.
When you build your own state management system, you need to implement backend services and infrastructure that handle persistence, checkpointing, and retrieval operations. For this post, we consider LangGraph to discuss the concepts of short-term memory and available options. Short-term memory stores information within a single conversation thread, which is managed as part of the agent’s state and persisted using thread-scoped checkpoints. You can persist short-term memory in a database like PostgreSQL using either a synchronous or asynchronous connection. However, you need to set up the infrastructure, implement data governance, and enable security and monitoring.
Solution overview
The Session Management APIs in Amazon Bedrock offer a comprehensive solution that streamlines the development and deployment of generative AI applications by alleviating the need for custom infrastructure setup and maintenance. This capability not only minimizes the complexities of handling data persistence, retrieval, and checkpointing, but also provides enterprise-grade security features with built-in tenant isolation capabilities. You can offload the heavy lifting of managing state and context of your DIY generative AI solutions to Session Management APIs, while still using your preferred OSS tool. This will accelerate your path to deploy secure and scalable generative AI solutions.
The Session Management APIs also support human-in-the-loop scenarios, where manual intervention is required within automated workflows. Additionally, it provides comprehensive debugging and traceability features, maintaining detailed execution logs for troubleshooting and compliance purposes. The ability to quickly retrieve and analyze session data empowers developers to optimize their applications based on actual usage patterns and performance metrics.
To understand how Session Management APIs integrate with LangGraph applications, let’s look at the following high-level flow.

Example use case
To demonstrate the power and simplicity of Session Management APIs, let’s walk through a practical example of building a shoe shopping assistant. We will show how BedrockMemorySaver provides a custom checkpointing solution backed by the Session Management APIs. The complete code for this example is available in the AWS Samples GitHub repository.
First, let’s understand how Session Management APIs work with our application, as illustrated in the following diagram.

This process flow shows how each user interaction creates a new invocation in the session, maintains conversation context, and automatically persists state while the LangGraph application focuses on business logic. The seamless integration between these components enables sophisticated, stateful conversations without the complexity of managing infrastructure for state and context persistence.
Prerequisites
To follow along with this post, you need an AWS account with the appropriate permissions.
Set up the environment
We use the following code to set up the environment:

%pip install -U langgraph_checkpoint_aws

import boto3
from langgraph_checkpoint_aws.saver import BedrockSessionSaver

# Configure Bedrock client
bedrock_client = boto3.client(“bedrock-runtime”, region_name=”=”<aws_region>”)

Initialize the model
For our large language model (LLM), we Anthropic’s Claude 3 Sonnet on Amazon Bedrock:

from langchain_aws import ChatBedrockConverse
llm = ChatBedrockConverse(
model=”anthropic.claude-3-sonnet-20240229-v1:0″,
temperature=0,
max_tokens=None,
client=bedrock_client,
)

Implement tools
Our assistant needs tools to search the product database and manage the shopping cart. These tools can use the information saved in the user session:

from langchain_core.tools import tool
@tool
def search_shoes(preference):
“””Search for shoes based on user preferences and interests.”””
return pass

Set up Session Management APIs
We use the following code to integrate the Session Management APIs:

# Initialize session saver
session_saver = BedrockSessionSaver(
region_name=”<aws_region>”,
)

# Compile graph with session management
graph = graph_builder.compile(checkpointer=session_saver)

# Create a new session
session_id = session_saver.session_client.client.create_session()[“sessionId”]

Run the conversation
Now we can run our stateful conversation:

config = {“configurable”: {“thread_id”: session_id}}

while True:
user_input = input(“User: “)
if user_input.lower() in [“quit”, “exit”, “q”]:
print(“Goodbye!”)
break
for event in graph.stream(
{“messages”: [(“user”, user_input)]},
config
):
for value in event.values():
if isinstance(value[“messages”][-1], BaseMessage):
print(“Assistant:”, value[“messages”][-1].content)

Access session history
You can quickly retrieve the conversation history using the graph instance:

for i in graph.get_state_history(config, limit=5):
print(i)

Although it’s simple to access data using BedrockSessionSaver in LangGraph, there might be instances where you need to access session data directly—whether for auditing purposes or external processing. The Session Management APIs provide this functionality, though it’s important to note that the retrieved data is in serialized format. To work with this data meaningfully, you need to perform deserialization first:

# List all invocation steps
steps = client.list_invocation_steps(
sessionIdentifier=session_id,
)

# Get specific step details
step_details = client.get_invocation_step(
sessionIdentifier=session_id,
invocationIdentifier=”your-invocation-id”,
invocationStepId=”your-step-id”,
)

Replay and fork actions
You might want to analyze the steps to understand the reasoning, debug, or try out different paths. You can invoke the graph with a checkpoint to replay specific actions from that point:

config_replay = {
“configurable”: {
“thread_id”: session_id,
“checkpoint_id”: “<checkpoint_id>”,
}
}
for event in graph.stream(None, config_replay, stream_mode=”values”):
print(event)

The graph replays previously executed steps before the provided checkpoint_id and executes the steps after checkpoint_id.
You can also try forking to revisit an agent’s past actions and explore alternative paths within the graph:

config = {
“configurable”: {
“thread_id”: session_id,
“checkpoint_id”: “<checkpoint_id>”,
}
}
graph.update_state(config, {“state”: “updated state”})

Human-in-the-loop
Human-in-the-loop (HITL) interaction patterns allow the graph to stop at specific steps and seek human approval before proceeding. This is important if you have to review specific tool calls. In LangGraph, breakpoints are built on checkpoints, which save the graph’s state after each node execution. You can use the Session Management APIs to effectively implement HITL in your graph.
This example demonstrates how Session Management APIs seamlessly integrate with LangGraph to create a stateful conversation that maintains context across interactions. The Session Management APIs handle the complexity of state persistence, allowing you to focus on building the conversation logic.
The complete code is available in the AWS Samples GitHub repository. Feel free to clone it and experiment with your own modifications.
Clean up
To avoid incurring ongoing charges, clean up the resources you created as part of this solution.
Considerations and best practices
When implementing the Session Management APIs, consider these key practices for optimal results:

Session lifecycle management – Plan your session lifecycles carefully, from creation to termination. Initialize sessions using CreateSession at the start of conversations and properly close them with EndSession when complete. This approach promotes efficient resource utilization and maintains clean state boundaries between interactions.
Security and compliance – For applications handling sensitive information, implement appropriate data protection measures using the Session Management APIs’ built-in security features. By default, AWS managed keys are used for session encryption. For additional security requirements, you can encrypt session data with a customer managed key. Use the service’s data retention and deletion capabilities to maintain compliance with relevant regulations while maintaining proper data governance.

Conclusion
The Session Management APIs in Amazon Bedrock offer a powerful solution for handling state in generative AI applications. By using this fully managed capability, developers can focus on creating innovative AI experiences without getting caught up in the complexities of infrastructure management. The seamless integration with LangGraph enhances its utility, allowing for rapid development and deployment of sophisticated, stateful AI applications.
As the field of generative AI continues to evolve, robust state management will become increasingly crucial. The Session Management APIs provide the scalability, security, and flexibility needed to help meet these growing demands, enabling developers to build more contextually aware, personalized, and reliable AI-powered applications.
By adopting the Session Management APIs, developers can accelerate their path to production, provide better user experiences through consistent and coherent interactions, and focus their efforts on the unique value propositions of their AI applications rather than the underlying infrastructure challenges.
Try out the Session Management APIs for your own use case, and share your feedback in the comments.

About the authors
Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for Generative AI to help customers and partners build GenAI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture and ML applications.
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Rupinder Grewal is a Tech Lead Gen AI Specialist. He enjoys playing tennis and biking on mountain trails.
Krishna Gourishetti is a Senior Software Engineer for the Bedrock Agents team in AWS. He is passionate about building scalable software solutions that solve customer problems. In his free time, Krishna loves to go on hikes.
Aniketh Manjunath is a Software Development Engineer at Amazon Bedrock. He is passionate about distributed machine learning systems. Outside of work, he enjoys hiking, watching movies, and playing cricket.
Sarthak Handa serves as a Principal Product Manager at Amazon Web Services (AWS) AI/ML in Seattle, Washington, where his primary focus is on developing AI services that facilitate advancements in the healthcare industry. Prior to his work at AWS, Sarthak spent several years as a startup founder, building technology solutions for the healthcare and disaster relief sectors.

Enhance deployment guardrails with inference component rolling updates …

Deploying models efficiently, reliably, and cost-effectively is a critical challenge for organizations of all sizes. As organizations increasingly deploy foundation models (FMs) and other machine learning (ML) models to production, they face challenges related to resource utilization, cost-efficiency, and maintaining high availability during updates. Amazon SageMaker AI introduced inference component functionality that can help organizations reduce model deployment costs by optimizing resource utilization through intelligent model packing and scaling. Inference components abstract ML models and enable assigning dedicated resources and specific scaling policies per model.
However, updating these models—especially in production environments with strict latency SLAs—has historically risked downtime or resource bottlenecks. Traditional blue/green deployments often struggle with capacity constraints, making updates unpredictable for GPU-heavy models. To address this, we’re excited to announce another powerful enhancement to SageMaker AI: rolling updates for inference component endpoints, a feature designed to streamline updates for models of different sizes while minimizing operational overhead.
In this post, we discuss the challenges faced by organizations when updating models in production. Then we deep dive into the new rolling update feature for inference components and provide practical examples using DeepSeek distilled models to demonstrate this feature. Finally, we explore how to set up rolling updates in different scenarios.
Challenges with blue/green deployment
Traditionally, SageMaker AI inference has supported the blue/green deployment pattern for updating inference components in production. Though effective for many scenarios, this approach comes with specific challenges:

Resource inefficiency – Blue/Green deployment requires provisioning resources for both the current (blue) and new (green) environments simultaneously. For inference components running on expensive GPU instances like P4d or G5, this means potentially doubling the resource requirements during deployments. Consider an example where a customer has 10 copies of an inference component spread across 5 ml.p4d.24xlarge instances, all operating at full capacity. With blue/green deployment, SageMaker AI would need to provision five additional ml.p4d.24xlarge instances to host the new version of the inference component before switching traffic and decommissioning the old instances.
Limited computing resources – For customers using powerful GPU instances like the P or G series, the required capacity might not be available in a given Availability Zone or Region. This often results in instance capacity exceptions during deployments, causing update failures and rollbacks.
All-or-nothing transitions – Traditional blue/green deployments shift all traffic at one time or based on a configured schedule. This leaves limited room for gradual validation and increases the area of effect if issues arise with the new deployment.

Although blue/green deployment has been a reliable strategy for zero-downtime updates, its limitations become glaring when deploying large-scale large language models (LLMs) or high-throughput models on premium GPU instances. These challenges demand a more nuanced approach—one that incrementally validates updates while optimizing resource usage. Rolling updates for inference components are designed to eliminate the rigidity of blue/green deployments. By updating models in controlled batches, dynamically scaling infrastructure, and integrating real-time safety checks, this strategy makes sure deployments remain cost-effective, reliable, and adaptable—even for GPU-heavy workloads.
Rolling deployment for inference component updates
As mentioned earlier, inference components are introduced as a SageMaker AI feature to optimize costs; they allow you to define and deploy the specific resources needed for your model inference workload. By right-sizing compute resources to match your model’s requirements, you can save costs during updates compared to traditional deployment approaches.
With rolling updates, SageMaker AI deploys new model versions in configurable batches of inference components while dynamically scaling instances. This is particularly impactful for LLMs:

Batch size flexibility – When updating the inference components in a SageMaker AI endpoint, you can specify the batch size for each rolling step. For each step, SageMaker AI provisions capacity based on the specified batch size on the new endpoint fleet, routes traffic to that fleet, and stops capacity on the old endpoint fleet. Smaller models like DeepSeek Distilled Llama 8B can use larger batches for rapid updates, and larger models like DeepSeek Distilled Llama 70B use smaller batches to limit GPU contention.
Automated safety guards – Integrated Amazon CloudWatch alarms monitor metrics on an inference component. You can configure the alarms to check if the newly deployed version of inference component is working properly or not. If the CloudWatch alarms are triggered, SageMaker AI will start an automated rollback.

The new functionality is implemented through extensions to the SageMaker AI API, primarily with new parameters in the UpdateInferenceComponent API:

sagemaker_client.update_inference_component(
InferenceComponentName=inference_component_name,
RuntimeConfig={ “CopyCount”: number },
Specification={ … },
DeploymentConfig={
“RollingUpdatePolicy”: {
“MaximumBatchSize”: { # Value must be between 5% to 50% of the IC’s total copy count.
“Type”: “COPY_COUNT”, # COPY_COUNT | CAPACITY_PERCENT
“Value”: 1 # Minimum value of 1
},
“MaximumExecutionTimeoutInSeconds”: 600, #Minimum value of 600. Maximum value of 28800.
“RollbackMaximumBatchSize”: {
“Type”: “COPY_COUNT”, # COPY_COUNT | CAPACITY_PERCENT
“Value”:1
},
“WaitIntervalInSeconds”: 120 # Minimum value of 0. Maximum value of 3600
}
},
AutoRollbackConfiguration={
“Alarms”: [
{
“AlarmName”: “string” #Optional
}
]
},
)

The preceding code uses the following parameters:

MaximumBatchSize – This is a required parameter and defines the batch size for each rolling step in the deployment process. For each step, SageMaker AI provisions capacity on the new endpoint fleet, routes traffic to that fleet, and stops capacity on the old endpoint fleet. The value must be between 5–50% of the copy count of the inference component.

Type – This parameter could contain a value like COPY_COUNT | CAPACITY_PERCENT, which specifies the endpoint capacity type.
Value – This defines the capacity size, either as a number of inference component copies or a capacity percentage.

MaximumExecutionTimeoutInSeconds – This is the maximum time that the rolling deployment would spend on the overall execution. Exceeding this limit causes a timeout.
RollbackMaximumBatchSize – This is the batch size for a rollback to the old endpoint fleet. If this field is absent, the value is set to the default, which is 100% of the total capacity. When the default is used, SageMaker AI provisions the entire capacity of the old fleet at the same time during rollback.

Value – The Value parameter of this structure would contain the value with which the Type would be executed. For a rollback strategy, if you don’t specify the fields in this object, or if you set the Value to 100%, then SageMaker AI uses a blue/green rollback strategy and rolls traffic back to the blue fleet.

WaitIntervalInSeconds – This is the time limit for the total deployment. Exceeding this limit causes a timeout.
AutoRollbackConfiguration – This is the automatic rollback configuration for handling endpoint deployment failures and recovery.

AlarmName – This CloudWatch alarm is configured to monitor metrics on an InferenceComponent. You can configure it to check if the newly deployed version of InferenceComponent is working properly or not.

For more information about the SageMaker AI API, refer to the SageMaker AI API Reference.
Customer experience
Let’s explore how rolling updates work in practice with several common scenarios, using different-sized LLMs. You can find the example notebook in the GitHub repo.
Scenario 1: Multiple single GPU cluster
In this scenario, assume you’re running an endpoint with three ml.g5.2xlarge instances, each with a single GPU. The endpoint hosts an inference component that requires one GPU accelerator, which means each instance holds one copy. When you want to update the inference component to use a new inference component version, you can use rolling updates to minimize disruption.
You can configure a rolling update with a batch size of one, meaning SageMaker AI will update one copy at a time. During the update process, SageMaker AI first identifies available capacity in the existing instances. Because none of the existing instances has space for additional temporary workloads, SageMaker AI will launch new ml.g5.2xlarge instances one at a time to deploy one copy of the new inference component version to a GPU instance. After the specified wait interval and the new inference component’s container passes healthy check, SageMaker AI removes one copy of the old version (because each copy is hosted on one instance, this instance will be torn down accordingly), completing the update for the first batch.
This process repeats for the second copy of the inference component, providing a smooth transition with zero downtime. The gradual nature of the update minimizes risk and allows you to maintain consistent availability throughout the deployment process. The following diagram shows this process.

Scenario 2: Update with automatic rollback
In another scenario, you might be updating your inference component from Llama-3.1-8B-Instruct to DeepSeek-R1-Distill-Llama-8B, but the new model version has different API expectations. In this use case, you have configured a CloudWatch alarm to monitor for 4xx errors, which would indicate API compatibility issues.
You can initiate a rolling update with a batch size of one copy. SageMaker AI deploys the first copy of the new version on a new GPU instance. When the new instance is ready to serve traffic, SageMaker AI will forward a proportion of the invocation requests to this new model. However, in this example, the new model version, which is missing the “MESSAGES_API_ENABLED” environment variable configuration, will begin to return 4xx errors when receiving requests in the Messages API format.

The configured CloudWatch alarm detects these errors and transitions to the alarm state. SageMaker AI automatically detects this alarm state and initiates a rollback process according to the rollback configuration. Following the specified rollback batch size, SageMaker AI removes the problematic new model version and maintains the original working version, preventing widespread service disruption. The endpoint returns to its original state with traffic being handled by the properly functioning original model version.
The following code snippet shows how to set up a CloudWatch alarm to monitor 4xx errors:

# Create alarm
cloudwatch.put_metric_alarm(
AlarmName=f’SageMaker-{endpoint_name}-4xx-errors’,
ComparisonOperator=’GreaterThanThreshold’,
EvaluationPeriods=1,
MetricName=’Invocation4XXErrors’,
Namespace=’AWS/SageMaker’,
Period=300,
Statistic=’Sum’,
Threshold=5.0,
ActionsEnabled=True,
AlarmDescription=’Alarm when greather than 5 4xx errors’,
Dimensions=[
{
‘Name’: ‘InferenceComponentName’,
‘Value’: inference_component_name
},
],
)

Then you can use this CloudWatch alarm in the update request:

sagemaker_client.update_inference_component(
InferenceComponentName=inference_component_name,
… …
DeploymentConfig={
“RollingUpdatePolicy”: {
“MaximumBatchSize”: {
“Type”: “COPY_COUNT”,
“Value”: 1
},
“WaitIntervalInSeconds”: 120,
“RollbackMaximumBatchSize”: {
“Type”: “COPY_COUNT”,
“Value”: 1
}
},
‘AutoRollbackConfiguration’: {
“Alarms”: [
{“AlarmName”: f’SageMaker-{endpoint_name}-4xx-errors’}
]
}
}
)

Scenario 3: Update with sufficient capacity in the existing instances
If an existing endpoint has multiple GPU accelerators and not all the accelerators are used, the update can use existing GPU accelerators without launching new instances to the endpoint. Consider if you have an endpoint configured with an initial two ml.g5.12xlarge instances that have four GPU accelerators in each instance. The endpoint hosts two inference components: IC-1 requires one accelerator and IC-2 also requires one accelerator. On one ml.g5.12xlarge instance, there are four copies of IC-1 that have been created; on the other instance, two copies of IC-2 have been created. There are still two GPU accelerators available on the second instance.
When you initiate an update for IC-1 with a batch size of two copies, SageMaker AI determines that there is sufficient capacity in the existing instances to host the new versions while maintaining the old ones. It will create two copies of the new IC-1 version on the second instance. When the containers are up and running, SageMaker AI will direct traffic to the new IC-1s and then start routing traffic to the new inference components. SageMaker AI will also remove two of the old IC-1 copies from the instance. You are not charged until the new inference components start taking the invocations and generating responses.
Now another two free GPU slots are available. SageMaker AI will update the second batch, and it will use the free GPU accelerators that just became available. After the processes are complete, the endpoint has four IC-1 with the new version and two copies of IC-2 that weren’t changed.

Scenario 4: Update requiring additional instance capacity
Consider if you have an endpoint configured with initially one ml.g5.12xlarge instance (4 GPUs total) and configured managed instance scaling (MIS) with a maximum instance number set to two. The endpoint hosts two inference components: IC-1 requiring 1 GPU with two copies (Llama 8B), and IC-2 (DeepSeek Distilled Llama 14B model) also requiring 1 GPU with two copies—utilizing all 4 available GPUs.
When you initiate an update for IC-1 with a batch size of two copies, SageMaker AI determines that there’s insufficient capacity in the existing instances to host the new versions while maintaining the old ones. Instead of failing the update, as you have configured MIS, SageMaker AI will automatically provision a second g5.12.xlarge instance to host the new inference components.
During the update process, SageMaker AI deploys two copies of the new IC-1 version onto the newly provisioned instance, as shown in the following diagram. After the new inference components are up and running, SageMaker AI begins removing the old IC-1 copies from the original instances. By the end of the update, the first instance will host IC-2 utilizing 2 GPUs, and the newly provisioned second instance will host the updated IC-1 with two copies using 2 GPUs. There will be new spaces available in the two instances, and you can deploy more inference component copies or new models to the same endpoint using the available GPU resources. If you set up managed instance auto scaling and set inference component auto scaling to zero, you can scale down the inference component copies to zero, which will result in the corresponding instance being scaled down. When the inference component is scaled up, SageMaker AI will launch the inference components in the existing instance with the available GPU accelerators, as mentioned in scenario 3.

Scenario 5: Update facing insufficient capacity
In scenarios where there isn’t enough GPU capacity, SageMaker AI provides clear feedback about capacity constraints. Consider if you have an endpoint running on 30 ml.g6e.16xlarge instances, each already fully utilized with inference components. You want to update an existing inference component using a rolling deployment with a batch size of 4, but after the first four batches are updated, there isn’t enough GPU capacity available for the remaining update. In this case, SageMaker AI will automatically roll back to the previous setup and stop the update process.
There can be two cases for this rollback final status. In the first case, the rollback was successful because there was new capacity available to launch the instances for the old model version. However, there could be another case where the capacity issue persists during rolling back, and the endpoint will show as UPDATE_ROLLBACK_FAILED. The existing instances can still serve traffic, but to move the endpoint out of the failed status, you need to contact your AWS support team.
Additional considerations
As mentioned earlier, when using blue/green deployment to update the inference components on an endpoint, you need to provision resources for both the current (blue) and new (green) environments simultaneously. When you’re using rolling updates for inference components on the endpoint, you can use the following equation to calculate the number of account service quotas for the instance type required. The GPU instance required for the endpoint has X number of GPU accelerators, and each inference component copy requires Y number of GPU accelerators. The maximum batch size is set to Z and the current endpoint has N instances. Therefore, the account-level service quota required for this instance type for the endpoint should be greater than the output of the equation:
ROUNDUP(Z x Y / X) + N
For example, let’s assume the current endpoint has 8 (N) ml.g5.12xlarge instances, which has 4 GPU accelerators of each instance. You set the maximum batch size to 2 (Z) copies, and each needs 1 (Y) GPU accelerators. The minimum AWS service quota value for ml.g5.12xlarge is ROUNDUP(2 x 1 / 4) + 8 = 9. In another scenario, when each copy of inference component requires 4 GPU accelerators, then the required account-level service quota for the same instance should be ROUNDUP(2 x 4 / 4) + 8 = 10.
Conclusion
Rolling updates for inference components represent a significant enhancement to the deployment capabilities of SageMaker AI. This feature directly addresses the challenges of updating model deployments in production, particularly for GPU-heavy workloads, and it eliminates capacity guesswork and reduces rollback risk. By combining batch-based updates with automated safeguards, SageMaker AI makes sure deployments are agile and resilient.
Key benefits include:

Reduced resource overhead during deployments, eliminating the need to provision duplicate fleets
Improved deployment guardrails with gradual updates and automatic rollback capabilities
Continued availability during updates with configurable batch sizes
Straightforward deployment of resource-intensive models that require multiple accelerators

Whether you’re deploying compact models or larger multi-accelerator models, rolling updates provide a more efficient, cost-effective, and safer path to keeping your ML models current in production.
We encourage you to try this new capability with your SageMaker AI endpoints and discover how it can enhance your ML operations. For more information, check out the SageMaker AI documentation or connect with your AWS account team.

About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Andrew Smith is a Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.
Dustin Liu is a solutions architect at AWS, focused on supporting financial services and insurance (FSI) startups and SaaS companies. He has a diverse background spanning data engineering, data science, and machine learning, and he is passionate about leveraging AI/ML to drive innovation and business transformation.
Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Shikher Mishra is a Software Development Engineer with SageMaker Inference team with over 9+ years of industry experience. He is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. In his spare time, Shikher enjoys outdoor sports, hiking and traveling.
June Won  is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Evaluate and improve performance of Amazon Bedrock Knowledge Bases

Amazon Bedrock Knowledge Bases is a fully managed capability that helps implement entire Retrieval Augmented Generation (RAG) workflows from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows.
There is no single way to optimize knowledge base performance: each use case is impacted differently by configuration parameters. As such, it’s important to test often and iterate quickly to identify the best configuration for each use case.
In this post, we discuss how to evaluate the performance of your knowledge base, including the metrics and data to use for evaluation. We also address some of the tactics and configuration changes that can improve specific metrics.
Measure the performance of your knowledge base
RAG is a complex AI system, combining several critical steps. In order to identify what is impacting the performance of the pipeline, it’s important to evaluate each step independently. The knowledge base evaluation framework decomposes the evaluation into the following stages:

Retrieval – The process of retrieving relevant parts of documents based on a query and adding the retrieved elements as context to the final prompt for the knowledge base
Generation – Sending the user’s prompt and the retrieved context to a large language model (LLM) and then sending the output from the LLM back to the user

The following diagram illustrates the standard steps in a RAG pipeline.

To see this evaluation framework in action, open the Amazon Bedrock console, and in the navigation pane, choose Evaluations. Choose the Knowledge Bases tab to review the evaluation.

Evaluate the retrieval
We recommend initially evaluating the retrieval process independently, because the accuracy and quality of this foundational stage can significantly impact downstream performance metrics in the RAG workflow, potentially introducing errors or biases that propagate through subsequent pipeline stages.

There are two metrics used to evaluate retrieval:

Context relevance – Evaluates whether the retrieved information directly addresses the query’s intent. It focuses on precision of the retrieval system.
Context coverage – Measures how comprehensively the retrieved texts cover the expected ground truth. It requires ground truth texts for comparison to assess recall and completeness of retrieved information.

Context relevance and context coverage metrics are compiled by comparing search results from the RAG pipeline with expected answers in the test dataset. The following diagram illustrates this workflow.

Running the evaluation requires you to bring a dataset that adheres to specific formatting guidelines. The dataset must be in JSON Lines format, with each line representing a valid JSON object. To maintain optimal performance, the dataset should be limited to a maximum of 1,000 prompts per evaluation. Each individual prompt within the dataset must be a well-structured, valid JSON object that can be properly parsed and processed by the evaluation system.
If you choose to evaluate for context coverage, you will need to provide a ground truth, which is text that serves as the baseline for measuring coverage. The ground truth must include referenceContexts, and each prompt in the ground truth must have corresponding reference contexts for accurate evaluation.
The following example code shows the required fields:

{
“conversationTurns”: [{
“referenceContexts”: [{
“content”: [{
“text”: “ground truth text”
}]
}],
“prompt”: {
“content”: [{
“text”: “query text”
}]
}
}]
}

For more details, see Creating a prompt dataset for Retrieve only evaluation jobs.
Evaluate the generation
After validating that your RAG workflow successfully retrieves relevant context from your vector database and aligns with your predefined performance standards, you can proceed to evaluate the generation stage of your pipeline. The Amazon Bedrock evaluation tool provides a comprehensive assessment framework with eight metrics that cover both response quality and responsible AI considerations.
Response quality includes the following metrics:

Helpfulness – Evaluates how useful and comprehensive the generated responses are in answering questions
Correctness – Assesses the accuracy of responses to questions
Logical coherence – Examines responses for logical gaps, inconsistencies, or contradictions
Completeness – Evaluates whether responses address all aspects of the questions
Faithfulness – Measures factual accuracy and resistance to hallucinations

Responsible AI includes the following metrics:

Harmfulness – Evaluates responses for the presence of hate, insult, or violent content
Stereotyping – Assesses for generalized statements about groups or individuals
Refusal – Measures how appropriately the system declines to answer inappropriate questions

Response quality and responsible AI metrics are compiled by comparing search results and the generated response from the RAG pipeline with ground truth answers. The following diagram illustrates this workflow.

The dataset for evaluation must adhere to specific structural requirements, using JSON Lines format with a maximum of 1,000 prompts per evaluation. Each prompt is required to be a valid JSON object with a well-defined structure. Within this structure, two critical fields play essential roles: the prompt field contains the query text used for model evaluation, and the referenceResponses field stores the expected ground truth responses against which the model’s performance will be measured. This format promotes a standardized, consistent approach to evaluating model outputs across different test scenarios.
The following example code shows the required fields:

{
“conversationTurns”: [{
“referenceResponses”: [{
“content”: [{
“text”: “This is a reference text”
}]
}],

## your prompt to the model
“prompt”: {
“content”: [{
“text”: “This is a prompt”
}]
}
}]
}

For more details, see Creating a prompt dataset for Retrieve and generate evaluation jobs.
The following screenshot shows an Amazon Bedrock evaluation results sample dashboard.

After processing, the evaluation provides comprehensive insights, delivering both aggregate metrics and granular performance breakdowns for each individual metric. These detailed results include sample conversations that illustrate performance nuances. To derive maximum value, we recommend conducting a qualitative review, particularly focusing on conversations that received low scores across any metrics. This deep-dive analysis can help you understand the underlying factors contributing to poor performance and inform strategic improvements to your RAG workflow.
Building a comprehensive test dataset: Strategies and considerations
Creating a robust test dataset is crucial for meaningful evaluation. In this section, we discuss three primary approaches to dataset development.
Human-annotated data collection
Human annotation remains the gold standard for domain-specific, high-quality datasets. You can:

Use your organization’s proprietary documents and answers
Use open-source document collections like Clueweb (a 10-billion web document repository)
Employ professional data labeling services such as Amazon SageMaker Ground Truth
Use a crowdsourcing marketplace like Amazon Mechanical Turk for distributed annotation

Human data annotation is recommended for domain-specific, high-quality, and nuanced results. However, generating and maintaining large datasets using human annotators is a time-consuming and costly approach.
Synthetic data generation using LLMs
Synthetic data generation offers a more automated, potentially cost-effective alternative with two primary methodologies:

Self-instruct approach:

Iterative process using a single target model
Model generates multiple responses to queries
Provides continuous feedback and refinement

Knowledge distillation approach:

Uses multiple models
Generates responses based on preexisting model training
Enables faster dataset creation by using previously trained models

Synthetic data generation requires careful navigation of several key considerations. Organizations must typically secure End User License Agreements and might need access to multiple LLMs. Although the process demands minimal human expert validation, these strategic requirements underscore the complexity of generating synthetic datasets efficiently. This approach offers a streamlined alternative to traditional data annotation methods, balancing legal compliance with technical innovation.
Continuous dataset improvement: The feedback loop strategy
Develop a dynamic, iterative approach to dataset enhancement that transforms user interactions into valuable learning opportunities. Begin with your existing data as a foundational baseline, then implement a robust user feedback mechanism that systematically captures and evaluates real-world model interactions. Establish a structured process for reviewing and integrating flagged responses, treating each piece of feedback as a potential refinement point for your dataset. For an example of such a feedback loop implemented in AWS, refer to Improve LLM performance with human and AI feedback on Amazon SageMaker for Amazon Engineering.
This approach transforms dataset development from a static, one-time effort into a living, adaptive system. By continuously expanding and refining your dataset through user-driven insights, you create a self-improving mechanism that progressively enhances model performance and evaluation metrics. Remember: dataset evolution is not a destination, but an ongoing journey of incremental optimization.
When developing your test dataset, strive for a strategic balance that precisely represents the range of scenarios your users will encounter. The dataset should comprehensively span potential use cases and edge cases, while avoiding unnecessary repetition. Because each evaluation example incurs a cost, focus on creating a dataset that maximizes insights and performance understanding, selecting examples that reveal unique model behaviors rather than redundant iterations. The goal is to craft a targeted, efficient dataset that provides meaningful performance assessment without wasting resources on superfluous testing.
Performance improvement tools
Comprehensive evaluation metrics are more than just performance indicators—they’re a strategic roadmap for continuous improvement in your RAG pipeline. These metrics provide critical insights that transform abstract performance data into actionable intelligence, enabling you to do the following:

Diagnose specific pipeline weaknesses
Prioritize improvement efforts
Objectively assess knowledge base readiness
Make data-driven optimization decisions

By systematically analyzing your metrics, you can definitively answer key questions: Is your knowledge base robust enough for deployment? What specific components require refinement? Where should you focus your optimization efforts for maximum impact?
Think of metrics as a diagnostic tool that illuminates the path from current performance to exceptional AI system reliability. They don’t just measure—they guide, providing a clear, quantitative framework for strategic enhancement.
Although a truly comprehensive exploration of RAG pipeline optimization would require an extensive treatise, this post offers a systematic framework for transformative improvements across critical dimensions.
Data foundation and preprocessing
Data foundation and preprocessing consists of the following best practices:

Clean and preprocess source documents to improve quality, removing noise, standardizing formats, and maintaining data consistency
Augment training data with relevant external sources, expanding dataset diversity and coverage
Implement named entity recognition and linking to improve retrieval, enhancing semantic understanding and context identification
Use text summarization techniques to condense long documents, reducing complexity while preserving key information

Chunking strategies
Consider the following chunking strategies:

Use semantic chunking instead of fixed-size chunking to preserve context, maintaining meaningful information boundaries.
Explore various chunk sizes (128–1,024 characters), adapting to semantic text structure and reserving meaning through intelligent segmentation. For more details on Amazon Bedrock chunking strategies, see How content chunking works for knowledge bases.
Implement sliding window chunking with overlap, minimizing information loss between chunks, typically 10–20% overlap to provide contextual continuity.
Consider hierarchical chunking for long documents, capturing both local and global contextual nuances.

Embedding techniques
Embedding techniques include the following:

If your text contains multiple languages, you might want to try using the Cohere Embed (Multilingual) embedding model. This could improve semantic understanding and retrieval relevance.
Experiment with embedding dimensions, balancing performance and computational efficiency.
Implement sentence or paragraph embeddings, moving beyond word-level representations.

Retrieval optimization
Consider the following best practices for retrieval optimization:

Statically or dynamically adjust the number of retrieved chunks, optimizing information density. In your RetrieveAndGenerate (or Retrieve) request, modify “retrievalConfiguration”: { “vectorSearchConfiguration”: { “numberOfResults”: NUMBER }}.
Implement metadata filtering, adding contextual layers to chunk retrieval. For example, prioritizing recent information in time-sensitive scenarios. For code samples for metadata filtering using Amazon Bedrock Knowledge Bases, refer to the following GitHub repo.
Use hybrid search combining dense and sparse retrieval, blending semantic and keyword search approaches.
Apply reranking models to improve precision, reorganizing retrieved contexts by relevance.
Experiment with diverse similarity metrics, exploring beyond standard cosine similarity.
Implement query expansion techniques, transforming queries for more effective retrieval. One example is query decomposition, breaking complex queries into targeted sub-questions.

The following screenshot shows these options on the Amazon Bedrock console.

Prompt engineering
After you select a model, you can edit the prompt template:

Design context-aware prompts, explicitly guiding models to use retrieved information
Implement few-shot prompting, using dynamic, query-matched examples
Create dynamic prompts based on query and documents, adapting instruction strategy contextually
Include explicit usage instructions for retrieved information, achieving faithful and precise response generation

The following screenshot shows an example of editing the prompt template on the Amazon Bedrock console.

Model selection and guardrails
When choosing your model and guardrails, consider the following:

Choose LLMs based on specific task requirements, aligning model capabilities with the use case
Fine-tune models on domain-specific data, enhancing specialized performance
Experiment with model sizes, balancing performance and computational efficiency
Consider specialized model configurations, using smaller models for retrieval and larger for generation
Implement contextual grounding checks, making sure responses remain true to provided information, such as contextual grounding with Amazon Bedrock Guardrails (see the following screenshot)
Explore advanced search paradigms, such as knowledge graph search (GraphRAG)

Navigating knowledge base improvements: Key considerations
When optimizing a RAG system, understanding your performance requirements is crucial. The acceptable performance bar depends entirely on your application’s context—whether it’s an internal tool, a system augmenting human workers, or a customer-facing service. A 0.95 metric score might be sufficient for some applications, where 1 in 20 answers could have minor inaccuracies, but potentially unacceptable for high-stakes scenarios. The key is to align your optimization efforts with the specific reliability and precision needs of your particular use case.
Another key is to prioritize refining the retrieval mechanism before addressing generation. Upstream performance directly influences downstream metrics, making retrieval optimization critical. Certain techniques, particularly chunking strategies, have nuanced impacts across both stages. For instance, increasing chunk size can improve retrieval efficiency by reducing search complexity, but simultaneously risks introducing irrelevant details that might compromise the generation’s correctness. This delicate balance requires careful, incremental adjustments to make sure both retrieval precision and response quality are systematically enhanced.
The following figure illustrates the aforementioned tools and how they relate to retrieval, generation, and both.

Diagnose the issue
When targeting a specific performance metric, adopt a forensic, human-centric approach to diagnosis. Treat your AI system like a colleague whose work requires thoughtful, constructive feedback. This includes the following steps:

Failure pattern identification:

Systematically map question types that consistently underperform
Identify specific characteristics triggering poor performance, such as:

List-based queries
Specialized vocabulary domains
Complex topic intersections

Contextual retrieval forensics:

Conduct granular chunk relevance analysis
Quantify irrelevant or incorrect retrieved contexts
Map precision distribution within the retrieved set (for example, the first 5 out of 15 chunks are relevant, the subsequent 10 are not)
Understand retrieval mechanism’s contextual discrimination capabilities

Ground truth comparative analysis:

Rigorously compare generated responses against reference answers
Diagnose potential ground truth limitations
Develop targeted improvement instructions—think about what specific guidance would enhance response accuracy, and which nuanced context might be missing

Develop a strategic approach to improvement
When confronting complex RAG pipeline challenges, adopt a methodical, strategic approach that transforms performance optimization from a daunting task into a systematic journey of incremental enhancement.
The key is to identify tactics with direct, measurable impact on your specific target metric, concentrating on optimization points that offer the highest potential return on effort. This means carefully analyzing each potential strategy through the lens of its probable performance improvement, focusing on techniques that can deliver meaningful gains with minimal systemic disruption. The following figure illustrates which sets of techniques to prioritize when working to improve metrics.

Additionally, you should prioritize low-friction optimization tactics, such as configurable parameters in your knowledge base, or implementations that have minimal infrastructure disruption. It’s recommended to avoid full vector database reimplementation unless necessary.
You should take a lean approach—make your RAG pipeline improvement into a methodical, scientific process of continuous refinement. Embrace an approach of strategic incrementalism: make purposeful, targeted adjustments that are small enough to be precisely measured, yet meaningful enough to drive performance forward.
Each modification becomes an experimental intervention, rigorously tested to understand its specific impact. Implement a comprehensive version tracking system that captures not just the changes made, but the rationale behind each adjustment, the performance metrics before and after, and the insights gained.
Lastly, approach performance evaluation with a holistic, empathetic methodology that transcends mere quantitative metrics. Treat the assessment process as a collaborative dialogue of growth and understanding, mirroring the nuanced approach you would take when coaching a talented team member. Instead of reducing performance to cold, numerical indicators, seek to uncover the underlying dynamics, contextual challenges, and potential for development. Recognize that meaningful evaluation goes beyond surface-level measurements, requiring deep insight into capabilities, limitations, and the unique context of performance.
Conclusion
Optimizing Amazon Bedrock Knowledge Bases for RAG is an iterative process that requires systematic testing and refinement. Success comes from methodically using techniques like prompt engineering and chunking to improve both the retrieval and generation stages of RAG. By tracking key metrics throughout this process, you can measure the impact of your optimizations and ensure they meet your application’s requirements.
To learn more about optimizing your Amazon Bedrock Knowledge Bases, see our guide on how to Evaluate the performance of Amazon Bedrock resources.

About the Authors
Clement Perrot is a Senior Solutions Architect and AI/ML Specialist at AWS, where he helps early-stage startups build and use AI on the AWS platform. Prior to AWS, Clement was an entrepreneur, whose last two AI and consumer hardware startups were acquired.
Miriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She uses her experience with AI/ML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.
Tamil Sambasivam is a Solutions Architect and AI/ML Specialist at AWS. She helps enterprise customers to solve their business problems by recommending the right AWS solutions. Her strong back ground in Information Technology (24+ years of experience) helps customers to strategize, develop and modernize their business problems in AWS cloud. In the spare time, Tamil like to travel and gardening.

Lyra: A Computationally Efficient Subquadratic Architecture for Biolog …

Deep learning architectures like CNNs and Transformers have significantly advanced biological sequence modeling by capturing local and long-range dependencies. However, their application in biological contexts is constrained by high computational demands and the need for large datasets. CNNs efficiently detect local sequence patterns with subquadratic scaling, whereas Transformers leverage self-attention to model global interactions but require quadratic scaling, making them computationally expensive. Hybrid models, such as Enformers, integrate CNNs and Transformers to balance local and international context modeling, but they still face scalability issues. Large-scale Transformer-based models, including AlphaFold2 and ESM3, have achieved breakthroughs in protein structure prediction and sequence-function modeling. Yet, their reliance on extensive parameter scaling limits their efficiency in biological systems where data availability is often restricted. This highlights the need for more computationally efficient approaches to model sequence-to-function relationships accurately.

To overcome these challenges, epistasis—the interaction between mutations within a sequence—provides a structured mathematical framework for biological sequence modeling. Multilinear polynomials can represent these interactions, offering a principled way to understand sequence-function relationships. State space models (SSMs) naturally align with this polynomial structure, using hidden dimensions to approximate epistatic effects. Unlike Transformers, SSMs utilize Fast Fourier Transform (FFT) convolutions to model global dependencies efficiently while maintaining subquadratic scaling. Additionally, integrating gated depthwise convolutions enhances local feature extraction and expressivity through adaptive feature selection. This hybrid approach balances computational efficiency with interpretability, making it a promising alternative to Transformer-based architectures for biological sequence modeling.

Researchers from institutions, including MIT, Harvard, and Carnegie Mellon, introduce Lyra, a subquadratic sequence modeling architecture designed for biological applications. Lyra integrates SSMs to capture long-range dependencies with projected gated convolutions for local feature extraction, enabling efficient O(N log N) scaling. It effectively models epistatic interactions and achieves state-of-the-art performance across over 100 biological tasks, including protein fitness prediction, RNA function analysis, and CRISPR guide design. Lyra operates with significantly fewer parameters—up to 120,000 times smaller than existing models—while being 64.18 times faster in inference, democratizing access to advanced biological sequence modeling.

Lyra consists of two key components: Projected Gated Convolution (PGC) blocks and a state-space layer with depthwise convolution (S4D). With approximately 55,000 parameters, the model includes two PGC blocks for capturing local dependencies, followed by an S4D layer for modeling long-range interactions. PGC processes input sequences by projecting them to intermediate dimensions, applying depthwise 1D convolutions and linear projections, and recombining features through element-wise multiplication. S4D leverages diagonal state-space models to compute convolution kernels using matrices A, B, and C, efficiently capturing sequence-wide dependencies through weighted exponential terms and enhancing Lyra’s ability to model biological data effectively.

Lyra is a sequence modeling architecture designed to capture local and long-range dependencies in biological sequences efficiently. It integrates PGCs for localized modeling and diagonalized S4D for global interactions. Lyra approximates complex epistatic interactions using polynomial expressivity, outperforming Transformer-based models in tasks like protein fitness landscape prediction and deep mutational scanning. It achieves state-of-the-art accuracy across various protein and nucleic acid modeling applications, including disorder prediction, mutation impact analysis, and RNA-dependent RNA polymerase detection, while maintaining a significantly smaller parameter count and lower computational cost than existing large-scale models.

In conclusion, Lyra introduces a subquadratic architecture for biological sequence modeling, leveraging SSMs to approximate multilinear polynomial functions efficiently. This enables superior modeling of epistatic interactions while significantly reducing computational demands. By integrating PGCs for local feature extraction, Lyra achieves state-of-the-art performance across over 100 biological tasks, including protein fitness prediction, RNA analysis, and CRISPR guide design. It outperforms large foundation models with far fewer parameters and faster inference, requiring only one or two GPUs for training within hours. Lyra’s efficiency democratizes access to advanced biological modeling with therapeutics, pathogen surveillance, and biomanufacturing applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Lyra: A Computationally Efficient Subquadratic Architecture for Biological Sequence Modeling appeared first on MarkTechPost.

SuperBPE: Advancing Language Models with Cross-Word Tokenization

Language models (LMs) face a fundamental challenge in how to perceive textual data through tokenization. Current subword tokenizers segment text into vocabulary tokens that cannot bridge whitespace, adhering to an artificial constraint that treats space as a semantic boundary. This practice ignores the reality that meaning often exceeds individual words – multi-word expressions like “a lot of” function as single semantic units, with English speakers mentally storing thousands of such phrases. Cross-linguistically, the same concepts may be expressed as single or multiple words, depending on the language. Notably, some languages like Chinese and Japanese use no whitespace, allowing tokens to span multiple words or sentences without apparent performance degradation.

Previous research has explored several approaches beyond traditional subword tokenization. Some studies investigated processing text at multiple granularity levels or creating multi-word tokens through frequency-based n-gram identification. Other researchers have explored multi-token prediction (MTP), allowing language models to predict various tokens in a single step, which confirms models’ capability to process more than one subword simultaneously. However, these approaches require architectural modifications and fix the number of tokens predicted per step. Some researchers have pursued tokenizer-free approaches, modeling text directly as byte sequences. However, this significantly increases sequence lengths and computational requirements, leading to complex architectural solutions.

Researchers from the University of Washington, NVIDIA, and the Allen Institute for AI have proposed SuperBPE, a tokenization algorithm that creates a vocabulary containing both traditional subword tokens and innovative “superword” tokens that span multiple words. This approach enhances the popular byte-pair encoding (BPE) algorithm by implementing a pretokenization curriculum by initially maintaining whitespace boundaries to learn subword tokens, then removing these constraints to allow for superword token formation. While standard BPE quickly reaches diminishing returns and begins using increasingly rare subwords as vocabulary size grows, SuperBPE continues discovering common multi-word sequences to encode as single tokens, improving encoding efficiency.

SuperBPE operates through a two-stage training process that modifies the pretokenization step of traditional BPE, mentioned above. This approach intuitively builds semantic units and combines them into common sequences for greater efficiency. Setting t=T (t is transition point and T is target size) produces standard BPE, while t=0 creates a naive whitespace-free BPE. Training SuperBPE requires more computational resources than standard BPE because, without whitespace pretokenization, the training data consists of extremely long “words” with minimal deduplication. However, this increased training cost a few hours on 100 CPUs and occurs only once, which is negligible compared to the resources required for language model pretraining.

SuperBPE shows impressive performance across 30 benchmarks spanning knowledge, reasoning, coding, reading comprehension, etc. All SuperBPE models outperform the BPE baseline, with the strongest 8B model achieving an average improvement of 4.0% and surpassing the baseline on 25 out of 30 individual tasks. Multiple-choice tasks show substantial gains, with a +9.7% improvement. The only statistically significant underperformance occurs in the LAMBADA task, where SuperBPE experiences a final accuracy drop from 75.8% to 70.6%. Moreover, all reasonable transition points yield stronger results than the baseline. The most encoding-efficient transition point delivers a +3.1% performance improvement while reducing inference computing by 35%.

In conclusion, researchers introduced SuperBPE, a more effective tokenization approach developed by enhancing the standard BPE algorithm to incorporate superword tokens. Despite tokenization serving as the fundamental interface between language models and text, tokenization algorithms have remained relatively static. SuperBPE challenges this status quo by recognizing that tokens can extend beyond traditional subword boundaries to include multi-word expressions. SuperBPE tokenizers enable language models to achieve superior performance across numerous downstream tasks while reducing inference computational costs. These advantages require no modifications to the underlying model architecture, making SuperBPE a seamless replacement for traditional BPE in modern language model development pipelines.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post SuperBPE: Advancing Language Models with Cross-Word Tokenization appeared first on MarkTechPost.