March 2025 - Page 2 of 7

A Code Implementation for Advanced Human Pose Estimation Using MediaPi …

Posted on March 26, 2025 by i-genie

Human pose estimation is a cutting-edge computer vision technology that transforms visual data into actionable insights about human movement. By utilizing advanced machine learning models like MediaPipe’s BlazePose and powerful libraries such as OpenCV, developers can track body key points with unprecedented accuracy. In this tutorial, we explore the seamless integration of these, demonstrating how Python-based frameworks enable sophisticated pose detection across various domains, from sports analytics to healthcare monitoring and interactive applications.

First, we install the essential libraries:

Copy CodeCopiedUse a different Browser!pip install mediapipe opencv-python-headless matplotlib

Then, we import the important libraries needed for our implementation:

Copy CodeCopiedUse a different Browserimport cv2
import mediapipe as mp
import matplotlib.pyplot as plt
import numpy as np

We initialize the MediaPipe Pose model in static image mode with segmentation enabled and a minimum detection confidence of 0.5. It also imports utilities for drawing landmarks and applying drawing styles.

Copy CodeCopiedUse a different Browsermp_pose = mp.solutions.pose
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

pose = mp_pose.Pose(
static_image_mode=True,
model_complexity=1,
enable_segmentation=True,
min_detection_confidence=0.5
)

Here, we define the detect_pose function, which reads an image, processes it to detect human pose landmarks using MediaPipe, and returns the annotated image along with the detected landmarks. If landmarks are found, they are drawn using default styling.

Copy CodeCopiedUse a different Browserdef detect_pose(image_path):
image = cv2.imread(image_path)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

results = pose.process(image_rgb)

annotated_image = image_rgb.copy()
if results.pose_landmarks:
mp_drawing.draw_landmarks(
annotated_image,
results.pose_landmarks,
mp_pose.POSE_CONNECTIONS,
landmark_drawing_spec=mp_drawing_styles.get_default_pose_landmarks_style()
)

return annotated_image, results.pose_landmarks

We define the visualize_pose function, which displays the original and pose-annotated images side by side using matplotlib. The extract_keypoints function converts detected pose landmarks into a dictionary of named keypoints with their x, y, z coordinates and visibility scores.

Copy CodeCopiedUse a different Browserdef visualize_pose(original_image, annotated_image):
plt.figure(figsize=(16, 8))

plt.subplot(1, 2, 1)
plt.title(‘Original Image’)
plt.imshow(cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB))
plt.axis(‘off’)

plt.subplot(1, 2, 2)
plt.title(‘Pose Estimation’)
plt.imshow(annotated_image)
plt.axis(‘off’)

plt.tight_layout()
plt.show()

def extract_keypoints(landmarks):
if landmarks:
keypoints = {}
for idx, landmark in enumerate(landmarks.landmark):
keypoints[mp_pose.PoseLandmark(idx).name] = {
‘x’: landmark.x,
‘y’: landmark.y,
‘z’: landmark.z,
‘visibility’: landmark.visibility
}
return keypoints
return None

Finally, we load an image from the specified path, detect and visualize human pose landmarks using MediaPipe, and then extract and print the coordinates and visibility of each detected keypoint.

Copy CodeCopiedUse a different Browserimage_path = ‘/content/Screenshot 2025-03-26 at 12.56.05 AM.png’
original_image = cv2.imread(image_path)
annotated_image, landmarks = detect_pose(image_path)

visualize_pose(original_image, annotated_image)

keypoints = extract_keypoints(landmarks)
if keypoints:
print(“Detected Keypoints:”)
for name, details in keypoints.items():
print(f”{name}: {details}”)

Sample Processed Output

In this tutorial, we explored human pose estimation using MediaPipe and OpenCV, demonstrating a comprehensive approach to body keypoint detection. We implemented a robust pipeline that transforms images into detailed skeletal maps, covering key steps including library installation, pose detection function creation, visualization techniques, and keypoint extraction. Using advanced machine learning models, we showcased how developers can transform raw visual data into meaningful movement insights across various domains like sports analytics and healthcare monitoring.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

The post A Code Implementation for Advanced Human Pose Estimation Using MediaPipe, OpenCV and Matplotlib appeared first on MarkTechPost.

RWKV-7: Advancing Recurrent Neural Networks for Efficient Sequence Mod …

Posted on March 26, 2025 by i-genie

Autoregressive Transformers have become the leading approach for sequence modeling due to their strong in-context learning and parallelizable training enabled by softmax attention. However, softmax attention has quadratic complexity in sequence length, leading to high computational and memory demands, especially for long sequences. While GPU optimizations mitigate this for short sequences, inference remains costly at scale. Researchers have explored recurrent architectures with compressive states that offer linear complexity and constant memory use to address this. Advances in linear attention and state-space models (SSMs) have shown promise, with RNN-based approaches like RWKV-4 achieving competitive performance while significantly lowering inference costs.

Researchers from multiple institutions, including the RWKV Project, EleutherAI, Tsinghua University, and others, introduce RWKV-7 “Goose,” a novel sequence modeling architecture that establishes new state-of-the-art (SoTA) performance at the 3 billion parameter scale for multilingual tasks. Despite being trained on significantly fewer tokens than competing models, RWKV-7 achieves comparable English language performance while maintaining constant memory usage and inference time per token. The architecture extends the delta rule by incorporating vector-valued state gating, adaptive in-context learning rates, and a refined value replacement mechanism. These improvements enhance expressivity, enable efficient state tracking, and allow recognition of all regular languages, exceeding the theoretical capabilities of Transformers under standard complexity assumptions. To support its development, the researchers release an extensive 3.1 trillion-token multilingual corpus, alongside multiple pre-trained RWKV-7 models ranging from 0.19 to 2.9 billion parameters, all available under an open-source Apache 2.0 license.

RWKV-7 introduces key innovations layered on the RWKV-6 architecture, including token-shift, bonus mechanisms, and a ReLU² feedforward network. The model’s training corpus, RWKV World v3, enhances its English, code, and multilingual capabilities. In addition to releasing trained models, the team provides proof that RWKV-7 can solve problems beyond TC₀ complexity, including S₅ state tracking and regular language recognition. This demonstrates its ability to handle computationally complex tasks more efficiently than Transformers. Furthermore, the researchers propose a cost-effective method to upgrade the RWKV architecture without full retraining, facilitating incremental improvements. The development of larger datasets and models will continue under open-source licensing, ensuring broad accessibility and reproducibility.

The RWKV-7 model employs a structured approach to sequence modeling, denoting model dimensions as D and using trainable matrices for computations. It introduces vector-valued state gating, in-context learning rates, and a refined delta rule formulation. The time-mixing process involves weight preparation using low-rank MLPs, with key components like replacement keys, decay factors, and learning rates designed for efficient state evolution. A weighted key-value (WKV) mechanism facilitates dynamic state transitions, approximating a forget gate. Additionally, RWKV-7 enhances expressivity through per-channel modifications and a two-layer MLP, improving computational stability and efficiency while preserving state-tracking capabilities.

RWKV-7 models were assessed using the LM Evaluation Harness on various English and multilingual benchmarks, demonstrating competitive performance with state-of-the-art models while utilizing fewer training tokens. Notably, RWKV-7 outperformed its predecessor in MMLU and significantly improved multilingual tasks. Additionally, evaluations of recent internet data confirmed its effectiveness in handling information. The model excelled in associative recall, mechanistic architecture design, and long-context retention. Despite constraints in training resources, RWKV-7 demonstrated superior efficiency, achieving strong benchmark results while requiring fewer FLOPs than leading transformer models.

In conclusion, RWKV-7 is an RNN-based architecture that achieves state-of-the-art results across multiple benchmarks while requiring significantly fewer training tokens. It maintains high parameter efficiency, linear time complexity, and constant memory usage, making it a strong alternative to Transformers. However, it faces limitations such as numerical precision sensitivity, lack of instruction tuning, prompt sensitivity, and restricted computational resources. Future improvements include optimizing speed, incorporating chain-of-thought reasoning, and scaling with larger datasets. The RWKV-7 models and training code are openly available under the Apache 2.0 License to encourage research and development in efficient sequence modeling.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post RWKV-7: Advancing Recurrent Neural Networks for Efficient Sequence Modeling appeared first on MarkTechPost.

Amazon Bedrock launches Session Management APIs for generative AI appl …

Posted on March 26, 2025 by i-genie

Amazon Bedrock announces the preview launch of Session Management APIs, a new capability that enables developers to simplify state and context management for generative AI applications built with popular open source frameworks such as LangGraph and LlamaIndex. Session Management APIs provide an out-of-the-box solution that enables developers to securely manage state and conversation context across multi-step generative AI workflows, alleviating the need to build, maintain, or scale custom backend solutions. In this post, we discuss the new Session Management APIs and how to handle session state in your generative AI applications.
By preserving session state between interactions, Session Management APIs enhance workflow continuity, enabling generative AI applications, such as virtual assistants and multi-agent research workflows, that require persistent context across extended interactions. Developers can use this capability to checkpoint workflow stages, save intermediate states, and resume tasks from points of failure or interruption. Additionally, they can pause and replay sessions and use detailed traces to debug and enhance their generative AI applications. By treating sessions as a first-class resource, this capability enables developers to enforce granular access control through AWS Identity and Access Management (IAM) and encrypt data using AWS Key Management Service (AWS KMS), making sure that data from different user sessions is securely isolated and supporting multi-tenant applications with strong privacy protections.
Building generative AI applications requires more than model API calls. Your applications must handle conversation history, user preferences, state tracking, and contextual shifts. As these applications grow in complexity, robust state management becomes crucial. Key reasons include:

Contextual coherence – Maintaining state makes sure that the application can track the flow of information, leading to more coherent and contextually relevant outputs.
User interaction tracking – In interactive applications, state management allows the system to remember user inputs and preferences, facilitating personalized experiences.
Resource optimization – Efficient state management helps in allocating computational resources effectively, making sure that the application runs smoothly without unnecessary redundancy.
Error handling and recovery – Developers can use this capability to checkpoint workflow stages, save intermediate states, and resume tasks from points of failure or interruption.

In this post, we discuss the new Session Management APIs and how to handle session state in your generative AI applications.
Background
State persistence in generative AI applications refers to the ability to maintain and recall information across multiple interactions. This is crucial for creating coherent and contextually relevant experiences. Some of the information that you might need to persist includes:

User information – Basic details about the user, such as ID, preferences, or history
Conversation history – A record of previous interactions within the current session
Context markers – Indicators of the current topic, intent, or stage in a multi-turn conversation
Application state – The current status of ongoing processes or workflows

Effective use of session attributes enables personalization by tailoring responses based on the ongoing conversation, continuity by allowing conversations to pick up where they left off even after interruptions, and complex task handling by managing multi-step processes or decision trees effectively. These capabilities enhance the user experience and the overall functionality of generative AI applications.
Challenges
Implementing robust state management in generative AI applications presents several interconnected challenges. The system must handle state persistence and retrieval in milliseconds to maintain fluid conversations. As traffic grows and contextual data expands, state management also needs to efficiently scale.
When you build your own state management system, you need to implement backend services and infrastructure that handle persistence, checkpointing, and retrieval operations. For this post, we consider LangGraph to discuss the concepts of short-term memory and available options. Short-term memory stores information within a single conversation thread, which is managed as part of the agent’s state and persisted using thread-scoped checkpoints. You can persist short-term memory in a database like PostgreSQL using either a synchronous or asynchronous connection. However, you need to set up the infrastructure, implement data governance, and enable security and monitoring.
Solution overview
The Session Management APIs in Amazon Bedrock offer a comprehensive solution that streamlines the development and deployment of generative AI applications by alleviating the need for custom infrastructure setup and maintenance. This capability not only minimizes the complexities of handling data persistence, retrieval, and checkpointing, but also provides enterprise-grade security features with built-in tenant isolation capabilities. You can offload the heavy lifting of managing state and context of your DIY generative AI solutions to Session Management APIs, while still using your preferred OSS tool. This will accelerate your path to deploy secure and scalable generative AI solutions.
The Session Management APIs also support human-in-the-loop scenarios, where manual intervention is required within automated workflows. Additionally, it provides comprehensive debugging and traceability features, maintaining detailed execution logs for troubleshooting and compliance purposes. The ability to quickly retrieve and analyze session data empowers developers to optimize their applications based on actual usage patterns and performance metrics.
To understand how Session Management APIs integrate with LangGraph applications, let’s look at the following high-level flow.

Example use case
To demonstrate the power and simplicity of Session Management APIs, let’s walk through a practical example of building a shoe shopping assistant. We will show how BedrockMemorySaver provides a custom checkpointing solution backed by the Session Management APIs. The complete code for this example is available in the AWS Samples GitHub repository.
First, let’s understand how Session Management APIs work with our application, as illustrated in the following diagram.

This process flow shows how each user interaction creates a new invocation in the session, maintains conversation context, and automatically persists state while the LangGraph application focuses on business logic. The seamless integration between these components enables sophisticated, stateful conversations without the complexity of managing infrastructure for state and context persistence.
Prerequisites
To follow along with this post, you need an AWS account with the appropriate permissions.
Set up the environment
We use the following code to set up the environment:

%pip install -U langgraph_checkpoint_aws

import boto3
from langgraph_checkpoint_aws.saver import BedrockSessionSaver

# Configure Bedrock client
bedrock_client = boto3.client(“bedrock-runtime”, region_name=”=”<aws_region>”)

Initialize the model
For our large language model (LLM), we Anthropic’s Claude 3 Sonnet on Amazon Bedrock:

from langchain_aws import ChatBedrockConverse
llm = ChatBedrockConverse(
model=”anthropic.claude-3-sonnet-20240229-v1:0″,
temperature=0,
max_tokens=None,
client=bedrock_client,
)

Implement tools
Our assistant needs tools to search the product database and manage the shopping cart. These tools can use the information saved in the user session:

from langchain_core.tools import tool
@tool
def search_shoes(preference):
“””Search for shoes based on user preferences and interests.”””
return pass

Set up Session Management APIs
We use the following code to integrate the Session Management APIs:

# Initialize session saver
session_saver = BedrockSessionSaver(
region_name=”<aws_region>”,
)

# Compile graph with session management
graph = graph_builder.compile(checkpointer=session_saver)

# Create a new session
session_id = session_saver.session_client.client.create_session()[“sessionId”]

Run the conversation
Now we can run our stateful conversation:

config = {“configurable”: {“thread_id”: session_id}}

while True:
user_input = input(“User: “)
if user_input.lower() in [“quit”, “exit”, “q”]:
print(“Goodbye!”)
break
for event in graph.stream(
{“messages”: [(“user”, user_input)]},
config
):
for value in event.values():
if isinstance(value[“messages”][-1], BaseMessage):
print(“Assistant:”, value[“messages”][-1].content)

Access session history
You can quickly retrieve the conversation history using the graph instance:

for i in graph.get_state_history(config, limit=5):
print(i)

Although it’s simple to access data using BedrockSessionSaver in LangGraph, there might be instances where you need to access session data directly—whether for auditing purposes or external processing. The Session Management APIs provide this functionality, though it’s important to note that the retrieved data is in serialized format. To work with this data meaningfully, you need to perform deserialization first:

# List all invocation steps
steps = client.list_invocation_steps(
sessionIdentifier=session_id,
)

# Get specific step details
step_details = client.get_invocation_step(
sessionIdentifier=session_id,
invocationIdentifier=”your-invocation-id”,
invocationStepId=”your-step-id”,
)

Replay and fork actions
You might want to analyze the steps to understand the reasoning, debug, or try out different paths. You can invoke the graph with a checkpoint to replay specific actions from that point:

config_replay = {
“configurable”: {
“thread_id”: session_id,
“checkpoint_id”: “<checkpoint_id>”,
}
}
for event in graph.stream(None, config_replay, stream_mode=”values”):
print(event)

The graph replays previously executed steps before the provided checkpoint_id and executes the steps after checkpoint_id.
You can also try forking to revisit an agent’s past actions and explore alternative paths within the graph:

config = {
“configurable”: {
“thread_id”: session_id,
“checkpoint_id”: “<checkpoint_id>”,
}
}
graph.update_state(config, {“state”: “updated state”})

Human-in-the-loop
Human-in-the-loop (HITL) interaction patterns allow the graph to stop at specific steps and seek human approval before proceeding. This is important if you have to review specific tool calls. In LangGraph, breakpoints are built on checkpoints, which save the graph’s state after each node execution. You can use the Session Management APIs to effectively implement HITL in your graph.
This example demonstrates how Session Management APIs seamlessly integrate with LangGraph to create a stateful conversation that maintains context across interactions. The Session Management APIs handle the complexity of state persistence, allowing you to focus on building the conversation logic.
The complete code is available in the AWS Samples GitHub repository. Feel free to clone it and experiment with your own modifications.
Clean up
To avoid incurring ongoing charges, clean up the resources you created as part of this solution.
Considerations and best practices
When implementing the Session Management APIs, consider these key practices for optimal results:

Session lifecycle management – Plan your session lifecycles carefully, from creation to termination. Initialize sessions using CreateSession at the start of conversations and properly close them with EndSession when complete. This approach promotes efficient resource utilization and maintains clean state boundaries between interactions.
Security and compliance – For applications handling sensitive information, implement appropriate data protection measures using the Session Management APIs’ built-in security features. By default, AWS managed keys are used for session encryption. For additional security requirements, you can encrypt session data with a customer managed key. Use the service’s data retention and deletion capabilities to maintain compliance with relevant regulations while maintaining proper data governance.

Conclusion
The Session Management APIs in Amazon Bedrock offer a powerful solution for handling state in generative AI applications. By using this fully managed capability, developers can focus on creating innovative AI experiences without getting caught up in the complexities of infrastructure management. The seamless integration with LangGraph enhances its utility, allowing for rapid development and deployment of sophisticated, stateful AI applications.
As the field of generative AI continues to evolve, robust state management will become increasingly crucial. The Session Management APIs provide the scalability, security, and flexibility needed to help meet these growing demands, enabling developers to build more contextually aware, personalized, and reliable AI-powered applications.
By adopting the Session Management APIs, developers can accelerate their path to production, provide better user experiences through consistent and coherent interactions, and focus their efforts on the unique value propositions of their AI applications rather than the underlying infrastructure challenges.
Try out the Session Management APIs for your own use case, and share your feedback in the comments.

About the authors
Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for Generative AI to help customers and partners build GenAI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture and ML applications.
Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.
Rupinder Grewal is a Tech Lead Gen AI Specialist. He enjoys playing tennis and biking on mountain trails.
Krishna Gourishetti is a Senior Software Engineer for the Bedrock Agents team in AWS. He is passionate about building scalable software solutions that solve customer problems. In his free time, Krishna loves to go on hikes.
Aniketh Manjunath is a Software Development Engineer at Amazon Bedrock. He is passionate about distributed machine learning systems. Outside of work, he enjoys hiking, watching movies, and playing cricket.
Sarthak Handa serves as a Principal Product Manager at Amazon Web Services (AWS) AI/ML in Seattle, Washington, where his primary focus is on developing AI services that facilitate advancements in the healthcare industry. Prior to his work at AWS, Sarthak spent several years as a startup founder, building technology solutions for the healthcare and disaster relief sectors.

Enhance deployment guardrails with inference component rolling updates …

Posted on March 26, 2025 by i-genie

Deploying models efficiently, reliably, and cost-effectively is a critical challenge for organizations of all sizes. As organizations increasingly deploy foundation models (FMs) and other machine learning (ML) models to production, they face challenges related to resource utilization, cost-efficiency, and maintaining high availability during updates. Amazon SageMaker AI introduced inference component functionality that can help organizations reduce model deployment costs by optimizing resource utilization through intelligent model packing and scaling. Inference components abstract ML models and enable assigning dedicated resources and specific scaling policies per model.
However, updating these models—especially in production environments with strict latency SLAs—has historically risked downtime or resource bottlenecks. Traditional blue/green deployments often struggle with capacity constraints, making updates unpredictable for GPU-heavy models. To address this, we’re excited to announce another powerful enhancement to SageMaker AI: rolling updates for inference component endpoints, a feature designed to streamline updates for models of different sizes while minimizing operational overhead.
In this post, we discuss the challenges faced by organizations when updating models in production. Then we deep dive into the new rolling update feature for inference components and provide practical examples using DeepSeek distilled models to demonstrate this feature. Finally, we explore how to set up rolling updates in different scenarios.
Challenges with blue/green deployment
Traditionally, SageMaker AI inference has supported the blue/green deployment pattern for updating inference components in production. Though effective for many scenarios, this approach comes with specific challenges:

Resource inefficiency – Blue/Green deployment requires provisioning resources for both the current (blue) and new (green) environments simultaneously. For inference components running on expensive GPU instances like P4d or G5, this means potentially doubling the resource requirements during deployments. Consider an example where a customer has 10 copies of an inference component spread across 5 ml.p4d.24xlarge instances, all operating at full capacity. With blue/green deployment, SageMaker AI would need to provision five additional ml.p4d.24xlarge instances to host the new version of the inference component before switching traffic and decommissioning the old instances.
Limited computing resources – For customers using powerful GPU instances like the P or G series, the required capacity might not be available in a given Availability Zone or Region. This often results in instance capacity exceptions during deployments, causing update failures and rollbacks.
All-or-nothing transitions – Traditional blue/green deployments shift all traffic at one time or based on a configured schedule. This leaves limited room for gradual validation and increases the area of effect if issues arise with the new deployment.

Although blue/green deployment has been a reliable strategy for zero-downtime updates, its limitations become glaring when deploying large-scale large language models (LLMs) or high-throughput models on premium GPU instances. These challenges demand a more nuanced approach—one that incrementally validates updates while optimizing resource usage. Rolling updates for inference components are designed to eliminate the rigidity of blue/green deployments. By updating models in controlled batches, dynamically scaling infrastructure, and integrating real-time safety checks, this strategy makes sure deployments remain cost-effective, reliable, and adaptable—even for GPU-heavy workloads.
Rolling deployment for inference component updates
As mentioned earlier, inference components are introduced as a SageMaker AI feature to optimize costs; they allow you to define and deploy the specific resources needed for your model inference workload. By right-sizing compute resources to match your model’s requirements, you can save costs during updates compared to traditional deployment approaches.
With rolling updates, SageMaker AI deploys new model versions in configurable batches of inference components while dynamically scaling instances. This is particularly impactful for LLMs:

Batch size flexibility – When updating the inference components in a SageMaker AI endpoint, you can specify the batch size for each rolling step. For each step, SageMaker AI provisions capacity based on the specified batch size on the new endpoint fleet, routes traffic to that fleet, and stops capacity on the old endpoint fleet. Smaller models like DeepSeek Distilled Llama 8B can use larger batches for rapid updates, and larger models like DeepSeek Distilled Llama 70B use smaller batches to limit GPU contention.
Automated safety guards – Integrated Amazon CloudWatch alarms monitor metrics on an inference component. You can configure the alarms to check if the newly deployed version of inference component is working properly or not. If the CloudWatch alarms are triggered, SageMaker AI will start an automated rollback.

The new functionality is implemented through extensions to the SageMaker AI API, primarily with new parameters in the UpdateInferenceComponent API:

sagemaker_client.update_inference_component(
InferenceComponentName=inference_component_name,
RuntimeConfig={ “CopyCount”: number },
Specification={ … },
DeploymentConfig={
“RollingUpdatePolicy”: {
“MaximumBatchSize”: { # Value must be between 5% to 50% of the IC’s total copy count.
“Type”: “COPY_COUNT”, # COPY_COUNT | CAPACITY_PERCENT
“Value”: 1 # Minimum value of 1
},
“MaximumExecutionTimeoutInSeconds”: 600, #Minimum value of 600. Maximum value of 28800.
“RollbackMaximumBatchSize”: {
“Type”: “COPY_COUNT”, # COPY_COUNT | CAPACITY_PERCENT
“Value”:1
},
“WaitIntervalInSeconds”: 120 # Minimum value of 0. Maximum value of 3600
}
},
AutoRollbackConfiguration={
“Alarms”: [
{
“AlarmName”: “string” #Optional
}
]
},
)

The preceding code uses the following parameters:

MaximumBatchSize – This is a required parameter and defines the batch size for each rolling step in the deployment process. For each step, SageMaker AI provisions capacity on the new endpoint fleet, routes traffic to that fleet, and stops capacity on the old endpoint fleet. The value must be between 5–50% of the copy count of the inference component.

Type – This parameter could contain a value like COPY_COUNT | CAPACITY_PERCENT, which specifies the endpoint capacity type.
Value – This defines the capacity size, either as a number of inference component copies or a capacity percentage.

MaximumExecutionTimeoutInSeconds – This is the maximum time that the rolling deployment would spend on the overall execution. Exceeding this limit causes a timeout.
RollbackMaximumBatchSize – This is the batch size for a rollback to the old endpoint fleet. If this field is absent, the value is set to the default, which is 100% of the total capacity. When the default is used, SageMaker AI provisions the entire capacity of the old fleet at the same time during rollback.

Value – The Value parameter of this structure would contain the value with which the Type would be executed. For a rollback strategy, if you don’t specify the fields in this object, or if you set the Value to 100%, then SageMaker AI uses a blue/green rollback strategy and rolls traffic back to the blue fleet.

WaitIntervalInSeconds – This is the time limit for the total deployment. Exceeding this limit causes a timeout.
AutoRollbackConfiguration – This is the automatic rollback configuration for handling endpoint deployment failures and recovery.

AlarmName – This CloudWatch alarm is configured to monitor metrics on an InferenceComponent. You can configure it to check if the newly deployed version of InferenceComponent is working properly or not.

For more information about the SageMaker AI API, refer to the SageMaker AI API Reference.
Customer experience
Let’s explore how rolling updates work in practice with several common scenarios, using different-sized LLMs. You can find the example notebook in the GitHub repo.
Scenario 1: Multiple single GPU cluster
In this scenario, assume you’re running an endpoint with three ml.g5.2xlarge instances, each with a single GPU. The endpoint hosts an inference component that requires one GPU accelerator, which means each instance holds one copy. When you want to update the inference component to use a new inference component version, you can use rolling updates to minimize disruption.
You can configure a rolling update with a batch size of one, meaning SageMaker AI will update one copy at a time. During the update process, SageMaker AI first identifies available capacity in the existing instances. Because none of the existing instances has space for additional temporary workloads, SageMaker AI will launch new ml.g5.2xlarge instances one at a time to deploy one copy of the new inference component version to a GPU instance. After the specified wait interval and the new inference component’s container passes healthy check, SageMaker AI removes one copy of the old version (because each copy is hosted on one instance, this instance will be torn down accordingly), completing the update for the first batch.
This process repeats for the second copy of the inference component, providing a smooth transition with zero downtime. The gradual nature of the update minimizes risk and allows you to maintain consistent availability throughout the deployment process. The following diagram shows this process.

Scenario 2: Update with automatic rollback
In another scenario, you might be updating your inference component from Llama-3.1-8B-Instruct to DeepSeek-R1-Distill-Llama-8B, but the new model version has different API expectations. In this use case, you have configured a CloudWatch alarm to monitor for 4xx errors, which would indicate API compatibility issues.
You can initiate a rolling update with a batch size of one copy. SageMaker AI deploys the first copy of the new version on a new GPU instance. When the new instance is ready to serve traffic, SageMaker AI will forward a proportion of the invocation requests to this new model. However, in this example, the new model version, which is missing the “MESSAGES_API_ENABLED” environment variable configuration, will begin to return 4xx errors when receiving requests in the Messages API format.

The configured CloudWatch alarm detects these errors and transitions to the alarm state. SageMaker AI automatically detects this alarm state and initiates a rollback process according to the rollback configuration. Following the specified rollback batch size, SageMaker AI removes the problematic new model version and maintains the original working version, preventing widespread service disruption. The endpoint returns to its original state with traffic being handled by the properly functioning original model version.
The following code snippet shows how to set up a CloudWatch alarm to monitor 4xx errors:

# Create alarm
cloudwatch.put_metric_alarm(
AlarmName=f’SageMaker-{endpoint_name}-4xx-errors’,
ComparisonOperator=’GreaterThanThreshold’,
EvaluationPeriods=1,
MetricName=’Invocation4XXErrors’,
Namespace=’AWS/SageMaker’,
Period=300,
Statistic=’Sum’,
Threshold=5.0,
ActionsEnabled=True,
AlarmDescription=’Alarm when greather than 5 4xx errors’,
Dimensions=[
{
‘Name’: ‘InferenceComponentName’,
‘Value’: inference_component_name
},
],
)

Then you can use this CloudWatch alarm in the update request:

sagemaker_client.update_inference_component(
InferenceComponentName=inference_component_name,
… …
DeploymentConfig={
“RollingUpdatePolicy”: {
“MaximumBatchSize”: {
“Type”: “COPY_COUNT”,
“Value”: 1
},
“WaitIntervalInSeconds”: 120,
“RollbackMaximumBatchSize”: {
“Type”: “COPY_COUNT”,
“Value”: 1
}
},
‘AutoRollbackConfiguration’: {
“Alarms”: [
{“AlarmName”: f’SageMaker-{endpoint_name}-4xx-errors’}
]
}
}
)

Scenario 3: Update with sufficient capacity in the existing instances
If an existing endpoint has multiple GPU accelerators and not all the accelerators are used, the update can use existing GPU accelerators without launching new instances to the endpoint. Consider if you have an endpoint configured with an initial two ml.g5.12xlarge instances that have four GPU accelerators in each instance. The endpoint hosts two inference components: IC-1 requires one accelerator and IC-2 also requires one accelerator. On one ml.g5.12xlarge instance, there are four copies of IC-1 that have been created; on the other instance, two copies of IC-2 have been created. There are still two GPU accelerators available on the second instance.
When you initiate an update for IC-1 with a batch size of two copies, SageMaker AI determines that there is sufficient capacity in the existing instances to host the new versions while maintaining the old ones. It will create two copies of the new IC-1 version on the second instance. When the containers are up and running, SageMaker AI will direct traffic to the new IC-1s and then start routing traffic to the new inference components. SageMaker AI will also remove two of the old IC-1 copies from the instance. You are not charged until the new inference components start taking the invocations and generating responses.
Now another two free GPU slots are available. SageMaker AI will update the second batch, and it will use the free GPU accelerators that just became available. After the processes are complete, the endpoint has four IC-1 with the new version and two copies of IC-2 that weren’t changed.

Scenario 4: Update requiring additional instance capacity
Consider if you have an endpoint configured with initially one ml.g5.12xlarge instance (4 GPUs total) and configured managed instance scaling (MIS) with a maximum instance number set to two. The endpoint hosts two inference components: IC-1 requiring 1 GPU with two copies (Llama 8B), and IC-2 (DeepSeek Distilled Llama 14B model) also requiring 1 GPU with two copies—utilizing all 4 available GPUs.
When you initiate an update for IC-1 with a batch size of two copies, SageMaker AI determines that there’s insufficient capacity in the existing instances to host the new versions while maintaining the old ones. Instead of failing the update, as you have configured MIS, SageMaker AI will automatically provision a second g5.12.xlarge instance to host the new inference components.
During the update process, SageMaker AI deploys two copies of the new IC-1 version onto the newly provisioned instance, as shown in the following diagram. After the new inference components are up and running, SageMaker AI begins removing the old IC-1 copies from the original instances. By the end of the update, the first instance will host IC-2 utilizing 2 GPUs, and the newly provisioned second instance will host the updated IC-1 with two copies using 2 GPUs. There will be new spaces available in the two instances, and you can deploy more inference component copies or new models to the same endpoint using the available GPU resources. If you set up managed instance auto scaling and set inference component auto scaling to zero, you can scale down the inference component copies to zero, which will result in the corresponding instance being scaled down. When the inference component is scaled up, SageMaker AI will launch the inference components in the existing instance with the available GPU accelerators, as mentioned in scenario 3.

Scenario 5: Update facing insufficient capacity
In scenarios where there isn’t enough GPU capacity, SageMaker AI provides clear feedback about capacity constraints. Consider if you have an endpoint running on 30 ml.g6e.16xlarge instances, each already fully utilized with inference components. You want to update an existing inference component using a rolling deployment with a batch size of 4, but after the first four batches are updated, there isn’t enough GPU capacity available for the remaining update. In this case, SageMaker AI will automatically roll back to the previous setup and stop the update process.
There can be two cases for this rollback final status. In the first case, the rollback was successful because there was new capacity available to launch the instances for the old model version. However, there could be another case where the capacity issue persists during rolling back, and the endpoint will show as UPDATE_ROLLBACK_FAILED. The existing instances can still serve traffic, but to move the endpoint out of the failed status, you need to contact your AWS support team.
Additional considerations
As mentioned earlier, when using blue/green deployment to update the inference components on an endpoint, you need to provision resources for both the current (blue) and new (green) environments simultaneously. When you’re using rolling updates for inference components on the endpoint, you can use the following equation to calculate the number of account service quotas for the instance type required. The GPU instance required for the endpoint has X number of GPU accelerators, and each inference component copy requires Y number of GPU accelerators. The maximum batch size is set to Z and the current endpoint has N instances. Therefore, the account-level service quota required for this instance type for the endpoint should be greater than the output of the equation:
ROUNDUP(Z x Y / X) + N
For example, let’s assume the current endpoint has 8 (N) ml.g5.12xlarge instances, which has 4 GPU accelerators of each instance. You set the maximum batch size to 2 (Z) copies, and each needs 1 (Y) GPU accelerators. The minimum AWS service quota value for ml.g5.12xlarge is ROUNDUP(2 x 1 / 4) + 8 = 9. In another scenario, when each copy of inference component requires 4 GPU accelerators, then the required account-level service quota for the same instance should be ROUNDUP(2 x 4 / 4) + 8 = 10.
Conclusion
Rolling updates for inference components represent a significant enhancement to the deployment capabilities of SageMaker AI. This feature directly addresses the challenges of updating model deployments in production, particularly for GPU-heavy workloads, and it eliminates capacity guesswork and reduces rollback risk. By combining batch-based updates with automated safeguards, SageMaker AI makes sure deployments are agile and resilient.
Key benefits include:

Reduced resource overhead during deployments, eliminating the need to provision duplicate fleets
Improved deployment guardrails with gradual updates and automatic rollback capabilities
Continued availability during updates with configurable batch sizes
Straightforward deployment of resource-intensive models that require multiple accelerators

Whether you’re deploying compact models or larger multi-accelerator models, rolling updates provide a more efficient, cost-effective, and safer path to keeping your ML models current in production.
We encourage you to try this new capability with your SageMaker AI endpoints and discover how it can enhance your ML operations. For more information, check out the SageMaker AI documentation or connect with your AWS account team.

About the authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
Andrew Smith is a Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.
Dustin Liu is a solutions architect at AWS, focused on supporting financial services and insurance (FSI) startups and SaaS companies. He has a diverse background spanning data engineering, data science, and machine learning, and he is passionate about leveraging AI/ML to drive innovation and business transformation.
Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Shikher Mishra is a Software Development Engineer with SageMaker Inference team with over 9+ years of industry experience. He is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. In his spare time, Shikher enjoys outdoor sports, hiking and traveling.
June Won is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Evaluate and improve performance of Amazon Bedrock Knowledge Bases

Posted on March 26, 2025 by i-genie

Amazon Bedrock Knowledge Bases is a fully managed capability that helps implement entire Retrieval Augmented Generation (RAG) workflows from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows.
There is no single way to optimize knowledge base performance: each use case is impacted differently by configuration parameters. As such, it’s important to test often and iterate quickly to identify the best configuration for each use case.
In this post, we discuss how to evaluate the performance of your knowledge base, including the metrics and data to use for evaluation. We also address some of the tactics and configuration changes that can improve specific metrics.
Measure the performance of your knowledge base
RAG is a complex AI system, combining several critical steps. In order to identify what is impacting the performance of the pipeline, it’s important to evaluate each step independently. The knowledge base evaluation framework decomposes the evaluation into the following stages:

Retrieval – The process of retrieving relevant parts of documents based on a query and adding the retrieved elements as context to the final prompt for the knowledge base
Generation – Sending the user’s prompt and the retrieved context to a large language model (LLM) and then sending the output from the LLM back to the user

The following diagram illustrates the standard steps in a RAG pipeline.

To see this evaluation framework in action, open the Amazon Bedrock console, and in the navigation pane, choose Evaluations. Choose the Knowledge Bases tab to review the evaluation.

Evaluate the retrieval
We recommend initially evaluating the retrieval process independently, because the accuracy and quality of this foundational stage can significantly impact downstream performance metrics in the RAG workflow, potentially introducing errors or biases that propagate through subsequent pipeline stages.

There are two metrics used to evaluate retrieval:

Context relevance – Evaluates whether the retrieved information directly addresses the query’s intent. It focuses on precision of the retrieval system.
Context coverage – Measures how comprehensively the retrieved texts cover the expected ground truth. It requires ground truth texts for comparison to assess recall and completeness of retrieved information.

Context relevance and context coverage metrics are compiled by comparing search results from the RAG pipeline with expected answers in the test dataset. The following diagram illustrates this workflow.

Running the evaluation requires you to bring a dataset that adheres to specific formatting guidelines. The dataset must be in JSON Lines format, with each line representing a valid JSON object. To maintain optimal performance, the dataset should be limited to a maximum of 1,000 prompts per evaluation. Each individual prompt within the dataset must be a well-structured, valid JSON object that can be properly parsed and processed by the evaluation system.
If you choose to evaluate for context coverage, you will need to provide a ground truth, which is text that serves as the baseline for measuring coverage. The ground truth must include referenceContexts, and each prompt in the ground truth must have corresponding reference contexts for accurate evaluation.
The following example code shows the required fields:

{
“conversationTurns”: [{
“referenceContexts”: [{
“content”: [{
“text”: “ground truth text”
}]
}],
“prompt”: {
“content”: [{
“text”: “query text”
}]
}
}]
}

For more details, see Creating a prompt dataset for Retrieve only evaluation jobs.
Evaluate the generation
After validating that your RAG workflow successfully retrieves relevant context from your vector database and aligns with your predefined performance standards, you can proceed to evaluate the generation stage of your pipeline. The Amazon Bedrock evaluation tool provides a comprehensive assessment framework with eight metrics that cover both response quality and responsible AI considerations.
Response quality includes the following metrics:

Helpfulness – Evaluates how useful and comprehensive the generated responses are in answering questions
Correctness – Assesses the accuracy of responses to questions
Logical coherence – Examines responses for logical gaps, inconsistencies, or contradictions
Completeness – Evaluates whether responses address all aspects of the questions
Faithfulness – Measures factual accuracy and resistance to hallucinations

Responsible AI includes the following metrics:

Harmfulness – Evaluates responses for the presence of hate, insult, or violent content
Stereotyping – Assesses for generalized statements about groups or individuals
Refusal – Measures how appropriately the system declines to answer inappropriate questions

Response quality and responsible AI metrics are compiled by comparing search results and the generated response from the RAG pipeline with ground truth answers. The following diagram illustrates this workflow.

The dataset for evaluation must adhere to specific structural requirements, using JSON Lines format with a maximum of 1,000 prompts per evaluation. Each prompt is required to be a valid JSON object with a well-defined structure. Within this structure, two critical fields play essential roles: the prompt field contains the query text used for model evaluation, and the referenceResponses field stores the expected ground truth responses against which the model’s performance will be measured. This format promotes a standardized, consistent approach to evaluating model outputs across different test scenarios.
The following example code shows the required fields:

{
“conversationTurns”: [{
“referenceResponses”: [{
“content”: [{
“text”: “This is a reference text”
}]
}],

## your prompt to the model
“prompt”: {
“content”: [{
“text”: “This is a prompt”
}]
}
}]
}

For more details, see Creating a prompt dataset for Retrieve and generate evaluation jobs.
The following screenshot shows an Amazon Bedrock evaluation results sample dashboard.

After processing, the evaluation provides comprehensive insights, delivering both aggregate metrics and granular performance breakdowns for each individual metric. These detailed results include sample conversations that illustrate performance nuances. To derive maximum value, we recommend conducting a qualitative review, particularly focusing on conversations that received low scores across any metrics. This deep-dive analysis can help you understand the underlying factors contributing to poor performance and inform strategic improvements to your RAG workflow.
Building a comprehensive test dataset: Strategies and considerations
Creating a robust test dataset is crucial for meaningful evaluation. In this section, we discuss three primary approaches to dataset development.
Human-annotated data collection
Human annotation remains the gold standard for domain-specific, high-quality datasets. You can:

Use your organization’s proprietary documents and answers
Use open-source document collections like Clueweb (a 10-billion web document repository)
Employ professional data labeling services such as Amazon SageMaker Ground Truth
Use a crowdsourcing marketplace like Amazon Mechanical Turk for distributed annotation

Human data annotation is recommended for domain-specific, high-quality, and nuanced results. However, generating and maintaining large datasets using human annotators is a time-consuming and costly approach.
Synthetic data generation using LLMs
Synthetic data generation offers a more automated, potentially cost-effective alternative with two primary methodologies:

Self-instruct approach:

Iterative process using a single target model
Model generates multiple responses to queries
Provides continuous feedback and refinement

Knowledge distillation approach:

Uses multiple models
Generates responses based on preexisting model training
Enables faster dataset creation by using previously trained models

Synthetic data generation requires careful navigation of several key considerations. Organizations must typically secure End User License Agreements and might need access to multiple LLMs. Although the process demands minimal human expert validation, these strategic requirements underscore the complexity of generating synthetic datasets efficiently. This approach offers a streamlined alternative to traditional data annotation methods, balancing legal compliance with technical innovation.
Continuous dataset improvement: The feedback loop strategy
Develop a dynamic, iterative approach to dataset enhancement that transforms user interactions into valuable learning opportunities. Begin with your existing data as a foundational baseline, then implement a robust user feedback mechanism that systematically captures and evaluates real-world model interactions. Establish a structured process for reviewing and integrating flagged responses, treating each piece of feedback as a potential refinement point for your dataset. For an example of such a feedback loop implemented in AWS, refer to Improve LLM performance with human and AI feedback on Amazon SageMaker for Amazon Engineering.
This approach transforms dataset development from a static, one-time effort into a living, adaptive system. By continuously expanding and refining your dataset through user-driven insights, you create a self-improving mechanism that progressively enhances model performance and evaluation metrics. Remember: dataset evolution is not a destination, but an ongoing journey of incremental optimization.
When developing your test dataset, strive for a strategic balance that precisely represents the range of scenarios your users will encounter. The dataset should comprehensively span potential use cases and edge cases, while avoiding unnecessary repetition. Because each evaluation example incurs a cost, focus on creating a dataset that maximizes insights and performance understanding, selecting examples that reveal unique model behaviors rather than redundant iterations. The goal is to craft a targeted, efficient dataset that provides meaningful performance assessment without wasting resources on superfluous testing.
Performance improvement tools
Comprehensive evaluation metrics are more than just performance indicators—they’re a strategic roadmap for continuous improvement in your RAG pipeline. These metrics provide critical insights that transform abstract performance data into actionable intelligence, enabling you to do the following:

Diagnose specific pipeline weaknesses
Prioritize improvement efforts
Objectively assess knowledge base readiness
Make data-driven optimization decisions

By systematically analyzing your metrics, you can definitively answer key questions: Is your knowledge base robust enough for deployment? What specific components require refinement? Where should you focus your optimization efforts for maximum impact?
Think of metrics as a diagnostic tool that illuminates the path from current performance to exceptional AI system reliability. They don’t just measure—they guide, providing a clear, quantitative framework for strategic enhancement.
Although a truly comprehensive exploration of RAG pipeline optimization would require an extensive treatise, this post offers a systematic framework for transformative improvements across critical dimensions.
Data foundation and preprocessing
Data foundation and preprocessing consists of the following best practices:

Clean and preprocess source documents to improve quality, removing noise, standardizing formats, and maintaining data consistency
Augment training data with relevant external sources, expanding dataset diversity and coverage
Implement named entity recognition and linking to improve retrieval, enhancing semantic understanding and context identification
Use text summarization techniques to condense long documents, reducing complexity while preserving key information

Chunking strategies
Consider the following chunking strategies:

Use semantic chunking instead of fixed-size chunking to preserve context, maintaining meaningful information boundaries.
Explore various chunk sizes (128–1,024 characters), adapting to semantic text structure and reserving meaning through intelligent segmentation. For more details on Amazon Bedrock chunking strategies, see How content chunking works for knowledge bases.
Implement sliding window chunking with overlap, minimizing information loss between chunks, typically 10–20% overlap to provide contextual continuity.
Consider hierarchical chunking for long documents, capturing both local and global contextual nuances.

Embedding techniques
Embedding techniques include the following:

If your text contains multiple languages, you might want to try using the Cohere Embed (Multilingual) embedding model. This could improve semantic understanding and retrieval relevance.
Experiment with embedding dimensions, balancing performance and computational efficiency.
Implement sentence or paragraph embeddings, moving beyond word-level representations.

Retrieval optimization
Consider the following best practices for retrieval optimization:

Statically or dynamically adjust the number of retrieved chunks, optimizing information density. In your RetrieveAndGenerate (or Retrieve) request, modify “retrievalConfiguration”: { “vectorSearchConfiguration”: { “numberOfResults”: NUMBER }}.
Implement metadata filtering, adding contextual layers to chunk retrieval. For example, prioritizing recent information in time-sensitive scenarios. For code samples for metadata filtering using Amazon Bedrock Knowledge Bases, refer to the following GitHub repo.
Use hybrid search combining dense and sparse retrieval, blending semantic and keyword search approaches.
Apply reranking models to improve precision, reorganizing retrieved contexts by relevance.
Experiment with diverse similarity metrics, exploring beyond standard cosine similarity.
Implement query expansion techniques, transforming queries for more effective retrieval. One example is query decomposition, breaking complex queries into targeted sub-questions.

The following screenshot shows these options on the Amazon Bedrock console.

Prompt engineering
After you select a model, you can edit the prompt template:

Design context-aware prompts, explicitly guiding models to use retrieved information
Implement few-shot prompting, using dynamic, query-matched examples
Create dynamic prompts based on query and documents, adapting instruction strategy contextually
Include explicit usage instructions for retrieved information, achieving faithful and precise response generation

The following screenshot shows an example of editing the prompt template on the Amazon Bedrock console.

Model selection and guardrails
When choosing your model and guardrails, consider the following:

Choose LLMs based on specific task requirements, aligning model capabilities with the use case
Fine-tune models on domain-specific data, enhancing specialized performance
Experiment with model sizes, balancing performance and computational efficiency
Consider specialized model configurations, using smaller models for retrieval and larger for generation
Implement contextual grounding checks, making sure responses remain true to provided information, such as contextual grounding with Amazon Bedrock Guardrails (see the following screenshot)
Explore advanced search paradigms, such as knowledge graph search (GraphRAG)

Navigating knowledge base improvements: Key considerations
When optimizing a RAG system, understanding your performance requirements is crucial. The acceptable performance bar depends entirely on your application’s context—whether it’s an internal tool, a system augmenting human workers, or a customer-facing service. A 0.95 metric score might be sufficient for some applications, where 1 in 20 answers could have minor inaccuracies, but potentially unacceptable for high-stakes scenarios. The key is to align your optimization efforts with the specific reliability and precision needs of your particular use case.
Another key is to prioritize refining the retrieval mechanism before addressing generation. Upstream performance directly influences downstream metrics, making retrieval optimization critical. Certain techniques, particularly chunking strategies, have nuanced impacts across both stages. For instance, increasing chunk size can improve retrieval efficiency by reducing search complexity, but simultaneously risks introducing irrelevant details that might compromise the generation’s correctness. This delicate balance requires careful, incremental adjustments to make sure both retrieval precision and response quality are systematically enhanced.
The following figure illustrates the aforementioned tools and how they relate to retrieval, generation, and both.

Diagnose the issue
When targeting a specific performance metric, adopt a forensic, human-centric approach to diagnosis. Treat your AI system like a colleague whose work requires thoughtful, constructive feedback. This includes the following steps:

Failure pattern identification:

Systematically map question types that consistently underperform
Identify specific characteristics triggering poor performance, such as:

List-based queries
Specialized vocabulary domains
Complex topic intersections

Contextual retrieval forensics:

Conduct granular chunk relevance analysis
Quantify irrelevant or incorrect retrieved contexts
Map precision distribution within the retrieved set (for example, the first 5 out of 15 chunks are relevant, the subsequent 10 are not)
Understand retrieval mechanism’s contextual discrimination capabilities

Ground truth comparative analysis:

Rigorously compare generated responses against reference answers
Diagnose potential ground truth limitations
Develop targeted improvement instructions—think about what specific guidance would enhance response accuracy, and which nuanced context might be missing

Develop a strategic approach to improvement
When confronting complex RAG pipeline challenges, adopt a methodical, strategic approach that transforms performance optimization from a daunting task into a systematic journey of incremental enhancement.
The key is to identify tactics with direct, measurable impact on your specific target metric, concentrating on optimization points that offer the highest potential return on effort. This means carefully analyzing each potential strategy through the lens of its probable performance improvement, focusing on techniques that can deliver meaningful gains with minimal systemic disruption. The following figure illustrates which sets of techniques to prioritize when working to improve metrics.

Additionally, you should prioritize low-friction optimization tactics, such as configurable parameters in your knowledge base, or implementations that have minimal infrastructure disruption. It’s recommended to avoid full vector database reimplementation unless necessary.
You should take a lean approach—make your RAG pipeline improvement into a methodical, scientific process of continuous refinement. Embrace an approach of strategic incrementalism: make purposeful, targeted adjustments that are small enough to be precisely measured, yet meaningful enough to drive performance forward.
Each modification becomes an experimental intervention, rigorously tested to understand its specific impact. Implement a comprehensive version tracking system that captures not just the changes made, but the rationale behind each adjustment, the performance metrics before and after, and the insights gained.
Lastly, approach performance evaluation with a holistic, empathetic methodology that transcends mere quantitative metrics. Treat the assessment process as a collaborative dialogue of growth and understanding, mirroring the nuanced approach you would take when coaching a talented team member. Instead of reducing performance to cold, numerical indicators, seek to uncover the underlying dynamics, contextual challenges, and potential for development. Recognize that meaningful evaluation goes beyond surface-level measurements, requiring deep insight into capabilities, limitations, and the unique context of performance.
Conclusion
Optimizing Amazon Bedrock Knowledge Bases for RAG is an iterative process that requires systematic testing and refinement. Success comes from methodically using techniques like prompt engineering and chunking to improve both the retrieval and generation stages of RAG. By tracking key metrics throughout this process, you can measure the impact of your optimizations and ensure they meet your application’s requirements.
To learn more about optimizing your Amazon Bedrock Knowledge Bases, see our guide on how to Evaluate the performance of Amazon Bedrock resources.

About the Authors
Clement Perrot is a Senior Solutions Architect and AI/ML Specialist at AWS, where he helps early-stage startups build and use AI on the AWS platform. Prior to AWS, Clement was an entrepreneur, whose last two AI and consumer hardware startups were acquired.
Miriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She uses her experience with AI/ML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.
Tamil Sambasivam is a Solutions Architect and AI/ML Specialist at AWS. She helps enterprise customers to solve their business problems by recommending the right AWS solutions. Her strong back ground in Information Technology (24+ years of experience) helps customers to strategize, develop and modernize their business problems in AWS cloud. In the spare time, Tamil like to travel and gardening.

Lyra: A Computationally Efficient Subquadratic Architecture for Biolog …

Posted on March 25, 2025 by i-genie

Deep learning architectures like CNNs and Transformers have significantly advanced biological sequence modeling by capturing local and long-range dependencies. However, their application in biological contexts is constrained by high computational demands and the need for large datasets. CNNs efficiently detect local sequence patterns with subquadratic scaling, whereas Transformers leverage self-attention to model global interactions but require quadratic scaling, making them computationally expensive. Hybrid models, such as Enformers, integrate CNNs and Transformers to balance local and international context modeling, but they still face scalability issues. Large-scale Transformer-based models, including AlphaFold2 and ESM3, have achieved breakthroughs in protein structure prediction and sequence-function modeling. Yet, their reliance on extensive parameter scaling limits their efficiency in biological systems where data availability is often restricted. This highlights the need for more computationally efficient approaches to model sequence-to-function relationships accurately.

To overcome these challenges, epistasis—the interaction between mutations within a sequence—provides a structured mathematical framework for biological sequence modeling. Multilinear polynomials can represent these interactions, offering a principled way to understand sequence-function relationships. State space models (SSMs) naturally align with this polynomial structure, using hidden dimensions to approximate epistatic effects. Unlike Transformers, SSMs utilize Fast Fourier Transform (FFT) convolutions to model global dependencies efficiently while maintaining subquadratic scaling. Additionally, integrating gated depthwise convolutions enhances local feature extraction and expressivity through adaptive feature selection. This hybrid approach balances computational efficiency with interpretability, making it a promising alternative to Transformer-based architectures for biological sequence modeling.

Researchers from institutions, including MIT, Harvard, and Carnegie Mellon, introduce Lyra, a subquadratic sequence modeling architecture designed for biological applications. Lyra integrates SSMs to capture long-range dependencies with projected gated convolutions for local feature extraction, enabling efficient O(N log N) scaling. It effectively models epistatic interactions and achieves state-of-the-art performance across over 100 biological tasks, including protein fitness prediction, RNA function analysis, and CRISPR guide design. Lyra operates with significantly fewer parameters—up to 120,000 times smaller than existing models—while being 64.18 times faster in inference, democratizing access to advanced biological sequence modeling.

Lyra consists of two key components: Projected Gated Convolution (PGC) blocks and a state-space layer with depthwise convolution (S4D). With approximately 55,000 parameters, the model includes two PGC blocks for capturing local dependencies, followed by an S4D layer for modeling long-range interactions. PGC processes input sequences by projecting them to intermediate dimensions, applying depthwise 1D convolutions and linear projections, and recombining features through element-wise multiplication. S4D leverages diagonal state-space models to compute convolution kernels using matrices A, B, and C, efficiently capturing sequence-wide dependencies through weighted exponential terms and enhancing Lyra’s ability to model biological data effectively.

Lyra is a sequence modeling architecture designed to capture local and long-range dependencies in biological sequences efficiently. It integrates PGCs for localized modeling and diagonalized S4D for global interactions. Lyra approximates complex epistatic interactions using polynomial expressivity, outperforming Transformer-based models in tasks like protein fitness landscape prediction and deep mutational scanning. It achieves state-of-the-art accuracy across various protein and nucleic acid modeling applications, including disorder prediction, mutation impact analysis, and RNA-dependent RNA polymerase detection, while maintaining a significantly smaller parameter count and lower computational cost than existing large-scale models.

In conclusion, Lyra introduces a subquadratic architecture for biological sequence modeling, leveraging SSMs to approximate multilinear polynomial functions efficiently. This enables superior modeling of epistatic interactions while significantly reducing computational demands. By integrating PGCs for local feature extraction, Lyra achieves state-of-the-art performance across over 100 biological tasks, including protein fitness prediction, RNA analysis, and CRISPR guide design. It outperforms large foundation models with far fewer parameters and faster inference, requiring only one or two GPUs for training within hours. Lyra’s efficiency democratizes access to advanced biological modeling with therapeutics, pathogen surveillance, and biomanufacturing applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Lyra: A Computationally Efficient Subquadratic Architecture for Biological Sequence Modeling appeared first on MarkTechPost.

SuperBPE: Advancing Language Models with Cross-Word Tokenization

Posted on March 25, 2025 by i-genie

Language models (LMs) face a fundamental challenge in how to perceive textual data through tokenization. Current subword tokenizers segment text into vocabulary tokens that cannot bridge whitespace, adhering to an artificial constraint that treats space as a semantic boundary. This practice ignores the reality that meaning often exceeds individual words – multi-word expressions like “a lot of” function as single semantic units, with English speakers mentally storing thousands of such phrases. Cross-linguistically, the same concepts may be expressed as single or multiple words, depending on the language. Notably, some languages like Chinese and Japanese use no whitespace, allowing tokens to span multiple words or sentences without apparent performance degradation.

Previous research has explored several approaches beyond traditional subword tokenization. Some studies investigated processing text at multiple granularity levels or creating multi-word tokens through frequency-based n-gram identification. Other researchers have explored multi-token prediction (MTP), allowing language models to predict various tokens in a single step, which confirms models’ capability to process more than one subword simultaneously. However, these approaches require architectural modifications and fix the number of tokens predicted per step. Some researchers have pursued tokenizer-free approaches, modeling text directly as byte sequences. However, this significantly increases sequence lengths and computational requirements, leading to complex architectural solutions.

Researchers from the University of Washington, NVIDIA, and the Allen Institute for AI have proposed SuperBPE, a tokenization algorithm that creates a vocabulary containing both traditional subword tokens and innovative “superword” tokens that span multiple words. This approach enhances the popular byte-pair encoding (BPE) algorithm by implementing a pretokenization curriculum by initially maintaining whitespace boundaries to learn subword tokens, then removing these constraints to allow for superword token formation. While standard BPE quickly reaches diminishing returns and begins using increasingly rare subwords as vocabulary size grows, SuperBPE continues discovering common multi-word sequences to encode as single tokens, improving encoding efficiency.

SuperBPE operates through a two-stage training process that modifies the pretokenization step of traditional BPE, mentioned above. This approach intuitively builds semantic units and combines them into common sequences for greater efficiency. Setting t=T (t is transition point and T is target size) produces standard BPE, while t=0 creates a naive whitespace-free BPE. Training SuperBPE requires more computational resources than standard BPE because, without whitespace pretokenization, the training data consists of extremely long “words” with minimal deduplication. However, this increased training cost a few hours on 100 CPUs and occurs only once, which is negligible compared to the resources required for language model pretraining.

SuperBPE shows impressive performance across 30 benchmarks spanning knowledge, reasoning, coding, reading comprehension, etc. All SuperBPE models outperform the BPE baseline, with the strongest 8B model achieving an average improvement of 4.0% and surpassing the baseline on 25 out of 30 individual tasks. Multiple-choice tasks show substantial gains, with a +9.7% improvement. The only statistically significant underperformance occurs in the LAMBADA task, where SuperBPE experiences a final accuracy drop from 75.8% to 70.6%. Moreover, all reasonable transition points yield stronger results than the baseline. The most encoding-efficient transition point delivers a +3.1% performance improvement while reducing inference computing by 35%.

In conclusion, researchers introduced SuperBPE, a more effective tokenization approach developed by enhancing the standard BPE algorithm to incorporate superword tokens. Despite tokenization serving as the fundamental interface between language models and text, tokenization algorithms have remained relatively static. SuperBPE challenges this status quo by recognizing that tokens can extend beyond traditional subword boundaries to include multi-word expressions. SuperBPE tokenizers enable language models to achieve superior performance across numerous downstream tasks while reducing inference computational costs. These advantages require no modifications to the underlying model architecture, making SuperBPE a seamless replacement for traditional BPE in modern language model development pipelines.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post SuperBPE: Advancing Language Models with Cross-Word Tokenization appeared first on MarkTechPost.

TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recomme …

Posted on March 25, 2025 by i-genie

Precision therapy has emerged as a critical approach in healthcare, tailoring treatments to individual patient profiles to optimise outcomes while reducing risks. However, determining the appropriate medication involves a complex analysis of numerous factors: patient characteristics, comorbidities, potential drug interactions, contraindications, current clinical guidelines, drug mechanisms, and disease biology. While Large Language Models (LLMs) have demonstrated therapeutic task capabilities through pretraining and fine-tuning medical data, they face significant limitations. These models lack access to updated biomedical knowledge, frequently generate hallucinations, and struggle to reason reliably across multiple clinical variables. Also, retraining LLMs with new medical information proves computationally prohibitive due to catastrophic forgetting. The models also risk incorporating unverified or deliberately misleading medical content from their extensive training data, further compromising their reliability in clinical applications.

Tool-augmented LLMs have been developed to address knowledge limitations through external retrieval mechanisms like retrieval-augmented generation (RAG). These systems attempt to overcome hallucination issues by fetching drug and disease information from external databases. However, they still fall short in executing the multi-step reasoning process essential for effective treatment selection. Precision therapy would benefit significantly from iterative reasoning capabilities where models could access verified information sources, systematically evaluate potential interactions, and dynamically refine treatment recommendations based on comprehensive clinical analysis.

Researchers from Harvard Medical School, MIT Lincoln Laboratory, Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Broad Institute of MIT and Harvard, and Harvard Data Science Initiative introduce TXAGENT, representing an innovative AI system delivering evidence-grounded treatment recommendations by integrating multi-step reasoning with real-time biomedical tools. The agent generates natural language responses while providing transparent reasoning traces that document its decision-making process. It employs goal-driven tool selection, accessing external databases and specialized machine learning models to ensure accuracy. Supporting this framework is TOOLUNIVERSE, a comprehensive biomedical toolbox containing 211 expert-curated tools covering drug mechanisms, interactions, clinical guidelines, and disease annotations. These tools incorporate trusted sources like openFDA, Open Targets, and the Human Phenotype Ontology. To optimize tool selection, TXAGENT implements TOOLRAG, an ML-based retrieval system that dynamically identifies the most relevant tools from TOOLUNIVERSE based on query context.

TXAGENT’s architecture integrates three core components: TOOLUNIVERSE, comprising 211 diverse biomedical tools; a specialized LLM fine-tuned for multi-step reasoning and tool execution; and the TOOLRAG model for adaptive tool retrieval. Tool compatibility is enabled through TOOLGEN, a multi-agent system that generates tools from API documentation. The agent undergoes fine-tuning with TXAGENT-INSTRUCT, an extensive dataset containing 378,027 instruction-tuning samples derived from 85,340 multi-step reasoning traces, encompassing 177,626 reasoning steps and 281,695 function calls. This dataset is generated by QUESTIONGEN and TRACEGEN, multi-agent systems that create diverse therapeutic queries and stepwise reasoning traces covering treatment information and drug data from FDA labels dating back to 1939.

TXAGENT demonstrates exceptional capabilities in therapeutic reasoning through its multi-tool approach. The system utilizes numerous verified knowledge bases, including FDA-approved drug labels and Open Targets, to ensure accurate and reliable responses with transparent reasoning traces. It excels in four key areas: knowledge grounding using tool calls, retrieving verified information from trusted sources; goal-oriented tool selection through the TOOLRAG model; multi-step therapeutic reasoning for complex problems requiring multiple information sources; and real-time retrieval from continuously updated knowledge sources. Importantly, TXAGENT successfully identified indications for Bizengri, a drug approved in December 2024, well after its base model’s knowledge cutoff, by querying the openFDA API directly rather than relying on outdated internal knowledge.

TXAGENT represents a significant advancement in AI-assisted precision medicine, addressing critical limitations of traditional LLMs through multi-step reasoning and targeted tool integration. By generating transparent reasoning trails alongside recommendations, the system provides interpretable decision-making processes for therapeutic problems. The integration of TOOLUNIVERSE enables real-time access to verified biomedical knowledge, allowing TXAGENT to make recommendations based on current data rather than static training information. This approach enables the system to stay current with newly approved medications, assess appropriate indications, and deliver evidence-based prescriptions. By grounding all responses in verified sources and providing traceable decision steps, TXAGENT establishes a new standard for trustworthy AI in clinical decision support.

Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post TxAgent: An AI Agent that Delivers Evidence-Grounded Treatment Recommendations by Combining Multi-Step Reasoning with Real-Time Biomedical Tool Integration appeared first on MarkTechPost.

Meet LocAgent: Graph-Based AI Agents Transforming Code Localization fo …

Posted on March 24, 2025 by i-genie

Software maintenance is an integral part of the software development lifecycle, where developers frequently revisit existing codebases to fix bugs, implement new features, and optimize performance. A critical task in this phase is code localization, pinpointing specific locations in a codebase that must be modified. This process has gained significance with modern software projects’ increasing scale and complexity. The growing reliance on automation and AI-driven tools has led to integrating large language models (LLMs) in supporting tasks like bug detection, code search, and suggestion. However, despite the advancement of LLMs in language tasks, enabling these models to understand the semantics and structures of complex codebases remains a technical challenge researchers strive to overcome.

Talking about the problems, one of the most persistent problems in software maintenance is accurately identifying the relevant parts of a codebase that need changes based on user-reported issues or feature requests. Often, issue descriptions in natural language mention symptoms but not the actual root cause in code. This disconnect makes it difficult for developers and automated tools to link descriptions to the exact code elements needing updates. Furthermore, traditional methods struggle with complex code dependencies, especially when the relevant code spans multiple files or requires hierarchical reasoning. Poor code localization contributes to inefficient bug resolution, incomplete patches, and longer development cycles.

Prior methods for code localization mostly depend on dense retrieval models or agent-based approaches. Dense retrieval requires embedding the entire codebase into a searchable vector space, which is difficult to maintain and update for large repositories. These systems often perform poorly when issue descriptions lack direct references to relevant code. On the other hand, some recent approaches use agent-based models that simulate a human-like exploration of the codebase. However, they often rely on directory traversal and lack an understanding of deeper semantic links like inheritance or function invocation. This limits their ability to handle complex relationships between code elements not explicitly linked.

A team of researchers from Yale University, University of Southern California, Stanford University, and All Hands AI developed LocAgent, a graph-guided agent framework to transform code localization. Rather than depending on lexical matching or static embeddings, LocAgent converts entire codebases into directed heterogeneous graphs. These graphs include nodes for directories, files, classes, and functions and edges to capture relationships like function invocation, file imports, and class inheritance. This structure allows the agent to reason across multiple levels of code abstraction. The system then applies tools like SearchEntity, TraverseGraph, and RetrieveEntity to allow LLMs to explore the system step-by-step. The use of sparse hierarchical indexing ensures rapid access to entities, and the graph design supports multi-hop traversal, which is essential for finding connections across distant parts of the codebase.

LocAgent performs indexing within seconds and supports real-time usage, making it practical for developers and organizations. The researchers fine-tuned two open-source models, Qwen2.5-7B, and Qwen2.5-32B, on a curated set of successful localization trajectories. These models performed impressively on standard benchmarks. For instance, on the SWE-Bench-Lite dataset, LocAgent achieved 92.7% file-level accuracy using Qwen2.5-32B, compared to 86.13% with Claude-3.5 and lower scores from other models. On the newly introduced Loc-Bench dataset, which contains 660 examples across bug reports (282), feature requests (203), security issues (31), and performance problems (144), LocAgent again showed competitive results, achieving 84.59% Acc@5 and 87.06% Acc@10 at the file level. Even the smaller Qwen2.5-7B model delivered performance close to high-cost proprietary models while costing only $0.05 per example, a stark contrast to the $0.66 cost of Claude-3.5.

The core mechanism relies on a detailed graph-based indexing process. Each node, whether representing a class or function, is uniquely identified by a fully qualified name and indexed using BM25 for flexible keyword search. The model enables agents to simulate a reasoning chain that begins with extracting issue-relevant keywords, proceeds through graph traversals, and concludes with code retrievals for specific nodes. These actions are scored using a confidence estimation approach based on prediction consistency over multiple iterations. Notably, when the researchers disabled tools like TraverseGraph or SearchEntity, performance dropped by up to 18%, highlighting their importance. Further, multi-hop reasoning was critical; fixing traversal hops to one led to a decline in function-level accuracy from 71.53% to 66.79%.

When applied to downstream tasks like GitHub issue resolution, LocAgent increased the issue pass rate (Pass@10) from 33.58% in baseline Agentless systems to 37.59% with the fine-tuned Qwen2.5-32B model. The framework’s modularity and open-source nature make it a compelling solution for organizations looking for in-house alternatives to commercial LLMs. The introduction of Loc-Bench, with its broader representation of maintenance tasks, ensures fair evaluation without contamination from pre-training data.

Some Key Takeaways from the Research on LocAgent include the following:

LocAgent transforms codebases into heterogeneous graphs for multi-level code reasoning.

It achieved up to 92.7% file-level accuracy on SWE-Bench-Lite with Qwen2.5-32B.

Reduced code localization cost by approximately 86% compared to proprietary models. Introduced Loc-Bench dataset with 660 examples: 282 bugs, 203 features, 31 security, 144 performance.

Fine-tuned models (Qwen2.5-7B, Qwen2.5-32B) performed comparably to Claude-3.5.

Tools like TraverseGraph and SearchEntity proved essential, with accuracy drops when disabled.

Demonstrated real-world utility by improving GitHub issue resolution rates.

It offers a scalable, cost-efficient, and effective alternative to proprietary LLM solutions.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Meet LocAgent: Graph-Based AI Agents Transforming Code Localization for Scalable Software Maintenance appeared first on MarkTechPost.

A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the …

Posted on March 24, 2025 by i-genie

Language processing in the brain presents a challenge due to its inherently complex, multidimensional, and context-dependent nature. Psycholinguists have attempted to construct well-defined symbolic features and processes for domains, such as phonemes for speech analysis and part-of-speech units for syntactic structures. Despite acknowledging some cross-domain interactions, research has focused on modeling each linguistic subfield in isolation through controlled experimental manipulations. This divide-and-conquer strategy shows limitations, as a significant gap has emerged between natural language processing and formal psycholinguistic theories. These models and theories struggle to capture the subtle, non-linear, context-dependent interactions occurring within and across levels of linguistic analysis.

Recent advances in LLMs have dramatically improved conversational language processing, summarization, and generation. These models excel in handling syntactic, semantic, and pragmatic properties of written text and in recognizing speech from acoustic recordings. Multimodal, end-to-end models represent a significant theoretical advancement over text-only models by providing a unified framework for transforming continuous auditory input into speech and word-level linguistic dimensions during natural conversations. Unlike traditional approaches, these deep acoustic-to-speech-to-language models shift to multidimensional vectorial representations where all elements of speech and language are embedded into continuous vectors across a population of simple computing units by optimizing straightforward objectives.

Researchers from Hebrew University, Google Research, Princeton University, Maastricht University, Massachusetts General Hospital and Harvard Medical School, New York University School of Medicine, and Harvard University have presented a unified computational framework that connects acoustic, speech, and word-level linguistic structures to investigate the neural basis of everyday conversations in the human brain. They utilized electrocorticography to record neural signals across 100 hours of natural speech production and detailed as participants engaged in open-ended real-life conversations. The team extracted various embedding like low-level acoustic, mid-level speech, and contextual word embeddings from a multimodal speech-to-text model called Whisper. Their model predicts neural activity at each level of the language processing hierarchy across hours of previously unseen conversations.

The internal workings of the Whisper acoustic-to-speech-to-language model are examined to model and predict neural activity during daily conversations. Three types of embeddings are extracted from the model for every word patients speak or hear: acoustic embeddings from the auditory input layer, speech embeddings from the final speech encoder layer, and language embeddings from the decoder’s final layers. For each embedding type, electrode-wise encoding models are constructed to map the embeddings to neural activity during speech production and comprehension. The encoding models show a remarkable alignment between human brain activity and the model’s internal population code, accurately predicting neural responses across hundreds of thousands of words in conversational data.

The Whisper model’s acoustic, speech, and language embeddings show exceptional predictive accuracy for neural activity across hundreds of thousands of words during speech production and comprehension throughout the cortical language network. During speech production, a hierarchical processing is observed where articulatory areas (preCG, postCG, STG) are better predicted by speech embeddings, while higher-level language areas (IFG, pMTG, AG) align with language embeddings. The encoding models show temporal specificity, with performance peaking more than 300ms before word onset during production and 300ms after onset during comprehension, with speech embeddings better predicting activity in perceptual and articulatory areas and language embeddings excelling in high-order language areas.

In summary, the acoustic-to-speech-to-language model offers a unified computational framework for investigating the neural basis of natural language processing. This integrated approach is a paradigm shift toward non-symbolic models based on statistical learning and high-dimensional embedding spaces. As these models evolve to process natural speech better, their alignment with cognitive processes may similarly improve. Some advanced models like GPT-4o incorporate visual modality alongside speech and text, while others integrate embodied articulation systems mimicking human speech production. The fast improvement of these models supports a shift to a unified linguistic paradigm that emphasizes the role of usage-based statistical learning in language acquisition as it is materialized in real-life contexts.

Check out the Paper, and Google Blog. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday Conversations appeared first on MarkTechPost.

Achieving Critical Reliability in Instruction-Following with LLMs: How …

Posted on March 24, 2025 by i-genie

Ensuring reliable instruction-following in LLMs remains a critical challenge. This is particularly important in customer-facing applications, where mistakes can be costly. Traditional prompt engineering techniques fail to deliver consistent results. A more structured and managed approach is necessary to improve adherence to business rules while maintaining flexibility.

This article explores key innovations, including granular atomic guidelines, dynamic evaluation and filtering of instructions, and Attentive Reasoning Queries (ARQs), while acknowledging implementation limitations and trade-offs.

The Challenge: Inconsistent AI Performance in Customer Service

LLMs are already providing tangible business value when used as assistants to human representatives in customer service scenarios. However, their reliability as autonomous customer-facing agents remains a challenge.Traditional approaches to developing conversational LLM applications often fail in real-world use cases. The two most common approaches are:

Iterative prompt engineering, which leads to inconsistent, unpredictable behavior.

Flowchart-based processing, which sacrifices the real magic of LLM-powered interactions: dynamic, free-flowing, human-like interactions.

In high-stakes customer-facing applications, such as banking, even minor errors can have serious consequences. For instance, an incorrectly executed API call (like transferring money) can lead to lawsuits and reputational damage. Conversely, mechanical interactions that lack naturalness and rapport hurt customer trust and engagement, limiting containment rates (cases resolved without human intervention).

For LLMs to reach their full potential as dynamic, autonomous agents in real-world cases, we must make them follow business-specific instructions consistently and at scale, while maintaining the flexibility of natural, free-flowing interactions.

How to Create a Reliable, Autonomous Customer Service Agent with LLMs

To address these gaps in LLMs and current approaches, and achieve a level of reliability and control that works well in real-world cases, we must question the approaches that failed.One of the first questions I had when I started working on Parlant (an open-source framework for customer-facing AI agents) was, “If an AI agent is found to mishandle a particular customer scenario, what would be the optimal process for fixing it?” Adding additional demands to an already-lengthy prompt, like “Here’s how you should approach scenario X…” would quickly become complicated to manage, and the results weren’t consistent anyhow. Besides that, adding those instructions unconditionally posed an alignment risk since LLMs are inherently biased by their input. It was therefore important that instructions for scenario X did not leak into other scenarios which potentially required a different approach.

We thus realized that instructions needed to apply only in their intended context. This made sense because, in real-life, when we catch unsatisfactory behavior in real-time in a customer-service interaction, we usually know how to correct it: We’re able to specify both what needs to improve as well as the context in which our feedback should apply. For example, “Be concise and to the point when discussing premium-plan benefits,” but “Be willing to explain our offering at length when comparing it to other solutions.”

In addition to this contextualization of instructions, in training a highly capable agent that can handle many use cases, we’d clearly need to tweak many instructions over time as we shaped our agent’s behavior to business needs and preferences. We needed a systematic approach.

Stepping back and rethinking, from first principles, our ideal expectations from modern AI-based interactions and how to develop them, this is what we understood about how such interactions should feel to customers:

Empathetic and coherent: Customers should feel in good hands when using AI.

Fluid, like Instant Messaging (IM): Allowing customers to switch topics back and forth, express themselves using multiple messages, and ask about multiple topics at a time.

Personalized: You should feel that the AI agent knows it’s speaking to you and understands your context.

From a developer perspective, we also realized that:

Crafting the right conversational UX is an evolutionary process. We should be able to confidently modify agent behavior in different contexts, quickly and easily, without worrying about breaking existing behavior.

Instructions should be respected consistently. This is hard to do with LLMs, which are inherently unpredictable creatures. An innovative solution was required.

Agent decisions should be transparent. The spectrum of possible issues related to natural language and behavior is too wide. Resolving issues in instruction-following without clear indications of how an agent interpreted our instructions in a given scenario would be highly impractical in production environments with deadlines.

Implementing Parlant’s Design Goals

Our main challenge was how to control and adjust an AI agent’s behavior while ensuring that instructions are not spoken in vain—that the AI agent implements them accurately and consistently. This led to a strategic design decision: granular, atomic guidelines.

1. Granular Atomic Guidelines

Complex prompts often overwhelm LLMs, leading to incomplete or inconsistent outputs with respect to the instructions they specify. We solved this in Parlant by dropping broad prompts for self-contained, atomic guidelines. Each guideline consists of:

Condition: A natural-language query that determines when the instruction should apply (e.g., “The customer inquires about a refund…”)

Action: The specific instruction the LLM should follow (e.g., “Confirm order details and offer an overview of the refund process.”)

By segmenting instructions into manageable units and systematically focusing their attention on each one at a time, we could get the LLM to evaluate and enforce them with higher accuracy.

2. Filtering and Supervision Mechanism

LLMs are highly influenced by the content of their prompts, even if parts of the prompt are not directly relevant to the conversation at hand.

Instead of presenting all guidelines at once, we made Parlant dynamically match and apply only the relevant set of instructions at each step of the conversation. This real-time matching can then be leveraged for:

Reduced cognitive overload for the LLM: We’d avoid prompt leaks and increase the model’s focus on the right instructions, leading to higher consistency.

Supervision: We added a mechanism to highlight each guideline’s impact and enforce its application, increasing conformance across the board.

Explainability: Every evaluation and decision generated by the system includes a rationale detailing how guidelines were interpreted and the reasoning behind skipping or activating them at each point in the conversation.

Continuous improvement: By monitoring guideline effectiveness and agent interpretation, developers could easily refine their AI’s behavior over time. Because guidelines are atomic and supervised, you could easily make structured changes without breaking fragile prompts.

3. Attentive Reasoning Queries (ARQs)

While “Chain of Thought” (CoT) prompting improves reasoning, it remains limited in its ability to maintain consistent, context-sensitive responses over time. Parlant introduces Attentive Reasoning Queries (ARQs)—a technique we’ve devised to ensure that multi-step reasoning stays effective, accurate, and predictable, even across thousands of runs. You can find our research paper on ARQs vs. CoT on parlant.io and arxiv.org.

ARQs work by directing the LLM’s attention back to high-priority instructions at key points in the response generation process, getting the LLM to attend to those instructions and reason about them right before it needs to apply them. We found that “localizing” the reasoning around the part of the response where a specific instruction needs to be applied provided significantly greater accuracy and consistency than a preliminary, nonspecific reasoning process like CoT.

Acknowledging Limitations

While these innovations improve instruction-following, there are challenges to consider:

Computational overhead: Implementing filtering and reasoning mechanisms increases processing time. However, with hardware and LLMs improving by the day, we saw this as a possibly controversial, yet strategic design choice.

Alternative approaches: In some low-risk applications, such as assistive AI co-pilots, simpler methods like prompt-tuning or workflow-based approaches often suffice.

Why Consistency Is Crucial for Enterprise-Grade Conversational AI

In regulated industries like finance, healthcare, and legal services, even 99% accuracy poses significant risk. A bank handling millions of monthly conversations cannot afford thousands of potentially critical errors. Beyond accuracy, AI systems must be constrained such that errors, even when they occur, remain within strict, acceptable bounds.

In response to the demand for greater accuracy in such applications, AI solution vendors often argue that humans also make mistakes. While this is true, the difference is that, with human employees, correcting them is usually straightforward. You can ask them why they handled a situation the way they did. You can provide direct feedback and monitor their results. But relying on “best-effort” prompt-engineering, while being blind to why an AI agent even made some decision in the first place, is an approach that simply doesn’t scale beyond basic demos.

This is why a structured feedback mechanism is so important. It allows you to pinpoint what changes need to be made, and how to make them while keeping existing functionality intact. It’s this realization that put us on the right track with Parlant early on.

Handling Millions of Customer Interactions with Autonomous AI Agents

For enterprises to deploy AI at scale, consistency and transparency are non-negotiable. A financial chatbot providing unauthorized advice, a healthcare assistant misguiding patients, or an e-commerce agent misrepresenting products can all have severe consequences.

Parlant redefines AI alignment by enabling:

Enhanced operational efficiency: Reducing human intervention while ensuring high-quality AI interactions.

Consistent brand alignment: Maintaining coherence with business values.

Regulatory compliance: Adhering to industry standards and legal requirements.

This methodology represents a shift in how AI alignment is approached in the first place. Using modular guidelines with intelligent filtering instead of long, complex prompts; adding explicit supervision and validation mechanisms to ensure things go as planned—these innovations mark a new standard for achieving reliability with LLMs. As AI-driven automation continues to expand in adoption, ensuring consistent instruction-following will become an accepted necessity, not an innovative luxury.

If your company is looking to deploy robust AI-powered customer service or any other customer-facing application, you should look into Parlant, an agent framework for controlled, explainable, and enterprise-ready AI interactions.

The post Achieving Critical Reliability in Instruction-Following with LLMs: How to Achieve AI Customer Service That’s 100% Reliable appeared first on MarkTechPost.

Fin-R1: A Specialized Large Language Model for Financial Reasoning and …

Posted on March 23, 2025 by i-genie

LLMs are advancing rapidly across multiple domains, yet their effectiveness in tackling complex financial problems remains an area of active investigation. The iterative development of LLMs has significantly driven the evolution of artificial intelligence toward artificial general intelligence (AGI). OpenAI’s o1 series and similar models like QwQ and Marco-o1 have improved complex reasoning capabilities by extending “chain-of-thought” reasoning through an iterative “exploration-reflection” approach. In finance, models such as XuanYuan-FinX1-Preview and Fino1 have showcased the potential of LLMs in cognitive reasoning tasks. Meanwhile, DeepSeekR1 adopts a different strategy, relying solely on RL with multi-stage training to enhance reasoning and inference abilities. By combining thousands of unsupervised RL training steps with a small cold-start dataset, DeepSeekR1 demonstrates strong emergent reasoning performance and readability, highlighting the effectiveness of RL-based methodologies in improving large-scale language models.

Despite these advancements, general-purpose LLMs struggle to adapt to specialized financial reasoning tasks. Financial decision-making requires interdisciplinary knowledge, including legal regulations, economic indicators, and mathematical modeling, while also demanding logical, step-by-step reasoning. Several challenges arise when deploying LLMs in financial applications. First, fragmented financial data complicates knowledge integration, leading to inconsistencies that hinder comprehensive understanding. Second, the black-box nature of LLMs makes their reasoning process difficult to interpret, conflicting with regulatory requirements for transparency and accountability. Finally, LLMs often struggle with generalization across financial scenarios, producing unreliable outputs in high-risk applications. These limitations pose significant barriers to their adoption in real-world financial systems, where accuracy and traceability are critical.

Researchers from Shanghai University of Finance & Economics, Fudan University, and FinStep have developed Fin-R1, a specialized LLM for financial reasoning. With a compact 7-billion-parameter architecture, Fin-R1 reduces deployment costs while addressing key economic challenges: fragmented data, lack of reasoning control, and weak generalization. It is trained on Fin-R1-Data, a high-quality dataset containing 60,091 CoT sourced from authoritative financial data. A two-stage training approach—Supervised Fine-Tuning (SFT) followed by RL—Fin-R1 enhances accuracy and interpretability. It performs well in financial benchmarks, excelling in financial compliance and robo-advisory applications.

The study presents a two-stage framework for constructing Fin-R1. The data generation phase involves creating a high-quality financial reasoning dataset, Fin-R1-Data, through data distillation with DeepSeek-R1 and filtering using an LLM-as-judge approach. In the model training phase, Fin-R1 is fine-tuned on Qwen2.5-7B-Instruct using SFT and Group Relative Policy Optimization (GRPO) to enhance reasoning and output consistency. The dataset combines open-source and proprietary financial data, refined through rigorous filtering. Training integrates supervised learning and reinforcement learning, incorporating structured prompts and reward mechanisms to improve financial reasoning accuracy and standardization.

The reasoning abilities of Fin-R1 in financial scenarios were evaluated through a comparative analysis against several state-of-the-art models, including DeepSeek-R1, Fin-R1-SFT, and various Qwen and Llama-based architectures. Despite its compact 7B parameter size, Fin-R1 achieved a notable average score of 75.2, ranking second overall. It outperformed all models of similar scale and exceeded DeepSeek-R1-Distill-Llama-70B by 8.7 points. Fin-R1 ranked highest in FinQA and ConvFinQA with scores of 76.0 and 85.0, respectively, demonstrating strong financial reasoning and cross-task generalization, particularly in benchmarks like Ant_Finance, TFNS, and Finance-Instruct-500K.

In conclusion, Fin-R1 is a large financial reasoning language model designed to tackle key challenges in financial AI, including fragmented data, inconsistent reasoning logic, and limited business generalization. It delivers state-of-the-art performance by utilizing a two-stage training process—SFT and RL—on the high-quality Fin-R1-Data dataset. With a compact 7B parameter scale, it achieves scores of 85.0 in ConvFinQA and 76.0 in FinQA, outperforming larger models. Future work aims to enhance financial multimodal capabilities, strengthen regulatory compliance, and expand real-world applications, driving innovation in fintech while ensuring efficient and intelligent financial decision-making.

Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Fin-R1: A Specialized Large Language Model for Financial Reasoning and Decision-Making appeared first on MarkTechPost.

Meta AI Researchers Introduced SWEET-RL and CollaborativeAgentBench: A …

Posted on March 23, 2025 by i-genie

Large language models (LLMs) are rapidly transforming into autonomous agents capable of performing complex tasks that require reasoning, decision-making, and adaptability. These agents are deployed in web navigation, personal assistance, and software development. To act effectively in real-world settings, these agents must handle multi-turn interactions that span several steps or decision points. This introduces the need for training methods beyond simple response generation and instead focuses on optimizing the entire trajectory of interactions. Reinforcement learning (RL) has emerged as a compelling approach to train such agents by refining their decision-making based on long-term rewards.

Despite their potential, LLM-based agents struggle with multi-turn decision-making. A major challenge lies in assigning proper credit to actions taken at earlier stages of interaction, which influence later outcomes. Traditional training methods rely on next-token prediction or imitate high-probability actions, which do not account for long-term dependencies or cumulative goals. As a result, these methods fail to address the high variance and inefficiency of long-horizon tasks, particularly in collaborative scenarios where understanding human intent and reasoning across multiple steps is critical.

Various reinforcement learning techniques have been adapted to fine-tune LLMs, especially from single-turn human feedback scenarios. Tools like PPO, RAFT, and DPO have been explored but exhibit significant limitations when applied to sequential interactions. These methods often fail at effective credit assignment across turns, making them less effective for multi-turn decision-making tasks. Benchmarks used to evaluate such tools lack the diversity and complexity required to assess performance in collaborative, real-world settings robustly. Value-based learning approaches are another alternative, but their need for custom heads and large amounts of task-specific fine-tuning data limit their generalization capabilities.

FAIR at Meta and UC Berkeley researchers proposed a new reinforcement learning method called SWEET-RL (Step-WisE Evaluation from Training-time Information). They also introduced a benchmark known as CollaborativeAgentBench or ColBench. This benchmark is central to the study, providing over 10,000 training tasks and over 1,000 test cases across two domains: backend programming and frontend design. ColBench simulates real collaboration between an AI agent and a human partner, where agents must ask questions, refine their understanding, and provide iterative solutions. For programming, agents are required to write functions in Python by asking for clarifications to refine missing specifications. In front-end tasks, agents must generate HTML code that matches a visual target through feedback-based corrections. Each task is designed to stretch the reasoning ability of the agent and mimic real-world constraints like limited interactions, capped at 10 turns per session.

SWEET-RL is built around an asymmetric actor-critic structure. The critic has access to additional information during training, such as the correct solution, which is not visible to the actor. This information allows the critic to evaluate each decision made by the agent with a much finer resolution. Instead of training a value function that estimates overall reward, SWEET-RL directly models an advantage function at each turn, using the Bradley-Terry optimization objective. The advantage function determines how much better or worse a particular action is compared to alternatives, helping the agent learn precise behaviors. For example, if an action aligns better with the human partner’s expectation, it receives a higher advantage score. This method simplifies credit assignment and aligns better with the pre-training architecture of LLMs, which rely on token-level prediction.

SWEET-RL achieved a 6% absolute improvement over other multi-turn reinforcement learning methods across both programming and design tasks. On backend programming tasks, it passed 48.0% of tests and achieved a success rate of 34.4%, compared to 28.2% for Multi-Turn DPO and 22.4% for zero-shot performance. On frontend design tasks, it reached a cosine similarity score of 76.9% and a win rate of 40.4%, improving from 38.6% with DPO and 33.8% with fine-tuning. Even when evaluated against top proprietary models like GPT-4o and O1-Mini, SWEET-RL closed the performance gap significantly, enabling the open-source Llama-3.1-8B model to match or exceed GPT-4o’s frontend win rate of 40.4%.

This research demonstrates that effective training of interactive agents hinges on precise, turn-by-turn feedback rather than generalized value estimations or broad supervision. SWEET-RL significantly improves credit assignment by leveraging training-time information and an architecture-aligned optimization approach. It enhances generalization, reduces training variance, and shows strong scalability, achieving better results with increased data. The algorithm also remains effective when applied to off-policy datasets, underlining its practicality in real-world scenarios with imperfect data. The research team created a meaningful evaluation framework by introducing ColBench as a benchmark tailored for realistic, multi-turn tasks. This combination with SWEET-RL provides a strong foundation for developing agents that can reason, adapt, and collaborate effectively over extended interactions.

Several key takeaways from this research include:

SWEET-RL improved backend programming success rates from 28.2% (DPO) to 34.4% and frontend win rates from 38.6% to 40.4%.

It allowed Llama-3.1-8B to match the performance of GPT-4o, reducing dependency on proprietary models.

The critic uses training-time information (e.g., correct solutions) that is invisible to the actor, creating an asymmetric training setup.

Tasks in ColBench are capped at 10 rounds per session and include over 10,000 procedurally generated training examples.

ColBench measures outcomes using unit test pass rates (for code) and cosine similarity (for web design), providing reliable evaluation.

SWEET-RL directly learns a turn-wise advantage function, improving credit assignment without needing an intermediate value function.

The model scales effectively with more data and performs well even on off-policy datasets from weaker models.

Compared to traditional fine-tuning methods, SWEET-RL delivers higher performance with less overfitting and greater generalization.

Check out the Paper, GitHub Page and Dataset. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Meta AI Researchers Introduced SWEET-RL and CollaborativeAgentBench: A Step-Wise Reinforcement Learning Framework to Train Multi-Turn Language Agents for Realistic Human-AI Collaboration Tasks appeared first on MarkTechPost.

Microsoft AI Releases RD-Agent: An AI-Driven Tool for Performing R& …

Posted on March 23, 2025 by i-genie

Research and development (R&D) is crucial in driving productivity, particularly in the AI era. However, conventional automation methods in R&D often lack the intelligence to handle complex research challenges and innovation-driven tasks, making them less effective than human experts. Conversely, researchers leverage deep domain knowledge to generate ideas, test hypotheses, and refine processes through iterative experimentation. The rise of LLMs offers a potential solution by introducing advanced reasoning and decision-making capabilities, allowing them to function as intelligent agents that enhance efficiency in data-driven R&D workflows.

Despite their potential, LLMs must overcome key challenges to deliver meaningful industrial impact in R&D. A major limitation is their inability to evolve beyond their initial training, restricting their capacity to adapt to emerging developments. Additionally, while LLMs possess broad general knowledge, they often lack the depth required for specialized domains, limiting their effectiveness in solving industry-specific problems. To maximize their impact, LLMs must continuously acquire specialized knowledge through practical industry applications, ensuring they remain relevant and capable of addressing complex R&D challenges.Researchers at Microsoft Research Asia have developed RD-Agent, an AI-powered tool designed to automate R&D processes using LLMs. RD-Agent operates through an autonomous framework with two key components: Research, which generates and explores new ideas, and Development, which implements them. The system continuously improves through iterative refinement. RD-Agent functions as both a research assistant and a data-mining agent, automating tasks like reading papers, identifying financial and healthcare data patterns, and optimizing feature engineering. Now open-source on GitHub, RD-Agent is actively evolving to support more applications and enhance industry productivity.

In R&D, two primary challenges must be addressed: enabling continuous learning and acquiring specialized knowledge. Traditional LLMs, once trained, struggle to expand their expertise, limiting their ability to tackle industry-specific problems. To overcome this, RD-Agent employs a dynamic learning framework that integrates real-world feedback, allowing it to refine hypotheses and accumulate domain knowledge over time. RD-Agent continuously proposes, tests, and improves ideas by automating the research process, linking scientific exploration with real-world validation. This iterative feedback loop ensures that knowledge is systematically acquired and applied like human experts refine their understanding through experience.

In the development phase, RD-Agent enhances efficiency by prioritizing tasks and optimizing execution strategies through Co-STEER, a data-driven approach that evolves via continuous learning. This system begins with simple tasks and refines its development methods based on real-world feedback. To evaluate R&D capabilities, researchers have introduced RD2Bench, a benchmarking system that assesses LLM agents on model and data development tasks. Looking ahead, automating feedback comprehension, task scheduling, and cross-domain knowledge transfer remains a major challenge. By integrating research and development processes through continuous feedback, RD-Agent aims to revolutionize automated R&D, boosting innovation and efficiency across disciplines.

In conclusion, RD-Agent is an open-source AI-driven framework designed to automate and enhance R&D processes. It integrates two core components—Research for idea generation and development for implementation—to ensure continuous improvement through iterative feedback. By incorporating real-world data, RD-Agent evolves dynamically and acquires specialized knowledge. The system employs Co-STEER, a data-centric approach, and RD2Bench, a benchmarking tool, to refine development strategies and evaluate AI-driven R&D capabilities. This integrated approach enhances innovation, fosters cross-domain knowledge transfer, and improves efficiency, marking a significant step toward intelligent and automated research and development.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Microsoft AI Releases RD-Agent: An AI-Driven Tool for Performing R&D with LLM-based Agents appeared first on MarkTechPost.

Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model …

Posted on March 22, 2025 by i-genie

Artificial intelligence has made significant strides in recent years, yet integrating real-time speech interaction with visual content remains a complex challenge. Traditional systems often rely on separate components for voice activity detection, speech recognition, textual dialogue, and text-to-speech synthesis. This segmented approach can introduce delays and may not capture the nuances of human conversation, such as emotions or non-speech sounds. These limitations are particularly evident in applications designed to assist visually impaired individuals, where timely and accurate descriptions of visual scenes are essential.

Addressing these challenges, Kyutai has introduced MoshiVis, an open-source Vision Speech Model (VSM) that enables natural, real-time speech interactions about images. Building upon their earlier work with Moshi—a speech-text foundation model designed for real-time dialogue—MoshiVis extends these capabilities to include visual inputs. This enhancement allows users to engage in fluid conversations about visual content, marking a noteworthy advancement in AI development.

Technically, MoshiVis augments Moshi by integrating lightweight cross-attention modules that infuse visual information from an existing visual encoder into Moshi’s speech token stream. This design ensures that Moshi’s original conversational abilities remain intact while introducing the capacity to process and discuss visual inputs. A gating mechanism within the cross-attention modules enables the model to selectively engage with visual data, maintaining efficiency and responsiveness. Notably, MoshiVis adds approximately 7 milliseconds of latency per inference step on consumer-grade devices, such as a Mac Mini with an M4 Pro Chip, resulting in a total of 55 milliseconds per inference step. This performance stays well below the 80-millisecond threshold for real-time latency, ensuring smooth and natural interactions.

In practical applications, MoshiVis demonstrates its ability to provide detailed descriptions of visual scenes through natural speech. For instance, when presented with an image depicting green metal structures surrounded by trees and a building with a light brown exterior, MoshiVis articulates:

“I see two green metal structures with a mesh top, and they’re surrounded by large trees. In the background, you can see a building with a light brown exterior and a black roof, which appears to be made of stone.”

This capability opens new avenues for applications such as providing audio descriptions for the visually impaired, enhancing accessibility, and enabling more natural interactions with visual information. By releasing MoshiVis as an open-source project, Kyutai invites the research community and developers to explore and expand upon this technology, fostering innovation in vision-speech models. The availability of the model weights, inference code, and visual speech benchmarks further supports collaborative efforts to refine and diversify the applications of MoshiVis.

In conclusion, MoshiVis represents a significant advancement in AI, merging visual understanding with real-time speech interaction. Its open-source nature encourages widespread adoption and development, paving the way for more accessible and natural interactions with technology. As AI continues to evolve, innovations like MoshiVis bring us closer to seamless integration of multimodal understanding, enhancing user experiences across various domains.

Check out the Technical details and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Kyutai Releases MoshiVis: The First Open-Source Real-Time Speech Model that can Talk About Images appeared first on MarkTechPost.