TikTok Researchers Introduce SWE-Perf: The First Benchmark for Reposit …

Introduction

As large language models (LLMs) advance in software engineering tasks—ranging from code generation to bug fixing—performance optimization remains an elusive frontier, especially at the repository level. To bridge this gap, researchers from TikTok and collaborating institutions have introduced SWE-Perf—the first benchmark specifically designed to evaluate the ability of LLMs to optimize code performance in real-world repositories.

Unlike prior benchmarks focused on correctness or function-level efficiency (e.g., SWE-Bench, Mercury, EFFIBench), SWE-Perf captures the complexity and contextual depth of repository-scale performance tuning. It provides a reproducible, quantitative foundation to study and improve the performance optimization capabilities of modern LLMs.

Image source: https://arxiv.org/abs/2507.12415

Why SWE-Perf Is Needed

Real-world codebases are often large, modular, and intricately interdependent. Optimizing them for performance requires understanding of cross-file interactions, execution paths, and computational bottlenecks—challenges beyond the scope of isolated function-level datasets.

LLMs today are largely evaluated on tasks like syntax correction or small function transformations. But in production environments, performance tuning across repositories can yield more substantial system-wide benefits. SWE-Perf is explicitly built to measure LLM capabilities in such settings.

Image source: https://arxiv.org/abs/2507.12415

Dataset Construction

SWE-Perf is constructed from over 100,000 pull requests across high-profile GitHub repositories. The final dataset covered 9 repositories including:

140 curated instances demonstrating measurable and stable performance improvements.

Complete codebases pre- and post-optimization.

Target functions categorized as oracle (file-level) or realistic (repo-level).

Unit tests and Docker environments for reproducible execution and performance measurement.

Expert-authored patches used as gold standards.

To ensure validity, each unit test must:

Pass before and after the patch.

Show statistically significant runtime gains over 20 repetitions (Mann-Whitney U test, p < 0.1).

Performance is measured using minimum performance gain (δ), isolating statistical improvements attributable to the patch while filtering noise.

Benchmark Settings: Oracle vs. Realistic

Oracle Setting: The model receives only the target functions and corresponding files. This setting tests localized optimization skills.

Realistic Setting: The model is given an entire repository and must identify and optimize performance-critical paths autonomously. This is a closer analog to how human engineers work.

Evaluation Metrics

SWE-Perf defines a three-tier evaluation framework, reporting each metric independently:

Apply: Can the model-generated patch be applied cleanly?

Correctness: Does the patch preserve functional integrity (all unit tests pass)?

Performance: Does the patch yield measurable runtime improvement?

The metrics are not aggregated into a single score, allowing more nuanced evaluation of tradeoffs between syntactic correctness and performance gains.

Experimental Results

The benchmark evaluates several top-tier LLMs under both oracle and realistic settings:

ModelSettingPerformance (%)Claude-4-opusOracle1.28GPT-4oOracle0.60Gemini-2.5-ProOracle1.48Claude-3.7 (Agentless)Realistic0.41Claude-3.7 (OpenHands)Realistic2.26Expert (Human Patch)–10.85

Notably, even the best-performing LLM configurations fall significantly short of human-level performance. The agent-based method OpenHands, built on Claude-3.7-sonnet, outperforms other configurations in the realistic setting but still lags behind expert-crafted optimizations.

Key Observations

Agent-based frameworks like OpenHands are better suited for complex, multi-step optimization, outperforming direct model prompts and pipeline-based approaches like Agentless.

Performance degrades as the number of target functions increases—LLMs struggle with broader optimization scopes.

LLMs exhibit limited scalability in long-runtime scenarios, where expert systems continue to show performance gains.

Patch analysis shows LLMs focus more on low-level code structures (e.g., imports, environment setup), while experts target high-level semantic abstractions for performance tuning.

Conclusion

SWE-Perf represents a pivotal step toward measuring and improving the performance optimization capabilities of LLMs in realistic software engineering workflows. It uncovers a significant capability gap between existing models and human experts, offering a strong foundation for future research in repository-scale performance tuning. As LLMs evolve, SWE-Perf can serve as a north star guiding them toward practical, production-ready software enhancement at scale.

Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization appeared first on MarkTechPost.

Build an AI-powered automated summarization system with Amazon Bedrock …

Extracting meaningful insights from unstructured data presents significant challenges for many organizations. Meeting recordings, customer interactions, and interviews contain invaluable business intelligence that remains largely inaccessible due to the prohibitive time and resource costs of manual review. Organizations frequently struggle to efficiently capture and use key information from these interactions, resulting in not only productivity gaps but also missed opportunities to use critical decision-making information.
This post introduces a serverless meeting summarization system that harnesses the advanced capabilities of Amazon Bedrock and Amazon Transcribe to transform audio recordings into concise, structured, and actionable summaries. By automating this process, organizations can reclaim countless hours while making sure key insights, action items, and decisions are systematically captured and made accessible to stakeholders.
Many enterprises have standardized on infrastructure as code (IaC) practices using Terraform, often as a matter of organizational policy. These practices are typically driven by the need for consistency across environments, seamless integration with existing continuous integration and delivery (CI/CD) pipelines, and alignment with broader DevOps strategies. For these organizations, having AWS solutions implemented with Terraform helps them maintain governance standards while adopting new technologies. Enterprise adoption of IaC continues to grow rapidly as organizations recognize the benefits of automated, version-controlled infrastructure deployment.
This post addresses this need by providing a complete Terraform implementation of a serverless audio summarization system. With this solution, organizations can deploy an AI-powered meeting summarization solution while maintaining their infrastructure governance standards. The business benefits are substantial: reduced meeting follow-up time, improved knowledge sharing, consistent action item tracking, and the ability to search across historical meeting content. Teams can focus on acting upon meeting outcomes rather than struggling to document and distribute them, driving faster decision-making and better organizational alignment.
What are Amazon Bedrock and Amazon Transcribe?
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, TwelveLabs (coming soon), Writer, and Amazon Nova through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. With Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.
Amazon Transcribe is a fully managed, automatic speech recognition (ASR) service that makes it straightforward for developers to add speech to text capabilities to their applications. It is powered by a next-generation, multi-billion parameter speech FM that delivers high-accuracy transcriptions for streaming and recorded speech. Thousands of customers across industries use it to automate manual tasks, unlock rich insights, increase accessibility, and boost discoverability of audio and video content.
Solution overview
Our comprehensive audio processing system combines powerful AWS services to create a seamless end-to-end solution for extracting insights from audio content. The architecture consists of two main components: a user-friendly frontend interface that handles customer interactions and file uploads, and a backend processing pipeline that transforms raw audio into valuable, structured information. This serverless architecture facilitates scalability, reliability, and cost-effectiveness while delivering insightful AI-driven analysis capabilities without requiring specialized infrastructure management.
The frontend workflow consists of the following steps:

Users upload audio files through a React-based frontend delivered globally using Amazon CloudFront.
Amazon Cognito provides secure authentication and authorization for users.
The application retrieves meeting summaries and statistics through AWS AppSync GraphQL API, which invokes AWS Lambda functions to query.

The processing consists of the following steps:

Audio files are stored in an Amazon Simple Storage Service (Amazon S3) bucket.
When an audio file is uploaded to Amazon S3 in the audio/{user_id}/ prefix, an S3 event notification sends a message to an Amazon Simple Queue Service (Amazon SQS) queue.
The SQS queue triggers a Lambda function, which initiates the processing workflow.
AWS Step Functions orchestrates the entire transcription and summarization workflow with built-in error handling and retries.
Amazon Transcribe converts speech to text with high accuracy.
uses an FM (specifically Anthropic’s Claude) to generate comprehensive, structured summaries.
Results are stored in both Amazon S3 (raw data) and Amazon DynamoDB (structured data) for persistence and quick retrieval.

For additional security, AWS Identity and Access Management helps manage identities and access to AWS services and resources.
The following diagram illustrates this architecture.

This architecture provides several key benefits:

Fully serverless – Automatic scaling and no infrastructure to manage
Event-driven – Real-time responses from components based on events
Resilient – Built-in error handling and retry mechanism
Secure – Authentication, authorization, and encryption throughout
Cost-effective – Pay-per-use price model
Globally available – Content delivery optimized for users worldwide
Highly extensible – Seamless integration with additional services

Let’s walk through the key components of our solution in more detail.
Project structure
Our meeting audio summarizer project follows a structure with frontend and backend components:

sample-meeting-audio-summarizer-in-terraform/                                         
├── backend/                                                                          
│   ├── functions/                           # Lambda function code                   
│   │   ├── audio-processing/                # Audio processing functions             
│   │   ├── authentication/                  # Authentication functions               
│   │   ├── data-access/                     # Data access functions                  
│   │   ├── queue-processing/                # SQS queue processing functions         
│   │   ├── summarization/                   # Summarization functions                
│   │   ├── transcription/                   # Transcription functions                
│   │   └── zipped/                          # Zipped Lambda functions for deployment 
│   └── terraform/                           # Infrastructure as Code                 
│       ├── modules/                         # Terraform modules                      
│       │   ├── api/                         # AppSync GraphQL API                    
│       │   ├── auth/                        # Cognito authentication                 
│       │   ├── compute/                     # Lambda functions                       
│       │   ├── messaging/                   # SQS queues and S3 notifications        
│       │   ├── network/                     # CloudFront and S3 website              
│       │   ├── orchestration/               # Step Functions                         
│       │   ├── queue-processor/             # Queue processing Lambda                
│       │   └── storage/                     # S3 and DynamoDB                        
│       ├── main.tf                          # Main Terraform configuration           
│       ├── outputs.tf                       # Output values                          
│       ├── variables.tf                     # Input variables                        
│       └── terraform.tfvars                 # Variable values                        
├── docs/                                    # Documentation and architecture diagrams
├── frontend/                                # React web application                  
│   ├── public/                              # Public assets                          
│   └── src/                                 # React application source               
│       ├── components/                      # React components                       
│       ├── graphql/                         # GraphQL queries and mutations          
│       ├── pages/                           # Page components                        
│       └── services/                        # Service integrations                   
└── scripts/                                 # Deployment and utility scripts         
├── deploy.sh                                # Main deployment script                 
└── zip-lambdas.sh                           # Script to zip all backend lambdas 

Infrastructure setup Terraform
Our solution uses Terraform to define and provision the AWS infrastructure in a consistent and repeatable way. The main Terraform configuration orchestrates the various modules. The following code shows three of them:

# Compute Module – Lambda functions
module “compute” {
  source = “./modules/compute”
  
  aws_region                        = var.aws_region
  aws_account                       = data.aws_caller_identity.current.account_id
  meeting_statistics_table_name     = var.meeting_statistics_table_name
  meeting_summaries_table_name      = var.meeting_summaries_table_name
  cognito_user_pool_id              = module.auth.cognito_user_pool_id
  iam_roles                         = module.auth.iam_roles
  storage_bucket                    = module.storage.storage_bucket
  model_id                          = var.model_id
  inference_profile_prefix          = var.inference_profile_prefix
}

# Orchestration Module – Step Functions
module “orchestration” {
  source = “./modules/orchestration”
  
  aws_region                              = var.aws_region
  aws_account                             = data.aws_caller_identity.current.account_id
  storage_bucket                          = module.storage.storage_bucket
  iam_roles                               = module.auth.iam_roles
  lambda_functions                        = module.compute.lambda_functions
}

# Queue Processor Module – ProcessTranscriptionQueueFunction Lambda
module “queue_processor” {
  source = “./modules/queue-processor”
  
  storage_bucket                    = module.storage.storage_bucket
  state_machine_arn                 = module.orchestration.state_machine_arn
  lambda_function_transcription_role = module.auth.iam_roles.lambda_function_transcription_role
  
  depends_on = [
    module.storage,
    module.orchestration
  ]
}

Audio processing workflow
The core of our solution is a Step Functions workflow that orchestrates the processing of audio files. The workflow handles language detection, transcription, summarization, and notification in a resilient way with proper error handling.
Amazon Bedrock for summarization
The summarization component is powered by Amazon Bedrock, which provides access to state-of-the-art FMs. Our solution uses Anthropic’s Claude 3.7 Sonnet version 1 to generate comprehensive meeting summaries:
prompt = f”””Even if it is a raw transcript of a meeting discussion, lacking clear structure and context and containing multiple speakers, incomplete sentences, and tangential topics, PLEASE PROVIDE a clear and thorough analysis as detailed as possible of this conversation. DO NOT miss any information. CAPTURE as much information as possible. Use bullet points instead of dashes in your summary. IMPORTANT: For ALL section headers, use plain text with NO markdown formatting (no #, ##, **, or * symbols). Each section header should be in ALL CAPS followed by a colon. For example: “TITLE:” not “# TITLE” or “## TITLE”.
CRITICAL INSTRUCTION: DO NOT use any markdown formatting symbols like #, ##, **, or * in your response, especially for the TITLE section. The TITLE section MUST start with “TITLE:” and not “# TITLE:” or any variation with markdown symbols.
FORMAT YOUR RESPONSE EXACTLY AS FOLLOWS:
TITLE: Give the meeting a short title 2 or 3 words that is related to the overall context of the meeting, find a unique name such a company name or stakeholder and include it in the title      
TYPE: Depending on the context of the meeting, the conversation, the topic, and discussion, ALWAYS assign a type of meeting to this summary. Allowed Meeting types are: Client meeting, Team meeting, Technical meeting, Training Session, Status Update, Brainstorming Session, Review Meeting, External Stakeholder Meeting, Decision Making Meeting, and Problem Solving Meeting. This is crucial, don’t overlook this.
STAKEHOLDERS: Provide a list of the participants in the meeting, their company, and their corresponding roles. If the name is not provided or not understood, please replace the name with the word ‘Not stated’. If a speaker does not introduce themselves, then don’t include them in the STAKEHOLDERS section.  
CONTEXT: provide a 10-15 summary or context sentences with the following information: Main reason for contact, Resolution provided, Final outcome, considering all the information above
MEETING OBJECTIVES: provide all the objectives or goals of the meeting. Be thorough and detailed.
CONVERSATION DETAILS: Customer’s main concerns/requests Solutions discussed Important information verified Decisions made
KEY POINTS DISCUSSED (Elaborate on each point, if applicable): List all significant topics and issues Important details or numbers mentioned Any policies or procedures explained Special requests or exceptions
ACTION ITEMS & NEXT STEPS (Elaborate on each point, if applicable): What the customer needs to do: Immediate actions required Future steps to take Important dates or deadlines What the company will do (Elaborate on each point, if applicable): Processing or handling steps Follow-up actions promised Timeline for completion
ADDITIONAL NOTES (Elaborate on each point, if applicable): Any notable issues or concerns Follow-up recommendations Important reminders
TECHNICAL REQUIREMENTS & RESOURCES (Elaborate on each point, if applicable): Systems or tools discussed/needed Technical specifications mentioned Required access or permissions Resource allocation details
Frontend implementation
The frontend is built with React and provides the following features:

User authentication and authorization using Amazon Cognito
Audio file upload interface with progress indicators
Summary viewing with formatted sections (stakeholders, key points, action items)
Search functionality across meeting summaries
Meeting statistics visualization

The frontend communicates with the backend through the AWS AppSync GraphQL API, which provides a unified interface for data operations.
Security considerations
Security is a top priority in our solution, which we address with the following measures:

User authentication is handled by Amazon Cognito
API access is secured with Amazon Cognito user pools
S3 bucket access is restricted to authenticated users
IAM roles follow the principle of least privilege
Data is encrypted at rest and in transit
Step Functions provide secure orchestration with proper error handling

Benefits of using Amazon Bedrock
Amazon Bedrock offers several key advantages for our meeting summarization system:

Access to state-of-the-art models – Amazon Bedrock provides access to leading FMs like Anthropic’s Claude 3.7 Sonnet version 1, which delivers high-quality summarization capabilities without the need to train custom models.
Fully managed integration – Amazon Bedrock integrates seamlessly with other AWS services, allowing for a fully serverless architecture that scales automatically with demand.
Cost-efficiency – On-Demand pricing means you only pay for the actual processing time, making it cost-effective for variable workloads.
Security and compliance – Amazon Bedrock maintains data privacy and security, making sure sensitive meeting content remains protected within your AWS environment.
Customizable prompts – The ability to craft detailed prompts allows for tailored summaries that extract exactly the information your organization needs from meetings. Amazon Bedrock also provides prompt management and optimization, as well as the playground for quick prototyping.
Multilingual support – Amazon Bedrock can process content in multiple languages, making it suitable for global organizations.
Reduced development time – Pre-trained models minimize the need for extensive AI development expertise and infrastructure.
Continuous improvement – Amazon Bedrock provides a model choice, and the user can update the existing models with a single string change.

Prerequisites
Before implementing this solution, make sure you have:

An AWS account with permissions to create and manage the required services, such as Amazon S3, DynamoDB, Lambda, Amazon Transcribe, Amazon Bedrock, Step Functions, AWS AppSync, Amazon CloudWatch, and IAM
Terraform v1.5.0 or later installed
The AWS Command Line Interface (AWS CLI) configured with appropriate credentials
Access to Amazon Bedrock FMs (Anthropic’s Claude 3.7 Sonnet version 1 recommended
Basic familiarity with Terraform and AWS services

In the following sections, we walk through the steps to deploy the meeting audio summarizer solution.
Clone the repository
First, clone the repository containing the Terraform code:
git clone https://github.com/aws-samples/sample-meeting-audio-summarizer-in-terraform
cd sample-meeting-audio-summarizer-in-terraform
Configure AWS credentials
Make sure your AWS credentials are properly configured. You can use the AWS CLI to set up your credentials:
aws configure –profile meeting-summarizer
You will be prompted to enter your AWS access key ID, secret access key, default AWS Region, and output format.
Install frontend dependencies
To set up the frontend development environment, navigate to the frontend directory and install the required dependencies:
cd frontend
npm install
Create configuration files
Move to the terraform directory:
cd ../backend/terraform/  
Update the terraform.tfvars file in the backend/terraform directory with your specific values. This configuration supplies values for the variables previously defined in the variables.tf file.
You can customize other variables defined in variables.tf according to your needs. In the terraform.tfvars file, you provide actual values for the variables declared in variables.tf, so you can customize the deployment without modifying the core configuration files:

aws_region                              = “us-east-1”                                  
aws_profile                             = “YOUR-AWS-PROFILE”                           
environment                             = “prod”                                       
app_name                                = “meeting-audio-summarizer”                   
dynamodb_read_capacity                  = 5                                            
dynamodb_write_capacity                 = 5                                            
cognito_allowed_email_domains           = [“example.com”]                              
model_id                                = “anthropic.claude-3-7-sonnet-20250219-v1:0”  
inference_profile_prefix                = “us”                                         
frontend_bucket_name                    = “a-unique-bucket-name”                       
storage_bucket                          = “a-unique-bucket-name”                       
cognito_domain_prefix                   = “meeting-summarizer”                         
meeting_statistics_table_name           = “MeetingStatistics”                          
meeting_summaries_table_name            = “MeetingSummaries” 

For a-unique-bucket-name, choose a unique name that is meaningful and makes sense to you.
Initialize and apply Terraform
Navigate to the terraform directory and initialize the Terraform environment:
terraform init
To upgrade the previously selected plugins to the newest version that complies with the configuration’s version constraints, use the following command:
terraform init -upgrade
This will cause Terraform to ignore selections recorded in the dependency lock file and take the newest available version matching the configured version constraints.
Review the planned changes:
terraform plan
Apply the Terraform configuration to create the resources:
terraform apply
When prompted, enter yes to confirm the deployment. You can run terraform apply -auto-approve to skip the approval question.
Deploy the solution
After the backend deployment is complete, deploy the entire solution using the provided deployment script:
cd ../../scripts
sudo chmod +x deploy.sh
./deploy.sh
This script handles the entire deployment process, including:

Deploying the backend infrastructure using Terraform
Automatically configuring the frontend with backend resource information
Building and deploying the frontend application
Setting up CloudFront distribution
Invalidating the CloudFront cache to make sure the latest content is served

Verify the deployment
After the entire solution (both backend and frontend) is deployed, in your terminal you should see something similar to the following text:

Deployment complete! 🙂

============================================================================
Your app is available at: https://d1e5vh2t5qryy2.cloudfront.net.
============================================================================

The CloudFront URL (*.cloudfront.net/) is unique, so yours will not be the same.
Enter the URL into your browser to open the application. You will see a login page like the following screenshot. You must create an account to access the application.

Start by uploading a file:

View generated summaries in a structured format:

See meeting statistics:

Clean up
To cleanup the solution you must run this command.
terraform destroy
This command will completely remove the AWS resources provisioned by Terraform in your environment. When executed, it will display a detailed plan showing the resources that will be destroyed, and prompt for confirmation before proceeding. The process may take several minutes as it systematically removes infrastructure components in the correct dependency order.
Remember to verify the destruction is complete by checking your AWS Console to make sure no billable resources remain active.
Cost considerations
When implementing this solution, it’s important to understand the cost implications of each component. Let’s analyze the costs based on a realistic usage scenario, based on the following assumptions:

50 hours of audio processing per month
Average meeting length of 30 minutes
100 active users accessing the system
5 million API queries per month

The majority of the cost comes from Amazon Transcribe (approximately 73% of total cost at $72.00), with AWS AppSync being the second largest cost component (approximately 20% at $20.00). Despite providing the core AI functionality, Amazon Bedrock costs approximately 3% of total at $3.00, and DynamoDB, CloudFront, Lambda, Step Functions, Amazon SQS, and Amazon S3 make up the remaining 4%.
We can take advantage of the following cost optimization opportunities:

Implement audio compression to reduce storage and processing costs
Use Amazon Transcribe Medical for medical meetings (if applicable) for higher accuracy
Implement caching strategies for frequently accessed summaries to reduce AppSync and DynamoDB costs
Consider reserved capacity for DynamoDB if usage patterns are predictable

The following table summarizes these prices. Refer the AWS pricing pages for each service to learn more about the AWS pricing model.

Service
Usage
Unit Cost
Monthly Cost

Amazon Bedrock
500K input tokens100K output tokens
$3.00 per million tokens$15.00 per million tokens
$3

Amazon CloudFront
5GB data transfer
$0.085 per GB
$0.43

Amazon Cognito
100 Monthly Active Users (MAU)
Free tier (first 50K users)
$0

Amazon DynamoDB
5 RCU/WCU, ~ 1GB storage
$0.25 per RCU/WCU + $0.25/GB
$2.75

Amazon SQS
1,000 messages
$0.40 per million
$0.01

Amazon S3 Storage
3GB audio + 12MB transcripts/summaries
$0.023 per GB
$0.07

AWS Step Functions
1,000 state transitions
$0.025 per 1,000
$0.03

AWS AppSync
5M queries
$4.00 per million
$20

AWS Lambda
300 invocations, 5s avg. runtime, 256MB
Various
$0.10

Amazon Transcribe
50 hours of audio
$1.44 per hour
$72

TOTAL
98.39

Next steps
The next phase of our meeting summarization solution will incorporate several advanced AI technologies to deliver greater business value. Amazon Sonic Model can improve transcription accuracy by better handling multiple speakers, accents, and technical terminology—addressing a key pain point for global organizations with diverse teams. Meanwhile, Amazon Bedrock Flows can enhance the system’s analytical capabilities by implementing automated meeting categorization, role-based summary customization, and integration with corporate knowledge bases to provide relevant context. These improvements can help organizations extract actionable insights that would otherwise remain buried in conversation.
The addition of real-time processing capabilities helps teams see key points, action items, and decisions as they emerge during meetings, enabling immediate clarification and reducing follow-up questions. Enhanced analytics functionality track patterns across multiple meetings over time, giving management visibility into communication effectiveness, decision-making processes, and project progress. By integrating with existing productivity tools like calendars, daily agenda, task management systems, and communication services, this solution makes sure that meeting intelligence flows directly into daily workflows, minimizing manual transfer of information and making sure critical insights drive tangible business outcomes across departments.
Conclusion
Our meeting audio summarizer combines AWS serverless technologies with generative AI to solve a critical productivity challenge. It automatically transcribes and summarizes meetings, saving organizations thousands of hours while making sure insights and action items are systematically captured and shared with stakeholders.
The serverless architecture scales effortlessly with fluctuating meeting volumes, costs just $0.98 per meeting on average, and minimizes infrastructure management and maintenance overhead. Amazon Bedrock provides enterprise-grade AI capabilities without requiring specialized machine learning expertise or significant development resources, and the Terraform-based infrastructure as code enables rapid deployment across environments, customization to meet specific organizational requirements, and seamless integration with existing CI/CD pipelines.
As the field of generative AI continues to evolve and new, better-performing models become available, the solution’s ability to perform its tasks will automatically improve on performance and accuracy without additional development effort, enhancing summarization quality, language understanding, and contextual awareness. This makes the meeting audio summarizer an increasingly valuable asset for modern businesses looking to optimize meeting workflows, enhance knowledge sharing, and boost organizational productivity.
Additional resources
Refer to Amazon Bedrock Documentation for more details on model selection, prompt engineering, and API integration for your generative AI applications. Additionally, see Amazon Transcribe Documentation for information about the speech-to-text service’s features, language support, and customization options for achieving accurate audio transcription. For infrastructure deployment needs, see Terraform AWS Provider Documentation for detailed explanations of resource types, attributes, and configuration options for provisioning AWS resources programmatically. To enhance your infrastructure management skills, see Best practices for using the Terraform AWS Provider, where you can find recommended approaches for module organization, state management, security configurations, and resource naming conventions that will help make sure your AWS infrastructure deployments remain scalable and maintainable.

About the authors
Dunieski Otano is a Solutions Architect at Amazon Web Services based out of Miami, Florida. He works with World Wide Public Sector MNO (Multi-International Organizations) customers. His passion is Security, Machine Learning and Artificial Intelligence, and Serverless. He works with his customers to help them build and deploy high available, scalable, and secure solutions. Dunieski holds 14 AWS certifications and is an AWS Golden Jacket recipient. In his free time, you will find him spending time with his family and dog, watching a great movie, coding, or flying his drone.
Joel Asante, an Austin-based Solutions Architect at Amazon Web Services (AWS), works with GovTech (Government Technology) customers. With a strong background in data science and application development, he brings deep technical expertise to creating secure and scalable cloud architectures for his customers. Joel is passionate about data analytics, machine learning, and robotics, leveraging his development experience to design innovative solutions that meet complex government requirements. He holds 13 AWS certifications and enjoys family time, fitness, and cheering for the Kansas City Chiefs and Los Angeles Lakers in his spare time.
Ezzel Mohammed is a Solutions Architect at Amazon Web Services (AWS) based in Dallas, Texas. He works on the International Organizations team within the World Wide Public Sector, collaborating closely with UN agencies to deliver innovative cloud solutions. With a Computer Science background, Ezzeldien brings deep technical expertise in system design, helping customers architect and deploy highly available and scalable solutions that meet international compliance requirements. He holds 9 AWS certifications and is passionate about applying AI Engineering and Machine Learning to address global challenges. In his free time, he enjoys going on walks, watching soccer with friends and family, playing volleyball, and reading tech articles.

Kyruus builds a generative AI provider matching solution on AWS

This post was written with Zach Heath of Kyruus Health.
When health plan members need care, they shouldn’t need a dictionary. Yet millions face this exact challenge—describing symptoms in everyday language while healthcare references clinical terminology and complex specialty classifications. This disconnect forces members to become amateur medical translators, attempting to convert phrases like “my knee hurts when I climb stairs” into specialized search criteria such as orthopedics or physical medicine. Traditional provider directories compound this problem with overwhelming filter options and medical jargon, leading to frustrated members, delayed care access, and ultimately higher costs for both individuals and health plans.
Kyruus Health, a leading provider of care access solutions, serves over 1,400 hospitals, 550 medical groups, and 100 health plan brands—connecting more than 500,000 providers with patients seeking care and facilitating over 1 million appointments annually. To address the challenges of healthcare navigation, they developed Guide, an AI-powered solution that understands natural language and connects members with the right providers. With Guide, members can express health concerns in their own words and receive personalized provider matches without requiring clinical knowledge. Health plans implementing this solution have reported enhanced member experience and higher Net Promoter Scores (NPS), along with improved care access conversion and appointment scheduling rates.
In this post, we demonstrate how Kyruus Health uses AWS services to build Guide. We show how Amazon Bedrock, a fully managed service that provides access to foundation models (FMs) from leading AI companies and Amazon through a single API, and Amazon OpenSearch Service, a managed search and analytics service, work together to understand everyday language about health concerns and connect members with the right providers. We explore the solution architecture, share implementation insights, and examine how this approach delivers measurable business value for health plans and their members.
Solution overview
Guide transforms healthcare provider search by translating natural language health concerns into precisely matched provider recommendations. The solution uses Amazon Bedrock with Anthropic’s Claude 3.5 Sonnet to understand everyday descriptions of health concerns and convert them into structured medical parameters. Then it uses OpenSearch Service to match these parameters against comprehensive provider data and deliver targeted recommendations.
This architecture makes it possible for members to express health needs in plain language while making sure provider matches meet clinical requirements. The entire solution maintains HIPAA compliance through end-to-end encryption and fine-grained access controls, so Kyruus Health to focus on improving the member experience instead of managing complex infrastructure.
The following diagram illustrates the solution architecture.

This architecture translates natural language queries into structured healthcare parameters through the following steps:

A member enters a query like “I’ve been having shooting pain down my leg for two weeks” through the health plan application. Amazon API Gateway securely receives the member’s query request.
API Gateway routes the request to Guide’s conversation service running on Amazon Elastic Container Service (Amazon ECS).
Guide’s conversation service calls Amazon Bedrock, where Anthropic’s Claude 3.5 Sonnet processes the natural language. The model identifies potential sciatica and translates this everyday description into structured medical parameters, including appropriate specialties like neurology or orthopedics.
The health plan application initiates a new API call through API Gateway to the Provider Search Service running on Amazon ECS, using the structured parameters derived from the previous steps.
The Provider Search Service queries OpenSearch Service, which contains comprehensive provider data previously ingested from Amazon Simple Storage Service (Amazon S3), including specialties, clinical focus areas, locations, and insurance network participation.

Matched providers are then returned to the health plan application and presented to the member through an intuitive conversational interface. This architecture demonstrates the powerful combination of Amazon Bedrock FMs with purpose-built AWS services like OpenSearch Service, creating an end-to-end solution that bridges the gap between complex healthcare data and intuitive member experiences.
Building with Tribe AI
To accelerate their AI transformation, Kyruus Health partnered with Tribe AI, an AWS Partner with extensive experience in building and implementing enterprise-grade generative AI solutions at scale. Tribe AI’s proven track record in deploying FMs in complex, regulatory environments like healthcare helped de-risk the adoption of generative AI for Kyruus. This partnership allowed Kyruus to focus on their healthcare domain expertise while using Tribe AI’s technical implementation knowledge to bring Guide from concept to production.
Implementation insights
Kyruus Health’s successful implementation of Guide yielded key insights that can help organizations building healthcare AI initiatives:

Healthcare-specific testing infrastructure is essential – Kyruus Health prioritized testing with real healthcare scenarios from the start. This process made sure Guide could accurately translate everyday descriptions into appropriate provider specialties, maintaining reliability where matching decisions directly impact health outcomes and plan costs.
User-centered design principles must guide AI implementation – By focusing first on member needs rather than technical capabilities, Kyruus Health made sure their solution addressed the actual friction points in healthcare navigation. This approach led directly to significant improvements in satisfaction and reduced search abandonment rates, demonstrating how AI implementations should start with human needs rather than technical possibilities.
Strategic model selection drives business outcomes – Rather than using a single model for all tasks, Kyruus Health discovered the power of strategically deploying specialized models for different aspects of healthcare navigation—including complex symptom interpretation and clinical specialty mapping. This targeted approach improved provider match accuracy by aligning specific AI capabilities to distinct parts of the matching process, optimizing both performance and cost while delivering more precise provider recommendations.

These insights demonstrate how a thoughtful implementation approach can transform complex healthcare navigation challenges into intuitive member experiences that deliver measurable business results.
Guide member experience in action
The following screenshot shows how the AWS architecture translates into the real-world member experience. When a member enters their symptom description and location preference, Guide processes this natural language input through Amazon Bedrock and identifies appropriate specialists using OpenSearch Service. The system interprets the medical concern and location requirements, responding with relevant specialists within the requested distance who are accepting new patients. This streamlined experience has delivered higher match rates and increased appointment completion for health plans.

Conclusion
Guide demonstrates how generative AI powered by AWS transforms healthcare navigation by bridging the gap between everyday language and clinical terminology. In this post, we explored how an architecture combining Amazon Bedrock and OpenSearch Service processes natural language queries into personalized provider matches, helping members find appropriate healthcare providers using natural language descriptions of their symptoms.
For health plans evaluating digital initiatives, Guide offers a blueprint for solving complex healthcare challenges while delivering measurable improvements in member satisfaction and appointment conversion rates. To build your own generative AI solutions, explore Amazon Bedrock for managed access to FMs. For healthcare-specific guidance, check out the AWS Healthcare Industry Lens and browse implementation examples, use cases, and technical guidance in the AWS Healthcare and Life Sciences Blog.

About the authors
Zach Heath is a Senior Staff Software Engineer at Kyruus Health. A passionate technologist, he specializes in architecting and implementing robust, scalable software solutions that transform healthcare search experiences by connecting patients with the right care through innovative technology.
Anil Chinnam is a Solutions Architect at AWS. He is a generative AI enthusiast passionate about translating cutting-edge technologies into tangible business value for healthcare customers. As a trusted technical advisor, he helps customers drive cloud adoption and business transformation outcomes.

Use generative AI in Amazon Bedrock for enhanced recommendation genera …

In the manufacturing world, valuable insights from service reports often remain underutilized in document storage systems. This post explores how Amazon Web Services (AWS) customers can build a solution that automates the digitisation and extraction of crucial information from many reports using generative AI.
The solution uses Amazon Nova Pro on Amazon Bedrock and Amazon Bedrock Knowledge Bases to generate recommended actions that are aligned with the observed equipment state, using an existing knowledge base of expert recommendations. The knowledge base expands over time as the solution is used.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Stability AI, Mistral, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
Amazon Bedrock Knowledge Base offers fully managed, end-to-end Retrieval-Augmented Generation (RAG) workflows to create highly accurate, low latency, and custom Generative AI applications by incorporating contextual information from your company’s data sources, making it a well-suited service to store engineers’ expert recommendations from past reports and allow FMs to accurately customise their responses.
Traditional service and maintenance cycles rely on manual report submission by engineers with expert knowledge. Time spent referencing past reports can lead to operational delays and business disruption.This solution empowers equipment maintenance teams to:

Ingest inspection and maintenance reports (in multiple languages) and extract equipment status and open actions, increasing visibility and actionability
Generate robust, trustworthy recommendations using experienced engineers’ expertise
Expand the initial knowledge base built by expert engineers to include valid generated recommendations
Accelerate maintenance times and prevent unplanned downtime with a centralised, AI-powered tool that streamlines your equipment maintenance processes on AWS

To help you implement this solution, we provide a GitHub repository containing deployable code and infrastructure as code (IaC) templates. You can quickly set up and customise the solution in your own AWS environment using the GitHub repository.
Solution overview
The following diagram is an architecture representation of the solution presented in this post, showcasing the various AWS services in use. Using this GitHub repository, you can deploy the solution into your AWS account to test it.

The following are key workflows of the solution:

Automated service report ingestion with Amazon Textract – The report ingestion workflow processes and translates service reports into a standardised format. This workflow uses Amazon Textract for optical character recognition (OCR), Amazon Translate for language translation, and Amazon Comprehend for language detection. These services provide reports that are accurately processed and prepared for metadata extraction, regardless of their original format or language.
Intelligent recommendation generation using RAG – Following ingestion, the metadata extraction and standardisation process uses RAG architecture with the Amazon Nova Pro in Amazon Bedrock and Amazon Bedrock Knowledge Bases. This workflow extracts crucial metadata from the reports and uses the RAG process to generate precise and actionable maintenance recommendations. The metadata is standardised for consistency and reliability, providing a solid foundation for the recommendations.
Expert validation with Amazon SageMaker Ground Truth – To validate and refine the generated recommendations, the solution incorporates an expert review process using Amazon SageMaker Ground Truth. This workflow involves creating customised labelling jobs where experts review and validate the recommendations for accuracy and reliability. This feedback loop helps continually improve the model’s performance, making the maintenance recommendations more trustworthy.
Expanding the knowledge base for future processing – The knowledge base for this tool needs to be expanded with new rules for each equipment type, drawing from two main sources:

Analysing past equipment and maintenance reports to obtain labeled data on recommended actions.
Reinforcing valid recommendations generated by the tool and verified by human experts.

This compiled set of rules is reviewed by experts, assigned criticality, and then automatically synced into the Amazon Bedrock Knowledge Bases to continually improve the solution’s confidence in generating the next recommended action.Together, these workflows create a seamless and efficient process from report ingestion to actionable recommendations, producing high-quality insights for maintenance operations.This solution is deployable and scalable using IaC with Terraform for ease of implementation and expansion across various environments. Teams have the flexibility to efficiently roll out the solution to customers globally, enhancing maintenance operations and reducing unplanned downtimes.In the following sections, we walk through the steps to customize and deploy the solution.
Prerequisites
To deploy the solution, you must have an AWS account with the appropriate permissions and access to Amazon Nova FMs on Amazon Bedrock. This can be enabled from the Amazon Bedrock console page.
Clone the GitHub repository
Clone the GitHub repository containing the IaC for the solution to your local machine.
Customise the ReportsProcessing function
To customize the ReportsProcessing AWS Lambda function, follow these steps:

Open the lambdas/python/ReportsProcessing/extract_observations.py file. This file contains the logic for the ReportsProcessing Lambda function.
Modify the code in this file to include your custom logic for processing reports based on their specific document styles. For example, you might need to modify the extract_metadata function to handle different report formats or adjust the logic in the standardize_metadata function to comply with your organisation’s standards.

Customise the RecommendationGeneration function
To customize the RecommendationGeneration Lambda, follow these steps:

Open the lambdas/python/RecommendationGeneration/generate_recommendations.pyfile. This file contains the logic for the RecommendationGeneration Lambda function, which uses the RAG architecture.
Modify the code in this file to include your custom logic for generating recommendations based on your specific requirements. For example, you might need to adjust the query_formulation() function to modify the prompt sent to Anthropic’s Claude 3 Sonnet or update the retrieve_rules function to customize the retrieval process from the knowledge base.

Update the Terraform configuration
If you made changes to the Lambda function names, roles, or other AWS resources, update the corresponding Terraform configuration files in the terraform directory to reflect these changes.
Initialise the Terraform working directory
Open a terminal or command prompt and navigate to the terraform directory within the cloned repository. Enter the following command to initialize the Terraform working directory:

terraform init

Preview the Terraform changes
Before applying the changes, preview the Terraform run plan by entering the following command:

terraform plan

This command will show you the changes that Terraform plans to make to your AWS infrastructure.
Deploy the Terraform stack
If you’re satisfied with the planned changes, deploy the Terraform stack to your AWS account by entering the following command:

terraform apply

Enter yes and press Enter to proceed with the deployment.
Create an Amazon Bedrock knowledge base
After you deploy the Terraform stack, create an Amazon Bedrock knowledge base to store and retrieve the maintenance rules and recommendations:

aws bedrock-agent create-knowledge-base
–knowledge-base-name “EquipmentMaintenanceKB” 
–description “Knowledge base for equipment maintenance recommendations” 
–storage-configuration ‘{ “type”: “VECTOR_STORE”, “vectorStore”: { “embeddingModelArn”: “arn:aws:bedrock:{aws-region}::foundation-model/amazon.titan-embed-text-v1” } }’

Once the knowledge bases are created, do not forget to update the Generate Recommendations lambda function environment variable with the appropriate knowledge base ID.
Upload a test report and validate the solution for generated recommendations
To test the solution, upload a sample maintenance report to the designated Amazon Simple Storage Service (Amazon S3) bucket:

aws s3 cp sample_report.pdf s3://iac-created-reports-bucket/incoming

Once the file is uploaded, navigate to the created AWS Step Functions State machine and validate that a successful execution occurs. The output of a successful execution must contain extracted observations from the input document as well as newly generated recommendations that have been pulled from the knowledge base.
Clean up
When you’re done with this solution, clean up the resources you created to avoid ongoing charges.
Conclusion
This post provided an overview of implementing a risk-based maintenance solution to preempt potential failures and avoid equipment downtime for maintenance teams. This solution highlights the benefits of Amazon Bedrock. By using Amazon Nova Pro with RAG for your equipment maintenance reports, engineers and scientists can focus their efforts on improving accuracy of recommendations and increasing development velocity. The key capabilities of this solution include:

Automated ingestion and standardization of maintenance reports using Amazon Textract, Amazon Comprehend, and Amazon Translate
Intelligent recommendation generation powered by RAG and Amazon Nova Pro on Amazon Bedrock
Continual expert validation and knowledge base expansion using SageMaker Ground Truth
Scalable and production-ready deployment using IaC with Terraform

By using the breadth of AWS services and the flexibility of Amazon Bedrock, equipment maintenance teams can streamline their operations and reduce unplanned downtimes.
AWS Professional Services is ready to help your team develop scalable and production-ready generative AI solutions on AWS. For more information, refer to the AWS Professional Services page or reach out to your account manager to get in touch.

About the authors
Jyothsna Puttanna is an AI/ML Consultant at AWS Professional Services. Jyothsna works closely with customers building their machine learning solutions on AWS. She specializes in distributed training, experimentation, and generative AI.
Shantanu Sinha is a Senior Engagement Manager at AWS Professional Services, based out of Berlin, Germany. Shantanu’s focus is on using generative AI to unlock business value and identify strategic business opportunities for his clients.
Selena Tabbara is a Data Scientist at AWS Professional Services specializing in AI/ML and Generative AI solutions for enterprise customers in energy, automotive and manufacturing industry.

MIRIX: A Modular Multi-Agent Memory System for Enhanced Long-Term Reas …

Recent developments in LLM agents have largely focused on enhancing capabilities in complex task execution. However, a critical dimension remains underexplored: memory—the capacity of agents to persist, recall, and reason over user-specific information across time. Without persistent memory, most LLM-based agents remain stateless, unable to build context beyond a single prompt, limiting their usefulness in real-world settings where consistency and personalization are essential.

To address this, MIRIX AI introduces MIRIX, a modular multi-agent memory system explicitly designed to enable robust long-term memory for LLM-based agents. Unlike flat, purely text-centric systems, MIRIX integrates structured memory types across modalities—including visual input—and is built upon a coordinated multi-agent architecture for memory management.

Core Architecture and Memory Composition

MIRIX features six specialized, compositional memory components, each governed by a corresponding Memory Manager:

Core Memory: Stores persistent agent and user information, segmented into ‘persona’ (agent profile, tone, and behavior) and ‘human’ (user facts such as name, preferences, and relationships).

Episodic Memory: Captures time-stamped events and user interactions with structured attributes like event_type, summary, details, actors, and timestamp.

Semantic Memory: Encodes abstract concepts, knowledge graphs, and named entities, with entries organized by type, summary, details, and source.

Procedural Memory: Contains structured workflows and task sequences using clearly defined steps and descriptions, often formatted as JSON for easy manipulation.

Resource Memory: Maintains references to external documents, images, and audio, recorded by title, summary, resource type, and content or link for contextual continuity.

Knowledge Vault: Secures verbatim facts and sensitive information such as credentials, contacts, and API keys with strict access controls and sensitivity labels.

A Meta Memory Manager orchestrates the activities of these six specialized managers, enabling intelligent message routing, hierarchical storage, and memory-specific retrieval operations. Additional agents—with roles like chat and interface—collaborate within this architecture.

Active Retrieval and Interaction Pipeline

A core innovation of MIRIX is its Active Retrieval mechanism. On user input, the system first autonomously infers a topic, then retrieves relevant memory entries from all six components, and finally tags the retrieved data for contextual injection into the resulting system prompt. This process decreases reliance on outdated parametric model knowledge and provides much stronger answer grounding.

Multiple retrieval strategies—including embedding_match, bm25_match, and string_match—are available, ensuring accurate and context-aware access to memory. The architecture allows for further expansion of retrieval tools as needed.

System Implementation and Application

MIRIX is deployed as a cross-platform assistant application developed with React-Electron (for the UI) and Uvicorn (for the backend API). The assistant monitors screen activity by capturing screenshots every 1.5 seconds; only non-redundant screens are kept, and memory updates are triggered in batches after collecting 20 unique screenshots (approximately once per minute). Uploads to the Gemini API are streaming, enabling efficient visual data processing and sub-5-second latency for updating memory from visual inputs.

Users interact through a chat interface, which dynamically draws on the agent’s memory components to generate context-aware, personalized responses. Semantic and procedural memories are rendered as expandable trees or lists, providing transparency and allowing users to audit and inspect what the agent “remembers” about them.

Evaluation on Multimodal and Conversational Benchmarks

MIRIX is validated on two rigorous tasks:

ScreenshotVQA: A visual question-answering benchmark requiring persistent, long-term memory over high-resolution screenshots. MIRIX outperforms retrieval-augmented generation (RAG) baselines—specifically SigLIP and Gemini—by 35% in LLM-as-a-Judge accuracy, while reducing retrieval storage needs by 99.9% compared to text-heavy methods.

LOCOMO: A textual benchmark assessing long-form conversation memory. MIRIX achieves 85.38% average accuracy, outperforming strong open-source systems such as LangMem and Mem0 by over 8 points, and approaching full-context sequence upper bounds.

The modular design enables high performance across both multimodal and text-only inference domains.

Use Cases: Wearables and the Memory Marketplace

MIRIX is designed for extensibility, with support for lightweight AI wearables—including smart glasses and pins—via its efficient, modular architecture. Hybrid deployment allows both on-device and cloud-based memory handling, while practical applications include real-time meeting summarization, granular location and context recall, and dynamic modeling of user habits.

A visionary feature of MIRIX is the Memory Marketplace: a decentralized ecosystem enabling secure memory sharing, monetization, and collaborative AI personalization between users. The Marketplace is designed with fine-grained privacy controls, end-to-end encryption, and decentralized storage to ensure data sovereignty and user self-ownership.

Conclusion

MIRIX represents a significant step toward endowing LLM-based agents with human-like memory. Its structured, multi-agent compositional architecture enables robust memory abstraction, multimodal support, and real-time, contextually grounded reasoning. With empirical gains across challenging benchmarks and an accessible, cross-platform application interface, MIRIX sets a new standard for memory-augmented AI systems.

FAQs

1. What makes MIRIX different from existing memory systems like Mem0 or Zep?MIRIX introduces multi-component, compositional memory (beyond text passage storage), multimodal support (including vision), and a multi-agent retrieval architecture for more scalable, accurate, and context-rich long-term memory management.

2. How does MIRIX ensure low-latency memory updates from visual inputs?By using streaming uploads in combination with Gemini APIs, MIRIX is able to update screenshot-based visual memory with under 5 seconds latency, even during active user sessions.

3. Is MIRIX compatible with closed-source LLMs like GPT-4?Yes. Since MIRIX operates as an external system (and not as a model plugin or retrainer), it can augment any LLM, regardless of its base architecture or licensing, including GPT-4, Gemini, and other proprietary models.

Check out the Paper, GitHub and Project. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post MIRIX: A Modular Multi-Agent Memory System for Enhanced Long-Term Reasoning and Personalization in LLM-Based Agents appeared first on MarkTechPost.

Can LLM Reward Models Be Trusted? Master-RM Exposes and Fixes Their We …

Generative reward models, where large language models (LLMs) serve as evaluators, are gaining prominence in reinforcement learning with verifiable rewards (RLVR). These models are preferred over rule-based systems for tasks involving open-ended or complex responses. Instead of relying on strict rules, LLMs compare a candidate response to a reference answer and generate binary feedback. However, despite aligning well with human evaluations, these models are surprisingly susceptible to superficial cues such as punctuation or boilerplate phrases (e.g., “Let’s solve this step by step”), which can yield false positive signals.

The Problem with Superficial Exploits

LLMs used as judges in RLVR can be manipulated by inserting trivial cues that mimic reasoning patterns. Researchers from Tencent AI Lab, Princeton University, and the University of Virginia found that even non-informative responses—like the word “Solution” or punctuation marks—can trigger positive evaluations. This behavior poses a serious risk to algorithms like preference optimization and rejection sampling, where accurate reward signals are vital. The issue is systemic, affecting both proprietary (e.g., GPT-4o, Claude-4) and open models (e.g., LLaMA3, Qwen2.5).

Introducing Master-RM: A Robust Reward Model

To counteract these vulnerabilities, the research team developed Master-RM, a new reward model trained with an augmented dataset containing 20,000 adversarial responses. These responses include generic reasoning openers and meaningless statements labeled as invalid. By fine-tuning on this enriched dataset, Master-RM significantly reduced false positive rates across benchmarks like GSM8K, MATH, and NaturalReasoning. It consistently outperformed both general-purpose and task-specific reward models, achieving near-zero error rates even under adversarial conditions.

Key Findings

Systemic Vulnerability: All evaluated models—including GPT-4o and LLaMA3—showed elevated false positive rates when exposed to “master key” hacks.

Model Scaling: Smaller models matched token patterns literally; mid-sized models made semantic errors; larger models overgeneralized.

Data Augmentation Works: Training on a mix of valid and manipulated responses drastically improves robustness without compromising accuracy.

Image source: https://arxiv.org/abs/2507.08794

Benchmark Performance

Master-RM was validated on five diverse reasoning benchmarks. Compared to models like Omni-Judge and Multi-sub RM, it maintained superior consistency with gold standards such as GPT-4o while showing minimal false positives. Even when evaluated with adversarial variants across languages and task domains, Master-RM retained its reliability.

Conclusion

This study identifies a critical weakness in using LLMs as judges within RLVR systems. Simple superficial patterns can compromise the learning pipeline by misleading the reward function. Master-RM offers a viable defense, showcasing that targeted data augmentation can harden reward models against manipulation. The model and its training set are now available via Hugging Face, paving the way for more trustworthy LLM-based evaluation in reinforcement learning.

Frequently Asked Questions (FAQs)

Q1: What are “master key” hacks in LLM-based reward models? “Master key” hacks refer to superficial textual cues, such as punctuation or boilerplate reasoning phrases, that can trigger false positive judgments in LLMs used as evaluators in RLVR systems.

Q2: How does Master-RM improve robustness compared to existing models? A2: Master-RM is trained with a curated set of adversarial examples labeled as invalid. This data augmentation reduces susceptibility to superficial manipulations while maintaining consistency with high-performing models like GPT-4o.

Q3: Where can I access Master-RM and its training data? A3: Both the model and dataset are publicly available on Hugging Face at Master-RM Model and Master-RM Dataset.

Check out the Paper. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post Can LLM Reward Models Be Trusted? Master-RM Exposes and Fixes Their Weaknesses appeared first on MarkTechPost.

Model Context Protocol (MCP) for Enterprises: Secure Integration with …

Table of contents1. MCP Overview & Ecosystem2. AWS: MCP at Cloud Scale3. Microsoft Azure: MCP in Copilot & AI Foundry4. Google Cloud: MCP Toolbox & Vertex AI5. Cross-Cloud Best Practices6. Security & Risk Management (2025 Threat Landscape)7. Expanded Ecosystem: Beyond the “Big Three”8. Example: AWS MSK MCP Integration Flow9. Summary (July 2025)

The Model Context Protocol (MCP), open-sourced by Anthropic in November 2024, has rapidly become the cross-cloud standard for connecting AI agents to tools, services, and data across the enterprise landscape. Since its release, major cloud vendors and leading AI providers have shipped first-party MCP integrations, and independent platforms are quickly expanding the ecosystem.

1. MCP Overview & Ecosystem

What is MCP?

MCP is an open standard (JSON-RPC 2.0-based) that enables AI systems (like large language models) to securely discover and call functions, tools, APIs, or data stores exposed by any MCP-compatible server.

It was purpose-built to eliminate the “N×M” connector problem in tool integrations: once a tool speaks MCP, any agent or app that supports MCP can interface with it securely and predictably.

Official SDKs: Python, TypeScript, C#, Java. Reference servers exist for databases, GitHub, Slack, Postgres, Google Drive, Stripe, and more.

Who’s Adopting MCP?

Cloud Providers: AWS (API MCP Server, MSK, Price List), Azure (AI Foundry MCP Server), Google Cloud (MCP Toolbox for Databases).

AI Platforms: OpenAI (Agents SDK, ChatGPT desktop), Google DeepMind (Gemini), Microsoft Copilot Studio, Claude Desktop.

Developer Tools: Replit, Zed, Sourcegraph, Codeium.

Enterprise Platforms: Block, Apollo, FuseBase, Wix—each embedding MCP for integrating AI assistants within custom business workflows.

Ecosystem Growth: The global MCP server market is projected to reach $10.3B in 2025, reflecting rapid enterprise adoption and ecosystem maturity.

2. AWS: MCP at Cloud Scale

What’s New (July 2025):

AWS API MCP Server: Developer preview launched July 2025; lets MCP-compatible AI agents securely call any AWS API via natural language.

Amazon MSK MCP Server: Now provides a standardized language interface to monitor Kafka metrics and manage clusters via agentic apps. Built-in security via IAM, fine-grained permissions, and OpenTelemetry tracing.

Price List MCP Server: Real-time AWS pricing and availability—query rates by region on demand.

Additional Offerings: Code Assistant MCP Server, Bedrock agent runtime, and sample servers for quick onboarding. All are open source where feasible.

Integration Steps:

Deploy the desired MCP server using Docker or ECS, leveraging official AWS guidance.

Harden endpoints with TLS, Cognito, WAF, and IAM roles.

Define API visibility/capabilities—e.g., msk.getClusterInfo.

Issue OAuth tokens or IAM credentials for secure access.

Connect with AI clients (Claude Desktop, OpenAI, Bedrock, etc.).

Monitor via CloudWatch and OpenTelemetry for observability.

Rotate credentials and review access policies regularly.

Why AWS Leads:

Unmatched scalability, official support for the widest set of AWS services, and fine-grained multi-region pricing/context APIs.

3. Microsoft Azure: MCP in Copilot & AI Foundry

What’s New:

Azure AI Foundry MCP Server: Unified protocol now connects Azure services (CosmosDB, SQL, SharePoint, Bing, Fabric), freeing developers from custom integration code.

Copilot Studio: Seamlessly discovers and invokes MCP capabilities—making it easy to add new data or actions to Microsoft 365 workflows.

SDKs: Python, TypeScript, and community kits receive regular updates.

Integration Steps:

Build/launch an MCP server in Azure Container Apps or Azure Functions.

Secure endpoints using TLS, Azure AD (OAuth), and RBAC.

Publish agent for Copilot Studio or Claude integration.

Connect to backend tools via MCP schemas: CosmosDB, Bing API, SQL, etc.

Use Azure Monitor and Application Insights for telemetry and security monitoring.

Why Azure Stands Out:

Deep integration with the Microsoft productivity suite, enterprise-grade identity, governance, and no/low-code agent enablement.

4. Google Cloud: MCP Toolbox & Vertex AI

What’s New:

MCP Toolbox for Databases: Released July 2025, this open-source module simplifies AI-agent access to Cloud SQL, Spanner, AlloyDB, BigQuery, and more—reducing integration to <10 lines of Python code.

Vertex AI: Native MCP via Agent Development Kit (ADK) allows robust multi-agent workflows across tools and data.

Security Models: Centralized connection-pooling, IAM integration, and VPC Service Controls.

Integration Steps:

Launch MCP Toolbox from Cloud Marketplace or deploy as a managed microservice.

Secure with IAM, VPC Service Controls, and OAuth2.

Register MCP tools and expose APIs for AI agent consumption.

Invoke database operations (e.g., bigquery.runQuery) via Vertex AI or MCP-enabled LLMs.

Audit all access via Cloud Audit Logs and Binary Authorization.

Why GCP Excels:

Best-in-class data tool integration, rapid agent orchestration, and strong enterprise network hygiene.

5. Cross-Cloud Best Practices

AreaBest Practices (2025)SecurityOAuth 2.0, TLS, fine-grained IAM/AAD/Cognito roles, audit logs, Zero Trust configDiscoveryDynamic MCP capability discovery at startup; schemas must be kept up-to-dateSchemaWell-defined JSON-RPC schemas with robust error/edge-case handlingPerformanceUse batching, caching, and paginated discovery for large tools listsTestingTest invalid parameters, multi-agent concurrency, logging, and traceabilityMonitoringExport telemetry via OpenTelemetry, CloudWatch, Azure Monitor, and App Insights

6. Security & Risk Management (2025 Threat Landscape)

Known Risks:

Prompt injection, privilege abuse, tool poisoning, impersonation, shadow MCP (rogue server), and new vulnerabilities enabling remote code execution in some MCP client libraries.

Mitigation: Only connect to trusted MCP servers over HTTPS, sanitize all AI inputs, validate tool metadata, deploy strong signature verification, and regularly review privilege scopes and audit logs.

Recent Vulnerabilities:

July 2025: CVE-2025-53110 and CVE-2025-6514 highlight the risk of remote code execution from malicious MCP servers. All users should urgently update affected libraries and restrict exposure to public/untrusted MCP endpoints.

7. Expanded Ecosystem: Beyond the “Big Three”

Anthropic: Core reference MCP servers—Postgres, GitHub, Slack, Puppeteer. Maintains rapid releases with new capabilities.

OpenAI: Full MCP support in GPT-4o, Agents SDK, sandbox and production use; extensive tutorials now available.

Google DeepMind: Gemini API has native SDK support for MCP definitions, broadening coverage in enterprise and research scenarios.

Other Companies Adopting MCP:

Netflix: Internal data orchestration.

Databricks: Integrating MCP for data pipeline agents.

Docusign, Litera: Automating legal agreements over MCP.

Replit, Zed, Codeium, Sourcegraph: Live code context tools.

Block (Square), Apollo, FuseBase, Wix: Next-gen enterprise integration.

8. Example: AWS MSK MCP Integration Flow

Deploy AWS MSK MCP server (use official AWS GitHub sample).

Secure with Cognito (OAuth2), WAF, IAM.

Configure available API actions and token rotation.

Connect supported AI agent (Claude, OpenAI, Bedrock).

Use agentic invocations, e.g., msk.getClusterInfo.

Monitor and analyze with CloudWatch/OpenTelemetry.

Iterate by adding new tool APIs; enforce least privilege.

9. Summary (July 2025)

MCP is the core open standard for AI-to-tool integrations.

AWS, Azure, and Google Cloud each offer robust first-party MCP support, often open source, with secure enterprise patterns.

Leading AI and developer platforms (OpenAI, DeepMind, Anthropic, Replit, Sourcegraph) are now MCP ecosystem “first movers.”

Security threats are real and dynamic—update tools, use Zero Trust, and follow best practices for credential management.

MCP unlocks rich, maintainable agentic workflows without per-agent or per-tool custom APIs.

The post Model Context Protocol (MCP) for Enterprises: Secure Integration with AWS, Azure, and Google Cloud- 2025 Update appeared first on MarkTechPost.

Deep Research Agents: A Systematic Roadmap for LLM-Based Autonomous Re …

A team of researchers from University of Liverpool, Huawei Noah’s Ark Lab, University of Oxford and University College London presents a report explaining Deep Research Agents (DR agents), a new paradigm in autonomous research. These systems are powered by Large Language Models (LLMs) and designed to handle complex, long-horizon tasks that require dynamic reasoning, adaptive planning, iterative tool use, and structured analytical outputs. Unlike traditional Retrieval-Augmented Generation (RAG) methods or static tool-use models, DR agents are capable of navigating evolving user intent and ambiguous information landscapes by integrating both structured APIs and browser-based retrieval mechanisms.

Limitations in Existing Research Frameworks

Prior to Deep Research Agents (DR agents), most LLM-driven systems focused on factual retrieval or single-step reasoning. RAG systems improved factual grounding, while tools like FLARE and Toolformer enabled basic tool use. However, these models lacked real-time adaptability, deep reasoning, and modular extensibility. They struggled with long-context coherence, efficient multi-turn retrieval, and dynamic workflow adjustment—key requirements for real-world research.

Architectural Innovations in Deep Research Agents (DR agents)

The foundational design of Deep Research Agents (DR agents) addresses the limitations of static reasoning systems. Key technical contributions include:

Workflow Classification: Differentiation between static (manual, fixed-sequence) and dynamic (adaptive, real-time) research workflows.

Model Context Protocol (MCP): A standardized interface enabling secure, consistent interaction with external tools and APIs.

Agent-to-Agent (A2A) Protocol: Facilitates decentralized, structured communication among agents for collaborative task execution.

Hybrid Retrieval Methods: Supports both API-based (structured) and browser-based (unstructured) data acquisition.

Multi-Modal Tool Use: Integration of code execution, data analytics, multimodal generation, and memory optimization within the inference loop.

System Pipeline: From Query to Report Generation

A typical Deep Research Agents (DR agents) processes a research query through:

Intent understanding via planning-only, intent-to-planning, or unified intent-planning strategies

Retrieval using both APIs (e.g., arXiv, Wikipedia, Google Search) and browser environments for dynamic content

Tool invocation through MCP for execution tasks like scripting, analytics, or media processing

Structured reporting, including evidence-grounded summaries, tables, or visualizations

Memory mechanisms such as vector databases, knowledge graphs, or structured repositories enable agents to manage long-context reasoning and reduce redundancy.

Comparison with RAG and Traditional Tool-Use Agents

Unlike RAG methods that operate on static retrieval pipelines, Deep Research Agents (DR agents):

Perform multi-step planning with evolving task goals

Adapt retrieval strategies based on task progress

Coordinate among multiple specialized agents (in multi-agent settings)

Utilize asynchronous and parallel workflows

This architecture enables more coherent, scalable, and flexible research task execution.

Industrial Implementations of DR Agents

OpenAI DR: Uses an o3 reasoning model with RL-based dynamic workflows, multimodal retrieval, and code-enabled report generation.

Gemini DR: Built on Gemini-2.0 Flash; supports large context windows, asynchronous workflows, and multi-modal task management.

Grok DeepSearch: Combines sparse attention, browser-based retrieval, and a sandboxed execution environment.

Perplexity DR: Applies iterative web search with hybrid LLM orchestration.

Microsoft Researcher & Analyst: Integrate OpenAI models within Microsoft 365 for domain-specific, secure research pipelines.

Benchmarking and Performance

Deep Research Agents (DR agents) are tested using both QA and task-execution benchmarks:

QA: HotpotQA, GPQA, 2WikiMultihopQA, TriviaQA

Complex Research: MLE-Bench, BrowseComp, GAIA, HLE

Benchmarks measure retrieval depth, tool use accuracy, reasoning coherence, and structured reporting. Agents like DeepResearcher and SimpleDeepSearcher consistently outperform traditional systems.

FAQs

Q1: What are Deep Research Agents?A: DR agents are LLM-based systems that autonomously conduct multi-step research workflows using dynamic planning and tool integration.

Q2: How are DR agents better than RAG models?A: DR agents support adaptive planning, multi-hop retrieval, iterative tool use, and real-time report synthesis.

Q3: What protocols do DR agents use?A: MCP (for tool interaction) and A2A (for agent collaboration).

Q4: Are these systems production-ready?A: Yes. OpenAI, Google, Microsoft, and others have deployed DR agents in public and enterprise applications.

Q5: How are DR agents evaluated?A: Using QA benchmarks like HotpotQA and HLE, and execution benchmarks like MLE-Bench and BrowseComp.

Check out the Paper. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post Deep Research Agents: A Systematic Roadmap for LLM-Based Autonomous Research Systems appeared first on MarkTechPost.

MemAgent: A Reinforcement Learning Framework Redefining Long-Context P …

Handling extremely long documents remains a persistent challenge for large language models (LLMs). Even with techniques such as length extrapolation and sparse attention, models often suffer from performance degradation and high computational costs. To address this, researchers from ByteDance Seed and Tsinghua University introduce MemAgent, a reinforcement learning-based memory agent designed to enable long-context processing with linear complexity and minimal performance loss.

Limitations of Existing Approaches

Current solutions for long-context modeling fall into three main categories:

Length Extrapolation Methods (e.g., NTK, PI, YaRN, DCA): Extend the context window via positional embedding manipulations. However, they often face performance degradation and scaling issues.

Sparse and Linear Attention Mechanisms: Reduce attention complexity to O(n) but typically require retraining from scratch and rely on fixed patterns or human-defined rules.

Context Compression: Use token-level or external memory modules to condense long inputs but often disrupt standard generation and struggle with extrapolation.

These approaches fail to deliver all three critical attributes: arbitrary input length support, consistent accuracy, and efficient linear complexity.

MemAgent: Human-Like Memory Strategy

Inspired by how humans summarize key information while ignoring noise, MemAgent processes input as a stream of evidence. At each step, it reads a document chunk and an internal memory, overwriting the latter with updated, compressed context.

Key innovations:

Fixed-Length Token-Based Memory: Compresses essential information while maintaining model compatibility.

Segment-Wise Overwrite Mechanism: Supports infinite text lengths without growing memory.

Linear Complexity: Memory update and decoding cost remain constant per chunk.

Multi-Conv RL Training with GRPO

MemAgent treats each document chunk interaction as an independent dialogue. It is trained via Group Relative Policy Optimization (GRPO) within a multi-conversation RL pipeline called DAPO, enabling reward-driven memory update.

Key elements include:

Rule-Based Verifier: Calculates outcome rewards by comparing model answers with multiple ground truths.

Token-Level RL Signal: Applied uniformly across conversations stemming from a sample.

This setup encourages memory compression focused on answer-relevant information and discards distractors.

Performance Evaluation

Using the RULER benchmark and synthetic datasets from HotpotQA and SQuAD, MemAgent was trained with an 8K context window and extrapolated up to 3.5 million tokens.

Model224K896K3.5MQwen2.5-Instruct-14B-1M37.5%0.0%N/AQwenLong-L1-32B17.2%11.7%N/ARL-MemAgent-14B81.3%77.3%78.1%

MemAgent maintained over 95% accuracy on RULER benchmarks (8K to 512K tokens) and consistently outperformed long-context and distillation-based baselines.

Case Study: Multi-Hop QA

Given the query “The director of the romantic comedy ‘Big Stone Gap’ is based in what New York city?”, MemAgent progressively tracked relevant content across 3 chunks:

Recognized unrelated content but retained location information.

Maintained memory against irrelevant chunks.

Correctly updated memory upon encountering Adriana Trigiani’s biography.

Final answer: Greenwich Village, New York City.

Theoretical Foundation and Complexity

MemAgent reformulates the autoregressive model using latent memory variables (m₁…mₖ):

p(x₁:N) = ∑ₘ₁:ₖ ∏ₖ p(cₖ | mₖ₋₁) * p(mₖ | cₖ, mₖ₋₁)

This enables O(N) compute cost and human-readable intermediate memory—unlike attention-based feature compression. RL is essential, as memory updates are discrete and can’t be learned via backpropagation.

Conclusion

MemAgent offers a scalable and efficient solution to the long-context trilemma: unlimited input length, near-lossless accuracy, and linear complexity. Its RL-based overwrite memory mechanism allows LLMs to read, abstract, and generate over multi-million-token inputs without architectural modification.

FAQs

Q1: What is MemAgent?MemAgent is a reinforcement learning-based framework that equips LLMs with memory tokens to handle extremely long contexts efficiently.

Q2: How is it different from attention or extrapolation methods?Unlike attention-based scaling or extrapolation techniques, MemAgent uses token-based memory updated via reinforcement learning.

Q3: What models can MemAgent be applied to?Any Transformer-based LLM. No changes to the model architecture are required.

Q4: How does it scale with input size?It maintains linear computational complexity regardless of input length by fixing the memory size.

Q5: What are the applications of MemAgent?Long-document QA, agent memory systems, legal document review, scientific literature analysis, and real-time decision-making with large evidence bases.

Check out the Paper. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post MemAgent: A Reinforcement Learning Framework Redefining Long-Context Processing in LLMs appeared first on MarkTechPost.

The Definitive Guide to AI Agents: Architectures, Frameworks, and Real …

Table of contentsWhat is an AI Agent?Why AI Agents Matter in 2025Types of AI AgentsKey Components of an AI AgentLeading AI Agent Frameworks in 2025Practical Use Cases for AI Agents AI Agent vs. Chatbot vs. LLMThe Future of Agentic AI SystemsFAQs About AI AgentsConclusion

What is an AI Agent?

An AI Agent is an autonomous software system that can perceive its environment, interpret data, reason, and execute actions to achieve specific goals without explicit human intervention. Unlike traditional automation, AI agents integrate decision-making, learning, memory, and multi-step planning capabilities—making them suitable for complex real-world tasks. In essence, an AI agent acts as a cognitive layer atop data and tools, intelligently navigating, transforming, or responding to situations in real time.

Why AI Agents Matter in 2025

AI agents are now at the forefront of next-generation software architecture. As businesses look to integrate generative AI into workflows, AI agents enable modular, extensible, and autonomous decision systems. With multi-agent systems, real-time memory, tool execution, and planning capabilities, agents are revolutionizing industries from DevOps to education. The shift from static prompts to dynamic, goal-driven agents is as significant as the leap from static websites to interactive web applications.

Types of AI Agents

1. Simple Reflex Agents

These agents operate based on the current percept, ignoring the rest of the percept history. They function using condition-action rules (if-then statements). For example, a thermostat responds to temperature changes without storing previous data.

2. Model-Based Reflex Agents

These agents enhance reflex behavior by maintaining an internal state that depends on the percept history. The state captures information about the world, helping the agent handle partially observable environments.

3. Goal-Based Agents

Goal-based agents evaluate future actions to achieve a desired state or goal. By simulating different possibilities, they can select the most efficient path to meet specific objectives. Planning and search algorithms are fundamental here.

4. Utility-Based Agents

These agents not only pursue goals but also consider the desirability of outcomes by maximizing a utility function. They are essential in scenarios requiring trade-offs or probabilistic reasoning (e.g., economic decision-making).

5. Learning Agents

Learning agents continuously improve their performance by learning from experience. They consist of four main components: a learning element, a performance element, a critic (to provide feedback), and a problem generator (to suggest exploratory actions).

6. Multi-Agent Systems (MAS)

These systems involve multiple AI agents interacting in a shared environment. Each agent may have different goals, and they may cooperate or compete. MAS is useful in robotics, distributed problem-solving, and simulations.

7. Agentic LLMs

Emerging in 2024–2025, these are advanced agents powered by large language models. They incorporate capabilities such as reasoning, planning, memory, and tool use. Examples include AutoGPT, LangChain Agents, and CrewAI.

Key Components of an AI Agent

1. Perception (Input Interface)

The perception module enables the agent to observe and interpret its environment. It processes raw inputs such as text, audio, sensor data, or visual feeds and translates them into internal representations for reasoning.

2. Memory (Short-Term and Long-Term)

Memory allows agents to store and retrieve past interactions, actions, and observations. Short-term memory supports context retention within a session, while long-term memory can persist across sessions to build user or task profiles. Often implemented using vector databases.

3. Planning and Decision-Making

This component enables agents to define a sequence of actions to achieve a goal. It uses planning algorithms (e.g., Tree-of-Thoughts, graph search, reinforcement learning) and can evaluate multiple strategies based on goals or utilities.

4. Tool Use and Action Execution

Agents interact with APIs, scripts, databases, or other software tools to act in the world. The execution layer handles these interactions securely and effectively, including function calls, shell commands, or web navigation.

5. Reasoning and Control Logic

Reasoning frameworks manage how an agent interprets observations and decides on actions. This includes logic chains, prompt engineering techniques (e.g., ReAct, CoT), and routing logic between modules.

6. Feedback and Learning Loop

Agents assess the success of their actions and update their internal state or behavior. This may involve user feedback, task outcome evaluation, or self-reflective strategies to improve over time.

7. User Interface

For human-agent interaction, a user interface—like a chatbot, voice assistant, or dashboard—facilitates communication and feedback. It bridges natural language understanding and action interfaces.

Leading AI Agent Frameworks in 2025

• LangChain

A dominant open-source framework for constructing LLM-based agents using chains, prompts, tool integration, and memory. It supports integrations with OpenAI, Anthropic, FAISS, Weaviate, web scraping tools, Python/JS execution, and more.

• Microsoft AutoGen

A framework geared toward multi-agent orchestration and code automation. It defines distinct agent roles—Planner, Developer, Reviewer—that communicate via natural language, enabling collaborative workflows.

• Semantic Kernel

An enterprise-grade toolkit from Microsoft that embeds AI into apps using “skills” and planners. It is model-agnostic, supports enterprise languages (Python, C#), and seamlessly integrates with LLMs like OpenAI and Hugging Face.

• OpenAI Agents SDK (Swarm)

A lightweight SDK defining agents, tools, handoffs, and guardrails. Optimized for GPT-4 and function-calling, it enables structured workflows with built-in monitoring and traceability.

• SuperAGI

A comprehensive agent-operating system offering persistent multi-agent execution, memory handling, visual runtime interface, and a marketplace for plug-and-play components.

• CrewAI

Focused on team-style orchestration, CrewAI allows developers to define specialized agent roles (e.g., Planner, Coder, Critic) and coordinate them in pipelines. It integrates seamlessly with LangChain and emphasizes collaboration.

• IBM watsonx Orchestrate

A no-code, enterprise SaaS solution for orchestrating “digital worker” agents across business workflows with drag-and-drop simplicity.

Practical Use Cases for AI Agents

Enterprise IT & Service Desk Automation

AI agents streamline internal support workflows—routing helpdesk tickets, diagnosing issues, and resolving common problems automatically. For instance, agents like IBM’s AskIT reduce IT support calls by 70%, while Atomicwork’s Diagnostics Agent supports self-service troubleshooting directly within teams’ chat tools.

Customer-Facing Support & Sales Assistance

These agents handle high-volume inquiries—from order tracking to product recommendations— by integrating with CRMs and knowledge bases. They boost user experience and deflect routine tickets. Case in point: e-commerce chatbots that manage returns, process refunds, and reduce support costs by ~65%. Botpress-powered sales agents have even increased lead volume by ~50%.

Contract & Document Analysis (Legal & Finance)

AI agents can analyze, extract, and summarize data from contracts and financial documents—reducing time spent by up to 75%. This supports sectors like banking, insurance, and legal where rapid, reliable insight is crucial.

E‑commerce & Inventory Optimization

Agents predict demand, track inventory, and handle returns or refunds with minimal human oversight. Walmart-style AI assistants and image-based product search (e.g., Pinterest Lens) enhance personalized shopping experiences and conversion rates.

Logistics & Operational Efficiency

In logistics, AI agents optimize delivery routes and manage supply chains. For example, UPS reportedly saved $300 million annually using AI-driven route optimization. In manufacturing, agents monitor equipment health via sensor data to predict and preempt breakdowns.

HR, Finance & Back‑Office Workflow Automation

AI agents automate internal tasks—from processing vacation requests to payroll queries. IBM’s digital HR agents automate 94% of routine queries, significantly reducing HR workload. Agents also streamline invoice processing, financial reconciliation, and compliance checks using document intelligence techniques.

Research, Knowledge Management & Analytics

AI agents support research by summarizing reports, retrieving relevant insights, and generating dashboards. Google Cloud’s generative AI agents can transform large datasets and documents into conversational insights for analysts.

AI Agent vs. Chatbot vs. LLM

FeatureChatbotLLMAI AgentPurposeTask-specific dialogueText generationGoal-oriented autonomyTool UseNoLimitedExtensive (APIs, code, search)MemoryStatelessShort-termStateful + persistentAdaptabilityPredefinedModerately adaptiveFully adaptive with feedback loopAutonomyReactiveAssistiveAutonomous + interactive

The Future of Agentic AI Systems

The trajectory is clear: AI agents will become modular infrastructure layers across enterprise, consumer, and scientific domains. With advancements in:

Planning Algorithms (e.g., Graph-of-Thoughts, PRM-based planning)

Multi-Agent Coordination

Self-correction and Evaluation Agents

Persistent Memory Storage and Querying

Tool Security Sandboxing and Role Guardrails

…we expect AI agents to mature into co-pilot systems that blend decision-making, autonomy, and accountability.

FAQs About AI Agents

Q: Are AI agents just LLMs with prompts?A: No. True AI agents orchestrate memory, reasoning, planning, tool use, and adaptiveness beyond static prompts.

Q: Where can I build my first AI agent?A: Try LangChain templates, Autogen Studio, or SuperAgent—all designed to simplify agent creation.

Q: Do AI agents work offline?A: Most rely on cloud-based LLM APIs, but local models (e.g., Mistral, LLaMA, Phi) can run agents offline.

Q: How are AI agents evaluated?A: Emerging benchmarks include AARBench (task execution), AgentEval (tool use), and HELM (holistic evaluation).

Conclusion

AI Agents represent a major evolution in AI system design—moving from passive generative models to proactive, adaptive, and intelligent agents that can interface with the world. Whether you’re automating DevOps, personalizing education, or building intelligent assistants, the agentic paradigm offers scalable and explainable intelligence.
The post The Definitive Guide to AI Agents: Architectures, Frameworks, and Real-World Applications (2025) appeared first on MarkTechPost.

You Don’t Need to Share Data to Train a Language Model Anymore—Fle …

The development of large-scale language models (LLMs) has historically required centralized access to extensive datasets, many of which are sensitive, copyrighted, or governed by usage restrictions. This constraint severely limits the participation of data-rich organizations operating in regulated or proprietary environments. FlexOlmo—introduced by researchers at the Allen Institute for AI and collaborators—proposes a modular training and inference framework that enables LLM development under data governance constraints.

Current LLMs…..

Current LLM training pipelines rely on aggregating all training data into a single corpus, which imposes a static inclusion decision and eliminates the possibility of opt-out post-training. This approach is incompatible with:

Regulatory regimes (e.g., HIPAA, GDPR, data sovereignty laws),

License-bound datasets (e.g., non-commercial or attribution-restricted),

Context-sensitive data (e.g., internal source code, clinical records).

FlexOlmo addresses two objectives:

Decentralized, modular training: Allow independently trained modules on disjoint, locally held datasets.

Inference-time flexibility: Enable deterministic opt-in/opt-out mechanisms for dataset contributions without retraining.

Model Architecture: Expert Modularity via Mixture-of-Experts (MoE)

FlexOlmo builds upon a Mixture-of-Experts (MoE) architecture where each expert corresponds to a feedforward network (FFN) module trained independently. A fixed public model (denoted as M<sub>pub</sub>) serves as the shared anchor. Each data owner trains an expert M<sub>i</sub> using their private dataset D<sub>i</sub>, while all attention layers and other non-expert parameters remain frozen.

Key architectural components:

Sparse activation: Only a subset of expert modules is activated per input token.

Expert routing: Token-to-expert assignment is governed by a router matrix derived from domain-informed embeddings, eliminating the need for joint training.

Bias regularization: A negative bias term is introduced to calibrate selection across independently trained experts, preventing over-selection of any single expert.

This design maintains interoperability among modules while enabling selective inclusion during inference.

Asynchronous and Isolated Optimization

Each expert M<sub>i</sub> is trained via a constrained procedure to ensure alignment with M<sub>pub</sub>. Specifically:

Training is performed on a hybrid MoE instance comprising M<sub>i</sub> and M<sub>pub</sub>.

The M<sub>pub</sub> expert and shared attention layers are frozen.

Only the FFNs corresponding to M<sub>i</sub> and the router embeddings r<sub>i</sub> are updated.

To initialize r<sub>i</sub>, a set of samples from D<sub>i</sub> is embedded using a pretrained encoder, and their average forms the router embedding. Optional lightweight router tuning can further improve performance using proxy data from the public corpus.

Dataset Construction: FLEXMIX

The training corpus, FLEXMIX, is divided into:

A public mix, composed of general-purpose web data.

Seven closed sets simulating non-shareable domains: News, Reddit, Code, Academic Text, Educational Text, Creative Writing, and Math.

Each expert is trained on a disjoint subset, with no joint data access. This setup approximates real-world usage where organizations cannot pool data due to legal, ethical, or operational constraints.

Evaluation and Baseline Comparisons

FlexOlmo was evaluated on 31 benchmark tasks across 10 categories, including general language understanding (e.g., MMLU, AGIEval), generative QA (e.g., GEN5), code generation (e.g., Code4), and mathematical reasoning (e.g., Math2).

Baseline methods include:

Model soup: Averaging weights of individually fine-tuned models.

Branch-Train-Merge (BTM): Weighted ensembling of output probabilities.

BTX: Converting independently trained dense models into a MoE via parameter transplant.

Prompt-based routing: Using instruction-tuned classifiers to route queries to experts.

Compared to these methods, FlexOlmo achieves:

A 41% average relative improvement over the base public model.

A 10.1% improvement over the strongest merging baseline (BTM).

The gains are especially notable on tasks aligned with closed domains, confirming the utility of specialized experts.

Architectural Analysis

Several controlled experiments reveal the contribution of architectural decisions:

Removing expert-public coordination during training significantly degrades performance.

Randomly initialized router embeddings reduce inter-expert separability.

Disabling the bias term skews expert selection, particularly when merging more than two experts.

Token-level routing patterns show expert specialization at specific layers. For instance, mathematical input activates the math expert at deeper layers, while introductory tokens rely on the public model. This behavior underlines the model’s expressivity compared to single-expert routing strategies.

Opt-Out and Data Governance

A key feature of FlexOlmo is deterministic opt-out capability. Removing an expert from the router matrix fully removes its influence at inference time. Experiments show that removing the News expert reduces performance on NewsG but leaves other tasks unaffected, confirming the localized influence of each expert.

Privacy Considerations

Training data extraction risks were evaluated using known attack methods. Results indicate:

0.1% extraction rate for a public-only model.

1.6% for a dense model trained on the math dataset.

0.7% for FlexOlmo with the math expert included.

While these rates are low, differential privacy (DP) training can be applied independently to each expert for stronger guarantees. The architecture does not preclude the use of DP or encrypted training methods.

Scalability

The FlexOlmo methodology was applied to an existing strong baseline (OLMo-2 7B), pretrained on 4T tokens. Incorporating two additional experts (Math, Code) improved average benchmark performance from 49.8 to 52.8, without retraining the core model. This demonstrates scalability and compatibility with existing training pipelines.

Conclusion

FlexOlmo introduces a principled framework for building modular LLMs under data governance constraints. Its design supports distributed training on locally maintained datasets and enables inference-time inclusion/exclusion of dataset influence. Empirical results confirm its competitiveness against both monolithic and ensemble-based baselines.

The architecture is particularly applicable to environments with:

Data locality requirements,

Dynamic data use policies,

Regulatory compliance constraints.

FlexOlmo provides a viable pathway for constructing performant language models while adhering to real-world data access boundaries.

Check out the Paper, Model on Hugging Face and Codes. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post You Don’t Need to Share Data to Train a Language Model Anymore—FlexOlmo Demonstrates How appeared first on MarkTechPost.

o1 Style Thinking with Chain-of-Thought Reasoning using Mirascope

In this tutorial, we’ll explore how to implement Chain-of-Thought (CoT) reasoning using the Mirascope library and Groq’s LLaMA 3 model. Rather than having the model jump straight to an answer, CoT reasoning encourages it to break the problem down into logical steps—much like how a human would solve it. This approach improves accuracy, transparency, and helps tackle complex, multi-step tasks more reliably. We’ll guide you through setting up the schema, defining step-by-step reasoning calls, generating final answers, and visualizing the thinking process in a structured way.

We’ll be asking the LLM a relative velocity question – “If a train leaves City A at 9:00 AM traveling at 60 km/h, and another train leaves City B (which is 300 km away from City A) at 10:00 AM traveling at 90 km/h toward City A, at what time will the trains meet?”

Installing the dependencies

Copy CodeCopiedUse a different Browser!pip install “mirascope[groq]”
!pip install datetime

Groq API Key

For this tutorial, we require a Groq API key to make LLM calls. You can get one at https://console.groq.com/keys

Importing the libraries & defining a Pydantic schema

This section imports the required libraries and defines a COTResult Pydantic model. The schema structures each reasoning step with a title, content, and a next_action flag to indicate whether the model should continue reasoning or return the final answer.

Copy CodeCopiedUse a different Browserfrom typing import Literal

from mirascope.core import groq
from pydantic import BaseModel, Field

history: list[dict] = []

class COTResult(BaseModel):
title: str = Field(…, desecription=”The title of the step”)
content: str = Field(…, description=”The output content of the step”)
next_action: Literal[“continue”, “final_answer”] = Field(
…, description=”The next action to take”
)

Defining Step-wise Reasoning and Final Answer Functions

These functions form the core of the Chain-of-Thought (CoT) reasoning workflow. The cot_step function allows the model to think iteratively by reviewing prior steps and deciding whether to continue or conclude. This enables deeper reasoning, especially for multi-step problems. The final_answer function consolidates all reasoning into a single, focused response, making the output clean and ready for end-user consumption. Together, they help the model approach complex tasks more logically and transparently.

Copy CodeCopiedUse a different Browser@groq.call(“llama-3.3-70b-versatile”, json_mode=True, response_model=COTResult)
def cot_step(prompt: str, step_number: int, previous_steps: str) -> str:
return f”””
You are an expert AI assistant that explains your reasoning step by step.
For this step, provide a title that describes what you’re doing, along with the content.
Decide if you need another step or if you’re ready to give the final answer.

Guidelines:
– Use AT MOST 5 steps to derive the answer.
– Be aware of your limitations as an LLM and what you can and cannot do.
– In your reasoning, include exploration of alternative answers.
– Consider you may be wrong, and if you are wrong in your reasoning, where it would be.
– Fully test all other possibilities.
– YOU ARE ALLOWED TO BE WRONG. When you say you are re-examining
– Actually re-examine, and use another approach to do so.
– Do not just say you are re-examining.

IMPORTANT: Do not use code blocks or programming examples in your reasoning. Explain your process in plain language.

This is step number {step_number}.

Question: {prompt}

Previous steps:
{previous_steps}
“””

@groq.call(“llama-3.3-70b-versatile”)
def final_answer(prompt: str, reasoning: str) -> str:
return f”””
Based on the following chain of reasoning, provide a final answer to the question.
Only provide the text response without any titles or preambles.
Retain any formatting as instructed by the original prompt, such as exact formatting for free response or multiple choice.

Question: {prompt}

Reasoning:
{reasoning}

Final Answer:
“””

Generating and Displaying Chain-of-Thought Responses

This section defines two key functions to manage the full Chain-of-Thought reasoning loop:

generate_cot_response handles the iterative reasoning process. It sends the user query to the model step-by-step, tracks each step’s content, title, and response time, and stops when the model signals it has reached the final answer or after a maximum of 5 steps. It then calls final_answer to produce a clear conclusion based on the accumulated reasoning.

display_cot_response neatly prints the step-by-step breakdown along with the time taken for each step, followed by the final answer and the total processing time.

Together, these functions help visualize how the model reasons through a complex prompt and allow for better transparency and debugging of multi-step outputs.

Copy CodeCopiedUse a different Browserdef generate_cot_response(
user_query: str,
) -> tuple[list[tuple[str, str, float]], float]:
steps: list[tuple[str, str, float]] = []
total_thinking_time: float = 0.0
step_count: int = 1
reasoning: str = “”
previous_steps: str = “”

while True:
start_time: datetime = datetime.now()
cot_result = cot_step(user_query, step_count, previous_steps)
end_time: datetime = datetime.now()
thinking_time: float = (end_time – start_time).total_seconds()

steps.append(
(
f”Step {step_count}: {cot_result.title}”,
cot_result.content,
thinking_time,
)
)
total_thinking_time += thinking_time

reasoning += f”n{cot_result.content}n”
previous_steps += f”n{cot_result.content}n”

if cot_result.next_action == “final_answer” or step_count >= 5:
break

step_count += 1

# Generate final answer
start_time = datetime.now()
final_result: str = final_answer(user_query, reasoning).content
end_time = datetime.now()
thinking_time = (end_time – start_time).total_seconds()
total_thinking_time += thinking_time

steps.append((“Final Answer”, final_result, thinking_time))

return steps, total_thinking_time

def display_cot_response(
steps: list[tuple[str, str, float]], total_thinking_time: float
) -> None:
for title, content, thinking_time in steps:
print(f”{title}:”)
print(content.strip())
print(f”**Thinking time: {thinking_time:.2f} seconds**n”)

print(f”**Total thinking time: {total_thinking_time:.2f} seconds**”)

Running the Chain-of-Thought Workflow

The run function initiates the full Chain-of-Thought (CoT) reasoning process by sending a multi-step math word problem to the model. It begins by printing the user’s question, then uses generate_cot_response to compute a step-by-step reasoning trace. These steps, along with the total processing time, are displayed using display_cot_response.

Finally, the function logs both the question and the model’s final answer into a shared history list, preserving the full interaction for future reference or auditing. This function ties together all earlier components into a complete, user-facing reasoning flow.

Copy CodeCopiedUse a different Browserdef run() -> None:
question: str = “If a train leaves City A at 9:00 AM traveling at 60 km/h, and another train leaves City B (which is 300 km away from City A) at 10:00 AM traveling at 90 km/h toward City A, at what time will the trains meet?”
print(“(User):”, question)
# Generate COT response
steps, total_thinking_time = generate_cot_response(question)
display_cot_response(steps, total_thinking_time)

# Add the interaction to the history
history.append({“role”: “user”, “content”: question})
history.append(
{“role”: “assistant”, “content”: steps[-1][1]}
) # Add only the final answer to the history

# Run the function

run()

Check out the Codes. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post o1 Style Thinking with Chain-of-Thought Reasoning using Mirascope appeared first on MarkTechPost.

EG-CFG: Enhancing Code Generation with Real-Time Execution Feedback

LLMs have made impressive strides in generating code for various programming tasks. However, they mostly rely on recognizing patterns from static code examples rather than understanding how the code behaves during execution. This often leads to programs that look correct but fail when run. While recent methods introduce iterative refinement and self-debugging, they typically act in separate steps, generating, testing, and then revising. Unlike human programmers who constantly run fragments of code and adjust based on real-time output, these models cannot integrate execution feedback continuously, limiting their effectiveness in producing truly functional code.

The Role of Program Synthesis and Prompting in Code Generation

Program synthesis has long been used to evaluate LLMs and automate code generation benchmarks, such as MBPP, HumanEval, and CodeContests, by testing models on various coding challenges. While prompting strategies, such as few-shot and Chain-of-Thought, have improved performance, newer methods now incorporate feedback loops that utilize tools or execution results to refine outputs. Some frameworks even assign tasks to multiple LLM agents, each tackling different aspects of the problem. However, most approaches still rely on simple decoding methods. Unlike traditional strategies, newer guidance techniques, such as CFG, offer a more dynamic approach but haven’t yet been widely applied with real-time execution feedback.

Introducing EG-CFG: Execution-Guided Code Generation from Tel Aviv University

Researchers at Tel Aviv University have introduced EG-CFG, a new method for code generation that actively utilizes execution feedback during the generation process, a technique commonly employed by human programmers. Instead of waiting until the end, EG-CFG evaluates partial code as it’s being written, guiding the model toward correct and executable outputs. It uses a beam search to generate multiple code options, runs them, and integrates runtime outcomes to influence the next steps. This real-time feedback loop significantly boosts performance across standard benchmarks, such as MBPP, HumanEval, and CodeContests, even surpassing closed-source models, while also enabling efficient parallel reasoning and dynamic exploration.

How EG-CFG Works: Real-Time Feedback Meets Beam Search and AST Parsing

The EG-CFG method improves code generation by guiding language models using real-time execution feedback during inference. For a given programming task, it generates partial code solutions and explores multiple continuations using beam search. These continuations are checked for syntax using AST parsing, and only valid ones are executed on test cases to gather detailed runtime traces, including variable states and errors. This feedback is then injected into the model’s prompt to inform future predictions. A guidance mechanism interpolates between the model’s standard output and feedback-informed suggestions, helping the model refine its solution step by step until it passes all test cases.

Benchmark Results: EG-CFG Outperforms GPT-4 and Claude on HumanEval and MBPP-ET

The EG-CFG method was tested using two versions of DeepSeek models: a 1.3B parameter model locally and the larger V3-0324 model through an API. It was evaluated on five code benchmarks: MBPP, HumanEval, CodeContests, MBPP-ET, and HumanEval-ET. On HumanEval, EG-CFG with DeepSeek V3 solved 90.1% of the tasks correctly, outperforming GPT-4 (85.5%) and Claude 2 (83.2%). On MBPP-ET, it achieved an 81.4% accuracy rate, setting a new benchmark. Notably, the smaller 1.3B model also showed strong gains, improving from 46.3% to 61.7% on HumanEval when guided with EG-CFG. An ablation study confirmed the importance of components like dynamic feedback and beam search in driving these results.

Conclusion: EG-CFG Simulates Human Debugging to Advance Code Generation

In conclusion, the EG-CFG method introduces a new way to generate code using language models by incorporating real-time execution feedback during generation. Unlike traditional approaches that rely on static patterns, EG-CFG simulates how human programmers test and refine code. It uses beam search to explore possible code completions, tests them with real inputs, and then guides generation based on the results. This happens line by line, ensuring feedback is both structured and actionable. The method also supports multiple agents working in parallel, boosting efficiency. EG-CFG achieves top accuracy across standard benchmarks, showing strong results even on complex coding tasks and with smaller models.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.

Sponsorship OpportunityReach the most influential AI developers worldwide. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]
The post EG-CFG: Enhancing Code Generation with Real-Time Execution Feedback appeared first on MarkTechPost.

Build real-time travel recommendations using AI agents on Amazon Bedro …

Generative AI is transforming how businesses deliver personalized experiences across industries, including travel and hospitality. Travel agents are enhancing their services by offering personalized holiday packages, carefully curated for customer’s unique preferences, including accessibility needs, dietary restrictions, and activity interests. Meeting these expectations requires a solution that combines comprehensive travel knowledge with real-time pricing and availability information.
In this post, we show how to build a generative AI solution using Amazon Bedrock that creates bespoke holiday packages by combining customer profiles and preferences with real-time pricing data. We demonstrate how to use Amazon Bedrock Knowledge Bases for travel information, Amazon Bedrock Agents for real-time flight details, and Amazon OpenSearch Serverless for efficient package search and retrieval.
Solution overview
Travel agencies face increasing demands for personalized recommendations while struggling with real-time data accuracy and scalability. Consider a travel agency that needs to offer accessible holiday packages: they need to match specific accessibility requirements with real-time flight and accommodation availability but are constrained by manual processing times and outdated information in traditional systems. This AI-powered solution combines personalization with real-time data integration, enabling the agency to automatically match accessibility requirements with current travel options, delivering accurate recommendations in minutes rather than hours.The solution uses a three-layer architecture to help travel agents create personalized holiday recommendations:

Frontend layer – Provides an interface where travel agents input customer requirements and preferences
Orchestration layer – Processes request and enriches them with customer data
Recommendation layer – Combines two key components:

Travel data storage – Maintains a searchable repository of travel packages
Real-time information retrieval – Fetches current flight details through API integration

The following diagram illustrates this architecture.

With this layered approach, travel agents can capture customer requirements, enrich them with stored preferences, integrate real-time data, and deliver personalized recommendations that match customer needs. The following diagram illustrates how these components are implemented using AWS services.

The AWS implementation includes:

Amazon API Gateway – Receives requests and routes them to AWS Lambda functions facilitating secure API calls for retrieving recommendations
AWS Lambda – Processes input data, creates the enriched prompt, and executes the recommendation workflow
Amazon DynamoDB – Stores customer preferences and travel history
Amazon Bedrock Knowledge Bases – Helps travel agents build a curated database of destinations, travel packages, and deals, making sure recommendations are based on reliable and up-to-date information
Amazon OpenSearch Serverless – Enables simple, scalable, and high-performing vector search
Amazon Simple Storage Service (Amazon S3) – Stores large datasets such as flight schedules and promotional materials
Amazon Bedrock Agents – Integrates real-time information retrieval, making sure recommended itineraries reflect current availability, pricing, and scheduling through external API integrations

This solution uses a AWS CloudFormation template that automatically provisions and configures the required resources. The template handles the complete setup process, including service configurations and necessary permissions.
For the latest information about service quotas that might affect your deployment, refer to AWS service quotas.
Prerequisites
To deploy and use this solution, you must have the following:

An AWS account with access to Amazon Bedrock
Permissions to create and manage the following services:

Amazon Bedrock
Amazon OpenSearch Serverless
Lambda
DynamoDB
Amazon S3
API Gateway

Access to foundation models in Amazon Bedrock for Amazon Titan Text Embeddings V2 and Anthropic Claude 3 Haiku models

Deploy the CloudFormation stack
You can deploy this solution in your AWS account using AWS CloudFormation. Complete the following steps:

Choose Launch Stack:

You will be redirected to the Create stack wizard on the AWS CloudFormation console with the stack name and the template URL already filled in.

Leave the default settings and complete the stack creation.
Choose View stack events to go to the AWS CloudFormation console to see the deployment details.

The stack takes around 10 minutes to create the resources. Wait until the stack status is CREATE_COMPLETE before continuing to the next steps.
The CloudFormation template automatically creates and configures components for data storage and management, Amazon Bedrock, and the API and interface.
Data storage and management
The template sets up the following data storage and management resources:

An S3 bucket and with a sample dataset (travel_data.json and promotions.csv), prompt template, and the API schema

DynamoDB tables populated with sample user profiles and travel history

An OpenSearch Serverless collection with optimized settings for travel package searches

A vector index with settings compatible with the Amazon Bedrock knowledge base

Amazon Bedrock configuration
For Amazon Bedrock, the CloudFormation template creates the following resources:

A knowledge base with the travel dataset and data sources ingested from Amazon S3 with automatic synchronization

An Amazon Bedrock agent, which is automatically prepared

A new version and alias for the agent

Agent action groups with mock flight data integration

An action group invocation, configured with the FlightPricingLambda Lambda function and the API schema retrieved from the S3 bucket

API and interface setup
To enable API access and the UI, the template configures the following resources:

API Gateway endpoints
Lambda functions with a mock flight API for demonstration purposes
A web interface for travel agents

Verify the setup
After stack creation is complete, you can verify the setup on the Outputs tab of the AWS CloudFormation console, which provides the following information:

WebsiteURL – Access the travel agent interface
ApiEndpoint – Use for programmatic access to the recommendation system

Test the endpoints
The web interface provides an intuitive form where travel agents can input customer requirements, including:

Customer ID (for example, Joe or Will)
Travel budget
Preferred dates
Number of travelers
Travel style

You can call the API directly using the following code:

curl -X POST
  <ApiEndpoint>
  -H ‘Content-Type: application/json’
  -d ‘{
    “userId”: “Joe”,
    “budget”: “3000 GBP”,
    “duration”: “7 days”,
    “travelDate”: “2025-07-15”,
    “numberOfTravelers”: 2
  }’

Test the solution
For demonstration purposes, we create sample user profiles in the UserPreferences and TravelHistory tables in DynamoDB.
The UserPreferences table stores user-specific travel preferences. For instance, Joe represents a luxury traveler with wheelchair accessibility requirements.

Will represents a budget traveler with elderly-friendly needs. These profiles help showcase how the system handles different customer requirements and preferences.

The TravelHistory table stores past trips taken by users. The following tables show the past trips taken by the user Joe, showing destinations, trip durations, ratings, and travel dates.

Let’s walk through a typical use case to demonstrate how a travel agent can use this solution to create personalized holiday recommendations.Consider a scenario where a travel agent is helping Joe, a customer who requires wheelchair accessibility, plan a luxury vacation. The travel agent enters the following information:

Customer ID: Joe
Budget: 4,000 GBP
Duration: 5 days
Travel dates: July 15, 2025
Number of travelers: 2
Travel style: Luxury

When a travel agent submits a request, the system orchestrates a series of actions through the PersonalisedHolidayFunction Lambda function, which will query the knowledge base, check real-time flight information using the mock API, and return personalized recommendations that match the customer’s specific needs and preferences. The recommendation layer uses the following prompt template:

Based on the profile and requirements:

User Preferences:
– Travel Preferences: {travelStyle}
– Interests: {interests}
– Dietary Restrictions: {dietaryRestrictions}
– Accessibility Needs: {accessibility}

Current Request:
– Budget: {budget}
– Duration: {duration}
– Travel Date: {travelDate}
– Number of Travelers: {numberOfTravelers}

Previous Destinations: {previousDestinations}

Instructions:
1. Match the user’s budget, travel style and interests
2. Consider dietary restrictions and accessibility needs
3. Avoid previously visited destinations
4. Include:
   – Recommended destinations
   – Suitable accommodations
   – Relevant activities and experiences
   – Transportation options
   – Estimated cost breakdown
   – Travel tips

Please follow the <Instructions> and provide a personalized holiday recommendation in the below format:
Destination: [Primary recommended destination]

[Detailed recommendation]

The system retrieves Joe’s preferences from the user profile, including:

{
    “userPreferences”: {
        “preferences”: “Prefer warm climate and cultural experiences”,
        “budget”: 3000,
        “duration”: “5 days”,
        “travelDate”: “2025-03-04”,
        “interests”: [
            “photography”,
            “food”,
            “beach”
        ],
        “travelStyle”: “Luxury”,
        “numberOfTravelers”: 2,
        “dietaryRestrictions”: [
            “plant based”,
            “vegetarian”
        ],
        “accessibility”: [
            “wheelchair-accessible”
        ],
        “previousDestinations”: [
            “Maldives”,
            “Bali”
        ]
    }
}

The system then generates personalized recommendations that consider the following:

Destinations with proven wheelchair accessibility
Available luxury accommodations
Flight details for the recommended destination

Each recommendation includes the following details:

Detailed accessibility information
Real-time flight pricing and availability
Accommodation details with accessibility features
Available activities and experiences
Total package cost breakdown

Clean up
To avoid incurring future charges, delete the CloudFormation stack. For more information, see Delete a stack from the CloudFormation console.
The template includes proper deletion policies, making sure the resources you created, including S3 buckets, DynamoDB tables, and OpenSearch collections, are properly removed.
Next steps
To further enhance this solution, consider the following:

Explore multi-agent capabilities:

Create specialized agents for different travel aspects (hotels, activities, local transport)
Enable agent-to-agent communication for complex itinerary planning
Implement an orchestrator agent to coordinate responses and resolve conflicts

Implement multi-language support using multi-language foundation models in Amazon Bedrock
Integrate with customer relationship management (CRM) systems

Conclusion
In this post, you learned how to build an AI-powered holiday recommendation system using Amazon Bedrock that helps travel agents deliver personalized experiences. Our implementation demonstrated how combining Amazon Bedrock Knowledge Bases with Amazon Bedrock Agents effectively bridges historical travel information with real-time data needs, while using serverless architecture and vector search for efficient matching of customer preferences with travel packages.The solution shows how travel recommendation systems can balance comprehensive travel knowledge, real-time data accuracy, and personalization at scale. This approach is particularly valuable for travel organizations needing to integrate real-time pricing data, handle specific accessibility requirements, or scale their personalized recommendations. This solution provides a practical starting point with clear paths for enhancement based on specific business needs, from modernizing your travel planning systems or handling complex customer requirements.
Related resources
To learn more, refer to the following resources:

Documentation:

Amazon Bedrock Documentation
Automate tasks in your application using AI agents
Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases
Amazon OpenSearch Serverless Developer Guide
Building Lambda functions with Python

Code samples:

Amazon Bedrock RAG with Knowledge Bases and Agents
Amazon Bedrock Samples Repository
Amazon Bedrock Agent Samples Repository

Additional learning:

AWS Machine Learning Blog
AWS Training and Certification

About the Author
Vishnu Vardhini
Vishnu Vardhini is a Solutions Architect at AWS based in Scotland, focusing on SMB customers across industries. With expertise in Security, Cloud Engineering and DevOps, she architects scalable and secure AWS solutions. She is passionate about helping customers leverage Machine Learning and Generative AI to drive business value.

Deploy a full stack voice AI agent with Amazon Nova Sonic

AI-powered speech solutions are transforming contact centers by enabling natural conversations between customers and AI agents, shortening wait times, and dramatically reducing operational costs—all without sacrificing the human-like interaction customers expect. With the recent launch of Amazon Nova Sonic in Amazon Bedrock, you can now build sophisticated conversational AI agents that communicate naturally through voice, without the need for separate speech recognition and text-to-speech components. Amazon Nova Sonic is a speech-to-speech model in Amazon Bedrock that enables real-time, human-like voice conversations.
Whereas many early Amazon Nova Sonic implementations focused on local development, this solution provides a complete cloud-deployed architecture that you can use as a foundation for building real proof of concept applications. This asset is deployable through the AWS Cloud Development Kit (AWS CDK) and provides a foundation for building further Amazon Nova use cases using preconfigured infrastructure components, while allowing you to customize the architecture to address your specific business requirements.
In this post, we show how to create an AI-powered call center agent for a fictional company called AnyTelco. The agent, named Telly, can handle customer inquiries about plans and services while accessing real-time customer data using custom tools implemented with the Model Context Protocol (MCP) framework.
Solution overview
The following diagram provides an overview of the deployable solution.

The solution is composed of the following layers:

Frontend layer – The frontend layer of this system is built with scalability and performance in mind:

Amazon CloudFront distribution serves as the content delivery network for the web application.
Amazon Simple Storage Service (Amazon S3) hosts static assets.
The UI handles audio streaming and user interaction.

Communication layer – The communication layer facilitates seamless real-time interactions:

Network Load Balancer manages WebSocket connections. WebSockets enable two-way interactive communication sessions between a user’s browser and the server, which is essential for real-time audio streaming applications.
Amazon Cognito provides user authentication and JSON web token (JWT) validation. Amazon Cognito provides user authentication, authorization, and user management for web and mobile applications, alleviating the need to build and maintain your own identity systems.

Processing layer – The processing layer forms the computational backbone of the system:

Amazon Elastic Container Service (Amazon ECS) runs the containerized backend service.
AWS Fargate provides the serverless compute backend. Orchestration is provided by the Amazon ECS engine.
The Python backend processes audio streams and manages Amazon Nova Sonic interactions.

Intelligence layer – The intelligence layer uses AI and data technologies to power the core functionalities:

The Amazon Nova Sonic model in Amazon Bedrock handles speech processing.
Amazon DynamoDB stores customer information.
Amazon Bedrock Knowledge Bases connects foundation models (FMs) with your organization’s data sources, allowing AI applications to reference accurate, up-to-date information specific to your business.

The following sequence diagram highlights the flow when a user initiates conversation. The user only signs in one time, but authentication Steps 3 and 4 happen every time the user starts a new session. The conversational loop in Steps 6–12 is repeated throughout the conversational interaction. Steps a–c only happen when the Amazon Nova Sonic agent decides to use a tool. In scenarios without tool use, the flow goes directly from Step 9 to Step 10.

Prerequisites
Before getting started, verify that you have the following:

Python 3.12
Node.js v20
npm v10.8
An AWS account
The AWS CDK set up (for prerequisites and installation instructions, see Getting started with the AWS CDK)
Amazon Nova Sonic enabled in Amazon Bedrock (for more information, see Add or remove access to Amazon Bedrock foundation models)
Chrome or Safari browser environment (Firefox is not supported at the time of writing)
A working microphone and speakers

Deploy the solution
You can find the solution and full deployment instructions on the GitHub repository. The solution uses the AWS CDK to automate infrastructure deployment. Use the following code terminal commands to get started in your AWS Command Line Interface (AWS CLI) environment:

git clone https://github.com/aws-samples/sample-sonic-cdk-agent.git
cd nova-s2s-call-center

# Configure environment variables
cp template.env .env

# Edit .env with your settings

# Deploy the solution
./deploy.sh

The deployment creates two AWS CloudFormation stacks:

Network stack for virtual private cloud (VPC) and networking components
Stack for application resources

The output of the second stack gives you a CloudFront distribution link, which takes you to the login page.

You can create an Amazon Cognito admin user with the following AWS CLI command:

aws cognito-idp admin-create-user
–user-pool-id YOUR_USER_POOL_ID
–username USERNAME
–user-attributes Name=email,Value=USER_EMAIL
–temporary-password TEMPORARY_PASSWORD
–region YOUR_AWS_REGION

The preceding command uses the following parameters:

YOUR_USER_POOL_ID: The ID of your Amazon Cognito user pool
USERNAME: The desired user name for the user
USER_EMAIL: The email address of the user
TEMPORARY_PASSWORD: A temporary password for the user
YOUR_AWS_REGION: Your AWS Region (for example, us-east-1)

Log in with your temporary password from the CloudFront distribution link, and you will be asked to set a new password.
You can choose Start Session to start a conversation with your assistant. Experiment with prompts and different tools for your use case.

Customizing the application
A key feature of this solution is its flexibility—you can tailor the AI agent’s capabilities to your specific use case. The sample implementation demonstrates this extensibility through custom tools and knowledge integration:

Customer information lookup – Retrieves customer profile data from DynamoDB using phone numbers as keys
Knowledge base search – Queries an Amazon Bedrock knowledge base for company information, plan details, and pricing

These features showcase how to enhance the functionality of Amazon Nova Sonic with external data sources and domain-specific knowledge. The architecture is designed for seamless customization in several key areas.
Modifying the system prompt
The solution includes a UI in which you can adjust the AI agent’s behavior by modifying its system prompt. This enables rapid iteration on the agent’s personality, knowledge base, and conversation style without redeploying the entire application.

Adding new tools
You can also extend the AI agent’s capabilities by implementing additional tools using the MCP framework. The process involves:

Implementing the tool logic, typically as a new Python module
Registering the tool with the MCP server by using the @mcp_server.tool custom decorator and defining the tool specification, including its name, description, and input schema in /backend/tools/mcp_tool_registry.py

For example, the following code illustrates how to add a knowledge base lookup tool:

@mcp_server.tool(
name=”lookup”,
description=”Runs query against a knowledge base to retrieve information.”
)
async def lookup_tool(
query: Annotated[str, Field(description=”the query to search”)]
) -> dict:
“””Look up information in the knowledge base”””
results = knowledge_base_lookup.main(query)
return results

The decorator handles registration with the MCP server, and the function body contains your tool’s implementation logic.
Expanding the knowledge base
The solution uses Amazon Bedrock Knowledge Bases to provide the AI agent with company-specific information. You can update this knowledge base with:

Frequently asked questions and their answers
Product catalogs and specifications
Company policies and procedures

Clean up
You can remove the stacks with the following command:

# move to the cdk folder, assuming you are in the project root folder
cd cdk
# Removes both stacks sequentially
npx cdk destroy –all

Conclusion
AI agents are transforming how organizations approach customer service, with solutions offering the ability to handle multiple conversations simultaneously, provide consistent service around the clock, and scale instantly while maintaining quality and reducing operational costs. This solution makes those benefits accessible by providing a deployable foundation for Amazon Nova Sonic applications on AWS. The solution demonstrates how AI agents can effectively handle customer inquiries, access real-time data, and provide personalized service—all while maintaining the natural conversational flow that customers expect.
By combining the Amazon Nova Sonic model with a robust cloud architecture, secure authentication, and flexible tool integration, organizations can quickly move from concept to proof of concept. This solution is not just helping build voice AI applications, it’s helping companies drive better customer satisfaction and productivity across a range of industries.
To learn more, refer to the following resources:

Introducing Amazon Nova Sonic: Human-like voice conversations for generative AI applications
Using the Amazon Nova Sonic Speech-to-Speech model
Amazon Nova Sonic Workshop

About the authors
Reilly Manton is a Solutions Architect in AWS Telecoms Prototyping. He combines visionary thinking and technical expertise to build innovative solutions. Focusing on generative AI and machine learning, he empowers telco customers to enhance their technological capabilities.
Shuto Araki is a Software Development Engineer at AWS. He works with customers in telecom industry focusing on AI security and networks. Outside of work, he enjoys cycling throughout the Netherlands.
Ratan Kumar is a Principal Solutions Architect at Amazon Web Services.A trusted technology advisor with over 20 years of experience working across a range of industry domains, Ratan’s passion lies in empowering enterprise customers innovate and transform their business by unlocking the potential of AWS cloud.
Chad Hendren is a Principal Solutions Architect at Amazon Web Services. His passion is AI/ML and Generative AI applied to Customer Experience. He is a published author and inventor with 30 years of telecommunications experience.