Going beyond AI assistants: Examples from Amazon.com reinventing indus …

Generative AI revolutionizes business operations through various applications, including conversational assistants such as Amazon’s Rufus and Amazon Seller Assistant. Additionally, some of the most impactful generative AI applications operate autonomously behind the scenes, an essential capability that empowers enterprises to transform their operations, data processing, and content creation at scale. These non-conversational implementations, often in the form of agentic workflows powered by large language models (LLMs), execute specific business objectives across industries without direct user interaction.
Non-conversational applications offer unique advantages such as higher latency tolerance, batch processing, and caching, but their autonomous nature requires stronger guardrails and exhaustive quality assurance compared to conversational applications, which benefit from real-time user feedback and supervision.
This post examines four diverse Amazon.com examples of such generative AI applications:

Amazon.com listing creation and catalog data quality improvements – Demonstrating how LLMs are helping selling partners and Amazon.com create higher-quality listings at scale
Prescription processing in Amazon Pharmacy – Showcasing implementation in a highly regulated environment and task decomposition for agentic workflows
Review highlights – Illustrating massive scale batch processing, traditional machine learning (ML) integration, use of smaller LLMs, and cost-effective solution at scale
Amazon Ads creative image and video generation – Highlighting multimodal generative AI and responsible AI practices in creative endeavors

Each case study reveals different aspects of implementing non-conversational generative AI applications, from technical architecture to operational considerations. Throughout these examples, you will learn how the comprehensive suite of AWS services, including Amazon Bedrock and Amazon SageMaker, are the key to success. Finally, we list key learnings commonly shared across these use cases.
Creating high-quality product listings on Amazon.com
Creating high-quality product listings with comprehensive details helps customers make informed purchase decisions. Traditionally, selling partners manually entered dozens of attributes per product. The new generative AI solution, launched in 2024, transforms this process by proactively acquiring product information from brand websites and other sources to improve the customer experience across numerous product categories.
Generative AI simplifies the selling partner experience by enabling information input in various formats such as URLs, product images, or spreadsheets and automatically translating this into the required structure and format. Over 900,000 selling partners have used it, with nearly 80% of generated listing drafts accepted with minimal edits. AI-generated content provides comprehensive product details that help with clarity and accuracy, which can contribute to product discoverability in customer searches.
For new listings, the workflow begins with selling partners providing initial information. The system then generates comprehensive listings using multiple information sources, including titles, descriptions, and detailed attributes. Generated listings are shared with selling partners for approval or editing.
For existing listings, the system identifies products that can be enriched with additional data.
Data integration and processing for a large variety of outputs
The Amazon team built robust connectors for internal and external sources with LLM-friendly APIs using Amazon Bedrock and other AWS services to seamlessly integrate into Amazon.com backend systems.
A key challenge is synthesizing diverse data into cohesive listings across more than 50 attributes, both textual and numerical. LLMs require specific control mechanisms and instructions to accurately interpret ecommerce concepts because they might not perform optimally with such complex, varied data. For example, LLMs might misinterpret “capacity” in a knife block as dimensions rather than number of slots, or mistake “Fit Wear” as a style description instead of a brand name. Prompt engineering and fine-tuning were extensively used to address these cases.
Generation and validation with LLMs
The generated product listings should be complete and correct. To help this, the solution implements a multistep workflow using LLMs for both generation and validation of attributes. This dual-LLM approach helps prevent hallucinations, which is critical when dealing with safety hazards or technical specifications. The team developed advanced self-reflection techniques to make sure the generation and validation processes complement each other effectively.
The following figure illustrates the generation process with validation both performed by LLMs.

Figure 1. Product Listing creation workflow

Multi-layer quality assurance with human feedback
Human feedback is central to the solution’s quality assurance. The process includes Amazon.com experts for initial evaluation and selling partner input for acceptance or edits. This provides high-quality output and enables ongoing enhancement of AI models.
The quality assurance process includes automated testing methods combining ML-, algorithm-, or LLM-based evaluations. Failed listings undergo regeneration, and successful listings proceed to further testing. Using causal inference models, we identify underlying features affecting listing performance and opportunities for enrichment. Ultimately, listings that pass quality checks and receive selling partner acceptance are published, making sure customers receive accurate and comprehensive product information.
The following figure illustrates the workflow of going to production with testing, evaluation, and monitoring of product listing generation.

Figure 2. Product Listing testing and human in the loop workflow

Application-level system optimization for accuracy and cost
Given the high standards for accuracy and completeness, the team adopted a comprehensive experimentation approach with an automated optimization system. This system explores various combinations of LLMs, prompts, playbooks, workflows, and AI tools to iterate for higher business metrics, including cost. Through continuous evaluation and automated testing, the product listing generator effectively balances performance, cost, and efficiency while staying adaptable to new AI developments. This approach means customers benefit from high-quality product information, and selling partners have access to cutting-edge tools for creating listings efficiently.
Generative AI-powered prescription processing in Amazon Pharmacy
Building upon the human-AI hybrid workflows previously discussed in the seller listing example, Amazon Pharmacy demonstrates how these principles can be applied in a Health Insurance Portability and Accountability Act (HIPAA)-regulated industry. Having shared a conversational assistant for patient care specialists in the post Learn how Amazon Pharmacy created their LLM-based chat-bot using Amazon SageMaker, we now focus on automated prescription processing, which you can read about in The life of a prescription at Amazon Pharmacy and the following research paper in Nature Magazine.
At Amazon Pharmacy, we developed an AI system built on Amazon Bedrock and SageMaker to help pharmacy technicians process medication directions more accurately and efficiently. This solution integrates human experts with LLMs in creation and validation roles to enhance precision in medication instructions for our patients.
Agentic workflow design for healthcare accuracy
The prescription processing system combines human expertise (data entry technicians and pharmacists) with AI support for direction suggestions and feedback. The workflow, shown in the following diagram, begins with a pharmacy knowledge-based preprocessor standardizing raw prescription text in Amazon DynamoDB, followed by fine-tuned small language models (SLMs) on SageMaker identifying critical components (dosage, frequency).

(a)

(b)
(c)

Figure 3. (a) Data entry technician and pharmacist workflow with two GenAI modules, (b) Suggestion module workflow and (c) Flagging module workflow

The system seamlessly integrates experts such as data entry technicians and pharmacists, where generative AI complements the overall workflow towards agility and accuracy to better serve our patients. A direction assembly system with safety guardrails then generates instructions for data entry technicians to create their typed directions through the suggestion module. The flagging module flags or corrects errors and enforces further safety measures as feedback provided to the data entry technician. The technician finalizes highly accurate, safe-typed directions for pharmacists who can either provide feedback or execute the directions to the downstream service.
One highlight from the solution is the use of task decomposition, which empowers engineers and scientists to break the overall process into a multitude of steps with individual modules made of substeps. The team extensively used fine-tuned SLMs. In addition, the process employs traditional ML procedures such as named entity recognition (NER) or estimation of final confidence with regression models. Using SLMs and traditional ML in such contained, well-defined procedures significantly improved processing speed while maintaining rigorous safety standards due to incorporation of appropriate guardrails on specific steps.
The system comprises multiple well-defined substeps, with each subprocess operating as a specialized component working semi-autonomously yet collaboratively within the workflow toward the overall objective. This decomposed approach, with specific validations at each stage, proved more effective than end-to-end solutions while enabling the use of fine-tuned SLMs. The team used AWS Fargate to orchestrate the workflow given its current integration into existing backend systems.
In their product development journey, the team turned to Amazon Bedrock, which provided high-performing LLMs with ease-of-use features tailored to generative AI applications. SageMaker enabled further LLM selections, deeper customizability, and traditional ML methods. To learn more about this technique, see How task decomposition and smaller LLMs can make AI more affordable and read about the Amazon Pharmacy business case study.
Building a reliable application with guardrails and HITL
To comply with HIPAA standards and provide patient privacy, we implemented strict data governance practices alongside a hybrid approach that combines fine-tuned LLMs using Amazon Bedrock APIs with Retrieval Augmented Generation (RAG) using Amazon OpenSearch Service. This combination enables efficient knowledge retrieval while maintaining high accuracy for specific subtasks.
Managing LLM hallucinations—which is critical in healthcare—required more than just fine-tuning on large datasets. Our solution implements domain-specific guardrails built on Amazon Bedrock Guardrails, complemented by human-in-the-loop (HITL) oversight to promote system reliability.
The Amazon Pharmacy team continues to enhance this system through real-time pharmacist feedback and expanded prescription format capabilities. This balanced approach of innovation, domain expertise, advanced AI services, and human oversight not only improves operational efficiency, but means that the AI system properly augments healthcare professionals in delivering optimal patient care.
Generative AI-powered customer review highlights
Whereas our previous example showcased how Amazon Pharmacy integrates LLMs into real-time workflows for prescription processing, this next use case demonstrates how similar techniques—SLMs, traditional ML, and thoughtful workflow design—can be applied to offline batch inferencing at massive scale.
Amazon has introduced AI-generated customer review highlights to process over 200 million annual product reviews and ratings. This feature distills shared customer opinions into concise paragraphs highlighting positive, neutral, and negative feedback about products and their features. Shoppers can quickly grasp consensus while maintaining transparency by providing access to related customer reviews and keeping original reviews available.
The system enhances shopping decisions through an interface where customers can explore review highlights by selecting specific features (such as picture quality, remote functionality, or ease of installation for a Fire TV). Features are visually coded with green check marks for positive sentiment, orange minus signs for negative, and gray for neutral—which means shoppers can quickly identify product strengths and weaknesses based on verified purchase reviews. The following screenshot shows review highlights regarding noise level for a product.

Figure 4. An example product review highlights for a product.

A recipe for cost-effective use of LLMs for offline use cases
The team developed a cost-effective hybrid architecture combining traditional ML methods with specialized SLMs. This approach assigns sentiment analysis and keyword extraction to traditional ML while using optimized SLMs for complex text generation tasks, improving both accuracy and processing efficiency. The following diagram shows ttraditional ML and LLMs working to provide the overall workflow.

Figure 5. Use of traditional ML and LLMs in a workflow.

The feature employs SageMaker batch transform for asynchronous processing, significantly reducing costs compared to real-time endpoints. To deliver a near zero-latency experience, the solution caches extracted insights alongside existing reviews, reducing wait times and enabling simultaneous access by multiple customers without additional computation. The system processes new reviews incrementally, updating insights without reprocessing the complete dataset. For optimal performance and cost-effectiveness, the feature uses Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances for batch transform jobs, providing up to 40% better price-performance to alternatives.
By following this comprehensive approach, the team effectively managed costs while handling the massive scale of reviews and products so that the solution remained both efficient and scalable.
Amazon Ads AI-powered creative image and video generation
Having explored mostly text-centric generative AI applications in previous examples, we now turn to multimodal generative AI with Amazon Ads creative content generation for sponsored ads. The solution has capabilities for image and video generation, the details of which we share in this section. In common, this solution uses Amazon Nova creative content generation models at its core.
Working backward from customer need, a March 2023 Amazon survey revealed that nearly 75% of advertisers struggling with campaign success cited creative content generation as their primary challenge. Many advertisers—particularly those without in-house capabilities or agency support—face significant barriers due to the expertise and costs of producing quality visuals. The Amazon Ads solution democratizes visual content creation, making it accessible and efficient for advertisers of different sizes. The impact has been substantial: advertisers using AI-generated images in Sponsored Brands campaigns saw nearly 8% click-through rates (CTR) and submitted 88% more campaigns than non-users.
Last year, the AWS Machine Learning Blog published a post detailing the image generation solution. Since then, Amazon has adopted Amazon Nova Canvas as its foundation for creative image generation, creating professional-grade images from text or image prompts with features for text-based editing and controls for color scheme and layout adjustments.
In September 2024, the Amazon Ads team included the creation of short-form video ads from product images. This feature uses foundation models available on Amazon Bedrock to give customers control over visual style, pacing, camera motion, rotation, and zooming through natural language, using an agentic workflow to first describe video storyboards and then generate the content for the story. The following screenshot shows an example of creative image generation for product backgrounds on Amazon Ads.

Figure 6. Ads image generation example for a product.

As discussed in the original post, responsible AI is at the center of the solution, and Amazon Nova creative models come with built-in controls to support safety and responsible AI use, including watermarking and content moderation.
The solution uses AWS Step Functions with AWS Lambda functions to orchestrate serverless orchestration of both image and video generation processes. Generated content is stored in Amazon Simple Storage Service (Amazon S3) with metadata in DynamoDB, and Amazon API Gateway provides customer access to the generation capabilities. The solution now employs Amazon Bedrock Guardrails in addition to maintaining Amazon Rekognition and Amazon Comprehend integration at various steps for additional safety checks. The following screenshot shows creative AI-generated videos on Amazon Ads campaign builder.

Figure 7. Ads video generation for a product

Creating high-quality ad creatives at scale presented complex challenges. The generative AI model needed to produce appealing, brand-appropriate images across diverse product categories and advertising contexts while remaining accessible to advertisers regardless of technical expertise. Quality assurance and improvement are fundamental to both image and video generation capabilities. The system undergoes continual enhancement through extensive HITL processes enabled by Amazon SageMaker Ground Truth. This implementation delivers a powerful tool that transforms advertisers’ creative process, making high-quality visual content creation more accessible across diverse product categories and contexts.
This is just the beginning of Amazon Ads using generative AI to empower advertising customers to create the content they need to drive their advertising objectives. The solution demonstrates how reducing creative barriers directly increases advertising activity while maintaining high standards for responsible AI use.
Key technical learnings and discussions
Non-conversational applications benefit from higher latency tolerance, enabling batch processing and caching, but require robust validation mechanisms and stronger guardrails due to their autonomous nature. These insights apply to both non-conversational and conversational AI implementations:

Task decomposition and agentic workflows – Breaking complex problems into smaller components has proven valuable across implementations. This deliberate decomposition by domain experts enables specialized models for specific subtasks, as demonstrated in Amazon Pharmacy prescription processing, where fine-tuned SLMs handle discrete tasks such as dosage identification. This strategy allows for specialized agents with clear validation steps, improving reliability and simplifying maintenance. The Amazon seller listing use case exemplifies this through its multistep workflow with separate generation and validation processes. Additionally, the review highlights use case showcased cost-effective and controlled use of LLMs by using traditional ML for preprocessing and performing parts that could be associated with an LLM task.
Hybrid architectures and model selection – Combining traditional ML with LLMs provides better control and cost-effectiveness than pure LLM approaches. Traditional ML excels at well-defined tasks, as shown in the review highlights system for sentiment analysis and information extraction. Amazon teams have strategically deployed both large and small language models based on requirements, integrating RAG with fine-tuning for effective domain-specific applications like the Amazon Pharmacy implementation.
Cost optimization strategies – Amazon teams achieved efficiency through batch processing, caching mechanisms for high-volume operations, specialized instance types such as AWS Inferentia and AWS Trainium, and optimized model selection. Review highlights demonstrates how incremental processing reduces computational needs, and Amazon Ads used Amazon Nova foundation models (FMs) to cost-effectively create creative content.
Quality assurance and control mechanisms – Quality control relies on domain-specific guardrails through Amazon Bedrock Guardrails and multilayered validation combining automated testing with human evaluation. Dual-LLM approaches for generation and validation help prevent hallucinations in Amazon seller listings, and self-reflection techniques improve accuracy. Amazon Nova creative FMs provide inherent responsible AI controls, complemented by continual A/B testing and performance measurement.
HITL implementation – The HITL approach spans multiple layers, from expert evaluation by pharmacists to end-user feedback from selling partners. Amazon teams established structured improvement workflows, balancing automation and human oversight based on specific domain requirements and risk profiles.
Responsible AI and compliance – Responsible AI practices include content ingestion guardrails for regulated environments and adherence to regulations such as HIPAA. Amazon teams integrated content moderation for user-facing applications, maintained transparency in review highlights by providing access to source information, and implemented data governance with monitoring to promote quality and compliance.

These patterns enable scalable, reliable, and cost-effective generative AI solutions while maintaining quality and responsibility standards. The implementations demonstrate that effective solutions require not just sophisticated models, but careful attention to architecture, operations, and governance, supported by AWS services and established practices.
Next steps
The examples from Amazon.com shared in this post illustrate how generative AI can create value beyond traditional conversational assistants. We invite you to follow these examples or create your own solution to discover how generative AI can reinvent your business or even your industry. You can visit the AWS generative AI use cases page to start the ideation process.
These examples showed that effective generative AI implementations often benefit from combining different types of models and workflows. To learn what FMs are supported by AWS services, refer to Supported foundation models in Amazon Bedrock and Amazon SageMaker JumpStart Foundation Models. We also suggest you explore Amazon Bedrock Flows, which can ease the path towards building workflows. Additionally, we remind you that Trainium and Inferentia accelerators provide important cost savings in these applications.
Agentic workflows, as illustrated in our examples, have proven particularly valuable. We recommend exploring Amazon Bedrock Agents for quickly building agentic workflows.
Successful generative AI implementation extends beyond model selection—it represents a comprehensive software development process from experimentation to application monitoring. To begin building your foundation across these essential services, we invite you to explore Amazon QuickStart.
Conclusion
These examples demonstrate how generative AI extends beyond conversational assistants to drive innovation and efficiency across industries. Success comes from combining AWS services with strong engineering practices and business understanding. Ultimately, effective generative AI solutions focus on solving real business problems while maintaining high standards of quality and responsibility.
To learn more about how Amazon uses AI, refer to Artificial Intelligence in Amazon News.

About the Authors
Burak Gozluklu is a Principal AI/ML Specialist Solutions Architect and lead GenAI Scientist Architect for Amazon.com on AWS, based in Boston, MA. He helps strategic customers adopt AWS technologies and specifically Generative AI solutions to achieve their business objectives. Burak has a PhD in Aerospace Engineering from METU, an MS in Systems Engineering, and a post-doc in system dynamics from MIT in Cambridge, MA. He maintains his connection to academia as a research affiliate at MIT. Outside of work, Burak is an enthusiast of yoga.
Emilio Maldonado is a Senior leader at Amazon responsible for Product Knowledge, oriented at building systems to scale the e-commerce Catalog metadata, organize all product attributes, and leverage GenAI to infer precise information that guides Sellers and Shoppers to interact with products. He’s passionate about developing dynamic teams and forming partnerships. He holds a Bachelor of Science in C.S. from Tecnologico de Monterrey (ITESM) and an MBA from Wharton, University of Pennsylvania.
Wenchao Tong is a Sr. Principal Technologist at Amazon Ads in Palo Alto, CA, where he spearheads the development of GenAI applications for creative building and performance optimization. His work empowers customers to enhance product and brand awareness and drive sales by leveraging innovative AI technologies to improve creative performance and quality. Wenchao holds a Master’s degree in Computer Science from Tongji University. Outside of work, he enjoys hiking, board games, and spending time with his family.
Alexandre Alves is a Sr. Principal Engineer at Amazon Health Services, specializing in ML, optimization, and distributed systems. He helps deliver wellness-forward health experiences.
Puneet Sahni is Sr. Principal Engineer in Amazon. He works on improving the data quality of all products available in Amazon catalog. He is passionate about leveraging product data to improve our customer experiences. He has a Master’s degree in Electrical engineering from Indian Institute of Technology (IIT) Bombay. Outside of work he enjoying spending time with his young kids and travelling.
Vaughn Schermerhorn is a Director at Amazon, where he leads Shopping Discovery and Evaluation—spanning Customer Reviews, content moderation, and site navigation across Amazon’s global marketplaces. He manages a multidisciplinary organization of applied scientists, engineers, and product leaders focused on surfacing trustworthy customer insights through scalable ML models, multimodal information retrieval, and real-time system architecture. His team develops and operates large-scale distributed systems that power billions of shopping decisions daily. Vaughn holds degrees from Georgetown University and San Diego State University and has lived and worked in the U.S., Germany, and Argentina. Outside of work, he enjoys reading, travel, and time with his family.
Tarik Arici is a Principal Applied Scientist at Amazon Selection and Catalog Systems (ASCS), working on Catalog Quality Enhancement using GenAI workflows. He has a PhD in Electrical and Computer Engineering from Georgia Tech. Outside of work, Tarik enjoys swimming and biking.

Apple and Duke Present a Reinforcement Learning Approach That Enables …

Long CoT reasoning improves large language models’ performance on complex tasks but comes with drawbacks. The typical “think-then-answer” method slows down response times, disrupting real-time interactions like those in chatbots. It also risks inaccuracies, as errors in earlier reasoning steps can lead to a misleading final answer. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay responses until all reasoning is complete. While RL is commonly used to train reasoning models, it mainly rewards final answers, overlooking useful intermediate insights. There is growing interest in teaching models that alternate between thinking and answering, but this remains a challenge. 

RL has become a popular method to enhance reasoning in LLMs, building on its success in aligning models with human preferences. Two common reward types guide RL: outcome-based rewards (ORM), which focus on the final answer, and process-based rewards (PRM), which provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to issues like reward hacking. Separately, efforts to improve LLM reasoning have explored prompting strategies, structured reasoning, tool integration, and methods to reduce latency and improve efficiency. 

Researchers from Apple and Duke University introduce Interleaved Reasoning, a new RL approach that enables language models to alternate between thinking and answering when solving complex, multi-step questions. Instead of waiting until the end to respond, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule-based reward, the model is trained to produce helpful reasoning steps, leading to over 80% faster responses and up to 19.3% better accuracy. Trained only on QA and logic datasets, the method demonstrates strong generalization to more challenging benchmarks, such as MATH, GPQA, and MMLU. 

The study proposes a reinforcement learning framework to train LLMs for Interleaved Reasoning, where models alternate between internal thinking and user-facing intermediate answers. Each intermediate step, or “sub-answer,” is shared once the model reaches a meaningful milestone in reasoning. A specialized training template with <think> and <answer> tags is used. The approach utilizes rule-based rewards—specifically, format, final accuracy, and conditional intermediate accuracy—to guide learning. Notably, intermediate rewards are applied only when specific criteria are met, ensuring the model prioritizes overall correctness. They also test different reward schemes, such as all-or-none, partial credit, and time-discounted rewards, to optimize the quality of reasoning. 

The interleaved reasoning approach was evaluated on both familiar and unfamiliar datasets using Qwen2.5 models (1.5B and 7B). Unlike traditional methods that separate thinking and answering, the interleaved method provides answers incrementally, improving both speed and usefulness. When combined with intermediate rewards, it significantly enhances model performance while reducing response delays by over 80%. Even without exposure to new domains during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved reasoning in making AI systems more responsive and effective in real-world, multi-step reasoning tasks. 

In conclusion, the study explores how interleaved reasoning—where models alternate between reasoning and generating intermediate answers—can significantly improve performance and responsiveness. Using the Qwen2.5-1.5B model, the authors show that providing timely intermediate feedback during training boosts accuracy and accelerates response generation. Different RL strategies were tested, with PPO showing stable results, and conditional, time-discounted rewards proving to be the most effective. The method scales well to complex tasks and outperforms traditional think-then-answer baselines. Unlike token-level reward models, this approach employs simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, interleaved reasoning enhances reasoning quality and efficiency without relying on external tools. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Apple and Duke Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy appeared first on MarkTechPost.

DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Deliverin …

DeepSeek, the Chinese AI Unicorn, has released an updated version of its R1 reasoning model, named DeepSeek-R1-0528. This release enhances the model’s capabilities in mathematics, programming, and general logical reasoning, positioning it as a formidable open-source alternative to leading models like OpenAI’s o3 and Google’s Gemini 2.5 Pro.

Technical Enhancements

The R1-0528 update introduces significant improvements in reasoning depth and inference accuracy. Notably, the model’s performance on the AIME 2025 math benchmark has increased from 70% to 87.5%, reflecting a more profound reasoning process that averages 23,000 tokens per question, up from 12,000 in the previous version. This enhancement is attributed to increased computational resources and algorithmic optimizations applied during post-training.

In addition to mathematical reasoning, the model has shown improved performance in code generation tasks. According to LiveCodeBench benchmarks, R1-0528 ranks just below OpenAI’s o4 mini and o3 models, outperforming xAI’s Grok 3 mini and Alibaba’s Qwen 3 in code generation tasks.

Open-Source Model Weights

DeepSeek continues its commitment to open-source and open weights AI by releasing R1-0528 under the MIT license, allowing developers to modify and deploy the model freely. The model’s weights are available on Hugging Face, and detailed documentation is provided for local deployment and API integration . This approach contrasts with the proprietary nature of many leading AI models, promoting transparency and accessibility in AI development.

Distilled Model for Lightweight Deployment

Recognizing the need for more accessible AI solutions, DeepSeek has also released a distilled version of R1-0528, named DeepSeek-R1-0528-Qwen3-8B. This model, fine-tuned from Alibaba’s Qwen3-8B using text generated by R1-0528, achieves state-of-the-art performance among open-source models on the AIME 2024 benchmark. It is designed to run efficiently on a single GPU, making advanced AI capabilities more accessible to developers with limited computational resources.

Censorship Considerations

While DeepSeek’s advancements in AI are noteworthy, the R1-0528 model has been observed to exhibit stricter content moderation compared to its predecessors. Independent testing revealed that the model avoids or provides limited responses to politically sensitive topics, such as the Tiananmen Square protests and the status of Taiwan, aligning with Chinese regulations that mandate AI models to adhere to content restrictions .

Here are the reasoning traces on the internment camps question–again mentioning Xianjiang, and reasoning quite clearly about why it’s not complying. pic.twitter.com/ooEwmF23TY— xlr8harder (@xlr8harder) May 29, 2025

Global Implications

The release of R1-0528 underscores China’s growing influence in the AI sector, challenging the dominance of U.S.-based companies. DeepSeek’s ability to develop competitive AI models at a fraction of the cost of their Western counterparts has prompted responses from companies like OpenAI, which have expressed concerns about the potential for these models to be manipulated by the Chinese government . This development highlights the shifting dynamics in global AI development and the increasing importance of open-source models in fostering innovation and competition.

Conclusion

DeepSeek’s R1-0528 model represents a significant advancement in open-source AI, offering enhanced reasoning capabilities and accessibility for developers. By providing both a full-scale model and a distilled version suitable for single-GPU deployment, DeepSeek is making strides in democratizing AI technology. However, the model’s adherence to content moderation policies reflects the complex interplay between technological advancement and regulatory compliance. As the AI landscape continues to evolve, DeepSeek’s developments will likely play a pivotal role in shaping the future of open-source AI.

Check out the Open-Source Weights and Try it now. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post DeepSeek Releases R1-0528: An Open-Source Reasoning AI Model Delivering Enhanced Math and Code Performance with Single-GPU Efficiency appeared first on MarkTechPost.

A Coding Guide for Building a Self-Improving AI Agent Using Google’s …

In this tutorial, we will explore how to create a sophisticated Self-Improving AI Agent using Google’s cutting-edge Gemini API. This self-improving agent demonstrates autonomous problem-solving, dynamically evaluates performance, learns from successes and failures, and iteratively enhances its capabilities through reflective analysis and self-modification. The tutorial walks through structured code implementation, detailing mechanisms for memory management, capability tracking, iterative task analysis, solution generation, and performance evaluation, all integrated within a powerful self-learning feedback loop.

Copy CodeCopiedUse a different Browserimport google.generativeai as genai
import json
import time
import re
from typing import Dict, List, Any
from datetime import datetime
import traceback

We set up the foundational components to build an AI-powered self-improving agent utilizing Google’s Generative AI API. Libraries such as json, time, re, and datetime facilitate structured data management, performance tracking, and text processing, while type hints (Dict, List, Any) help ensure robust and maintainable code.

Copy CodeCopiedUse a different Browserclass SelfImprovingAgent:
def __init__(self, api_key: str):
“””Initialize the self-improving agent with Gemini API”””
genai.configure(api_key=api_key)
self.model = genai.GenerativeModel(‘gemini-1.5-flash’)

self.memory = {
‘successful_strategies’: [],
‘failed_attempts’: [],
‘learned_patterns’: [],
‘performance_metrics’: [],
‘code_improvements’: []
}

self.capabilities = {
‘problem_solving’: 0.5,
‘code_generation’: 0.5,
‘learning_efficiency’: 0.5,
‘error_handling’: 0.5
}

self.iteration_count = 0
self.improvement_history = []

def analyze_task(self, task: str) -> Dict[str, Any]:
“””Analyze a given task and determine approach”””
analysis_prompt = f”””
Analyze this task and provide a structured approach:
Task: {task}

Please provide:
1. Task complexity (1-10)
2. Required skills
3. Potential challenges
4. Recommended approach
5. Success criteria

Format as JSON.
“””

try:
response = self.model.generate_content(analysis_prompt)
json_match = re.search(r'{.*}’, response.text, re.DOTALL)
if json_match:
return json.loads(json_match.group())
else:
return {
“complexity”: 5,
“skills”: [“general problem solving”],
“challenges”: [“undefined requirements”],
“approach”: “iterative improvement”,
“success_criteria”: [“task completion”]
}
except Exception as e:
print(f”Task analysis error: {e}”)
return {“complexity”: 5, “skills”: [], “challenges”: [], “approach”: “basic”, “success_criteria”: []}

def solve_problem(self, problem: str) -> Dict[str, Any]:
“””Attempt to solve a problem using current capabilities”””
self.iteration_count += 1
print(f”n=== Iteration {self.iteration_count} ===”)
print(f”Problem: {problem}”)

task_analysis = self.analyze_task(problem)
print(f”Task Analysis: {task_analysis}”)

solution_prompt = f”””
Based on my previous learning and capabilities, solve this problem:
Problem: {problem}

My current capabilities: {self.capabilities}
Previous successful strategies: {self.memory[‘successful_strategies’][-3:]} # Last 3
Known patterns: {self.memory[‘learned_patterns’][-3:]} # Last 3

Provide a detailed solution with:
1. Step-by-step approach
2. Code implementation (if applicable)
3. Expected outcome
4. Potential improvements
“””

try:
start_time = time.time()
response = self.model.generate_content(solution_prompt)
solve_time = time.time() – start_time

solution = {
‘problem’: problem,
‘solution’: response.text,
‘solve_time’: solve_time,
‘iteration’: self.iteration_count,
‘task_analysis’: task_analysis
}

quality_score = self.evaluate_solution(solution)
solution[‘quality_score’] = quality_score

self.memory[‘performance_metrics’].append({
‘iteration’: self.iteration_count,
‘quality’: quality_score,
‘time’: solve_time,
‘complexity’: task_analysis.get(‘complexity’, 5)
})

if quality_score > 0.7:
self.memory[‘successful_strategies’].append(solution)
print(f” Solution Quality: {quality_score:.2f} (Success)”)
else:
self.memory[‘failed_attempts’].append(solution)
print(f” Solution Quality: {quality_score:.2f} (Needs Improvement)”)

return solution

except Exception as e:
print(f”Problem solving error: {e}”)
error_solution = {
‘problem’: problem,
‘solution’: f”Error occurred: {str(e)}”,
‘solve_time’: 0,
‘iteration’: self.iteration_count,
‘quality_score’: 0.0,
‘error’: str(e)
}
self.memory[‘failed_attempts’].append(error_solution)
return error_solution

def evaluate_solution(self, solution: Dict[str, Any]) -> float:
“””Evaluate the quality of a solution”””
evaluation_prompt = f”””
Evaluate this solution on a scale of 0.0 to 1.0:

Problem: {solution[‘problem’]}
Solution: {solution[‘solution’][:500]}… # Truncated for evaluation

Rate based on:
1. Completeness (addresses all aspects)
2. Correctness (logically sound)
3. Clarity (well explained)
4. Practicality (implementable)
5. Innovation (creative approach)

Respond with just a decimal number between 0.0 and 1.0.
“””

try:
response = self.model.generate_content(evaluation_prompt)
score_match = re.search(r'(d+.?d*)’, response.text)
if score_match:
score = float(score_match.group(1))
return min(max(score, 0.0), 1.0)
return 0.5
except:
return 0.5

def learn_from_experience(self):
“””Analyze past performance and improve capabilities”””
print(“n Learning from experience…”)

if len(self.memory[‘performance_metrics’]) < 2:
return

learning_prompt = f”””
Analyze my performance and suggest improvements:

Recent Performance Metrics: {self.memory[‘performance_metrics’][-5:]}
Successful Strategies: {len(self.memory[‘successful_strategies’])}
Failed Attempts: {len(self.memory[‘failed_attempts’])}

Current Capabilities: {self.capabilities}

Provide:
1. Performance trends analysis
2. Identified weaknesses
3. Specific improvement suggestions
4. New capability scores (0.0-1.0 for each capability)
5. New patterns learned

Format as JSON with keys: analysis, weaknesses, improvements, new_capabilities, patterns
“””

try:
response = self.model.generate_content(learning_prompt)

json_match = re.search(r'{.*}’, response.text, re.DOTALL)
if json_match:
learning_results = json.loads(json_match.group())

if ‘new_capabilities’ in learning_results:
old_capabilities = self.capabilities.copy()
for capability, score in learning_results[‘new_capabilities’].items():
if capability in self.capabilities:
self.capabilities[capability] = min(max(float(score), 0.0), 1.0)

print(f” Capability Updates:”)
for cap, (old, new) in zip(self.capabilities.keys(),
zip(old_capabilities.values(), self.capabilities.values())):
change = new – old
print(f” {cap}: {old:.2f} → {new:.2f} ({change:+.2f})”)

if ‘patterns’ in learning_results:
self.memory[‘learned_patterns’].extend(learning_results[‘patterns’])

self.improvement_history.append({
‘iteration’: self.iteration_count,
‘timestamp’: datetime.now().isoformat(),
‘learning_results’: learning_results,
‘capabilities_before’: old_capabilities,
‘capabilities_after’: self.capabilities.copy()
})

print(f” Learned {len(learning_results.get(‘patterns’, []))} new patterns”)

except Exception as e:
print(f”Learning error: {e}”)

def generate_improved_code(self, current_code: str, improvement_goal: str) -> str:
“””Generate improved version of code”””
improvement_prompt = f”””
Improve this code based on the goal:

Current Code:
{current_code}

Improvement Goal: {improvement_goal}
My current capabilities: {self.capabilities}
Learned patterns: {self.memory[‘learned_patterns’][-3:]}

Provide improved code with:
1. Enhanced functionality
2. Better error handling
3. Improved efficiency
4. Clear comments explaining improvements
“””

try:
response = self.model.generate_content(improvement_prompt)

improved_code = {
‘original’: current_code,
‘improved’: response.text,
‘goal’: improvement_goal,
‘iteration’: self.iteration_count
}

self.memory[‘code_improvements’].append(improved_code)
return response.text

except Exception as e:
print(f”Code improvement error: {e}”)
return current_code

def self_modify(self):
“””Attempt to improve the agent’s own code”””
print(“n Attempting self-modification…”)

current_method = “””
def solve_problem(self, problem: str) -> Dict[str, Any]:
# Current implementation
pass
“””

improved_method = self.generate_improved_code(
current_method,
“Make problem solving more efficient and accurate”
)

print(“Generated improved method structure”)
print(“Note: Actual self-modification requires careful implementation in production”)

def run_improvement_cycle(self, problems: List[str], cycles: int = 3):
“””Run a complete improvement cycle”””
print(f” Starting {cycles} improvement cycles with {len(problems)} problems”)

for cycle in range(cycles):
print(f”n{‘=’*50}”)
print(f”IMPROVEMENT CYCLE {cycle + 1}/{cycles}”)
print(f”{‘=’*50}”)

cycle_results = []
for problem in problems:
result = self.solve_problem(problem)
cycle_results.append(result)
time.sleep(1)

self.learn_from_experience()

if cycle < cycles – 1:
self.self_modify()

avg_quality = sum(r.get(‘quality_score’, 0) for r in cycle_results) / len(cycle_results)
print(f”n Cycle {cycle + 1} Summary:”)
print(f” Average Solution Quality: {avg_quality:.2f}”)
print(f” Current Capabilities: {self.capabilities}”)
print(f” Total Patterns Learned: {len(self.memory[‘learned_patterns’])}”)

time.sleep(2)

def get_performance_report(self) -> str:
“””Generate a comprehensive performance report”””
if not self.memory[‘performance_metrics’]:
return “No performance data available yet.”

metrics = self.memory[‘performance_metrics’]
avg_quality = sum(m[‘quality’] for m in metrics) / len(metrics)
avg_time = sum(m[‘time’] for m in metrics) / len(metrics)

report = f”””
AGENT PERFORMANCE REPORT
{‘=’*40}

Total Iterations: {self.iteration_count}
Average Solution Quality: {avg_quality:.3f}
Average Solve Time: {avg_time:.2f}s

Successful Solutions: {len(self.memory[‘successful_strategies’])}
Failed Attempts: {len(self.memory[‘failed_attempts’])}
Success Rate: {len(self.memory[‘successful_strategies’]) / max(1, self.iteration_count) * 100:.1f}%

Current Capabilities:
{json.dumps(self.capabilities, indent=2)}

Patterns Learned: {len(self.memory[‘learned_patterns’])}
Code Improvements: {len(self.memory[‘code_improvements’])}
“””

return report

We define the above class, SelfImprovingAgent, as implementing a robust framework leveraging Google’s Gemini API for autonomous task-solving, self-assessment, and adaptive learning. It incorporates structured memory systems, capability tracking, iterative problem-solving with continuous improvement cycles, and even attempts controlled self-modification. This advanced implementation allows the agent to progressively enhance its accuracy, efficiency, and problem-solving sophistication over time, creating a dynamic AI that can autonomously evolve and adapt.

Copy CodeCopiedUse a different Browserdef main():
“””Main function to demonstrate the self-improving agent”””

API_KEY = “Use Your GEMINI KEY Here”

if API_KEY == “Use Your GEMINI KEY Here”:
print(” Please set your Gemini API key in the API_KEY variable”)
print(“Get your API key from: https://makersuite.google.com/app/apikey”)
return

agent = SelfImprovingAgent(API_KEY)

test_problems = [
“Write a function to calculate the factorial of a number”,
“Create a simple text-based calculator that handles basic operations”,
“Design a system to find the shortest path between two points in a graph”,
“Implement a basic recommendation system for movies based on user preferences”,
“Create a machine learning model to predict house prices based on features”
]

print(” Self-Improving Agent Demo”)
print(“This agent will attempt to solve problems and improve over time”)

agent.run_improvement_cycle(test_problems, cycles=3)

print(“n” + agent.get_performance_report())

print(“n” + “=”*50)
print(“TESTING IMPROVED AGENT”)
print(“=”*50)

final_problem = “Create an efficient algorithm to sort a large dataset”
final_result = agent.solve_problem(final_problem)

print(f”nFinal Problem Solution Quality: {final_result.get(‘quality_score’, 0):.2f}”)

The main() function serves as the entry point for demonstrating the SelfImprovingAgent class. It initializes the agent with the user’s Gemini API key and defines practical programming and system design tasks. The agent then iteratively tackles these tasks, analyzing its performance to refine its problem-solving abilities over multiple improvement cycles. Finally, it tests the agent’s enhanced capabilities with a new complex task, showcasing measurable progress and providing a detailed performance report.

Copy CodeCopiedUse a different Browserdef setup_instructions():
“””Print setup instructions for Google Colab”””
instructions = “””
SETUP INSTRUCTIONS FOR GOOGLE COLAB:

1. Install the Gemini API client:
!pip install google-generativeai

2. Get your Gemini API key:
– Go to https://makersuite.google.com/app/apikey
– Create a new API key
– Copy the key

3. Replace ‘your-gemini-api-key-here’ with your actual API key

4. Run the code!

CUSTOMIZATION OPTIONS:
– Modify test_problems list to add your own challenges
– Adjust improvement cycles count
– Add new capabilities to track
– Extend the learning mechanisms

IMPROVEMENT IDEAS:
– Add persistent memory (save/load agent state)
– Implement more sophisticated evaluation metrics
– Add domain-specific problem types
– Create visualization of improvement over time
“””
print(instructions)

if __name__ == “__main__”:
setup_instructions()
print(“n” + “=”*60)
main()

Finally, we define the setup_instructions() function, which guides users through preparing their Google Colab environment to run the self-improving agent. It explains step-by-step how to install dependencies, set up and configure the Gemini API key, and highlight various options for customizing and enhancing the agent’s functionality. This approach simplifies user onboarding, facilitating easy experimentation and extending the agent’s capabilities further.

In conclusion, the implementation demonstrated in this tutorial offers a comprehensive framework for creating AI agents that perform tasks and actively enhance their capabilities over time. By harnessing the Gemini API’s advanced generative power and integrating a structured self-improvement loop, developers can build agents capable of sophisticated reasoning, iterative learning, and self-modification.

Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post A Coding Guide for Building a Self-Improving AI Agent Using Google’s Gemini API with Intelligent Adaptation Features appeared first on MarkTechPost.

Revolutionizing earth observation with geospatial foundation models on …

Emerging transformer-based vision models for geospatial data—also called geospatial foundation models (GeoFMs)—offer a new and powerful technology for mapping the earth’s surface at a continental scale, providing stakeholders with the tooling to detect and monitor surface-level ecosystem conditions such as forest degradation, natural disaster impact, crop yield, and many others.
GeoFMs represent an emerging research field and are a type of pre-trained vision transformer (ViT) specifically adapted to geospatial data sources. GeoFMs offer immediate value without training. The models excel as embedding models for geospatial similarity search and ecosystem change detection. With minimal labeled data, GeoFMs can be fine-tuned for custom tasks such as land surface classification, semantic segmentation, or pixel-level regression. Many leading models are available under very permissive licenses making them accessible for a wide audience. Examples include SatVision-Base, Prithvi-100M, SatMAE, and Clay (used in this solution).
In this post, we explore how Clay Foundation’s Clay foundation model, available on Hugging Face, can be deployed for large-scale inference and fine-tuning on Amazon SageMaker. For illustrative purposes, we focus on a deforestation use case from the Amazon rainforest, one of the most biodiverse ecosystems in the world. Given the strong evidence that the Amazon forest system could soon be reaching a tipping point, it presents an important domain of study and a high-impact application area for GeoFMs, for example, through early detection of forest degradation. However, the solution presented here generalizes to a wide range of geospatial use cases. It also comes with ready-to-deploy code samples to help you get started quickly with deploying GeoFMs in your own applications on AWS.
Let’s dive in!
Solution overview
At the core of our solution is a GeoFM. Architecturally, GeoFMs build on the ViT architecture first introduced in the seminal 2022 research paper An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. To account for the specific properties of geospatial data (multiple channels ranging from ultraviolet to infrared, varying electromagnetic spectrum coverage, and spatio-temporal nature of data), GeoFMs incorporate several architectural innovations such as variable input size (to capture multiple channels) or the addition of positional embeddings that capture spatio-temporal aspects such as seasonality and location on earth. The pre-training of these models is conducted on unlabeled geospatial data sampled from across the globe using masked autoencoders (MAE) as self-supervised learners. Sampling from global-scale data helps ensure that diverse ecosystems and surface types are represented appropriately in the training set. What results are general purpose models that can be used for three core use cases:

Geospatial similarity search: Quickly map diverse surface types with semantic geospatial search using the embeddings to find similar items (such as deforested areas).
Embedding-based change detection: Analyze a time series of geospatial embeddings to identify surface disruptions over time for a specific region.
Custom geospatial machine learning: Fine-tune a specialized regression, classification, or segmentation model for geospatial machine learning (ML) tasks. While this requires a certain amount of labeled data, overall data requirements are typically much lower compared to training a dedicated model from the ground up.

The general solution flow is shown in the following diagram. Note that this flow diagram is highly abstracted and omits certain architectural details for reasons of clarity. For a full architecture diagram demonstrating how the flow can be implemented on AWS, see the accompanying GitHub repository. This repository also contains detailed deployment instructions to get you started quickly with applying GeoFMs to your own use cases.

Retrieve and process satellite imagery for GeoFM inference or training: The first step is to get the raw geospatial data into a format that’s consumable by the GeoFM. This entails breaking down the large raw satellite imagery into equally-sized 256×256 pixel chips (the size that the mode expects) and normalizing pixel values, among other data preparation steps required by the GeoFM that you choose. This routine can be conducted at scale using an Amazon SageMaker AI processing job.
Retrieve model weights and deploy the GeoFM: Next, retrieve the open weights of the GeoFM from a model registry of your choice (HuggingFace in this example) and deploy the model for inference. The best deployment option ultimately depends on how the model is consumed. If you need to generate embedding asynchronously, use a SageMaker AI processing or transform step. For real-time inference, consider deploying to a SageMaker AI real-time endpoint, which can be configured to auto-scale with demand, allowing for large-scale inference. In this example, we use a SageMaker AI processing job with a custom Docker image for generating embeddings in batch.
Generate geospatial embeddings: The GeoFM is an encoder-only model, meaning that it outputs an embedding vector. During inference, you perform a forward pass of the pre-processed satellite image chip through the GeoFM. This produces the corresponding embedding vector, which can be thought of as a compressed representation of the information contained in the image. This process is equivalent to using text embedding models for RAG use cases or similar.

The generated geospatial embeddings can be used largely as-is for two key use cases: geospatial similarity search and ecosystem change detection.

Run similarity search on the embeddings to identify semantically similar images: The GeoFM embeddings reside in the same vector space. This allows us to identify similar items by identifying vectors that are very close to a given query point. A common high-performance search algorithm for this is approximate nearest neighbor (ANN). For scalability and search performance, we index the embedding vectors in a vector database.
Analyze time-series of embeddings for break points that indicate change: Instead of looking for similarity between embedding vectors, you can also look for distance. Doing this for a specific region and across time lets you pinpoint specific times where change occurs. This allows you to use embeddings for surface change detection over time, a very common use case in geospatial analytics.

Optionally, you can also fine-tune a model on top of the GeoFM.

Train a custom head and run inference: To fine-tune a model you add a custom (and typically lightweight) head on top of the GeoFM and fine-tune it on a (often small) labeled dataset. The GeoFM weights remain frozen and are not retrained. The custom head takes the GeoFM-generated embedding vectors as input and produces classification masks, pixel-level recessions results, or simply a class per image, depending on the use case.

We explore the key steps of this workflow in the next sections. For additional details on the implementation—including. how to build a high-quality user interface with Solara—see the accompanying GitHub repository.
Geospatial data processing and embedding generation
Our comprehensive, four-stage data processing pipeline transforms raw satellite imagery into analysis-ready vector embeddings that power advanced geospatial analytics. This orchestrated workflow uses Amazon SageMaker AI Pipelines to create a robust, reproducible, and scalable processing architecture. The end-to-end solution can process Earth observation data for a selected region of interest, with built-in flexibility to adapt to different use cases. In this example, we use Sentinel-2 imagery from the Amazon Registry of Open Data for monitoring deforestation in the Brazilian rainforest. However, our pipeline architecture is designed to work seamlessly with other satellite image providers and resolutions (such as NAIP with 1m/pixel resolution, or Maxar and Planet Labs up to below 1m/pixel resolution).
Pipeline architecture overview

The SageMaker pipeline consists of four processing steps, shown in the preceding figure, each step builds on the outputs of the previous steps with intermediate results stored in Amazon Simple Storage Service (Amazon S3).

Pre-process satellite tiles: Divides the satellite imagery into chips. We chose a chip size of 256×256 pixels as expected by Clay v1. For Sentinel-2 images this corresponds to an area of 2.56 x 2.56 km2.
Generate embeddings: Creates 768-dimensional vector representations for the chips using the Clay v1 model.
Process embeddings: Performs dimensionality reduction and computes similarity metrics (for downstream analyses).
Consolidate and index: Consolidates outputs and loads embeddings vectors into a Vector store.

# Pipeline definition (simplified)
pipeline = Pipeline(
name=f”EmbeddingPipeline-{aoi_name}”,
steps=[preprocess_step, embeddings_step, postprocessing_step, consolidation_step],
parameters=[bucket_name_param, aoi_name_param, s2_grid_id_param,
chip_size_param, batch_size_param, s2_bands_param],
sagemaker_session=pipeline_session
)

Step 1: Satellite data acquisition and chipping
The pipeline starts by accessing Sentinel-2 multispectral satellite imagery through the AWS Open Data program from S3 buckets. This imagery provides 10-meter resolution across multiple spectral bands including RGB (visible light) and NIR (near-infrared), which are critical for environmental monitoring.
This step filters out chips that have excessive cloud cover and divides large satellite scenes into manageable 256×256 pixel chips, which enables efficient parallel processing and creates uniform inputs for the foundation model. This step also runs on a SageMaker AI Processing job with a custom Docker image optimized for geospatial operations.

# Chip generation – Simplified version
def generate_chips_from_xarray(ds, chip_size, aoi_name, s2_scene_id):
“””Generates image chips of specified size from xarray dataset
using Area of Interest (AOI) name and Sentinel-2 scene id”””
try:
# Calculate optimal number of jobs for parallel processing
n_jobs = min(os.cpu_count(), 16) # Use optimal number of cores or available CPU count
logger.info(f”Using {n_jobs} parallel jobs for chip generation”)

# Generate all x,y coordinates for processing
coordinates = [(x, y)
for x in range(0, ds.sizes[‘x’], chip_size)
for y in range(0, ds.sizes[‘y’], chip_size)]

logger.info(f”Generating {len(coordinates)} chips from scene”)

# Process chips in parallel
with parallel_backend(‘threading’, n_jobs=n_jobs):
results = Parallel(verbose=1)(
delayed(process_single_chip)(
x, y, ds, chip_size, s2_scene_id
) for x, y in coordinates
)

# Filter out None results (e.g., chips with too many cloud pixels)
valid_results = [r for r in results if r is not None]

logger.info(f”Successfully generated {len(valid_results)} valid chips”)

return valid_results
except Exception as e:
logger.error(f”Error in chip generation for scene {s2_scene_id}: {str(e)}”)
raise

For each chip, this step generates:

NetCDF datacubes (.netcdf) containing the full multispectral information
RGB thumbnails (.png) for visualization
Rich metadata (.parquet) with geolocation, timestamps, and other metadata

Step 2: Embedding generation using a Clay foundation model
The second step transforms the preprocessed image chips into vector embeddings using the Clay v1 foundation model. This is the most computationally intensive part of the pipeline, using multiple GPU instances (ml.g5.xlarge) to efficiently process the satellite imagery.
For each chip, this step:

Accesses the NetCDF datacube from Amazon S3
Normalizes the spectral bands according to the Clay v1 model’s input requirements
Generates both patch-level and class token (CLS) embeddings
Stores the embeddings as NumPy arrays (.npy) alongside the original data on S3 as intermediate store

While Clay can use all Sentinel-2 spectral bands, our implementation uses RGB and NIR as input bands to generate a 768-dimensional embedding, which provide excellent results in our examples. Customers can easily adapt the input bands based on their specific use-cases. These embeddings encapsulate high-level features such as vegetation patterns, urban structures, water bodies, and land use characteristics—without requiring explicit feature engineering.
Step 3: Embedding processing and analysis
The third step analyzes the embeddings to extract meaningful insights, particularly for time-series analysis. Running on high-memory instances, this step:

Performs dimensionality reduction on the embeddings using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) (to be used later for change detection)
Computes cosine similarity between embeddings over time (an alternative for change detection)
Identifies significant changes in the embeddings that might indicate surface changes
Saves processed embeddings in Parquet format for efficient querying

The output includes processed embedding files that contain both the original high-dimensional vectors and their reduced representations, along with computed similarity metrics.
For change detection applications, this step establishes a baseline for each geographic location and calculates deviations from this baseline over time. These deviations, captured as vector distances, provide a powerful indicator of surface changes like deforestation, urban development, or natural disasters.
Step 4: Consolidation and vector database integration
The final pipeline step consolidates the processed embeddings into a unified dataset and loads them into vector databases optimized for similarity search. The outputs include consolidated embedding files, GeoJSON grid files for visualization, and configuration files for frontend applications.
The solution supports two vector database options:

Amazon OpenSearch Serverless for production deployments with very low latency and high scalability requirements
LanceDB for lightweight deployments for smaller regions

Both options provide efficient ANN search capabilities, enabling sub-second query performance. The choice between them depends on the scale of deployment, integration requirements, and operational preferences.
With this robust data processing and embedding generation foundation in place, let’s explore the real-world applications enabled by the pipeline, beginning with geospatial similarity search.
Geospatial similarity search
Organizations working with Earth observation data have traditionally struggled with efficiently identifying specific landscape patterns across large geographic regions. Traditional Earth observation analysis requires specialized models trained on labeled datasets for each target feature. This approach forces organizations into a lengthy process of data collection, annotation, and model training before obtaining results.
In contrast, the GeoFM-powered similarity search converts satellite imagery into 768-dimensional vector embeddings that capture the semantic essence of landscape features, eliminating the need for manual feature engineering and computation of specialized indices like NDVI or NDWI.
This capability uses the Clay foundation model’s pre-training on diverse global landscapes to understand complex relationships between features without explicit programming. The result is an intuitive image-to-image search capability where users can select a reference area—such as early-stage deforestation or wildfire damage—and instantly find similar patterns across vast territories in seconds rather than weeks.
Similarity search implementation
Our implementation provides a streamlined workflow for finding similar geographic areas using the embeddings generated by the data processing pipeline. The search process involves:

Reference area selection: Users select a reference chip representing a search term (for example, a deforested patch, urban development, or agricultural field)
Search parameters: Users specify the number of results and a similarity threshold
Vector search execution: The system retrieves similar chips using cosine similarity between embeddings
Result visualization: Matching chips are highlighted on the map

Let’s dive deeper on a real-world application, taking our running example of detecting deforestation in the Mato Grosso region of the Brazilian Amazon. Traditional monitoring approaches often detect forest loss too late—after significant damage has already occurred. The Clay-powered similarity search capability offers a new approach by enabling early detection of emerging deforestation patterns before they expand into large-scale clearing operations.
Using a single reference chip showing the initial signs of forest degradation—such as selective logging, small clearings, or new access roads—analysts can instantly identify similar patterns across vast areas of the Amazon rainforest. As demonstrated in the following example images, the system effectively recognizes the subtle signatures of early-stage deforestation based on a single reference image. This capability enables environmental protection agencies and conservation organizations to deploy resources precisely, improving the anti-deforestation efforts by addressing threats to prevent major forest loss. While a single reference chip image led to good results in our examples, alternative approaches exist, such as an average vector strategy, which leverages embeddings from multiple reference images to enhance the similarity search results.

Ecosystem change detection
Unlike vector-based similarity search, change detection focuses on measuring the distance between embedding vectors over time, the core assumption being that the more distant embedding vectors are to each other, the more dissimilar the underlying satellite imagery is. If applied to a single region over time, this lets you pinpoint so called change points—periods where significant and long-lasting change in surface conditions occurred.
Our solution implements a timeline view of Sentinel-2 satellite observations from 2018 to present. Each observation point corresponds to a unique satellite image, allowing for detailed temporal analysis. While embedding vectors are highly dimensional, we use the previously computed PCA (and optionally t-SNE) to reduce dimensionality to a single dimension for visualization purposes.
Let’s review a compelling example from our analysis of deforestation in the Amazon. The following image is a timeseries plot of geospatial embeddings (first principal component) for a single 256×256 pixel chip. Cloudy images and major outliers have been removed.

Points clustered closely on the y-axis indicate similar ground conditions; sudden and persistent discontinuities in the embedding values signal significant change. Here’s what the analysis shows:

Stable forest conditions from 2018 through 2020
A significant discontinuity in embedding values during 2021. Closer review of the underlying satellite imagery shows clear evidence of forest clearing and conversion to agricultural fields
Further transformation visible in 2024 imagery

Naturally, we need a way to automate the process of change detection so that it can be applied at scale. Given that we do not typically have extensive changepoint training datasets, we need an unsupervised approach that works without labeled data. The intuition behind unsupervised change detection is the following: identify what normal looks like, then highlight large enough deviations from normal and flag them as change points; after a change point has occurred, characterize the new normal and repeat the process.
The following function performs harmonic regression analysis on the embeddings timeseries data, specifically designed to model yearly seasonality patterns. The function fits a harmonic regression with a specified frequency (default 365 days for annual patterns) to the embedding data of a baseline period (the year 2018 in this example). It then generates predictions and calculates error metrics (absolute and percentage deviations). Large deviations from the normal seasonal pattern indicate change and can be automatically flagged using thresholding.

def fit_harmonic_regression(data, outliers, date_col=”date”, y_col=”patch_emb_pca_1″,
baseline_start_date=datetime.datetime(2018,1,1),
monitoring_start_date=datetime.datetime(2019,1,1),
deg=3, reg=0.001, avg_days_yr=365):
“””
Fits a harmonic regression model to time series data using Lasso regularization.

Parameters:
– data: DataFrame containing the time series data
– outliers: List of dates to exclude from the fitting process
– date_col: Name of the column containing dates (default: ‘date’)
– y_col: Name of the column containing target values (default: ‘patch_emb_pca_1’)
– baseline_start_date: Start date for the baseline period (default: Jan 1, 2018)
– monitoring_start_date: Start date for the monitoring period (default: Jan 1, 2019)
– deg: Degree of polynomial features (default: 3)
– reg: Regularization strength for Lasso (default: 0.001)
– avg_days_yr: Number of days in a year (default: 365)
“””

# Convert date column to datetime format and create numerical representation
data[date_col] = pd.to_datetime(data[date_col])
data[“date_numerical”] = data[date_col].apply(lambda x: mdates.date2num(x))
t_full = data[“date_numerical”]
y_full = data[y_col]

# Filter data for model fitting:
# – Remove outliers
# – Keep only data between baseline_start_date and monitoring_start_date
t_fitting = data[
(~data[“date”].isin(outliers)) &
(data[date_col] >= baseline_start_date) &
(data[date_col] < monitoring_start_date)
][“date_numerical”]

y_fitting = data[
(~data[“date”].isin(outliers)) &
(data[date_col] >= baseline_start_date) &
(data[date_col] < monitoring_start_date)
][y_col]

# Create design matrix for harmonic regression
# Convert time to angular frequency (2π/period)
w = 2 * np.pi / avg_days_yr

# Generate polynomial features from sine and cosine transformations
poly = PolynomialFeatures(deg)
# Create feature matrix for fitting period
X_fitting = poly.fit_transform(np.column_stack((
np.sin(w*t_fitting),
np.cos(w*t_fitting)
)))
# Create feature matrix for all data points
X_full = poly.fit_transform(np.column_stack((
np.sin(w*t_full),
np.cos(w*t_full)
)))

# Initialize and fit Lasso regression model
lasso_model = Lasso(alpha=reg)
lasso_model.fit(X_fitting, y_fitting)

# Generate predictions for all time points
y_fit = lasso_model.predict(X_full)

# Calculate error metrics
absolute_deviation = np.abs(y_full – y_fit)
percentage_deviation = (absolute_deviation / np.abs(y_fit)) * 100

# Compile results into a DataFrame
df = pd.DataFrame()
df[“date”] = data[date_col]
df[“date_numerical”] = data[“date_numerical”]
df[f”{y_col}_true”] = data[y_col] # Original values
df[f”{y_col}_pred”] = y_fit # Predicted values
df[f”{y_col}_abs_error”] = absolute_deviation # Absolute error
df[f”{y_col}_perc_error”] = percentage_deviation # Percentage error
# Add time-based features
df[“year”] = df[“date”].apply(lambda x: x.year)
df[“month”] = df[“date”].apply(lambda x: x.month)
df[“year_month”] = df.apply(
lambda x: “{}_{}”.format(str(x.year), str(x.month)),
axis=1
)
# Return the fitted model, polynomial transformer, and results DataFrame
return lasso_model, poly, df

When applied to the chips across an area of observation and defining a threshold on the maximum deviation from the fitted harmonic regression, we can automatically map change intensity allowing analysts to quickly zoom in on problematic areas.

While this method performs well in our analyses, it is also quite rigid in that it requires a careful tuning of error thresholds and the definition of a baseline period. There are more sophisticated approaches available ranging from general-purpose time-series analyses that automate the baseline definition and change point detection using recursive methods (for example, Gaussian Processes) to specialized algorithms for geospatial change detection (for example, LandTrendr, and Continuous Change Detection and Classification (CCDC)).
In sum, our approach to change detection demonstrates the power of geospatial embedding vectors in tracking environmental changes over time, providing valuable insights for land use monitoring, environmental protection, and urban planning applications.
GeoFM fine-tuning for your custom use case
Fine-tuning is a specific implementation of transfer learning, in which a pre-trained foundation model is adapted to specific tasks through targeted additional training on specialized labeled datasets. For GeoFMs, these specific tasks can target agriculture, disaster monitoring or urban analysis. The model retains its broad spatial understanding while developing expertise for particular regions, ecosystems or analytical tasks. This approach significantly reduces computational and data requirements compared to building specialized models from scratch, without sacrificing accuracy. Fine-tuning typically involves preserving the pre-trained Clay’s encoder—which has already learned rich representations of spectral patterns, spatial relationships, and temporal dynamics from massive satellite imagery, while attaching and training a specialized task-specific head.
For pixel-wise prediction tasks—such as land use segmentation—the specialized head is typically a decoder architecture, whereas for class-level outputs (classification tasks) the head can be as basic as a multilayer perceptron network. Training focuses exclusively on the new decoder that captures the feature representations from model’s frozen encoder and gradually transforms them back to full-resolution images where each pixel is classified according to its land use type.
The segmentation framework combines the powerful pre-trained Clay encoder with an efficient convolutional decoder, taking Clay’s rich understanding of satellite imagery and converting it into detailed land use maps. The lightweight decoder features convolutional layers and pixel shuffle upsampling techniques that capture the feature representations from Clay’s frozen encoder and gradually transforms them back to full-resolution images where each pixel is classified according to its land use type. By freezing the encoder (which contains 24 transformer heads and 16 attention heads) and only training the compact decoder, the model achieves a good balance between computational efficiency and segmentation accuracy.
We applied this segmentation architecture on a labeled land use land cover (LULC) dataset from Impact Observatory and hosted on the Amazon Registry of Open Data. For illustrative purposes, we again focused on our running example from Brazil’s Mato Grosso region. We trained the decoder head for 10 epochs which took 17 minutes total and tracked intersection over union (IOU) and F1 score as segmentation accuracy metrics. After just one training epoch, the model already achieved 85.7% validation IOU. With the full 10 epochs completed, performance increased to an impressive 92.4% IOU and 95.6% F1 score. In the following image, we show ground truth satellite imagery (upper) and the model’s predictions (lower). The visual comparison highlights how accurately this approach can classify different land use categories.

Conclusion
Novel GeoFMs provide an encouraging new approach to geospatial analytics. Through their extensive pre-training, these models have incorporated a deep implicit understanding of geospatial data and can be used out-of-the-box for high-impact use cases such as similarity search or change detection. They can also serve as the basis for specialized models using a fine-tuning process that is significantly less data-hungry (fewer labeled data needed) and has lower compute requirements.
In this post, we have shown how you can deploy a state-of-the-art GeoFM (Clay) on AWS and have explored one specific use case – monitoring deforestation in the Amazon rainforest – in greater detail. The same approach is applicable to a large variety of industry use case. For example, insurance companies can use a similar approach to ours to assess damage after natural disasters including hurricanes, floods or fires and keep track of their insured assets. Agricultural organizations can use GeoFMs for crop type identification, crop yield predictions, or other use cases. We also envision high-impact use cases in industries like urban planning, emergency and disaster response, supply chain and global trade, sustainability and environmental modeling, and many others. To get started applying GeoFMs to your own earth observation use case, check out the accompanying GitHub repository, which has the prerequisites and a step-by-step walkthrough to run it on your own area of interest.

About the Authors
Dr. Karsten Schroer is a Senior Machine Learning (ML) Prototyping Architect at AWS, focused on helping customers leverage artificial intelligence (AI), ML, and generative AI technologies. With deep ML expertise, he collaborates with companies across industries to design and implement data- and AI-driven solutions that generate business value. Karsten holds a PhD in applied ML.
Bishesh Adhikari is a Senior ML Prototyping Architect at AWS with over a decade of experience in software engineering and AI/ML. Specializing in GenAI, LLMs, NLP, CV, and GeoSpatial ML, he collaborates with AWS customers to build solutions for challenging problems through co-development. His expertise accelerates customers’ journey from concept to production, tackling complex use cases across various industries. In his free time, he enjoys hiking, traveling, and spending time with family and friends.
Dr. Iza Moise is a Senior Machine Learning (ML) Prototyping Architect at AWS, with expertise in both traditional ML and advanced techniques like foundation models and vision transformers. She focuses on applied ML across diverse scientific fields, publishing and reviewing at Amazon’s internal ML conferences. Her strength lies in translating theoretical advances into practical solutions that deliver measurable impact through thoughtful implementation.

Create an agentic RAG application for advanced knowledge discovery wit …

Agentic Retrieval Augmented Generation (RAG) applications represent an advanced approach in AI that integrates foundation models (FMs) with external knowledge retrieval and autonomous agent capabilities. These systems dynamically access and process information, break down complex tasks, use external tools, apply reasoning, and adapt to various contexts. They go beyond simple question answering by performing multi-step processes, making decisions, and generating complex outputs.
In this post, we demonstrate an example of building an agentic RAG application using the LlamaIndex framework. LlamaIndex is a framework that connects FMs with external data sources. It helps ingest, structure, and retrieve information from databases, APIs, PDFs, and more, enabling the agent and RAG for AI applications.
This application serves as a research tool, using the Mistral Large 2 FM on Amazon Bedrock generate responses for the agent flow. The example application interacts with well-known websites, such as Arxiv, GitHub, TechCrunch, and DuckDuckGo, and can access knowledge bases containing documentation and internal knowledge.
This application can be further expanded to accommodate broader use cases requiring dynamic interaction with internal and external APIs, as well as the integration of internal knowledge bases to provide more context-aware responses to user queries.
Solution overview
This solution uses the LlamaIndex framework to build an agent flow with two main components: AgentRunner and AgentWorker. The AgentRunner serves as an orchestrator that manages conversation history, creates and maintains tasks, executes task steps, and provides a user-friendly interface for interactions. The AgentWorker handles the step-by-step reasoning and task execution.
For reasoning and task planning, we use Mistral Large 2 on Amazon Bedrock. You can use other text generation FMs available from Amazon Bedrock. For the full list of supported models, see Supported foundation models in Amazon Bedrock. The agent integrates with GitHub, arXiv, TechCrunch, and DuckDuckGo APIs, while also accessing internal knowledge through a RAG framework to provide context-aware answers.
In this solution, we present two options for building the RAG framework:

Document integration with Amazon OpenSearch Serverless – The first option involves using LlamaIndex to programmatically load and process documents. It splits the documents into chunks using various chunking strategies and then stores these chunks in an Amazon OpenSearch Serverless vector store for future retrieval.
Document integration with Amazon Bedrock Knowledge Bases – The second option uses Amazon Bedrock Knowledge Bases, a fully managed service that handles the loading, processing, and chunking of documents. This service can quickly create a new vector store on your behalf with a few configurations and clicks. You can choose from Amazon OpenSearch Serverless, Amazon Aurora PostgreSQL-Compatible Edition Serverless, and Amazon Neptune Analytics. Additionally, the solution includes a document retrieval rerank feature to enhance the relevance of the responses.

You can select the RAG implementation option that best suits your preference and developer skill level.
The following diagram illustrates the solution architecture.

In the following sections, we present the steps to implement the agentic RAG application. You can also find the sample code in the GitHub repository.
Prerequisites
The solution has been tested in the AWS Region us-west-2. Complete the following steps before proceeding:

Set up the following resources:

Create an Amazon SageMaker
Create a SageMaker domain user profile.
Launch Amazon SageMaker Studio, select JupyterLab, and create a space.
Select the instance t3.medium and the image SageMaker Distribution 2.3.1, then run the space.

Request model access:

On the Amazon Bedrock console, choose Model access in the navigation pane.
Choose Modify model access.
Select the models Mistral Large 2 (24.07), Amazon Titan Text Embeddings V2, and Rerank 1.0 from the list, and request access to these models.

Configure AWS Identity and Access Management (IAM) permissions:

In the SageMaker console, go to the SageMaker user profile details and find the execution role that the SageMaker notebook uses. It should look like AmazonSageMaker-ExecutionRole-20250213T123456.

In the IAM console, create an inline policy for this execution role. that your role can perform the following actions:

Access to Amazon Bedrock services including:

Reranking capabilities
Retrieving information
Invoking models
Listing available foundation models

IAM permissions to:

Create policies
Attach policies to roles within your account

Full access to Amazon OpenSearch Serverless service

Run the following command in the JupyterLab notebook terminal to download the sample code from GitHub:

git init
git remote add origin https://github.com/aws-samples/mistral-on-aws.git
git sparse-checkout init
git sparse-checkout set “notebooks/mistral-llamaindex-agentic-rag”
git pull origin main

Finally, install the required Python packages by running the following command in the terminal:

cd mistral-llamaindex-agentic-rag
pip install -r requirements.txt

Initialize the models
Initialize the FM used for orchestrating the agentic flow with Amazon Bedrock Converse API. This API provides a unified interface for interacting with various FMs available on Amazon Bedrock. This standardization simplifies the development process, allowing developers to write code one time and seamlessly switch between different models without adjusting for model-specific differences. In this example, we use the Mistral Large 2 model on Amazon Bedrock.
Next, initialize the embedding model from Amazon Bedrock, which is used for converting document chunks into embedding vectors. For this example, we use Amazon Titan Text Embeddings V2. See the following code:

# Initialise and configure the BedrockConverse LLM with the Mistral Large 2 model and set it as the default in Settings

from llama_index.llms.bedrock_converse import BedrockConverse
from llama_index.core import Settings
llm = BedrockConverse(model=”mistral.mistral-large-2407-v1:0″, max_tokens = 2048)
Settings.llm = BedrockConverse(model=”mistral.mistral-large-2407-v1:0″, max_tokens = 2048)

# Initialise and configure the embedding model with Amazon Titan Text Embeddings V2, and set it as the default in Settings

from llama_index.embeddings.bedrock import BedrockEmbedding
embed_model = BedrockEmbedding(model_name=”amazon.titan-embed-text-v2:0″)
Settings.embed_model = BedrockEmbedding(model_name=”amazon.titan-embed-text-v2:0″)

Integrate API tools
Implement two functions to interact with the GitHub and TechCrunch APIs. The APIs shown in this post don’t require credentials. To provide clear communication between the agent and the foundation model, follow Python function best practices, including:

Type hints for parameter and return value validation
Detailed docstrings explaining function purpose, parameters, and expected returns
Clear function descriptions

The following code sample shows the function that integrates with the GitHub API. After the function is created, use the FunctionTool.from_defaults() method to wrap the function as a tool and integrate it seamlessly into the LlamaIndex workflow.
See the code repository for the full code samples of the function that integrates with the TechCrunch API.

# Define a function to search GitHub repositories by topic, sorting by stars or update date, and return top results
import requests
def github_search(topic: str, num_results: int = 3, sort_by: str = “stars”) -> list:
“””
Retrieve a specified number of GitHub repositories based on a given topic,
ranked by the specified criteria.

This function uses the GitHub API to search for repositories related to a
specific topic or keyword. The results can be sorted by the number of stars
(popularity) or the most recent update, with the most relevant repositories
appearing first according to the chosen sorting method.

Parameters:
———–
topic : str
The topic or keyword to search for in GitHub repositories.
The topic cannot contain blank spaces.
num_results : int, optional
The number of repository results to retrieve. Defaults to 3.
sort_by : str, optional
The criterion for sorting the results. Options include:
– ‘stars’: Sort by the number of stars (popularity).
– ‘updated’: Sort by the date of the last update (most recent first).
Defaults to ‘stars’.

Returns:
——–
list
A list of dictionaries, where each dictionary contains information
about a repository. Each dictionary includes:
– ‘html_url’: The URL of the repository.
– ‘description’: A brief description of the repository.
– ‘stargazers_count’: The number of stars (popularity) the repository has.
“””

url = f”https://api.github.com/search/repositories?q=topic:{topic}&sort={sort_by}&order=desc”
response = requests.get(url).json()
code_repos = [
{
‘html_url’: item[‘html_url’],
‘description’: item[‘description’],
‘stargazers_count’: item[‘stargazers_count’],
}
for item in response[‘items’][:num_results]
]
return code_repos

github_tool = FunctionTool.from_defaults(fn=github_search)

For arXiv and DuckDuckGo integration, we use LlamaIndex’s pre-built tools instead of creating custom functions. You can explore other available pre-built tools in the LlamaIndex documentation to avoid duplicating existing solutions.

# Import and configure the ArxivToolSpec and DuckDuckGoSearchToolSpec from LlamaIndex prebuilt tools

from llama_index.tools.arxiv import ArxivToolSpec
from llama_index.tools.duckduckgo import DuckDuckGoSearchToolSpec

arxiv_tool = ArxivToolSpec()
search_tool = DuckDuckGoSearchToolSpec()

api_tools = arxiv_tool.to_tool_list() + search_tool.to_tool_list()

# Consolidate all tools into one list.
api_tools.extend([news_tool, github_tool])

RAG option 1: Document integration with Amazon OpenSearch Serverless
Next, programmatically build the RAG component using LlamaIndex to load, process, and chunk documents.  store the embedding vectors in Amazon OpenSearch Serverless. This approach offers greater flexibility for advanced scenarios, such as loading various file types (including .epub and .ppt) and selecting advanced chunking strategies based on file types (such as HTML, JSON, and code).
Before moving forward, you can download some PDF documents for testing from the AWS website using the following command, or you can use your own documents. The following documents are AWS guides that help in choosing the right generative AI service (such as Amazon Bedrock or Amazon Q) based on use case, customization needs, and automation potential. They also assist in selecting AWS machine learning (ML) services (such as SageMaker) for building models, using pre-trained AI, and using cloud infrastructure.

# download test documents from below links
!wget -O docs/genai_on_aws.pdf https://docs.aws.amazon.com/pdfs/decision-guides/latest/generative-ai-on-aws-how-to-choose/generative-ai-on-aws-how-to-choose.pdf?did=wp_card&trk=wp_card#guide
!wget -O docs/ml_on_aws.pdf https://docs.aws.amazon.com/pdfs/decision-guides/latest/machine-learning-on-aws-how-to-choose/machine-learning-on-aws-how-to-choose.pdf?did=wp_card&trk=wp_card#guide

Load the PDF documents using SimpleDirectoryReader() in the following code. For a full list of supported file types, see the LlamaIndex documentation.

# use Llamaindex to load documents
from llama_index.core import SimpleDirectoryReader
loader = SimpleDirectoryReader(‘docs/’)
documents = loader.load_data()

Next, create an Amazon OpenSearch Serverless collection as the vector database. Check the utils.py file for details on the create_collection() function.

# Create Amazon OpenSearch Serverless collection
from utils import *
import sagemaker
import random

region_name = “us-west-2”
suffix = random.randrange(1, 500)
collection_name = “llamaindex-blog-“+str(suffix)
notebook_execution_role = sagemaker.get_execution_role()
endpoint = create_collection(collection_name, notebook_execution_role)

After you create the collection, create an index to store embedding vectors:

## create an index in the collection
index_name = “pdf-rag”
create_index(index_name, endpoint, emb_dim=1024)

Next, use the following code to implement a document search system using LlamaIndex integrated with Amazon OpenSearch Serverless. It first sets up AWS authentication to securely access OpenSearch Service, then configures a vector client that can handle 1024-dimensional embeddings (specifically designed for the Amazon Titan Embedding V2 model). The code processes input documents by breaking them into manageable chunks of 1,024 tokens with a 20-token overlap, converts these chunks into vector embeddings, and stores them in the OpenSearch Serverless vector index. You can select a different or more advanced chunking strategy by modifying the transformations parameter in the VectorStoreIndex.from_documents() method. For more information and examples, see the LlamaIndex documentation.

import boto3
from llama_index.vector_stores.opensearch import OpensearchVectorStore, OpensearchVectorClient
from opensearchpy import RequestsHttpConnection, AWSV4SignerAuth
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.node_parser import SentenceSplitter

## integrate Amazon OpenSearch Serverless collection and index to llamaindex

dim = 1024 # Amazon Titan Embedding V2 model dimension
service = ‘aoss’
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, service)

client = OpensearchVectorClient(
endpoint,
index_name,
dim,
embedding_field=”vector”,
text_field=”chunk”,
http_auth=awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
)

# initialise vector store and save document chunks to the vector store
vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
transformations=[SentenceSplitter(chunk_size=1024, chunk_overlap=20)]
)

You can add a reranking step in the RAG pipeline, which improves the quality of information retrieved by making sure that the most relevant documents are presented to the language model, resulting in more accurate and on-topic responses:

from llama_index.postprocessor.bedrock_rerank import AWSBedrockRerank
reranker = AWSBedrockRerank(
top_n=3,
model_id=”amazon.rerank-v1:0″,# another rerank model option is: cohere.rerank-v3-5:0
region_name=”us-west-2″,
)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[reranker],
)

Use the following code to test the RAG framework. You can compare results by enabling or disabling the reranker model.

response = query_engine.query(
“In which situation should I use Amazon Bedrock over Amazon SageMaker?”,
)

Next, convert the vector store into a LlamaIndex QueryEngineTool, which requires a tool name and a comprehensive description. This tool is then combined with other API tools to create an agent worker that executes tasks in a step-by-step manner. The code initializes an AgentRunner to orchestrate the entire workflow, analyzing text inputs and generating responses. The system can be configured to support parallel tool execution for improved efficiency.

# create QueryEngineTool based on the OpenSearch vector store

from llama_index.core.tools import QueryEngineTool, ToolMetadata
oss_tool = QueryEngineTool(
query_engine=query_engine,
metadata=ToolMetadata(
name=”oss_guide_tool”,
description=”””
These decision guides help users select appropriate AWS machine learning and generative AI services based on specific needs.
They cover pre-built solutions, customizable platforms, and infrastructure options for ML workflows,
while outlining how generative AI can automate processes, personalize content, augment data, reduce costs,
and enable faster experimentation in various business contexts.”””,
),
)

all_tools = api_tools +[oss_tool]

agent_worker = FunctionCallingAgentWorker.from_tools(
all_tools,
llm=llm,
verbose=True, # Set verbose=True to display the full trace of steps.
system_prompt = system_prompt,
# allow_parallel_tool_calls = True # Uncomment this line to allow multiple tool invocations
)
agent = AgentRunner(agent_worker)
response = agent.chat(text_input)

You have now completed building the agentic RAG application using LlamaIndex and Amazon OpenSearch Serverless. You can test the chatbot application with your own questions. For example, ask about the latest news and features regarding Amazon Bedrock, or inquire about the latest papers and most popular GitHub repositories related to generative AI.
RAG option 2: Document integration with Amazon Bedrock Knowledge Bases
In this section, you use Amazon Bedrock Knowledge Bases to build the RAG framework. You can create an Amazon Bedrock knowledge base on the Amazon Bedrock console or follow the provided notebook example to create it programmatically. Create a new Amazon Simple Storage Service (Amazon S3) bucket for the knowledge base, then upload the previously downloaded files to this S3 bucket. You can select different embedding models and chunking strategies that work better for your data. After you create the knowledge base, remember to sync the data. Data synchronization might take a few minutes.
To enable your newly created knowledge base to invoke the rerank model, you need to modify its permissions. First, open the Amazon Bedrock console and locate the service role that matches the one shown in the following screenshot.

Choose the role and add the following provided IAM permission policy as an inline policy. This additional authorization grants your knowledge base the necessary permissions to successfully invoke the rerank model on Amazon Bedrock.

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: “bedrock:InvokeModel”,
“Resource”: “arn:aws:bedrock:us-west-2::foundation-model/amazon.rerank-v1:0”
},
{
“Effect”: “Allow”,
“Action”: “bedrock:Rerank”,
“Resource”: “*”
}
]
}

Use the following code to integrate the knowledge base into the LlamaIndex framework. Specific configurations can be provided in the retrieval_config parameter, where numberOfResults is the maximum number of retrieved chunks from the vector store, and overrideSearchType has two valid values: HYBRID and SEMANTIC. In the rerankConfiguration, you can optionally provide a rerank modelConfiguration and numberOfRerankedResults to sort the retrieved chunks by relevancy scores and select only the defined number of results. For the full list of available configurations for retrieval_config, refer to the Retrieve API documentation.

# Configure a knowledge base retriever using AmazonKnowledgeBasesRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.retrievers.bedrock import AmazonKnowledgeBasesRetriever

# maximum number of relevant text chunks that will be retrieved
# If you need quick, focused answers: lower numbers (1-3)
# If you need detailed, comprehensive answers: higher numbers (5-10)
top_k = 10

# search mode options: HYBRID, SEMANTIC
# HYBRID search combines the strengths of semantic search and keyword search
# Balances semantic understanding with exact matching
# https://docs.llamaindex.ai/en/stable/examples/retrievers/bedrock_retriever/
search_mode = “HYBRID”

kb_retriever = AmazonKnowledgeBasesRetriever(
knowledge_base_id=knowledge_base_id,
retrieval_config={
“vectorSearchConfiguration”: {
“numberOfResults”: top_k,
“overrideSearchType”: search_mode,
‘rerankingConfiguration’: {
‘bedrockRerankingConfiguration’: {
‘modelConfiguration’: {
‘modelArn’: ‘arn:aws:bedrock:us-west-2::foundation-model/amazon.rerank-v1:0’
},
‘numberOfRerankedResults’: 3
},
‘type’: ‘BEDROCK_RERANKING_MODEL’
}
},

}
)
kb_engine = RetrieverQueryEngine(retriever=kb_retriever)

Like the first option, you can create the knowledge base as a QueryEngineTool in LlamaIndex and combine it with other API tools. Then, you can create a FunctionCallingAgentWorker using these combined tools and initialize an AgentRunner to interact with them. By using this approach, you can chat with and take advantage of the capabilities of the integrated tools.

# Create a query tool for Bedrock Knowledge Base
kb_tool = QueryEngineTool(
query_engine=kb_engine,
metadata=ToolMetadata(
name=”kb_tool”,
description=”””
These decision guides help users select appropriate AWS machine learning and generative AI services based on specific needs.
They cover pre-built solutions, customizable platforms, and infrastructure options for ML workflows,
while outlining how generative AI can automate processes, personalize content, augment data, reduce costs,
and enable faster experimentation in various business contexts.”””,
),
)

# Update the agent to include all API tools and the Knowledge Base tool.
all_tools = api_tools +[kb_tool]

agent_worker = FunctionCallingAgentWorker.from_tools(
all_tools,
llm=llm,
verbose=True, # Set verbose=True to display the full trace of steps.
system_prompt = system_prompt,
# allow_parallel_tool_calls = True # Uncomment this line to allow multiple tool invocations
)
agent = AgentRunner(agent_worker)
response = agent.chat(text_input)

Now you have built the agentic RAG solution using LlamaIndex and Amazon Bedrock Knowledge Bases.
Clean up
When you finish experimenting with this solution, use the following steps to clean up the AWS resources to avoid unnecessary costs:

In the Amazon S3 console, delete the S3 bucket and data created for this solution.
In the OpenSearch Service console, delete the collection that was created for storing the embedding vectors.
In the Amazon Bedrock Knowledge Bases console, delete the knowledge base you created.
In the SageMaker console, navigate to your domain and user profile, and launch SageMaker Studio to stop or delete the JupyterLab instance.

Conclusion
This post demonstrated how to build a powerful agentic RAG application using LlamaIndex and Amazon Bedrock that goes beyond traditional question answering systems. By integrating Mistral Large 2 as the orchestrating model with external APIs (GitHub, arXiv, TechCrunch, and DuckDuckGo) and internal knowledge bases, you’ve created a versatile technology discovery and research tool.
We showed you two complementary approaches to implement the RAG framework: a programmatic implementation using LlamaIndex with Amazon OpenSearch Serverless, providing maximum flexibility for advanced use cases, and a managed solution using Amazon Bedrock Knowledge Bases that simplifies document processing and storage with minimal configuration. You can try out the solution using the following code sample.
For more relevant information, see Amazon Bedrock, Amazon Bedrock Knowledge Bases, Amazon OpenSearch Serverless, and Use a reranker model in Amazon Bedrock. Refer to Mistral AI in Amazon Bedrock to see the latest Mistral models that are available on both Amazon Bedrock and AWS Marketplace.

About the Authors
Ying Hou, PhD, is a Sr. Specialist Solution Architect for Gen AI at AWS, where she collaborates with model providers to onboard the latest and most intelligent AI models onto AWS platforms. With deep expertise in Gen AI, ASR, computer vision, NLP, and time-series forecasting models, she works closely with customers to design and build cutting-edge ML and GenAI applications. Outside of architecting innovative AI solutions, she enjoys spending quality time with her family, getting lost in novels, and exploring the UK’s national parks.
Preston Tuggle is a Sr. Specialist Solutions Architect with the Third-Party Model Provider team at AWS. He focuses on working with model providers across Amazon Bedrock and Amazon SageMaker, helping them accelerate their go-to-market strategies through technical scaling initiatives and customer engagement.

Text-to-image basics with Amazon Nova Canvas

AI image generation has emerged as one of the most transformative technologies in recent years, revolutionizing how you create and interact with visual content. Amazon Nova Canvas is a generative model in the suite of Amazon Nova creative models that enables you to generate realistic and creative images from plain text descriptions.
This post serves as a beginner’s guide to using Amazon Nova Canvas. We begin with the steps to get set up on Amazon Bedrock. Amazon Bedrock is a fully managed service that hosts leading foundation models (FMs) for various use cases such as text, code, and image generation; summarization; question answering; and custom use cases that involve fine-tuning and Retrieval Augmented Generation (RAG). In this post, we focus on the Amazon Nova image generation models available in AWS Regions in the US, in particular, the Amazon Nova Canvas model. We then provide an overview of the image generation process (diffusion) and dive deep into the input parameters for text-to-image generation with Amazon Nova Canvas.
Get started with image generation on Amazon Bedrock
Complete the following steps to get setup with access to Amazon Nova Canvas and the image playground:

Create an AWS account if you don’t have one already.
Open the Amazon Bedrock console as an AWS Identity and Access Management (IAM) administrator or appropriate IAM user.
Confirm and choose one of the Regions where the Amazon Nova Canvas model is available (for example, US East (N. Virginia)).
In the navigation pane, choose Model access under Bedrock configurations.

Under What is Model access, choose Modify model access or Enable specific models (if not yet activated).

Select Nova Canvas, then choose Next.

On the Review and submit page, choose Submit.

Refresh the Base models If you see the Amazon Nova Canvas model in the Access Granted status, you are ready to proceed with the next steps.

In the navigation pane, choose Image / Video under Playgrounds.

Choose Select model, then choose Amazon and Nova Canvas. Then choose Apply.

You are all set up to start generating images with Amazon Nova Canvas on Amazon Bedrock. The following screenshot shows an example of our playground.

Understanding the generation process
Amazon Nova Canvas uses diffusion-based approaches to generate images:

Starting point – The process begins with random noise (a pure static image).
Iterative denoising – The model gradually removes noise in steps, guided by your prompts. The amount of noise to remove at each step is learned at training. For instance, for a model to generate an image of a cat, it has to be trained on multiple cat images, and iteratively insert noise into the image until it is complete noise. When learning the amount of noise to add at each step, the model effectively learns the reverse process, starting with a noisy image and iteratively subtracting noise to arrive at the image of a cat.
Text conditioning – The text prompt serves as the conditioning that guides the image generation process. The prompt is encoded as a numerical vector, referenced against similar vectors in a text-image embedding space that corresponds to images, and then using these vectors, a noisy image is transformed into an image that captures the input prompt.
Image conditioning – In addition to text prompts, Amazon Nova Canvas also accepts images as inputs.
Safety and fairness – To comply with safety and fairness goals, both the prompt and the generated output image go through filters. If no filter is triggered, the final image is returned.

Prompting fundamentals
Image generation begins with effective prompting—the art of crafting text descriptions that guide the model toward your desired output. Well-constructed prompts include specific details about subject, style, lighting, perspective, mood, and composition, and work better when structured as image captions rather than a command or conversation. For example, rather than saying “generate an image of a mountain,” a more effective prompt might be “a majestic snow-capped mountain peak at sunset with dramatic lighting and wispy clouds, photorealistic style.” Refer to Amazon Nova Canvas prompting best practices for more information about prompting.
Let’s address the following prompt elements and observe their impact on the final output image:

Subject descriptions (what or who is in the image) – In the following example, we use the prompt “a cat sitting on a chair.”

Style references (photography, oil painting, 3D render) – In the following examples, we use the prompts “A cat sitting on a chair, oil painting style” and then “A cat sitting on a chair, anime style.”

Compositional elements and technical specifications (foreground, background, perspective, lighting) – In the following examples, we use the prompts “A cat sitting on a chair, mountains in the background,” and “A cat sitting on a chair, sunlight from the right low angle shot.”

Positive and negative prompts
Positive prompts tell the model what to include. These are the elements, styles, and characteristics you want to observe in the final image. Avoid the use of negation words like “no,” “not,” or “without” in your prompt. Amazon Nova Canvas has been trained on image-caption pairs, and captions rarely describe what isn’t in an image. Therefore, the model has never learned the concept of negation. Instead, use negative prompts to specify elements to exclude from the output.
Negative prompts specify what to avoid. Common negative prompts include “blurry,” “distorted,” “low quality,” “poor anatomy,” “bad proportions,” “disfigured hands,” or “extra limbs,” which help models avoid typical generation artifacts.
In the following examples, we first use the prompt “An aerial view of an archipelago,” then we refine the prompt as “An aerial view of an archipelago. Negative Prompt: Beaches.”

The balance between positive and negative prompting creates a defined creative space for the model to work within, often resulting in more predictable and desirable outputs.
Image dimensions and aspect ratios
Amazon Nova Canvas is trained on 1:1, portrait and landscape resolutions, with generation tasks having a maximum output resolution of 4.19 million pixels (that is, 2048×2048, 2816×1536). For editing tasks, the image should be 4,096 pixels on its longest side, have an aspect ratio between 1:4 and 4:1, and have a total pixel count of 4.19 million or smaller. Understanding dimensional limitations helps avoid stretched or distorted results, particularly for specialized composition needs.
Classifier-free guidance scale
The classifier-free guidance (CFG) scale controls how strictly the model follows your prompt:

Low values (1.1–3) – More creative freedom for the AI, potentially more aesthetic, but low contrast and less prompt-adherent results
Medium values (4–7) – Balanced approach, typically recommended for most generations
High values (8–10) – Strict prompt adherence, which can produce more precise results but sometimes at the cost of natural aesthetics and increased color saturation

In the following examples, we use the prompt “Cherry blossoms, bonsai, Japanese style landscape, high resolution, 8k, lush greens in the background.”
The first image with CFG 2 captures some elements of cherry blossoms and bonsai. The second image with CFG 8 adheres more to the prompt with a potted bonsai, more pronounced cherry blossom flowers, and lush greens in the background.

Think of CFG scale as adjusting how literally your instructions are taken into consideration vs. how much artistic interpretation it applies.
Seed values and reproducibility
Every image generation begins with a randomization seed—essentially a starting number that determines initial conditions:

Seeds are typically represented as long integers (for example, 1234567890)
Using the same seed, prompt, and parameters reproduces identical images every time
Saving seeds allows you to revisit successful generations or create variations on promising results
Seed values have no inherent quality; they are simply different starting points

Reproducibility through seed values is essential for professional workflows, allowing refined iterations on the prompt or other input parameters to clearly see their effect, rather than completely random generations. The following images are generated using two slightly different prompts (“A portrait of a girl smiling” vs. “A portrait of a girl laughing”), while holding the seed value and all other parameters constant.

All preceding images in this post have been generated using the text-to-image (TEXT_IMAGE) task type of Amazon Nova Canvas, available through the Amazon Bedrock InvokeModel API. The following is the API request and response structure for image generation:

#Request Structure
{
“taskType”: “TEXT_IMAGE”,
“textToImageParams”: {
“text”: string, #Positive Prompt
“negativeText”: string #Negative Prompt
},
“imageGenerationConfig”: {
“width”: int, #Image Resolution Width
“height”: int, #Image Resolution Width
“quality”: “standard” | “premium”, #Image Quality
“cfgScale”: float, #Classifer Free Guidance Scale
“seed”: int, #Seed value
“numberOfImages”: int #Number of images to be generated (max 5)
}
}
#Response Structure
{
“images”: “images”: string[], #list of Base64 encoded images
“error”: string
}

Code example
This solution can also be tested locally with a Python script or a Jupyter notebook. For this post, we use an Amazon SageMaker AI notebook using Python (v3.12). For more information, see Run example Amazon Bedrock API requests using an Amazon SageMaker AI notebook. For instructions to set up your SageMaker notebook instance, refer to Create an Amazon SageMaker notebook instance. Make sure the instance is set up in the same Region where Amazon Nova Canvas access is enabled. For this post, we create a Region variable to match the Region where Amazon Nova Canvas is enabled (us-east-1). You must modify this variable if you’ve enabled the model in a different Region. The following code demonstrates text-to-image generation by invoking the Amazon Nova Canvas v1.0 model using Amazon Bedrock. To understand the API request and response structure for different types of generations, parameters, and more code examples, refer to Generating images with Amazon Nova.

import base64 #For encoding/decoding base64 data
import io #For handling byte streams
import json #For JSON processing
import boto3 #AWS SDK for Python
from PIL import Image #Python Imaging Library for image processing
from botocore.config import Config #For AWS client configuration

#Create a variable to fix the region to where Nova Canvas is enabled
region = “us-east-1”

#Setup an Amazon Bedrock runtime client
client = boto3.client(service_name=’bedrock-runtime’, region_name=region, config=Config(read_timeout=300))

#Set the content type and accept headers for the API call
accept = “application/json”
content_type = “application/json”

#Define the prompt for image generation
prompt = “””A cat sitting on a chair, mountains in the background, low angle shot.”””

#Create the request body with generation parameters
api_request= json.dumps({
“taskType”: “TEXT_IMAGE”, #Specify text-to-image generation
“textToImageParams”: {
“text”: prompt
},
“imageGenerationConfig”: {
“numberOfImages”: 1, #Generate one image
“height”: 720, #Image height in pixels
“width”: 1280, #Image width in pixels
“cfgScale”: 7.0, #CFG Scale
“seed”: 0 #Seed number for generation
}
})
#Call the Bedrock model to generate the image
response = client.invoke_model(body=api_request, modelId=’amazon.nova-canvas-v1:0′, accept=accept,
contentType=content_type)

#Parse the JSON response
response_json = json.loads(response.get(“body”).read())

#Extract the base64-encoded image from the response
base64_image = response_json.get(“images”)[0]
#Convert the base64 string to ASCII bytes
base64_bytes = base64_image.encode(‘ascii’)
#Decode the base64 bytes to get the actual image bytes
image_data = base64.b64decode(base64_bytes)

#Convert bytes to an image object
output_image = Image.open(io.BytesIO(image_data))
#Display the image
output_image.show()
#Save the image to current working directory
output_image.save(‘output_image.png’)

Clean up
When you have finished testing this solution, clean up your resources to prevent AWS charges from being incurred:

Back up the Jupyter notebooks in the SageMaker notebook instance.
Shut down and delete the SageMaker notebook instance.

Cost considerations
Consider the following costs from the solution deployed on AWS:

You will incur charges for generative AI inference on Amazon Bedrock. For more details, refer to Amazon Bedrock pricing.
You will incur charges for your SageMaker notebook instance. For more details, refer to Amazon SageMaker pricing.

Conclusion
This post introduced you to AI image generation, and then provided an overview of accessing image models available on Amazon Bedrock. We then walked through the diffusion process and key parameters with examples using Amazon Nova Canvas. The code template and examples demonstrated in this post aim to get you familiar with the basics of Amazon Nova Canvas and get started with your AI image generation use cases on Amazon Bedrock.
For more details on text-to-image generation and other capabilities of Amazon Nova Canvas, see Generating images with Amazon Nova. Give it a try and let us know your feedback in the comments.

About the Author
Arjun Singh is a Sr. Data Scientist at Amazon, experienced in artificial intelligence, machine learning, and business intelligence. He is a visual person and deeply curious about generative AI technologies in content creation. He collaborates with customers to build ML and AI solutions to achieve their desired outcomes. He graduated with a Master’s in Information Systems from the University of Cincinnati. Outside of work, he enjoys playing tennis, working out, and learning new skills.

This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web …

Web navigation focuses on teaching machines how to interact with websites to perform tasks such as searching for information, shopping, or booking services. Building a capable web navigation agent is a complex task because it requires understanding the structure of websites, interpreting user goals, and making a series of decisions across multiple steps. These tasks are further complicated by the need for agents to adapt in dynamic web environments, where content can change frequently and where multimodal information, such as text and images, must be understood together.

A key problem in web navigation is the absence of reliable and detailed reward models that can guide agents in real-time. Existing methods primarily rely on multimodal large language models (MLLMs) like GPT-4o and GPT-4o-mini as evaluators, which are expensive, slow, and often inaccurate, especially when handling long sequences of actions in multi-step tasks. These models use prompting-based evaluation or binary success/failure feedback but fail to provide step-level guidance, often leading to errors such as repeated actions or missing critical steps like clicking specific buttons or filling form fields. This limitation reduces the practicality of deploying web agents in real-world scenarios, where efficiency, accuracy, and cost-effectiveness are crucial.

The research team from Yonsei University and Carnegie Mellon University introduced WEB-SHEPHERD, a process reward model specifically designed for web navigation tasks. WEB-SHEPHERD is the first model to evaluate web navigation agents at the step level, using structured checklists to guide assessments. The researchers also developed the WEBPRM COLLECTION, a dataset of 40,000 step-level annotated web navigation tasks, and the WEBREWARDBENCH benchmark for evaluating PRMs. These resources were designed to enable WEB-SHEPHERD to provide detailed feedback by breaking down complex tasks into smaller, measurable subgoals.

WEB-SHEPHERD works by generating a checklist for each task based on the user’s instruction, such as “Search for product” or “Click on product page,” and evaluates the agent’s progress against these subgoals. The model uses next-token prediction to generate feedback and assigns rewards based on checklist completion. This process enables WEB-SHEPHERD to assess the correctness of each step with fine-grained judgment. The model estimates the reward for each step by combining the probabilities of “Yes,” “No,” and “In Progress” tokens and averages these across the checklist. This detailed scoring system enables agents to receive targeted feedback on their progress, enhancing their ability to navigate complex websites.

The researchers demonstrated that WEB-SHEPHERD significantly outperforms existing models. On the WEBREWARDBENCH benchmark, WEB-SHEPHERD achieved a Mean Reciprocal Rank (MRR) score of 87.6% and a trajectory accuracy of 55% in the text-only setting, compared to GPT-4o-mini’s 47.5% MRR and 0% trajectory accuracy without checklists. When tested in WebArena-lite using GPT-4o-mini as the policy model, WEB-SHEPHERD achieved a 34.55% success rate, which is 10.9 points higher than using GPT-4o-mini as the evaluator, while also being ten times more cost-efficient. In ablation studies, the researchers observed that WEB-SHEPHERD’s performance dropped significantly when checklists or feedback were removed, proving their importance for accurate reward assignments. They also showed that multimodal input, surprisingly, did not always improve performance and sometimes introduced noise.

This research highlights the critical role of detailed process-level rewards in building reliable web agents. The team’s work addresses the core challenge of web navigation—evaluating complex, multi-step actions—and offers a solution that is both scalable and cost-effective. With WEB-SHEPHERD, agents can now receive accurate feedback during navigation, enabling them to make better decisions and complete tasks more effectively.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency appeared first on MarkTechPost.

National University of Singapore Researchers Introduce Dimple: A Discr …

In recent months, there has been growing interest in applying diffusion models—originally designed for continuous data, such as images—to natural language processing tasks. This has led to the development of Discrete Diffusion Language Models (DLMs), which treat text generation as a denoising process. Unlike traditional autoregressive models, DLMs enable parallel decoding and provide better control over structure, offering advantages such as flexible initialization of entire sequences, explicit control over output format, and improved infilling through bidirectional attention. Furthermore, their non-sequential nature opens the door to faster generation. Despite these benefits, most current multimodal large language models (MLLMs)—such as LLaMA, Qwen-VL, and InternVL—still rely solely on autoregressive methods.

Work in diffusion-based language models has explored both continuous and discrete diffusion spaces. Continuous approaches, such as DiffuSeq and SED, use embedding or relaxed categorical spaces for smoother generation. In contrast, discrete models like SDDM and RDM tailor the diffusion process to linguistic structures. Training techniques vary, but commonly use masked language modeling losses or entropy-based score matching. Some hybrid models, such as AR-Diffusion and SSD-LM, combine autoregressive and diffusion strategies to leverage the strengths of both approaches. Meanwhile, open-source MLLMs such as LLaVA and InternVL have advanced through visual instruction tuning and joint pretraining, yet still follow an autoregressive generation scheme. 

Researchers at the National University of Singapore present Dimple, the first Discrete DMLLM, which integrates a vision encoder with a discrete diffusion-based language model. To overcome the instability and performance issues of purely diffusion-based training, they introduce a two-phase training method—Autoregressive-then-Diffusion—combining initial autoregressive alignment with subsequent diffusion-based masked language modeling. Dimple-7B surpasses LLaVA-NEXT by 3.9% on benchmarks. The team also introduces Confident Decoding for dynamic token generation and explores Structure Priors for precise control over output. These innovations significantly improve inference efficiency, generation flexibility, and structural controllability without sacrificing performance. 

Dimple is a Discrete Diffusion Multimodal LLM that integrates a vision encoder with a diffusion-based language model. To address inefficiencies in diffusion training, such as sparse supervision and limited generation coverage, the model is trained in two phases: first with autoregressive training using a causal attention mask for vision-language alignment, then with diffusion training to restore generation capabilities. During inference, a dynamic “Confident Decoding” strategy adapts token updates based on prediction confidence. Despite using significantly fewer training samples, Dimple exhibits competitive performance on multiple benchmarks, outperforming similar-scale autoregressive models, although it trails behind larger-scale state-of-the-art systems. 

The experiments evaluate Dimple, a DMLLM, against autoregressive models on instruction-following tasks. Dimple, trained with a hybrid strategy that combines autoregressive and diffusion tuning, exhibits strong performance, surpassing models with similar training data on most benchmarks. Although it lags behind models trained on much larger datasets, Dimple benefits from a stronger base language model. Ablation studies reveal that combining autoregressive and diffusion tuning mitigates issues like length bias and improves consistency. Prefilling further boosts inference speed significantly, with only minor performance drops, making the model both efficient and competitive in multimodal understanding tasks. 

In conclusion, Dimple, the first DMLLM, is designed to overcome the limitations of purely discrete diffusion training, such as instability and length bias. Dimple employs a hybrid training approach that starts with autoregressive learning, followed by diffusion tuning, yielding the Dimple-7B model, which outperforms LLaVA-NEXT by 3.9%. A decoding strategy, confident decoding, significantly reduces inference steps, while prefilling improves speed with minimal performance trade-offs. Dimple also enables structured and controllable outputs through structure priors, offering fine-grained control over format and length capabilities that autoregressive models struggle to provide. 

Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post National University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation appeared first on MarkTechPost.

Incorrect Answers Improve Math Reasoning? Reinforcement Learning with …

In natural language processing (NLP), RL methods, such as reinforcement learning with human feedback (RLHF), have been utilized to enhance model outputs by optimizing responses based on feedback signals. A specific variant, reinforcement learning with verifiable rewards (RLVR), extends this approach by utilizing automatic signals, such as mathematical correctness or syntactic features, as feedback, enabling the large-scale tuning of language models. RLVR is especially interesting because it promises to enhance models’ reasoning abilities without needing extensive human supervision. This intersection of automated feedback and reasoning tasks forms an exciting area of research, where developers aim to uncover how models can learn to reason mathematically, logically, or structurally using limited supervision.

A persistent challenge in machine learning is building models that can reason effectively under minimal or imperfect supervision. In tasks like mathematical problem-solving, where the correct answer might not be immediately available, researchers grapple with how to guide a model’s learning. Models often learn from ground-truth data, but it’s impractical to label vast datasets with perfect accuracy, particularly in reasoning tasks that require understanding complex structures like proofs or programmatic steps. Consequently, there’s an open question about whether models can learn to reason if they are exposed to noisy, misleading, or even incorrect signals during training. This issue is significant because models that overly rely on perfect feedback may not generalize well when such supervision is unavailable, thereby limiting their utility in real-world scenarios.

Several existing techniques aim to enhance models’ reasoning abilities through reinforcement learning (RL), with RLVR being a key focus. Traditionally, RLVR has used “ground truth” labels, correct answers verified by humans or automated tools, to provide rewards during training. Some approaches have relaxed this requirement by using majority vote labels or simple format-based heuristics, such as rewarding answers that follow a specific output style. Other methods have experimented with random rewards, offering positive signals without considering the correctness of the answer. These methods aim to explore whether models can learn even with minimal guidance, but they mostly concentrate on specific models, such as Qwen, raising concerns about generalizability across different architectures.

Researchers from the University of Washington, the Allen Institute for AI, and UC Berkeley investigate this question by testing various reward signals on Qwen2.5-Math, a family of large language models fine-tuned for mathematical reasoning. They tested ground-truth rewards, majority-vote rewards, format rewards based on boxed expressions, random rewards, and incorrect rewards. Remarkably, they observed that even completely spurious signals, like random rewards and rewards for wrong answers, could lead to substantial performance gains in Qwen models. For example, training Qwen2.5-Math-7B on MATH-500 with ground-truth rewards yielded a 28.8% improvement, while using incorrect labels resulted in a 24.6% gain. Random rewards still produced a 21.4% boost, and format rewards led to a 16.4% improvement. Majority-vote rewards provided a 26.5% accuracy gain. These improvements were not limited to a single model; Qwen2.5-Math-1.5B also showed strong gains: format rewards boosted accuracy by 17.6%, and incorrect labels by 24.4%. However, the same reward strategies failed to deliver similar benefits on other model families, such as Llama3 and OLMo2, which showed minimal or negative changes when trained with spurious rewards. For instance, Llama3.1-8B saw performance drops of up to 8.5% under certain spurious signals, highlighting the model-specific nature of the observed improvements.

The research team’s approach involved using RLVR training to fine-tune models with these varied reward signals, replacing the need for ground-truth supervision with heuristic or randomized feedback. They found that Qwen models, even without access to correct answers, could still learn to produce high-quality reasoning outputs. A key insight was that Qwen models tended to exhibit a distinct behavior called “code reasoning”, generating math solutions structured like code, particularly in Python-like formats, regardless of whether the reward signal was meaningful. This code reasoning tendency became more frequent over training, rising from 66.7% to over 90% in Qwen2.5-Math-7B when trained with spurious rewards. Answers that included code reasoning showed higher accuracy rates, often around 64%, compared to just 29% for answers without such reasoning patterns. These patterns emerged consistently, suggesting that spurious rewards may unlock latent capabilities learned during pretraining rather than introducing new reasoning skills.

Performance data underscored the surprising robustness of Qwen models. Gains from random rewards (21.4% on MATH-500) and incorrect labels (24.6%) nearly matched the ground-truth reward gain of 28.8%. Similar trends appeared across tasks, such as AMC, where format, wrong, and random rewards produced around an 18% improvement, only slightly lower than the 25% improvement from ground-truth or majority-vote rewards. Even on AIME2024, spurious rewards like format (+13.0%), incorrect (+8.7%), and random (+6.3%) led to meaningful gains, though the advantage of ground-truth labels (+12.8%) remained evident, particularly for AIME2025 questions created after model pretraining cutoffs.

Several Key Takeaways from the research include:

Qwen2.5-Math-7B gained 28.8% accuracy on MATH-500 with ground-truth rewards, but also 24.6% with incorrect rewards, 21.4% with random rewards, 16.4% with format rewards, and 26.5% with majority-vote rewards.

Code reasoning patterns emerged in Qwen models, increasing from 66.7% to 90%+ under RLVR, which boosted accuracy from 29% to 64%.

Non-Qwen models, such as Llama3 and OLMo2, did not show similar improvements, with Llama3.1-8B experiencing up to 8.5% performance drops on spurious rewards.

Gains from spurious signals appeared within 50 training steps in many cases, suggesting rapid elicitation of reasoning abilities.

The research warns that RLVR studies should avoid generalizing results based on Qwen models alone, as spurious reward effectiveness is not universal.

In conclusion, these findings suggest that while Qwen models can leverage spurious signals to improve performance, the same is not true for other model families. Non-Qwen models, such as Llama3 and OLMo2, showed flat or negative performance changes when trained with spurious signals. The research emphasizes the importance of validating RLVR methods on diverse models rather than relying solely on Qwen-centric results, as many recent papers have done.

Check out the Paper, Official Release and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Incorrect Answers Improve Math Reasoning? Reinforcement Learning with Verifiable Rewards (RLVR) Surprises with Qwen2.5-Math appeared first on MarkTechPost.

Part 3: Building an AI-powered assistant for investment research with …

In the financial services industry, analysts need to switch between structured data (such as time-series pricing information), unstructured text (such as SEC filings and analyst reports), and audio/visual content (earnings calls and presentations). Each format requires different analytical approaches and specialized tools, creating workflow inefficiencies. Add on top of this the intense time pressure resulting from rapidly evolving industry conditions and narrow decision windows. Delayed analysis can mean missed opportunities or failure to identify emerging risks, with potentially significant financial consequences.
AI-assistants boost productivity by automating routine data collection and processing tasks, surfacing relevant insights, and allowing analysts to focus on higher-value activities. However, a single AI agent struggles with complex, multistep investment research workflows to effectively handle the full spectrum of multiple specialized tasks. This is where multi-agent collaboration provides an advancement. By creating specialized AI subagents that excel in specific tasks and functionalities and using them together under a coordinated framework using an AI supervisor agent, we can address the full complexity of investment research workflows. A supervisor agent can intelligently decompose complex queries, delegate specialized tasks to subagents, and synthesize their outputs into comprehensive answers—similar to how a research team functions in the real world. The benefits of this approach are substantial, stemming from distributed problem-solving and specialization, improved accuracy through specialized expertise, enhanced scalability by adding new agent capabilities without rebuilding the entire system, and greater transparency because the reasoning process of each specialized agent can be tracked and verified.
Amazon Bedrock Agents uses the reasoning of foundation models (FMs), APIs, and data to break down user requests, gather relevant information, and efficiently complete tasks. With the multi-agent collaboration capability of Amazon Bedrock, you can build, deploy, and manage multiple AI agents working together on complex multistep tasks that require specialized skills. Amazon Bedrock Data Automation (BDA) enables the generation of useful insights from unstructured multimodal content such as documents, images, audio, and video for your AI-powered applications, and it can be used as a parser when setting up a knowledge base for Retrieval Augmented Generation (RAG) workflows.
In this post, we walk through how to build a multi-agent investment research assistant using the multi-agent collaboration capability of Amazon Bedrock. Our solution demonstrates how a team of specialized AI agents can work together to analyze financial news, evaluate stock performance, optimize portfolio allocations, and deliver comprehensive investment insights—all orchestrated through a unified, natural language interface. We use BDA as a parser when setting up a knowledge base for RAG workflows with multi-modal data. Amazon Nova understanding models are used as the large language models (LLMs) associated with the supervisor agents and subagents.
Check the video that demonstrates the solution,

Results from these agents are demonstrative, not to be taken as financial advice.
Solution overview
The multi-agent investment research assistant is composed of a supervisor agent and three subagents (quantitative analysis agent, news agent, and smart summarizer agent) working together in a coordinated framework, as shown in the following diagram.

The supervisor agent uses the instructions you provide to understand the structure and role of each collaborator agent. The supervisor agent orchestrates the overall investment research process by breaking down user prompts, delegating subtasks to specialized subagents, and consolidating their outputs to generate the final response. Each subagent is configured to carry out specialized tasks as follows:

Quantitative analysis agent

Functions – Powered by stock-query and portfolio-optimization action groups.
Responsibilities – Query and analyze historical stock data and build optimized portfolio allocations based on user inputs such as stock tickers or investment amount.

News agent

Functions – Powered by the web_search action group and the financial_analysis_KB knowledge base.
Responsibilities – Search and retrieve relevant financial data such as earnings reports and filings from the knowledge base for context. If information isn’t present in the knowledge base, construct a web query. Handles queries related to analyzing company financials, performance drivers, and management commentary.

Smart summarizer agent

Functions – Uses LLM capabilities with specific prompt instructions, no custom action group.
Responsibilities – Takes in output from other subagents, such as recent news and financial data, and synthesizes the information into structured investment insights. Summarizes long texts or documents such as earnings reports into concise insights.

Multi-agent collaboration works as follows:

User prompts the multi-agent investment research assistant with a high-level research query.
Supervisor agent breaks the query into subtasks and orchestrates and invokes the relevant subagents in parallel or sequentially as needed:

News agent to retrieve financial reports or data
Quantitative analysis agent for stock data analysis
Smart summarizer agent to summarize lengthy texts

Supervisor agent consolidates the subagent outputs along with the original query context.
Supervisor agent passes the consolidated information to the LLM to generate the final research insights.

This architecture uses the strengths of the different specialized agents to comprehensively address the various aspects of investment research, and the supervisor agent provides the intelligent orchestration.
Technical architecture
The following diagram is the technical architecture.

Some of the key components and the associated codes of the technical architecture are:

Amazon Bedrock Data Automation (BDA) for processing data – The following invokes a BDA job asynchronously using the invoke_data_automation_async() method from the bda-runtime client in boto3:

bda_response = bda_runtime_client.invoke_data_automation_async(
            inputConfiguration={
                “s3Uri”: s3_uri_input
            },
            outputConfiguration={
                “s3Uri”: s3_uri_output
            },
        )

Defining and building subagents with tools – The following creates a subagent using Agent.create. It defines two Lambda tools—one for fetching historical stock data and another for running portfolio optimization—then registers them with the subagent along with its role, goal, instructions, and the underlying FM.

quantitative_analysis_agent = Agent.create(
    name=”quantitative_analysis_agent”,
    role=”Financial Data Collector”,
    goal=”Retrieve real-time and historic stock prices as well as optimizing a portfolio given tickers.”,
    instructions=”””Specialist in real-time financial data extraction and portfolio optimization.
                        …”””,
    tools=[
        # Stock Data Lookup Tool
        {
            “code”: stock_data_lookup_arn,
            “definition”: {
                “name”: “stock_data_lookup”,
                “description”: “Gets the 1-month stock price history for a given stock ticker, formatted as JSON.”,
                “parameters”: {
                    “ticker”: {“description”: “The ticker to retrieve price history for”, “type”: “string”, “required”: True}
                },
            },
        },
    ],
    llm=LLM,
)

Defining and building the supervisor agent (multi-agent investment research assistant) – The following creates a supervisor agent using SupervisorAgent.create. It defines the collaborator agents (subagents), and registers them with the supervisor agent along with its role, goal, instructions, and the underlying FM.

investment_research_assistant = SupervisorAgent.create(
    “investment_research_assistant”,
    role=”Investment Research Assistant”,
    goal=”A seasoned investment research expert responsible for orchestrating subagents …”,
    collaboration_type=”SUPERVISOR”,
    instructions=f”””You are a Investment Research Assistant, a financial research supervisor overseeing multiple specialized agents. Your goal is to coordinate and synthesize their outputs to create a structured stock investment analysis.
                …”””,
    collaborator_agents=[
        {
            “agent”: “news_agent”,
            “instructions”: f”Always check the knowledge base (ID: {kb_id}) first. Use this collaborator for finding news and analyzing specific documents.”
        },
        {
            “agent”: “quantitative_analysis_agent”,
            “instructions”: “Use this collaborator for retrieving stock price history and performing portfolio optimization.”
        },
        {
            “agent”: “smart_summarizer_agent”,
            “instructions”: “Use this collaborator for synthesizing stock trends, financial data, and generating structured investment insights.”
        }
    ],
    collaborator_objects=[news_agent, quantitative_analysis_agent, smart_summarizer_agent],
    llm=LLM,
)

Invoking the supervisor agent (multi-agent investment research assistant) – The following uses the invoke method to pass in the defined request as a prompt and get response from the supervisor agent, using the multi-agent collaboration architecture.

request = “””What’s AMZN stock price doing over the last week
               and relate that to recent news”””
print(f”Request:n{request}n”)
result = investment_research_assistant.invoke(
    request,
    enable_trace=True,
    trace_level=”core”,
)
print(f”Final answer:n{result}”)

Prerequisites
To deploy this solution, attach the appropriate permissions for Amazon Bedrock, AWS Lambda, Amazon OpenSearch Serverless, Amazon Simple Storage Service (Amazon S3), and AWS Identity and Access Management (IAM) to your execution role. You also need to deploy two AWS CloudFormation stacks: web_search and stock_data.
For more information on the permission policy, CloudFormation stacks, and setting up the environment, refer to the README.md in the investment_research_agent repository on GitHub.
Dive deeper into the solution
To dive deeper into the solution and the code shown in this post, refer to the investment research assistant agent GitHub repo. The repo contains instructions for the end-to-end solution, including setting up the supervisor agent and subagents, associated action groups, unstructured data (earnings reports PDF files and earnings call audio files), and structured data (stocks time series).
In the appendix at the end of this post, we share different questions asked by a financial analyst to the supervisor agent, the subagent invoked, and the answers from the supervisor agent.
Cleanup
To clean up this solution, make sure to delete all agents, knowledge bases, and associated Amazon OpenSearch Serverless instances. Delete both the web_search and stock_data stacks. This will make sure that no charges are incurred after you’re finished testing this solution. Read the associated notebook in the GitHub repository for more instructions on cleaning up the agents.
Conclusion
The multi-agent investment research assistant built using Amazon Bedrock Agents with its multi-agent collaboration feature represents how financial professionals can use AI to enhance their analytical capabilities. By using specialized agents—each focusing on distinct aspects of financial analysis—orchestrated by a supervisor agent, we’ve created a system that can process diverse data types, apply appropriate analytical techniques, and synthesize comprehensive insights in response to complex investment queries. The result is an AI assistant that can handle multistep financial analysis workflows that would otherwise overwhelm a single-agent system.
The application of multi-agent collaboration extends broadly across the financial sector, from equity research and portfolio management to risk assessment and compliance monitoring. A complex financial workflow that requires multiple specialized skills and data sources can benefit from this architectural approach.
The implementation uses key AWS services, including Amazon Bedrock for foundation models, the multi-agent collaboration capability of Amazon Bedrock Agents, Amazon S3 for document storage, Amazon OpenSearch Serverless for vector search capabilities, and Lambda functions for specialized actions. This serverless architecture provides scalability, and the knowledge base integration provides context from financial documents to ground the AI’s responses in factual data.
For more information:

Get started building your own multi-agent investment research assistant by accessing our complete solution code and implementation guide in the GitHub repo.
Explore how to customize the solution for your specific investment research needs by modifying the subagents’ specialized tools and action groups and by bringing your data securely to Amazon Bedrock Knowledge Bases.
You can also explore and run Amazon Bedrock multi-agent collaboration workshop with AWS specialists or on your own.

Read more about single agent solutions for investment research with multi-modal data:

Part 1 – Generative AI and multi-modal agents in AWS: The key to unlocking new value in financial markets, which demonstrates how multi-modal agents use AWS services to efficiently gather financial insights from an organization’s text, speech, and tabular data sources for a financial analyst’s query using open source LangChain agents as alternative implementation.
Part 2 – AI-powered assistants for investment research with multi-modal data: An application of Amazon Bedrock Agents, which introduces a solution using Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases that can help financial analysts use various multi-modal financial data to gather financial insights.

Read the blog post on Unlocking complex problem-solving with multi-agent collaboration on Amazon Bedrock for more details on benefits, design, and evaluation of multi-agent collaboration for enterprise applications.

About the Authors
Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.
Lucas Banerji is an AI/ML and GenAI specialist Solutions Architect at AWS. He is passionate about building agentic AI systems and exploring the frontier of what’s possible with intelligent automation. Lucas holds a degree in Computer Science from the University of Virginia. Outside of work, he enjoys running, practicing Muay Thai, and traveling the world.
 Mohan Musti is a Principal Technical Account Manger based out of Dallas. Mohan helps customers architect and optimize applications on AWS. Mohan has Computer Science and Engineering from JNT University, India. In his spare time, he enjoys spending time with his family and camping.

Appendix
The following table contains example questions asked and responses from the AI-powered assistant for investment research based on structured and unstructured data.

Question Purpose
Question asked by financial analyst to AI-powered assistant
Subagent, action group and/or knowledge base invoked
Intermediate steps from Supervisor agent chain-of-thought reasoning
Multi-agent Investment Research Assistant – Responses

Prompt the agent to analyze the company’s financial health and performance.
Analyze Amazon’s financial health based on the 2024 10k report. Calculate important financial ratios. Limit to 5 sentences.
news_agent (financial_analysis_kb), smart_summarizer_agent
1) I need to specifically request the 2024 10-K report information from the knowledge base. Let me ask the news agent for this specific report. 2) Since the knowledge base contains recent financial information and forward-looking guidance, I’ll provide a comprehensive summary of Amazon’s financial metrics and projections. 3) Now I’ll ask the analyst agent to analyze this information and calculate key financial ratios.
Based on Amazon’s recent financial data, the company demonstrates strong financial health with total assets of $624.9B and a robust cash position of $78.8B, while generating $143.3B in net sales and $59.2B in net income. Key profitability metrics show improvement with a 46% gross margin, 4.4% operating margin, and 14.8% return on equity, while efficiency ratios remain healthy with inventory turnover at 13.2x and asset turnover at 1.4x. Liquidity ratios (current ratio: 1.06, quick ratio: 0.88) are adequate though tight, but are supported by strong operating cash flow of $54.3B, representing a 38% year-over-year increase. The company maintains a balanced leverage profile with a debt-to-equity ratio of 0.98 and solid interest coverage ratio of 8.5x, indicating good debt management. Despite ongoing high capital expenditures, Amazon’s improving margins, strong cash generation, and efficient operations suggest a healthy financial position with sustainable growth potential.

Prompt the agent to analyze the company’s financial performance from company’s earnings reports
What were the main drivers of AWS’s revenue growth in the Q1 2023 quarter? Please limit your answers to 5 sentences.
news_agent (financial_analysis_kb), smart_summarizer_agent
1.) I need to get information about AWS’s Q1 2023 earnings report. I’ll ask the news agent to check the knowledge base for this specific information. 2.) Based on the search results, I can provide a comprehensive answer about AWS’s Q1 2023 performance and growth drivers. The information comes directly from their earnings reports. 3.) I’ll synthesize this information into a concise response focusing on the main drivers and send it to the user.
AWS’s Q1 2023 revenue growth was primarily driven by continued cloud adoption, achieving net sales of $21.4 billion and a 16% year-over-year growth rate. However, the growth was tempered by customers across all segments optimizing their cloud spending in response to economic uncertainty. The business maintained strong fundamentals with an annualized sales run rate exceeding $85 billion, though momentum showed signs of slowing with April growth rates declining by approximately 500 basis points compared to Q1.

Prompt the agent to analyze the company’s management team and their plans for the future from company’s earnings reports
What are Amazon’s top priorities for 2024? What are the biggest risks facing the company? Please limit your answers to 5 sentences.
news_agent (financial_analysis_kb & web_search), smart_summarizer_agent
1.) I’ll ask the news agent to check the knowledge base for recent information about Amazon’s strategic priorities and risk factors for 2024. 2.) Let me get additional recent news and risk factors by asking the news agent to search for supplementary information. 3.) Based on the recent news searches, I can now provide a comprehensive overview of the significant risks currently facing Amazon
Amazon’s top priorities for 2024 include expanding international stores, growing grocery and B2B retail segments, scaling the “Buy with Prime” program, advancing healthcare initiatives, and developing low Earth orbit satellite capabilities for improved connectivity. The company faces several significant risks, including substantial AI-related capital expenditure pressures (expected $75 billion in 2024), increasing regulatory scrutiny from the FTC and competition authorities, potential impacts from new Chinese import tariffs, and intensifying competition in both cloud computing and retail sectors. Amazon is balancing these ambitious growth initiatives with cost optimization efforts, while viewing their strategic investments as potentially transformative for their long-term future.

Prompt the agent to build an optimal portfolio using the collected data
Optimize my portfolio with AAA, BBB, and CCC [anonymized tickers]
quantitative_analysis_agent (stock_data_lookup & portfolio tool)
1.) I need to get historical stock data for these three stocks before running portfolio optimization. I’ll ask the stock data agent for this information. 2.) Now that I have retrieved the historical price data for all three stocks, I’ll perform portfolio optimization using this data. 3.) I’ll synthesize this information into a clear, comprehensive response for the user.
Based on portfolio optimization analysis, here is the recommended allocation: Company A (AAA): 61.04% Company B (BBB): 32.54% Company C (CCC): 6.42% This allocation optimizes risk-adjusted returns based on recent market data. Please note that these allocations are based on mathematical optimization and historical data. Consider your personal investment goals, risk tolerance, and market conditions when making investment decisions.

Prompt the agent to look up news and relate it to stock data
What was happening to AMZN stock in November 2024, relate that to news in the same timeframe
news_agent (web_search), quantitative_analysis_agent (stock_data_lookup), & smart_summarizer_agent
1.) I’ll need to get both recent stock price data and news, then synthesize them together. I’ll request both simultaneously to be efficient. 2.) I’ll now synthesize the stock price movement with the recent news developments to provide a comprehensive analysis.
November 2024: Stock Price Movement: – Trading Range: $161.43 (low) to $206.21 (high) – Overall Trend: Significant upward momentum with notable volatility – Key Technical Levels: Resistance at $206, Support at $190. Key News Impact: 1. Q4 2024 Guidance (Released Oct 31): – Projected net sales: $181.5-188.5 billion – Expected growth: 7-11% year-over-year – This guidance served as a major catalyst for the stock’s movement Price-News Correlation: – The strong upward movement (~27% range) appears to be primarily driven by positive Q4 guidance – Investor confidence was bolstered by the projected sales growth – The wide guidance range reflected optimism around the holiday shopping season – Higher trading volumes were observed during price movements, indicating strong market participation The stock’s performance suggests that investors responded positively to Amazon’s forward-looking guidance, particularly regarding holiday season expectations. The volatility in price movements indicates active trading around these news events, with the market generally maintaining an optimistic outlook on Amazon’s near-term prospects.

A generative AI prototype with Amazon Bedrock transforms life sciences …

It takes biopharma companies over 10 years, at a cost of over $2 billion and with a failure rate of over 90%, to deliver a new drug to patients. The Market to Molecule (M2M) value stream process, which biopharma companies must apply to bring new drugs to patients, is resource-intensive, lengthy, and highly risky. Nine out of ten biopharma companies are AWS customers, and helping them streamline and transform the M2M processes can help deliver drugs to patients faster, reduce risk, and bring value to our customers.
Pharmaceutical companies are taking a new approach to drug discovery, looking for variants in the human genome and linking them to diseases. This genetic validation approach can improve the success ratio in the M2M value stream process by focusing on the root cause of disease and the gene variants.
As depicted in the following M2M value stream diagram, the Research process (and the Basic Research sub-process) is critical to downstream processes where linking the gene variant to a disease occurs, and is instrumental in defining the target molecule. This can be a critical step in expediting and reducing the cost of delivering a new drug to patients.

To transform the M2M value stream process, our customer has been working on associating genes with diseases by using their large dataset of over 2 million sequenced exomes (genes that are expressed into proteins). To accomplish this, the customer’s clinical scientists have to develop methods to navigate through the enormous dataset by using online genome browsers, a mechanical data-first experience that doesn’t fully meet the needs of users. Starting with a search query to get results, the typical interactions of navigating levels, filtering, waiting, and repeating the search can be time-consuming and tedious. Simplifying the UI from the traditional human browser to a conversational AI assistant can enhance the user experience in the clinical research process.
Generative AI is a promising next step in the evolutionary process of leading this change. As generative AI started to make significant impact in healthcare and life sciences, this use case was primed for generative AI experimentation. In collaboration with the customer, AWS built a custom approach of posting a question or a series of questions, allowing scientists to have more flexibility and agility for exploring the genome. Our customer aimed at saving researchers countless hours of work using a new generative AI-enabled gene assistant. By asking a question, or a series of questions, scientists have more flexibility and agility in exploring the genome. Identifying variants and their potential correlation with diseases can be done more efficiently using words, rather than filters, settings, and buttons. With a more streamlined research process, we can help increase the likelihood of leading to new breakthroughs.
This post explores deploying a text-to-SQL pipeline using generative AI models and Amazon Bedrock to ask natural language questions to a genomics database. We demonstrate how to implement an AI assistant web interface with AWS Amplify and explain the prompt engineering strategies adopted to generate the SQL queries. Finally, we present instructions to deploy the service in your own AWS account. Amazon Bedrock is a fully managed service that provides access to large language models (LLMs) and other foundation models (FMs) from leading AI companies through a single API, allowing you to use it instantly without much effort, saving developers valuable time. We used the AWS HealthOmics variant stores to store the Variant Call Format (VCF) files with omics data. A VCF file is typically the output of a bioinformatics pipeline. VCFs encode Single Nucleotide Polymorphisms (SNPs) and other structural genetic variants. The format is further described on the 1000 Genomes project website. We used the AWS HealthOmics – End to End workshop to deploy the variants and annotation stores.
Although this post focuses on a text-to-SQL approach to an omics database, the generative AI approaches discussed here can be applied to a variety of complex schemas of relational databases.
Text-to-SQL for genomics data
Text-to-SQL is a task in natural language processing (NLP) to automatically convert natural language text into SQL queries. This involves translating the written text into a structured format and using it to generate an accurate SQL query that can run on a database. The task is difficult because there are big differences between human language, which is flexible, ambiguous, and dependent on context, and SQL, which is structured.
Before LLMs for text-to-SQL, user queries had to be preprocessed to match specific templates, which were then used to rephrase the queries. This approach was use case-specific and required data preparation and manual work. Now, with LLMs, the text-to-SQL task has undergone a major transformation. LLMs continue to showcase key performance improvements in generating valid SQL queries from natural language queries. Relying on pre-trained models trained on massive datasets, LLMs can identify the relationships between words in language and accurately predict the next ones to be used.
However, although LLMs have remarkable performance in many text-to-SQL problems, they have limitations that lead to hallucinations. This post describes the main approaches used to overcome these limitations.
There are two key strategies to achieve high accuracy in text-to-SQL services:

Prompt engineering – The prompt is structured to annotate different components, such as pointing to columns and schemas, and then instructing the model on which type of SQL to create. These annotations act as instructions that guide the model in formatting the SQL output correctly. For example, a prompt might contain annotations showing specific table columns and guiding the model to generate a SQL query. This approach allows for more control over the model’s output by explicitly specifying the desired structure and format of the SQL query.
Fine-tuning – You can start with a pre-trained model on a large general text corpus and then proceed with an instruction-based fine-tuning with labeled examples to improve the model’s performance on text-to-SQL tasks. This process adapts the model to the target task by directly training it on the end task, but it requires a substantial number of text-SQL examples.

This post focuses on the prompt engineering strategy for SQL generation. AWS customers deploy prompt engineering strategies first because they’re efficient in returning high-quality results and require a less complex infrastructure and process. For more details and best practices on when to follow each approach, refer to Best practices to build generative AI applications on AWS.
We experimented with prompt engineering using chain-of-thought and tree-of-thought approaches to improve the reasoning and SQL generation capabilities. The chain-of-thought prompting technique guides the LLMs to break down a problem into a series of intermediate steps or reasoning steps, explicitly expressing their thought process before arriving at a definitive answer or output.
Using prompts, we compelled the LLM to generate a series of statements about its own reasoning, allowing the LLM to articulate its reasoning process to produce accurate and understandable outputs. The tree-of-thought approach introduces a structured branching approach to the reasoning process. Instead of a linear chain, we prompt the LLM to generate a tree-like structure, where each node represents a sub-task, sub-question, or intermediate step in the overall problem-solving process.
Solution Overview
The following architecture depicts the solution and AWS services we used to accomplish the prototype.

The workflow consists of the following steps:

A scientist submits a natural language question or request to a chat web application connected through Amplify and integrated with an AWS AppSync GraphQL API.
The request is submitted to Amazon API Gateway, which transfers the request to an AWS Lambda function that contains the text-to-SQL implementation. We recommend the implementation of a second helper Lambda function to fetch variants data, or gene names, or ClinVar listed diseases, to simplify the user experience and facilitate the SQL generation process.
The text-to-SQL Lambda function receives the natural language request, merges the input question with the prompt template, and submits to Amazon Bedrock to generate the SQL.

Our implementation also adds a step to simplify the incoming history into a single request. We submit a request to Amazon Bedrock to transform the historical inputs from that user session into a simplified natural language request. This step is optional.

With the generated SQL, the Lambda function submits the query to Amazon Athena to retrieve the genomic data from the Amazon Simple Storage Service (Amazon S3) bucket.

If successful, the Lambda function updates the user session stored in Amazon DynamoDB through an AWS AppSync request. That change will automatically appear on the UI that is subscribed to changes to the session table.
If an error occurs, the code attempts to re-generate the SQL query, passing the returned error as input and requesting it to fix the error. The Lambda function then reruns the re-generated SQL against Athena and returns the result.

Generative AI approaches to text-to-SQL
We tested the following prompt-engineering strategies:

LLM SQL agents
LLM with Retrieval Augmented Generation (RAG) to detect tables and columns of interest
Prompt engineering with full description of tables and columns of interest
Prompt engineering with chain-of-thought and tree-of-thought approaches
Prompt engineering with a dynamic few-shot approach

We didn’t achieve good results with SQL agents. We experimented with LangChain SQL agents. It was difficult for the agent to use contextual information from the dataset to generate accurate and syntactically correct SQL. A big challenge in omics data is that certain columns are arrays of structs or maps. At the time of building this project, the agents were incapable of detecting these nuances and failed to generate relevant SQL.
We experimented with a RAG approach to retrieve relevant tables and columns, given a user question. Then we informed the LLM by prompting it to generate a SQL query using only those tables and columns. A motivation behind this experiment is that a RAG approach can deal well with hundreds or thousands of columns or tables. However, this approach also didn’t return good results. This RAG approach returned too many irrelevant variables to be used in each SQL generation.
The next three approaches were successful, and we used them in combination to get the highest accuracy on synthetically correct SQL generation.
A first prompt idea we tested was to provide a full description of the main tables and columns to be used in the SQL generation given a user question. In the following example, we show a snapshot of the prompts used to describe the 1000 Genome variants tables. The goal of the prompt with database tables and column descriptions is to teach the LLM how to use the schema to generate queries. We approached it as if teaching a new developer that will write queries to that database, with examples of SQL queries to extract the correct dataset, how to filter the data, and only using the most relevant columns.

<table>
       <table_name>
       variants
       </table_name>
       <table_description>
       This table contains information about genetic variants.
       </table_description>
       <column>
              <column_name>contigname</column_name>
              <column_description>
This column specifies the name of the contig (a contiguous sequence of DNA) or chromosome where the variant is located. It is typicauy prefixed with “chr”. If the user asks for variants at the chromossome 22, use `chr22` to access variants in this table.
              </column_description>
              <example_use>
                      setect *
                      from variants
                      wnere contigname = ‘chr22’
                      and start between 45509414 and 45509418;
              </example_use>
       </column>
       <column>
              <column_name>start</column_name>
              <column_description>
                      The start position of the variant on the chromosome. This should
                      be used to compose the primary key of the variant, along with the
                      following tables: `contigname`, `end`, `referenceallele`, `alternatealleles`.
              </column_description>
              <example_use>
                      SELECT * FROM variants WHERE start > 100000 and end < 200000;
              </example_use>
       </column>
</table>

The team also worked with the creation of a prompt that used the concept of chain-of-thought and its evolution tree-of-thought to improve the reasoning and SQL generation capabilities.
The chain-of-thought prompting technique encourages LLMs to break down a problem into a series of intermediate steps, explicitly expressing their thought process before arriving at a definitive answer or output. This approach takes inspiration from the way humans often break down problems into smaller, manageable parts.
Through the use of prompts, we compelled the LLM to generate a chain-of-thought, letting the LLM articulate its reasoning process and produce more accurate and understandable outputs. This technique has the potential to improve performance on tasks that require multi-step reasoning, such as SQL generation from open-ended natural language questions. This approach presented excellent results with the FM that we tested.
As a next step in our experimentation, we used the tree-of-thought technique to generate even better results than the chain-of-thought approach. The tree-of-thought approach introduces a more structured and branching approach to the reasoning process. Instead of a linear chain, we prompt the LLM to generate a tree-like structure, where each node represents a sub-task, sub-question, or intermediate step in the overall problem-solving process. The following example presents how we used these two approaches in the prompt template:

Imagine three different experts are answering this question. All experts will write down 1 step
of their thinking, then share it with the group. Then all experts will go on to the next step, etc.
If any expert realises they’re wrong at any point then they leave. Each of the three experts should
explain their thinking along with the generated SQL statement. Your final step is to review the
generated SQL code for syntax errors. Pay close attention to any use of the UNNEST function – it
MUST be immediately followed by ‘AS t(unpacked)’ rather than ‘AS t’ . If you find a syntax error
with the generated SQL, produce a corrected version within <SQL_FIXED> tags. Only produce
the <SQL_FIXED> code if you find a syntax problem in the <SQL_QUERY> tags.

Finally, we tested a few-shot and a dynamic few-shot approach. The few-shot approach is a prompting technique used in prompt engineering for LLMs. It involves providing the LLM with a few examples or demonstrations, along with the input prompt, to guide the model’s generation or output. In the few-shot setting, the prompt comprises the following:

An instruction or task description
A few examples or demonstrations of the desired output, given a specific input
The new input for which the LLM will generate an output

By exposing the LLM to these examples, the model recognizes better patterns and infers the underlying rules or mappings between the input and desired output.
The dynamic few-shot approach extends the few-shot prompting technique. It introduces the concept of dynamically generating or selecting the examples or demonstrations used in the prompt, based on the specific input or context. In this approach, instead of providing a fixed set of examples, the prompt generation process involves:

Analyzing the input or context
Creating embeddings of the examples and of the input, and retrieving or generating relevant examples or demonstrations tailored to the specific input by applying a semantic search
Constructing the prompt with the selected examples and the input

Conclusion
This post demonstrated how to implement a text-to-SQL solution to democratize the access to omics data for users that aren’t data analytics specialists. The approach used HealthOmics and Amazon Bedrock to generate SQL based on natural language queries. This approach has the potential to provide access to omics data to a larger audience than what is available today.
The code is available in the accompanying GitHub repo. The deployment instructions for the HealthOmics variants and annotation store can be found in the AWS HealthOmics – End to End workshop. The deployment instructions for the text-to-SQL project are available in the README file.
We would like to acknowledge Thomaz Silva and Saeed Elnaj for their contributions to this blog. It couldn’t have been done without them.

About the Authors
Ganesh Raam Ramadurai is a Senior Technical Program Manager at Amazon Web Services (AWS), where he leads the PACE (Prototyping and Cloud Engineering) team. He specializes in delivering innovative, AI/ML and Generative AI-driven prototypes that help AWS customers explore emerging technologies and unlock real-world business value. With a strong focus on experimentation, scalability, and impact, Ganesh works at the intersection of strategy and engineering—accelerating customer innovation and enabling transformative outcomes across industries.
Jeff Harman is a Senior Prototyping Architect on the Amazon Web Services (AWS) Prototyping and Cloud Engineering team, he specializes in developing innovative solutions that leverage AWS’s cloud infrastructure to meet complex business needs. Jeff Harman is a seasoned technology professional with over three decades of experience in software engineering, enterprise architecture, and cloud computing. Prior to his tenure at AWS, Jeff held various leadership roles at Webster Bank, including Vice President of Platform Architecture for Core Banking, Vice President of Enterprise Architecture, and Vice President of Application Architecture. During his time at Webster Bank, he was instrumental in driving digital transformation initiatives and enhancing the bank’s technological capabilities. He holds a Master of Science degree from the Rochester Institute of Technology, where he conducted research on creating a Java-based, location-independent desktop environment—a forward-thinking project that anticipated the growing need for remote computing solutions. Based in Unionville, Connecticut, Jeff continues to be a driving force in the field of cloud computing, applying his extensive experience to help organizations harness the full potential of AWS technologies.
Kosal Sen is a Design Technologist on the Amazon Web Services (AWS) Prototyping and Cloud Engineering team. Kosal specializes in creating solutions that bridge the gap between technology and actual human needs. As an AWS Design Technologist, that means building prototypes on AWS cloud technologies, and ensuring they bring empathy and value into the real world. Kosal has extensive experience spanning design, consulting, software development, and user experience. Prior to AWS, Kosal held various roles where he combined technical skillsets with human-centered design principles across enterprise-scale projects.

Gemma 3 27B model now available on Amazon Bedrock Marketplace and Amaz …

We are excited to announce the availability of Gemma 3 27B Instruct models through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, developers and data scientists can now deploy Gemma 3, a 27-billion-parameter language model, along with its specialized instruction-following versions, to help accelerate building, experimentation, and scalable deployment of generative AI solutions on AWS.
In this post, we show you how to get started with Gemma 3 27B Instruct on both Amazon Bedrock Marketplace and SageMaker JumpStart, and how to use the model’s powerful instruction-following capabilities in your applications.
Overview of Gemma 3 27B
Gemma 3 27B is a high-performance, open-weight, multimodal language model by Google designed to handle both text and image inputs with efficiency and contextual understanding. It introduces a redesigned attention architecture, enhanced multilingual support, and extended context capabilities. With its optimized memory usage and support for large input sequences, it is well-suited for complex reasoning tasks, long-form interactions, and vision-language applications. With 27 billion parameters and training on up to 6 trillion tokens of text, these models are optimized for tasks requiring advanced reasoning, multilingual capabilities, and instruction following. According to Google, Gemma3 27B Instruct models are ideal for developers, researchers, and businesses looking to build generative AI applications such as chatbots, virtual assistants, and automated content generation tools. The following are its key features:

Multimodal input – Processes text, images, and short videos for unified reasoning across modalities
Long context support – Handles up to 128,000 tokens, enabling seamless processing of long documents, conversations, and multimedia transcripts
Multilingual support – Offers out-of-the-box support for over 35 languages, with pre-training exposure to more than 140 languages in total
Function calling – Facilitates building agentic workflows by using natural‐language interfaces to APIs
Memory-efficient inference – Offers architectural updates that reduce KV-cache usage and introduce QK-norm for faster and more accurate outputs

Key use cases for Gemma3, as described by Google, include:

Q&A and summarization – Processing and condensing long documents or articles
Visual understanding – Image captioning, object identification, visual Q&A, and document understanding
Multilingual applications – Building AI assistants and tools across over 140 languages
Document processing – Analyzing multi-page articles or extracting information from large texts
Automated workflows – Using function calling to create AI agents that can interact with other systems

There are two primary methods for deploying Gemma 3 27B in AWS: The first approach involves using Amazon Bedrock Marketplace, which offers a streamlined way of accessing Amazon Bedrock APIs (Invoke and Converse) and tools such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, Amazon Bedrock Flows, Amazon Bedrock Guardrails, and model evaluation. The second approach is using SageMaker JumpStart, a machine learning (ML) hub, with foundation models (FMs), built-in algorithms, and pre-built ML solutions. You can deploy pre-trained models using either the Amazon SageMaker console or SDK.
Deploy Gemma 3 27B Instruct on Amazon Bedrock Marketplace
Amazon Bedrock Marketplace offers access to over 150 specialized FMs, including Gemma 3 27B Instruct.
Prerequisites
To try the Gemma 3 27B Instruct model using Amazon Bedrock Marketplace, you need the following:

An AWS account that will contain all your AWS resources
Access to accelerated instances (GPUs) for hosting the large language models (LLMs)

Deploy the model
To deploy the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, under Foundation models in the navigation pane, select Model catalog.
Filter for Gemma as the provider and choose Gemma 3 27B Instruct.

Information about Gemma3’s features, costs, and setup instructions can be found on its model overview page. This resource includes integration examples, API documentation, and programming samples. The model excels at a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. You can also access deployment guidelines and license details to begin implementing Gemma3 into your projects.

Review the model details, pricing, and deployment guidelines, and choose Deploy to start the deployment process.

For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters) or leave it as the default name that is pre-populated.
For Number of instances, enter a number of instances (between 1–100).
Select your preferred instance type, with GPU-powered options like ml.g5.48xlarge being particularly well-suited for running Gemma 3 efficiently.

Although default configurations are typically sufficient for basic needs, you have the option to customize security features such as virtual private cloud (VPC) networking, role-based permissions, and data encryption. These advanced settings might require adjustment for production environments to maintain compliance with your organization’s security protocols.

Prior to deploying Gemma 3, verify that your AWS account has sufficient quota allocation for ml.g5.48xlarge instances. A quota set to 0 will trigger deployment failures, as shown in the following screenshot.

To request a quota increase, open the AWS Service Quotas console and search for SageMaker. Locate ml.g5.48xlarge for endpoint usage and choose Request quota increase, then specify your required limit value.

While the deployment is in progress, you can choose Managed deployments in the navigation pane to monitor the deployment status.
When deployment is complete, you can test Gemma 3’s capabilities directly in the Amazon Bedrock playground by selecting the managed deployment and choosing Open in playground.

You can now use the playground to interact with Gemma 3.

For detailed steps and example code for invoking the model using Amazon Bedrock APIs, refer to Submit prompts and generate response using the API and the following code:

import boto3
bedrock_runtime = boto3.client(“bedrock-runtime”)
endpoint_arn = “arn:aws:sagemaker:us-east-2:061519324070:endpoint/endpoint-quick-start-3t7kp”
response = bedrock_runtime.converse(
    modelId=endpoint_arn,
    messages=[
        {
            “role”: “user”,
            “content”: [{“text”: “What is Amazon doing in the field of generative AI?”}]
        }
    ],
    inferenceConfig={
        “maxTokens”: 256,
        “temperature”: 0.1,
        “topP”: 0.999
    }
)
print(response[“output”][“message”][“content”][0][“text”])

Deploy Gemma 3 27B Instruct with SageMaker JumpStart
SageMaker JumpStart offers access to a broad selection of publicly available FMs. These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can use state-of-the-art model architectures—such as language models, computer vision models, and more—without having to build them from scratch.
With SageMaker JumpStart, you can deploy models in a secure environment. The models can be provisioned on dedicated SageMaker inference instances and can be isolated within your VPC. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of Amazon SageMaker AI, including SageMaker inference for deploying models and container logs for improved observability. With SageMaker AI, you can streamline the entire model deployment process.
There are two ways to deploy the Gemma 3 model using SageMaker JumpStart:

Through the user-friendly SageMaker JumpStart interface
Using the SageMaker Python SDK for programmatic deployment

We examine both deployment methods to help you determine which approach aligns best with your requirements.
Prerequisites
To try the Gemma 3 27B Instruct model in SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker AI. To learn more about how IAM works with SageMaker AI, see Identity and Access Management for Amazon SageMaker AI.
Access to Amazon SageMaker Studio and a SageMaker AI notebook instance or an interactive development environment (IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
Access to accelerated instances (GPUs) for hosting the LLMs.

Deploy the model through the SageMaker JumpStart UI
SageMaker JumpStart provides a user-friendly interface for deploying pre-built ML models with just a few clicks. Through the SageMaker JumpStart UI, you can select, customize, and deploy a wide range of models for various tasks such as image classification, object detection, and natural language processing, without the need for extensive coding or ML expertise.

On the SageMaker AI console, choose Studio in the navigation pane.
First-time users will be prompted to create a domain.
On the SageMaker Studio console, choose JumpStart in the navigation pane.

The model browser displays available models, with details like the provider name and model capabilities.

Search for Gemma 3 to view the Gemma 3 model card. Each model card shows key information, including:

Model name
Provider name
Task category (for example, Text Generation)
The Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, so you can use Amazon Bedrock APIs to invoke the model

Choose the model card to view the model details page.

The model details page includes the following information:

The model name and provider information
The Deploy button to deploy the model
About and Notebooks tabs with detailed information. The About tab includes important details, such as:
Model description
License information
Technical specifications
Usage guidelines

Before you deploy the model, we recommended you review the model details and license terms to confirm compatibility with your use case.

Choose Deploy to proceed with deployment.
For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters) or leave it as default.
For Instance type, choose an instance type (default: ml.g5.48xlarge).
For Initial instance count, enter the number of instances (default: 1).

Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.

Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
Choose Deploy to deploy the model.

The deployment process can take several minutes to complete.
Deploy the model programmatically using the SageMaker Python SDK
To use Gemma 3 with the SageMaker Python SDK, first make sure you have installed the SDK and set up your AWS permissions and environment correctly. The following is a code example showing how to programmatically deploy and run inference with Gemma 3:

import sagemaker
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker import Session, image_uris
import boto3
# Initialize SageMaker session
session = sagemaker.Session()
role = sagemaker.get_execution_role()
# Specify model parameters
model_id = “huggingface-vlm-gemma-3-27b-instruct”  # or “huggingface-llm-gemma-2b” for the smaller version
instance_type = “ml.g5.48xlarge”  # Choose appropriate instance based on your needs
# Create and deploy the model
model = JumpStartModel(
    model_id=model_id,
    role=role,
    instance_type=instance_type,
    model_version=”*”,  # Latest version
)
# Deploy the model
predictor = model.deploy(
    initial_instance_count=1,
    accept_eula=True  # Required for deploying foundation models
)

Run inference using the SageMaker API
With your Gemma 3 model successfully deployed as a SageMaker endpoint, you’re now ready to start making predictions. The SageMaker SDK provides a straightforward way to interact with your model endpoint for inference tasks. The following code demonstrates how to format your input and make API calls to the endpoint. The code handles both sending requests to the model and processing its responses, making it straightforward to integrate Gemma 3 into your applications.

import json
import boto3
# Initialize AWS session (ensure your AWS credentials are configured)
session = boto3.Session()
sagemaker_runtime = session.client(“sagemaker-runtime”)
# Define the SageMaker endpoint name (replace with your deployed endpoint name)
endpoint_name = “hf-vlm-gemma-3-27b-instruct-2025-05-07-18-09-16-221”

payload = {
    “inputs”: “What is Amazon doing in the field of generative AI?”,
    “parameters”: {
        “max_new_tokens”: 256,
        “temperature”: 0.1,
        “top_p”: 0.9,
        “return_full_text”: False
    }
}

# Run inference
try:
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType=”application/json”,
        Body=json.dumps(payload)
    )
    # Parse the response
    result = json.loads(response[“Body”].read().decode(“utf-8”))
    generated_text = result[0][“generated_text”].strip()
    print(“Generated Response:”)
    print(generated_text)
except Exception as e:
    print(f”Error during inference: {e}”)

Clean up
To avoid incurring ongoing charges for AWS resources used during exploration of Gemma3 27B Instruct models, it’s important to clean up deployed endpoints and associated resources. Complete the following steps:

Delete SageMaker endpoints:

On the SageMaker console, in the navigation pane, choose Endpoints under Inference.
Select the endpoint associated with the Gemma3 27B Instruct model (for example, gemma3-27b-instruct-endpoint).
Choose Delete and confirm the deletion. This stops the endpoint and prevents further compute charges.

Delete SageMaker models (if applicable):
On the SageMaker console, choose Models under Inference.
Select the model associated with your endpoint and choose Delete.
Verify Amazon Bedrock Marketplace resources:
On the Amazon Bedrock console, choose Model catalog in the navigation pane.
Make sure no additional endpoints are running for the Gemma3 27B Instruct model deployed through Amazon Bedrock Marketplace.

Always verify that all endpoints are deleted after experimentation to optimize costs. Refer to the Amazon SageMaker documentation for additional guidance on managing resources.
Conclusion
The availability of Gemma3 27B Instruct models in Amazon Bedrock Marketplace and SageMaker JumpStart empowers developers, researchers, and businesses to build cutting-edge generative AI applications with ease. With their high performance, multilingual capabilities and efficient deployment on AWS infrastructure, these models are well-suited for a wide range of use cases, from conversational AI to code generation and content automation. By using the seamless discovery and deployment capabilities of SageMaker JumpStart and Amazon Bedrock Marketplace, you can accelerate your AI innovation while benefiting from the secure, scalable, and cost-effective AWS Cloud infrastructure.
We encourage you to explore the Gemma3 27B Instruct models today by visiting the SageMaker JumpStart console or Amazon Bedrock Marketplace. Deploy the model and experiment with sample prompts to meet your specific needs. For further learning, explore the AWS Machine Learning Blog, the SageMaker JumpStart GitHub repository, and the Amazon Bedrock documentation. Start building your next generative AI solution with Gemma3 27B Instruct models and unlock new possibilities with AWS!

About the Authors
Santosh Vallurupalli is a Sr. Solutions Architect at AWS. Santosh specializes in networking, containers, and migrations, and enjoys helping customers in their journey of cloud adoption and building cloud-based solutions for challenging issues. In his spare time, he likes traveling, watching Formula1, and watching The Office on repeat.
Aravind Singirikonda is an AI/ML Solutions Architect at AWS. He works with AWS customers in the healthcare and life sciences domain to provide guidance and technical assistance, helping them improve the value of their AI/ML solutions when using AWS.
Pawan Matta is a Sr. Solutions Architect at AWS. He works with AWS customers in the gaming industry and guides them to deploy highly scalable, performant architectures. His area of focus is management and governance. In his free time, he likes to play FIFA and watch cricket.
Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in product management, engineering, and GTM. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing generative AI technologies and driving real-world impact with generative AI.

GuardianGamer scales family-safe cloud gaming with AWS

This blog post is co-written with Heidi Vogel Brockmann and Ronald Brockmann from GuardianGamer.

Millions of families face a common challenge: how to keep children safe in online gaming without sacrificing the joy and social connection these games provide.
In this post, we share how GuardianGamer—a member of the AWS Activate startup community—has built a cloud gaming platform that helps parents better understand and engage with their children’s gaming experiences using AWS services. Built specifically for families with children under 13, GuardianGamer uses AWS services including Amazon Nova and Amazon Bedrock to deliver a scalable and efficient supervision platform. The team uses Amazon Nova for intelligent narrative generation to provide parents with meaningful insights into their children’s gaming activities and social interactions, while maintaining a non-intrusive approach to monitoring.
The challenge: Monitoring children’s online gaming experiences
Monitoring children’s online gaming activities has been overwhelming for parents, offering little visibility and limited control. GuardianGamer fills a significant void in the market for parents to effectively monitor their children’s gaming activities without being intrusive.
Traditional parental controls were primarily focused on blocking content rather than providing valuable data related to their children’s gaming experiences and social interactions. This led GuardianGamer’s founders to develop a better solution—one that uses AI to summarize gameplay and chat interactions, helping parents better understand and engage with their children’s gaming activities in a non-intrusive way, by using short video reels, while also helping identify potential safety concerns.
Creating connected experiences for parent and child
GuardianGamer is a cloud gaming platform built specifically for families with pre-teen children under 13, combining seamless gaming experiences with comprehensive parental insights. Built on AWS and using Amazon Nova for intelligent narrative generation, the platform streams popular games while providing parents with much-desired visibility into their children’s gaming activities and social interactions. The service prioritizes both safety and social connection through integrated private voice chat, delivering a positive gaming environment that keeps parents informed in a non-invasive way.
There are two connected experiences offered in the platform: one for parents to stay informed and one for kids to play in a highly trusted and safe GuardianGamer space.
For parents, GuardianGamer offers a comprehensive suite of parental engagement tools and insights, empowering them to stay informed and involved in their children’s online activities. Insights are generated from gaming and video understanding, and texted to parents to foster positive conversations between parents and kids. Through these tools, parents can actively manage their child’s gaming experience, enjoying a safe and balanced approach to online entertainment.
For kids, GuardianGamer offers uninterrupted gameplay with minimal latency, all while engaging in social interactions. The platform makes it possible for children to connect and play exclusively within a trusted circle of friends—each vetted and approved by parents—creating a secure digital extension of their real-world relationships. This transforms gaming sessions into natural extensions of friendships formed through school, sports, and community activities, all enhanced by advanced parental AI insights.
By seamlessly blending technology, community, and family, GuardianGamer creates a safer and enriching digital space, called “The Trusted Way for Kids to Play.”
Solution overview
When the GuardianGamer team set out to build a platform that would help parents supervise their children’s gaming experiences across Minecraft, Roblox, and beyond, they knew they needed a cloud infrastructure partner with global reach and proven scalability. Having worked with AWS on previous projects, the team found it to be the natural choice for their ambitious vision.

“Our goal was to build a solution that could scale from zero to millions of users worldwide while maintaining low latency and high reliability—all with a small, nimble engineering team. AWS serverless architecture gave us exactly what we needed without requiring a massive DevOps investment.”
– Heidi Vogel Brockmann, founder and CEO of GuardianGamer.

The following diagram illustrates the backend’s AWS architecture.

GuardianGamer’s backend uses a fully serverless stack built on AWS Lambda, Amazon DynamoDB, Amazon Cognito, Amazon Simple Storage Service (Amazon S3), and Amazon Simple Notification Service (Amazon SNS), making it possible to expand the platform effortlessly as user adoption grows while keeping operational overhead minimal. This architecture enables the team to focus on their core innovation: AI-powered game supervision for parents, rather than infrastructure management.
The cloud gaming component presented unique challenges, requiring low-latency GPU resources positioned close to users around the world.

“Gaming is an inherently global activity, and latency can make or break the user experience. The extensive Regional presence and diverse Amazon Elastic Compute Cloud (Amazon EC2) instance types give us the flexibility to deploy gaming servers where our users are.”
– Heidi Vogel Brockmann.

The team uses Amazon Elastic File System (Amazon EFS) for efficient game state storage within each AWS Region and Amazon Elastic Container Service (Amazon ECS) for streamlined cluster management.
For the AI analysis capabilities that form the heart of GuardianGamer’s parental supervision features, the team relies on AWS Batch to coordinate analysis jobs, and Amazon Bedrock provides access to powerful large language models (LLMs).

“We’re currently using Amazon Nova Lite for summary generation and highlight video selection, which helps parents quickly understand what’s happening in their children’s gameplay without watching hours of content, just a few minutes a day to keep up to date and start informed conversations with their child,”
– Heidi Vogel Brockmann.

Results
Together, AWS and GuardianGamer have successfully scaled GuardianGamer’s cloud gaming platform to handle thousands of concurrent users across multiple game environments. The company’s recent expansion to support Roblox—in addition to its existing Minecraft capabilities—has broadened its serviceable addressable market to 160 million children and their families.

“What makes our implementation special is how we use Amazon Nova to maintain a continuous record of each child’s gaming activities across sessions. When a parent opens our app, they see a comprehensive view of their child’s digital journey, not just isolated moments.”
– Ronald Brockmann, CTO of GuardianGamer.

Conclusion
GuardianGamer demonstrates how a small, agile team can use AWS services to build a sophisticated, AI-powered gaming platform that prioritizes both child safety and parent engagement. By combining cloud gaming infrastructure across multiple Regions with the capabilities of Amazon Bedrock and Amazon Nova, GuardianGamer is pioneering a new approach to family-friendly gaming. Through continuous parent feedback and responsible AI practices, the platform delivers safer, more transparent gaming experiences while maintaining rapid innovation.

“AWS has been exceptional at bringing together diverse teams and technologies across the company to support our vision. Our state-of-the-art architecture leverages several specialized AI components, including speech analysis, video processing, and game metadata collection. We’re particularly excited about incorporating Amazon Nova, which helps us transform complex gaming data into coherent narratives for parents. With AWS as our scaling partner, we’re confident we can deliver our service to millions of families worldwide.”
–  Heidi Vogel Brockmann.

Learn more about building family-safe gaming experiences on AWS. And for further reading, check out The psychology behind why children are hooked on Minecraft and Keep kids off Roblox if you’re worried, its CEO tells parents.

About the Authors
Heidi Vogel Brockmann is the CEO & Founder at GuardianGamer AI. Heidi is an engineer and a proactive mom of four with a mission to transform digital parenting in the gaming space. Frustrated by the lack of tools available for parents with gaming kids, Heidi built the platform to enable fun for kids and peace of mind for parents.
Ronald Brockmann is the CTO of GuardianGamer AI. With extensive expertise in cloud technology and video streaming, Ronald brings decades of experience in building scalable, secure systems. A named inventor on dozens of patents, he excels at building high-performance teams and deploying products at scale. His leadership combines innovative thinking with precise execution to drive GuardianGamer’s technical vision.
Raechel Frick is a Sr Product Marketing Manager at AWS. With over 20 years of experience in the tech industry, she brings a customer-first approach and growth mindset to building integrated marketing programs. Based in the greater Seattle area, Raechel balances her professional life with being a soccer mom and after-school carpool manager, demonstrating her ability to excel both in the corporate world and family life.
John D’Eufemia is an Account Manager at AWS supporting customers within Media, Entertainment, Games, and Sports. With an MBA from Clark University, where he graduated Summa Cum Laude, John brings entrepreneurial spirit to his work, having co-founded multiple ventures at Femia Holdings. His background includes significant leadership experience through his 8-year involvement with DECA Inc., where he served as both an advisor and co-founder of Clark University’s DECA chapter.

Researchers at UT Austin Introduce Panda: A Foundation Model for Nonli …

Chaotic systems, such as fluid dynamics or brain activity, are highly sensitive to initial conditions, making long-term predictions difficult. Even minor errors in modeling these systems can rapidly grow, which limits the effectiveness of many scientific machine learning (SciML) approaches. Traditional forecasting methods rely on models trained on specific time series or broad datasets lacking true dynamical structure. However, recent work has demonstrated the potential for local forecasting models to predict chaotic systems more accurately over longer timeframes by learning the numerical rules governing these systems. The real challenge is achieving out-of-domain generalization—creating models that can adapt and forecast new, previously unseen dynamical systems. This would require integrating prior knowledge with the ability to adapt locally. Still, the need for task-specific data constrains current methods and often overlooks key dynamical system properties such as ergodicity, channel coupling, and conserved quantities.

Machine learning for dynamical systems (MLDS) utilizes the unique properties of such systems as inductive biases. These include fixed relationships among system variables and invariant statistical measures, like strange attractors or conserved quantities. MLDS models use these properties to build more accurate and generalizable models, sometimes incorporating probabilistic or latent variable techniques. While datasets of dynamical systems have been curated and new systems are often generated by tweaking parameters or using symbolic methods, these approaches typically don’t ensure diverse or stable dynamics. Structural stability is a challenge—small changes may not yield new behaviors, while large ones can cause trivial dynamics. Foundation models aim to address this by enabling transfer learning and zero-shot inference. Still, most current models perform comparably to standard time series models or are limited in generating meaningful, dynamic variety. Some progress has been made through techniques like embedding spaces or symbolic discovery, but a richer, more diverse sampling of dynamical behaviors remains an open challenge. 

Researchers at the Oden Institute, UT Austin, introduce Panda (Patched Attention for Nonlinear Dynamics), a pretrained model trained solely on synthetic data from 20,000 algorithmically-generated chaotic systems. These systems were created using an evolutionary algorithm based on known chaotic ODEs. Despite training only on low-dimensional ODEs, Panda shows strong zero-shot forecasting on real-world nonlinear systems—including fluid dynamics and electrophysiology—and unexpectedly generalizes to PDEs. The model incorporates innovations like masked pretraining, channel attention, and kernelized patching to capture dynamical structure. A neural scaling law also emerges, linking Panda’s forecasting performance to the diversity of training systems. 

The researchers generated 20,000 new chaotic systems using a genetic algorithm that evolves from a curated set of 135 known chaotic ODEs. These systems are mutated and recombined using a skew product approach, with only truly chaotic behaviors retained through rigorous tests. Augmentations like time-delay embeddings and affine transformations expand the dataset while preserving its dynamics. A separate set of 9,300 unseen systems is held out for zero-shot testing. The model, Panda, is built on PatchTST and enhanced with features like channel attention, temporal-channel attention layers, and dynamic embeddings using polynomial and Fourier features, inspired by Koopman operator theory. 

Panda demonstrates strong zero-shot forecasting capabilities on unseen nonlinear dynamical systems, outperforming models like Chronos-SFT across various metrics and prediction horizons. Trained solely on 3D systems, it generalizes to higher-dimensional ones due to channel attention. Despite never encountering PDEs during training, Panda also succeeds on real-world experimental data and chaotic PDEs, such as the Kuramoto-Sivashinsky and von Kármán vortex street. Architectural ablations confirm the importance of channel attention and dynamics embeddings. The model exhibits neural scaling with increased dynamical system diversity and forms interpretable attention patterns, suggesting resonance and attractor-sensitive structure. This indicates Panda’s broad generalization across complex dynamical behaviors. 

In conclusion, Panda is a pretrained model designed to uncover generalizable patterns in dynamical systems. Trained on a large, diverse set of synthetic chaotic systems, Panda demonstrates strong zero-shot forecasting on unseen real-world data and even partial differential equations, despite only being trained on low-dimensional ODEs. Its performance improves with system diversity, revealing a neural scaling law. The model also shows emergent nonlinear resonance in attention patterns. While focused on low-dimensional dynamics, the approach may extend to higher-dimensional systems by leveraging sparse interactions. Future directions include alternative pretraining strategies to improve rollout performance forecasting chaotic behaviors. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Researchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search appeared first on MarkTechPost.