An innovative financial services leader finds the right AI solution: R …

This post is cowritten with Renyu Chen and Dev Tagare from Robinhood.
Robinhood has been a pioneer and disruptor in the once staid world of online brokerages. Founded in 2013, the company transformed an industry better known for gatekeeping into an open platform accessible to all. Robinhood pioneered commission-free trading, and harnessed the power of technology and intuitive design to create a seamless and engaging experience for modern investors. To this day, the company continues to disrupt the financial services industry by launching groundbreaking product innovations on AWS.
Such innovations have made Robinhood one of the fastest growing brokerages in history, with more than 25 million customers worldwide and a global reputation as an innovator and technology leader. Fueled by its mission of “democratizing finance for all,” the company’s focus on accessibility, particularly for first-time investors, has kept Robinhood as one of the top finance apps on the Apple App Store for more than a decade and earned Robinhood accolades such as an award from Fast Company magazine as one of World’s 50 Most Innovative Companies. This annual ranking highlights companies that are reshaping industries and culture through innovation.
Robinhood’s Chief Executive Officer, Vlad Tenev, explains why this focus is important to Robinhood:

“Our belief is, the more we lower the barriers to entry, the more we level the playing field and allow people to invest their money at a younger age, the better off our economy will be and the better off society will be.”

Built to operate in the cloud, Robinhood uses AWS to power its online business, deliver and update its mobile trading app, securely store information and data, and perform business analytics. Robinhood recently used AI to improve customer experience and expand accessibility. For example, in 2025, the company will launch Robinhood Cortex, an AI investment tool that is designed to provide real-time insights to help users better navigate markets, identify potential opportunities, and stay up to date on the latest market moving news. Cortex is an exciting step forward, providing a level premium investment and market digests that have historically been reserved for institutional investors and wealthy individuals.
As Robinhood customers are able to do more on the platform, the company is working with AWS to explore new generative AI solutions such as Amazon Nova, a family of foundation models (FMs) that make generative AI development faster and more efficient, with exceptional price performance. These new solutions will help the company accommodate rapid expansion of customer requirements.
In this post, we share how Robinhood delivers democratized finance and real-time market insights using generative AI and Amazon Nova.
An AI/ML journey built on customer obsession
Robinhood, like all financial services firms, operates in a highly regulated environment. Historically, the industry was seen as slow-moving and wary of new technologies. Robinhood’s founders put technology at the forefront by initially building a no-frills, no-fee app that, by design, would make investing accessible to everyone, not just the very wealthy. As Robinhood grew, it attracted a wider variety of customers who need the speed, reliability, security, and low cost the platform offers, but who also want a richer set of services for different and novel use cases.
Robinhood listens closely to these active traders. As Renyu Chen, staff machine learning (ML) engineer at Robinhood, explains,

“We wanted to create a seamless journey for AI/ML applications to go from experimentation to Robinhood scale. We looked to the AWS team to help meet the AI/ML needs of our developers while providing advanced ML tooling to serve our most sophisticated ‘active trader’ customers. This would also require a plug-and-play approach that could adopt the latest generative AI technologies from open source, model providers, and home-grown platform tooling.”

Robinhood explored various generative AI solutions during 2023, concluding that the best way to get to Robinhood scale was with Amazon Bedrock, a fully managed service that helps users build generative AI models. Amazon Bedrock offers an extensive selection of FMs from various providers, and allows a high level of customization and security through a single API.
According to Robinhood’s Renyu Chen,

“For us, the security of our customers’ data comes first. Nothing is more important. With Amazon Bedrock, data stays under our control. When we query a model, the input and output never leave our virtual private cloud. When we fine-tune a foundation model, it is based on a private copy of that model. This means our customers’ data is not shared with model providers, and is not used to improve the base models.”

To meet the needs of Robinhood’s ever-growing base of power users, Robinhood is exploring Amazon Nova, estimating that the price per token using Amazon Nova can be up to 80% lower than other models they have tested, which would make it cost-effective to power new high-demand use cases such as a fraud investigation assistant, enhanced document processing, and AI-created content generation.
In addition, AWS generative AI solutions working through Amazon Nova can power new agentic workflows for Robinhood, in which autonomous AI agents can independently make decisions, adapt to changing situations, and execute actions.

“Robinhood offers its customers simplicity, speed, security, and cost savings. Working developer-to-developer with the Robinhood team and building together, we can design generative AI solutions that meet Robinhood’s priorities and customer-focused goals. For example, Amazon Nova models can be easily customized with Amazon Bedrock Model Distillation, which ‘distills’ knowledge from a larger, more capable ‘teacher’ model to a smaller, faster, and cost-efficient ‘student’ model. This solution can help Robinhood use models such as DeepSeek to explore exciting new use cases quickly, securely, and at a 75% lower cost than equivalent offerings from competitors.”
– Dushan Tharmal, Principal Product Manager, Amazon Artificial General Intelligence (AGI).

Amazon Nova: More services, greater value for Robinhood and its customers
Working with AWS on its ambitious AI journey, Robinhood is able to rapidly scale new services for customers without needing the costly structures, staff, and infrastructure found at traditional brokerages. With support from AWS, Robinhood is able to offer a richer customer experience while remaining true to its mission of simplicity, clarity, low cost, speed, security, and reliability.

“We see that Amazon Nova can be a great match for our mission. Amazon Nova offers the lowest latency responses at very low cost, and is accurate and lightning-fast across a wide range of interactive and high-volume Robinhood applications. And, consistent with Robinhood’s commitment to simplicity and low cost for its customers, using Amazon Nova models through Amazon Bedrock makes these large-scale tasks significantly easier, cheaper, and more cost-effective.”
– Dev Tagare, Robinhood’s head of AI.

Learn more about Amazon Nova and how it can deliver frontier intelligence and industry leading price-performance for your organization.

About the authors
Renyu Chen is a Staff AI Engineer at Robinhood Markets
Dev Tagare is the Head of AI at Robinhood Markets
Uchenna Egbe is a GenAI Solutions Architect at AWS FSI,
Trevor Spires is a GenAI Solutions Architect at AWS FinTech.

Build conversational interfaces for structured data using Amazon Bedro …

Organizations manage extensive structured data in databases and data warehouses. Large language models (LLMs) have transformed natural language processing (NLP), yet converting conversational queries into structured data analysis remains complex. Data analysts must translate business questions into SQL queries, creating workflow bottlenecks.
Amazon Bedrock Knowledge Bases enables direct natural language interactions with structured data sources. The system interprets database schemas and context, converting natural language questions into accurate queries while maintaining data reliability standards. You can chat with your structured data by setting up structured data ingestion from AWS Glue Data Catalog tables and Amazon Redshift clusters in a few steps, using the power of Amazon Bedrock Knowledge Bases structured data retrieval.
This post provides instructions to configure a structured data retrieval solution, with practical code examples and templates. It covers implementation samples and additional considerations, empowering you to quickly build and scale your conversational data interfaces. Through clear examples and proven methodologies, organizations can transform their data access capabilities and accelerate decision-making processes.
Solution overview
The solution demonstrates how to build a conversational application using Amazon Bedrock Knowledge Bases structured data retrieval. Developers often face challenges integrating structured data into generative AI applications. This includes difficulties training LLMs to convert natural language queries to SQL queries based on complex database schemas, as well as making sure appropriate data governance and security controls are in place. Amazon Bedrock Knowledge Bases alleviates these complexities by providing a managed natural language to SQL (NL2SQL) module. Amazon Bedrock Knowledge Bases offers an end-to-end managed workflow for you to build custom generative AI applications that can access and incorporate contextual information from a variety of structured and unstructured data sources. Using advanced NLP, Amazon Bedrock Knowledge Bases can transform natural language queries into SQL queries, so you can retrieve data directly from the source without the need to move or preprocess the data.
This solution includes Amazon Bedrock Knowledge Bases, Amazon Redshift, AWS Glue, and Amazon Simple Storage Service (Amazon S3). The solution architecture consists of two parts: a data ingestion pipeline, and a structured data retrieval application using Amazon Bedrock Knowledge Bases.
Amazon Bedrock Knowledge Bases structured data retrieval supports Amazon Redshift as the query engine and multiple data ingestion options. The data ingestion pipeline is a one-time setup, and supports multiple ingestion options. In this post, we discuss a common data ingestion use case using Amazon S3, AWS Glue, and Amazon Redshift.
You can configure Amazon Bedrock Knowledge Bases structured data retrieval to retrieve data from AWS Glue databases and S3 datasets. This setup uses automatic mounting of the Data Catalog in Amazon Redshift. With this ingestion option, you can seamlessly integrate existing S3 datasets and Data Catalog tables into your Retrieval Augmented Generation (RAG) applications with the access permissions configured through Lake Formation. The following diagram illustrates this pipeline.

The following screenshot shows the configuration options on the Amazon Bedrock console.

After the data ingestion is configured and the knowledge bases data source sync job is complete, users can ask natural language questions, and Amazon Bedrock Knowledge Bases will generate the SQL, execute the SQL against the query engine, and process it through the LLM to provide a user-friendly response. The following diagram illustrates a sample architecture of the structured data retrieval workflow.

The data retrieval workflow consists of the following steps:

In a RAG application, the user can ask a natural language data analytics question through the chat interface, such as “What is the sales revenue for the Month of February 2025?”
The natural language query is sent to Amazon Bedrock Knowledge Bases for data retrieval and processing.
Amazon Bedrock Knowledge Bases generates a SQL query based on the underlying data schema configured during the knowledge base creation.
The SQL query is executed against the query engine (Amazon Redshift) to retrieve data from a structured data store (AWS Glue tables). The query can include multiple joins and aggregation.
The generated SQL response is sent to an LLM along with additional context to generate a response in natural language.
The response is sent back to the user. The user can ask follow-up questions based on the retrieved response, such as “What is the product that generated highest revenue in this period?”

Amazon Bedrock Knowledge Bases structured data retrieval supports three different APIs to meet your data retrieval requirements:

Retrieval and response generation – The retrieval and response generation API, similar to the solution workflow we’ve discussed, generates a SQL query, retrieves data through the query engine, and processes it through the LLM to generate a natural language response
Retrieval only – The retrieval only API generates a SQL query, retrieves data through the query engine, and returns the data without processing it through an LLM
Generate SQL queries – The generate SQL query API returns the raw SQL query that was generated by Amazon Bedrock Knowledge Bases, which can be used for review and further processing by applications

The following screenshot shows the configuration options on the Amazon Bedrock console.

Code resources and templates
The solution uses the following notebooks:

Data ingestion notebook – Structured-rag-s3-glue-ingestion includes the step-by-step guide to ingest an open dataset to Amazon S3, configure AWS Glue tables using crawlers, and set up the Amazon Redshift Serverless query engine.
Structured data retrieval notebook – Structured-rag-s3-glue-retrieval walks through the implementation steps and provides sample code for configuring Amazon Bedrock Knowledge Bases structured data retrieval using Amazon S3, AWS Glue, and the Amazon Redshift query engine.

For more details, refer to the GitHub repo.
Prerequisites
To implement the solution provided in this post, you must have an AWS account. Additionally, access to the required foundation models must be enabled in Amazon Bedrock.
Set up the data ingestion pipeline
To set up the data ingestion pipeline, we load the sample dataset in an S3 bucket and configure AWS Glue as data storage and a Redshift Serverless workgroup as the query engine. Complete the following steps in data ingestion notebook:

For data ingestion, download the following sample ecommerce dataset, convert it to a pandas data frame, and upload it to an S3 bucket using Amazon SageMaker Data Wrangler.
Create an AWS Glue database and table using an AWS Glue crawler by crawling the source S3 bucket with the dataset. You can update this step to crawl your own S3 bucket or use your existing Data Catalog tables as storage metadata.
Use the data ingestion notebook to create a Redshift Serverless namespace and workgroup in the default VPC. If you plan to use your own Redshift Serverless workgroup or Amazon Redshift provisioned cluster, you can skip this step.

Set up the structured data retrieval solution
In this section, we detail the steps to set up the structured data retrieval component of the solution.
Amazon Bedrock Knowledge Bases supports multiple data access patterns, including AWS Identity and Access Management (IAM), AWS Secrets Manager, and database users. For this post, we demonstrate the setup option with IAM access. You can use IAM access with the Redshift Serverless workgroup configured as part of the ingestion workflow or an existing Redshift Serverless or provisioned cluster to compete these steps.
Complete the following steps in structured data retrieval notebook:

Create an execution role with the necessary policies for accessing data from Amazon Redshift, AWS Glue, and the S3 bucket.
Invoke the CreateKnowledgeBase API to create the knowledge base with the execution role and knowledge base configurations. In the knowledge base configuration, the AWS Glue database and tables are used as storage metadata with Amazon Redshift as the query engine.
After you create the knowledge base, you must complete additional steps to make sure the IAM execution role has the necessary permissions to execute the query in Amazon Redshift and retrieve data from AWS Glue. The notebook includes the necessary instructions to create and grant database access to the execution role, and grant AWS Lake Formation permissions.
The ingestion job will sync the data store schema metadata about AWS Glue database and tables with the NL2SQL module. This schema metadata will be used while generating the SQL query during structured data retrieval.
After the knowledge base sync job is complete, you can use the three data retrieval APIs – retrieve and generate response, retrieval only, and generate SQL query – to query and validate the structured data retrieval solution.

For more details, refer to Create a knowledge base by connecting to a structured data store.
Clean up
We have included cleanup instructions in both the data ingestion and structured data retrieval notebooks to clean up resources after the end-to-end solution is implemented and validated.
Conclusion
Amazon Bedrock Knowledge Bases simplifies data analysis by converting natural language questions into SQL queries, eliminating the need for specialized database expertise. The service integrates with Amazon Redshift, AWS Glue, and Amazon S3, allowing business analysts, data scientists, and operations teams to query data directly using conversation-like questions. It maintains data security through built-in governance controls and access permissions. Customers can deploy this managed service to enable users to analyze data using natural language questions, while maintaining data integrity and security standards.
To learn more, refer to Build a knowledge base by connecting to a structured data store and Amazon Bedrock Knowledge Bases now supports structured data retrieval.

About the authors
George Belsian is a Senior Cloud Application Architect at Amazon Web Services, helping organizations navigate the complexities of cloud adoption, AI integration, and data-driven innovation. By transforming legacy systems into cloud-based platforms and incorporating AI/ML capabilities, he helps businesses create new opportunities for growth, optimize their processes, and deliver scalable solutions.
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.
Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Gopikrishnan Anilkumar is a Principal Technical Product Manager in AWS Agentic AI organization. He has over 10 years of product management experience across a variety of domains and is passionate about AI/ML.

Innovate business logic by implementing return of control in Amazon Be …

In the context of distributed systems and microservices architecture, orchestrating communication between diverse components presents significant challenges. However, with the launch of Amazon Bedrock Agents, the landscape is evolving, offering a simplified approach to agent creation and seamless integration of the return of control capability. In this post, we explore how Amazon Bedrock Agents revolutionizes agent creation and demonstrates the efficacy of the return of control capability in orchestrating complex interactions between multiple systems.
Amazon Bedrock Agents simplifies the creation, deployment, and management of agents in distributed systems. By using the power of AWS Lambda and AWS Step Functions, Amazon Bedrock Agents abstracts away the complexities of agent implementation, which means developers can focus on building robust and scalable applications without worrying about infrastructure management.
You can use agents in Amazon Bedrock in various scenarios where you need to handle the return of control to the user or the system. Use cases include conversational assistants, task automation, decision support systems, interactive tutorials and walkthroughs, and virtual assistants. In these use cases, the key aspect of the agents is their ability to handle the return of control to the user or the system. This allows for a more natural and responsive interaction, where the user feels in control of the process while still benefiting from the agent’s guidance and automation capabilities.
Solution overview
In this post, we demonstrate an automated personalized investment portfolio solution using Amazon Bedrock Agents. The solution calls a third-party API to fetch a user’s current investment portfolio. These are then analyzed using foundation models (FMs) available on Amazon Bedrock to produce recommendations inline to the inputs provided by the end user, showcasing a return of control capability integrated with Amazon Bedrock Agents.
This solution uses a combination of synchronous data retrieval and generative AI to provide tailored investment recommendations that align with users’ specific financial goals and risk tolerance. By incorporating machine learning (ML) and simulation techniques, the system can generate personalized portfolios and assess their potential performance, making sure the recommended solutions are optimized for individual needs.
With Amazon Bedrock Agents, the capability to return control to the application invoking the agent can handle external functions and business logic at the application level instead of using a Lambda function. This way, an application can manage external interactions and return the response while the agent continues its orchestration. This is illustrated in the following diagram.

The option to return control is particularly useful in two main scenarios:

Calling an API from an existing application rather than building a new Lambda function with the required authentication and networking configurations
Handling tasks that might run longer than 15 minutes and can’t be accommodated through a Lambda function, instead requiring containers, virtual servers, or workflow orchestration tools such as AWS Step Functions

The following sample code uses Amazon Bedrock Agents with handling return of control in the code. With the Amazon Bedrock Agents feature, you can manage Amazon Bedrock Agents return of control in your backend services and simplify application integrations. To demonstrate this, we have the following four code snippets: external-bedrock-agent-api.py, streamlit-app-portfolio-recommender.py, Portfolio-Recommender-CFN-Template.yaml, and requirements.txt, along with detailed steps to replicate the scenario.
The external-bedrock-agent-api code implements a portfolio recommendation system using Amazon Bedrock Agents and Flask. Here’s a high-level overview of the functions used:

fetch_user_data: Processes user profile information such as risk tolerance or investment goals
generate_portfolios: Creates sample investment portfolios with different risk levels
fetch_custom_portfolio: Combines user data and portfolio generation
send_custom_portfolio_as_email: Sends portfolio recommendations by email using an Amazon Simple Email Service (Amazon SES) verified email identity
/sns-handler endpoint: This API endpoint receives POST requests with user investment preferences, processes the message containing user preference details, invokes the Amazon Bedrock agent to generate recommendations, and handles email communication of the recommendations

The streamlit-app-portfolio-recommender code is a Streamlit web application for investment portfolio recommendations. The code sets up the webpage with a title and configuration. The app collects several pieces of information through form elements:

Email address – Text input
Financial goal – Dropdown with options for retirement, wealth accumulation, and passive income
Risk tolerance – Dropdown with options for low, medium, and high
Investment horizon – Dropdown with options for short-term and long-term
Environmental, social, and governance (ESG) preference – Checkbox for environmental, social, and governance preferences
Email preference – Checkbox for receiving recommendations by email

The system operates through a Portfolio Generation Function that actively sending POST requests to a local API endpoint. This function transforms user preferences into JSON data and delivers either an API response or error message back to the user.
The process to display results begins when user click the Submit button, which triggers the custom_portfolio function with their specific inputs. The system then displays the portfolio recommendation in a text area for successful executions, while immediately alerting users with an error message if any issues occur during the process.
Solution walkthrough
Follow the steps to set up the environment and test the application in the US East (N. Virginia) us-east-1 Region.
To enable Anthropic’s Claude model on Amazon Bedrock in your AWS account:

On the Amazon Bedrock console, in the left navigation pane under Amazon Bedrock configurations, select Model access
Select Claude 3 Sonnet, as shown in the following screenshot

To create the Amazon Bedrock agents, related action groups, Amazon SageMaker AI domain, sample user profile, and JupyterLab space, follow these steps:

Invoke the AWS CloudFormation template at Portfolio-Recommender-CloudFormation-Template.yml
Give a name to the stack
Provide an email address for the EmailIdentityParameter

Select the checkbox to acknowledge that the template contains AWS Identity and Access Management (IAM) resources, as shown in the following screenshot

Monitor AWS CloudFormation until it completes the resource creation process. You can verify the successful deployment by checking the Stack details output tab, which will display the AgentId and AgentAliasId values, as shown in the screenshot below.

You will receive an email address verification request email from AWS for in the US East (N. Virginia) Region. Select the link in the email to verify.
After creating your CloudFormation resources, follow these steps to access Amazon SageMaker Studio:

On the Amazon SageMaker AI console, under Admin configurations in the left navigation pane, select Domains
Select the bedrock-return-of-control-demo domain created by the CloudFormation template, as shown in the following screenshot

Select the User profiles tab
To open the SageMaker Studio environment, under User profiles, next to the sagemakeruser profile on the right, select Launch. From the dropdown menu, choose Studio, as shown in the following screenshot

You should now observe the SageMaker Studio home page. This environment is where you will execute Python scripts to set up your application.
To access the JupyterLab environment for this lab, follow these steps:

On the SageMaker Studio console, in the left navigation pane under Applications, select JupyterLab
You’ll find bedrock-agent-space that has been preprovisioned for this lab. Its Status should be Stopped. On the right side under Action, choose Run
Within 30–40 seconds, the JupyterLab application status will change from Starting to Running

When it’s running, under Action, choose Open, as shown in the following screenshot

Three required files are copied under the /home/sagemaker-user/scripts directory: two Python files (external-bedrock-agent-api and streamlit-app-portfolio-recommender) and one requirements.txt file, as shown in the following screenshot. The JupyterLab application environment is under the default directory.

In the File menu, select New. In the dropdown menu, select Terminal to open a new terminal window, as shown in the following screenshot.
Go to the scripts directory where you have the required files in the terminal and enter:

pip install -r requirements.txt

Enter the following command on the terminal:

python3 external-bedrock-agent-api.py

Open a new terminal and go to the /home/sagemaker-user/scripts directory and enter:

streamlit run streamlit-app-portfolio-recommender.py

From the command execution in the terminal, note the port number (8501) and studio URL from the browser. The URL will be in the format of: https://{domainid}.studio.{region}-1.sagemaker.aws/jupyterlab/default/lab/tree/scripts
To access the Streamlit app, modify the Studio URL, replacing everything after the default/ lab/tree/scripts with proxy/[PORT NUMBER]/. The modified Streamlit UI URL will look like this: https://{domainid}.studio.{region}.sagemaker.aws/jupyterlab/default/proxy/8501/
Select all appropriate inputs for generating your custom portfolio recommendation. Choose whether you prefer to receive email notifications or inline recommendations through the application interface by checking the corresponding box. Then choose Submit. Provide the same email address that was verified earlier in this walkthrough.

The sample output and email response are shown in the following demo screenshot.

Cleanup
When you’re done, delete resources you no longer need to avoid ongoing costs. Follow these steps:

Go to the SageMaker AI JupyterLab environment and stop the Amazon SageMaker Studio application or running instance
Delete the resources created by deleting the CloudFormation stack.

The following screenshot demonstrates how to view and stop running instances in the SageMaker AI JupyterLab environment. For more information, refer to Delete a stack from the CloudFormation console.

Amazon Bedrock Agents return of control considerations
When implementing return of control, consider the following:

Return of control performance considerations – When implementing return of control, developers should focus on optimizing action execution times and response handling. Each action should be designed to complete within reasonable timeframes to maintain conversation flow. Consider implementing caching mechanisms for frequently accessed data and facilitate efficient state management between return of control cycles. The application should be designed to handle concurrent user sessions effectively while maintaining responsiveness.
Return of control limitations – Actions must be defined with clear input and output schemas. Each action should be atomic and focused on a specific task to maintain simplicity and reliability. Consider payload sizes for requests and responses because there might be size limitations. Actions execute sequentially, and the system needs to maintain conversation context throughout the interaction cycle.
Security recommendations – Security implementation requires proper authentication and authorization mechanisms for all actions, following the principle of least privilege when defining permissions. Input parameters must be validated before processing, with comprehensive error handling in place. Rate limiting and request validation should be implemented to prevent abuse, and sensitive data handling must comply with security requirements and include proper logging mechanisms for audit trails. Additionally, implement input filtering to prevent prompt injection attacks, configure response filters to protect sensitive information, and set up content scanning for both input and output. Deploy regex-based response filtering to help prevent personally identifiable information (PII) exposure and establish content moderation filters to block inappropriate content.
Monitoring and observability – Implement comprehensive logging for all action executions and responses. Monitor key metrics such as action execution times, success rates, and error rates. Set up alerts for abnormal patterns or failures. Use Amazon CloudWatch for monitoring system health and performance. Consider implementing tracing to track request flow through different components of your system. Regular review of metrics and logs helps identify potential issues and optimization opportunities.

Conclusion
In this post, we’ve demonstrated how Amazon Bedrock Agents simplifies agent creation and streamlines the orchestration of complex interactions between microservices using the return of control capability. By abstracting away infrastructure management and providing seamless integration with your application, Amazon Bedrock Agents empowers developers to build resilient and scalable applications with ease. As organizations embrace microservices architecture and distributed systems, tools such as Amazon Bedrock Agents play a pivotal role in accelerating innovation and driving digital transformation.
Resources
For the most current and specific information, refer to:

Amazon Bedrock documentation
AWS Well-Architected Framework best practices
AWS Security best practices
AWS observability best practices

About the Authors
Vishwanatha Handadi is a Sr. Solutions Architect within the Global Financial Services vertical, working with Amazon Web Services (AWS) for over 2 years and has over 22 years of experience in the IT industry primarily in data and analytics. At AWS, he drives customers through their cloud transformation journeys by converting complex challenges into actionable roadmaps for both technical and business audiences. He is based out of Bangalore, India.
Mohammed Asadulla Baig is a Sr. Technical Account Manager with Amazon Web Services (AWS) Enterprise Support. Asad helps customers architect scalable, resilient, and secure solutions. With a keen eye for innovation and a passion for delivering customer success, Asad has established himself as a thought leader in the industry, helping enterprises navigate their cloud transformation journeys with confidence and ease.

OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computatio …

The Inefficiency of Static Chain-of-Thought Reasoning in LRMs

Recent LRMs achieve top performance by using detailed CoT reasoning to solve complex tasks. However, many simple tasks they handle could be solved by smaller models with fewer tokens, making such elaborate reasoning unnecessary. This echoes human thinking, where we use fast, intuitive responses for easy problems and slower, analytical thinking for complex ones. While LRMs mimic slow, logical reasoning, they generate significantly longer outputs, thereby increasing computational cost. Current methods for reducing reasoning steps lack flexibility, limiting models to a single fixed reasoning style. There is a growing need for adaptive reasoning that adjusts effort according to task difficulty. 

Limitations of Existing Training-Based and Training-Free Approaches

Recent research on improving reasoning efficiency in LRMs can be categorized into two main areas: training-based and training-free methods. Training strategies often use reinforcement learning or fine-tuning to limit token usage or adjust reasoning depth, but they tend to follow fixed patterns without flexibility. Training-free approaches utilize prompt engineering or pattern detection to shorten outputs during inference; however, they also lack adaptability. More recent work focuses on variable-length reasoning, where models adjust reasoning depth based on task complexity. Others study “overthinking,” where models over-reason unnecessarily. However, few methods enable dynamic switching between quick and thorough reasoning—something this paper addresses directly. 

Introducing OThink-R1: Dynamic Fast/Slow Reasoning Framework

Researchers from Zhejiang University and OPPO have developed OThink-R1, a new approach that enables LRMs to switch between fast and slow thinking smartly, much like humans do. By analyzing reasoning patterns, they identified which steps are essential and which are redundant. With help from another model acting as a judge, they trained LRMs to adapt their reasoning style based on task complexity. Their method reduces unnecessary reasoning by over 23% without losing accuracy. Using a loss function and fine-tuned datasets, OThink-R1 outperforms previous models in both efficiency and performance on various math and question-answering tasks. 

System Architecture: Reasoning Pruning and Dual-Reference Optimization

The OThink-R1 framework helps LRMs dynamically switch between fast and slow thinking. First, it identifies when LRMs include unnecessary reasoning, like overexplaining or double-checking, versus when detailed steps are truly essential. Using this, it builds a curated training dataset by pruning redundant reasoning and retaining valuable logic. Then, during fine-tuning, a special loss function balances both reasoning styles. This dual-reference loss compares the model’s outputs with both fast and slow thinking variants, encouraging flexibility. As a result, OThink-R1 can adaptively choose the most efficient reasoning path for each problem while preserving accuracy and logical depth. 

Empirical Evaluation and Comparative Performance

The OThink-R1 model was tested on simpler QA and math tasks to evaluate its ability to switch between fast and slow reasoning. Using datasets like OpenBookQA, CommonsenseQA, ASDIV, and GSM8K, the model demonstrated strong performance, generating fewer tokens while maintaining or improving accuracy. Compared to baselines such as NoThinking and DualFormer, OThink-R1 demonstrated a better balance between efficiency and effectiveness. Ablation studies confirmed the importance of pruning, KL constraints, and LLM-Judge in achieving optimal results. A case study illustrated that unnecessary reasoning can lead to overthinking and reduced accuracy, highlighting OThink-R1’s strength in adaptive reasoning. 

Conclusion: Towards Scalable and Efficient Hybrid Reasoning Systems

In conclusion, OThink-R1 is a large reasoning model that adaptively switches between fast and slow thinking modes to improve both efficiency and performance. It addresses the issue of unnecessarily complex reasoning in large models by analyzing and classifying reasoning steps as either essential or redundant. By pruning the redundant ones while maintaining logical accuracy, OThink-R1 reduces unnecessary computation. It also introduces a dual-reference KL-divergence loss to strengthen hybrid reasoning. Tested on math and QA tasks, it cuts down reasoning redundancy by 23% without sacrificing accuracy, showing promise for building more adaptive, scalable, and efficient AI reasoning systems in the future. 

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs appeared first on MarkTechPost.

Building AI-Powered Applications Using the Plan → Files → Code Wor …

In this tutorial, we introduce TinyDev class implementation, a minimal yet powerful AI code generation tool that utilizes the Gemini API to transform simple app ideas into comprehensive, structured applications. Designed to run effortlessly in Notebook, TinyDev follows a clean three-phase workflow—Plan → Files → Code—to ensure consistency, functionality, and modular design. Whether building a web interface, a Python backend, or a utility script, TinyDev allows users to describe their project in natural language & receive ready-to-run code files, automatically generated and saved in an organized directory. This makes it an ideal starting point for rapid prototyping or learning how AI can assist in development tasks.

Copy CodeCopiedUse a different Browserimport google.generativeai as genai
import os
import json
import re
from pathlib import Path
from typing import List, Dict

We begin by importing essential libraries required for the TinyDev code generator. google.generativeai is used to interact with the Gemini API, while standard libraries like os, json, and re support file handling and text processing. Path and type hints from typing ensure clean file operations and better code readability.

Copy CodeCopiedUse a different Browserclass TinyDev:
“””
TinyDev: A lightweight AI code generator inspired by smol-dev
Uses Gemini API to generate complete applications from simple prompts
Follows the proven three-phase workflow: Plan → Files → Code
“””

def __init__(self, api_key: str, model: str = “gemini-1.5-flash”):
genai.configure(api_key=api_key)
self.model = genai.GenerativeModel(model)
self.generation_config = {
‘temperature’: 0.1,
‘top_p’: 0.8,
‘max_output_tokens’: 8192,
}

def plan(self, prompt: str) -> str:
“””
Phase 1: Generate project plan and shared dependencies
Creates the foundation for consistent code generation
“””
planning_prompt = f”””As an AI developer, you’re building a tool that automatically generates code tailored to the user’s needs.

the program you are writing is based on the following description:
{prompt}

the files we write will be generated by a python script. the goal is for us to all work together to write a program that will write the code for the user.

since we are working together, we need to understand what our shared dependencies are. this includes:
– import statements we all need to use
– variable names that are shared between files
– functions that are called from one file to another
– any other shared state

this is the most critical part of the process, if we don’t get this right, the generated code will not work properly.

please output a markdown file called shared_dependencies.md that lists all of the shared dependencies.

the dependencies should be organized as:
1. shared variables (globals, constants)
2. shared functions (function signatures)
3. shared classes (class names and key methods)
4. shared imports (modules to import)
5. shared DOM element ids (if web project)
6. shared file paths/names

be EXHAUSTIVE in your analysis. every file must be able to import or reference these shared items.”””

response = self.model.generate_content(
planning_prompt,
generation_config=self.generation_config
)
return response.text

def specify_file_paths(self, prompt: str, shared_deps: str) -> List[str]:
“””
Phase 2: Determine what files need to be created
“””
files_prompt = f”””As an AI developer, you’re building a tool that automatically generates code tailored to the user’s needs.

the program:
{prompt}

the shared dependencies:
{shared_deps}

Based on the program description and shared dependencies, return a JSON array of the filenames that should be written.

Only return the JSON array, nothing else. The JSON should be an array of strings representing file paths.

For example, for a simple web app you might return:
[“index.html”, “styles.css”, “script.js”]

For a Python project you might return:
[“main.py”, “utils.py”, “config.py”, “requirements.txt”]

JSON array:”””

response = self.model.generate_content(
files_prompt,
generation_config=self.generation_config
)

try:
json_match = re.search(r'[.*?]’, response.text, re.DOTALL)
if json_match:
files = json.loads(json_match.group())
return [f for f in files if isinstance(f, str)]
else:
lines = [line.strip() for line in response.text.split(‘n’) if line.strip()]
files = []
for line in lines:
if ‘.’ in line and not line.startswith(‘#’):
file = re.sub(r'[^w-_./]’, ”, line)
if file:
files.append(file)
return files[:10]
except Exception as e:
print(f”Error parsing files: {e}”)
return [“main.py”, “README.md”]

def generate_code_sync(self, prompt: str, shared_deps: str, filename: str) -> str:
“””
Phase 3: Generate code for individual files
“””
code_prompt = f”””As an AI developer, you’re building a tool that automatically generates code tailored to the user’s needs..

the program:
{prompt}

the shared dependencies:
{shared_deps}

Please write the file {filename}.

Remember that your job is to write the code for {filename} ONLY. Do not write any other files.

the code should be fully functional. meaning:
– all imports should be correct
– all variable references should be correct
– all function calls should be correct
– the code should be syntactically correct
– the code should be logically correct

Make sure to implement every part of the functionality described in the program description.

DO NOT include “` code fences in your response. Return only the raw code.

Here is the code for {filename}:”””

response = self.model.generate_content(
code_prompt,
generation_config=self.generation_config
)

code = response.text
code = re.sub(r’^“`[w]*n’, ”, code, flags=re.MULTILINE)
code = re.sub(r’n“`$’, ”, code, flags=re.MULTILINE)

return code.strip()

def create_app(self, prompt: str, output_dir: str = “/content/generated_app”) -> Dict:
“””
Main workflow: Transform a simple prompt into a complete application
“””
print(f” TinyDev workflow starting…”)
print(f” Prompt: {prompt}”)

print(“n Step 1: Planning shared dependencies…”)
shared_deps = self.plan(prompt)
print(” Dependencies planned”)

print(“n Step 2: Determining file structure…”)
file_paths = self.specify_file_paths(prompt, shared_deps)
print(f” Files to generate: {file_paths}”)

Path(output_dir).mkdir(parents=True, exist_ok=True)

print(f”n Step 3: Generating {len(file_paths)} files…”)
results = {
‘prompt’: prompt,
‘shared_deps’: shared_deps,
‘files’: {},
‘output_dir’: output_dir
}

with open(Path(output_dir) / “shared_dependencies.md”, ‘w’) as f:
f.write(shared_deps)

for filename in file_paths:
print(f” Generating {filename}…”)
try:
code = self.generate_code_sync(prompt, shared_deps, filename)

file_path = Path(output_dir) / filename
file_path.parent.mkdir(parents=True, exist_ok=True)

with open(file_path, ‘w’, encoding=’utf-8′) as f:
f.write(code)

results[‘files’][filename] = code
print(f” {filename} created ({len(code)} chars)”)

except Exception as e:
print(f” Error generating {filename}: {e}”)
results[‘files’][filename] = f”# Error: {e}”

readme = f”””# Generated by TinyDev (Gemini-Powered)

## Original Prompt
{prompt}

## Generated Files
{chr(10).join(f’- {f}’ for f in file_paths)}

## About TinyDev
TinyDev is inspired by smol-ai/developer but uses free Gemini API.
It follows the proven three-phase workflow: Plan → Files → Code

## Usage
Check individual files for specific usage instructions.

Generated on: {os.popen(‘date’).read().strip()}
“””

with open(Path(output_dir) / “README.md”, ‘w’) as f:
f.write(readme)

print(f”n Complete! Generated {len(results[‘files’])} files in {output_dir}”)
return results

The TinyDev class encapsulates the full logic of an AI-powered code generator using the Gemini API. It implements a structured three-phase workflow: first, it analyzes the user prompt to generate shared dependencies (plan); next, it identifies which files are needed for the application (specify_file_paths); and finally, it generates functional code for each file individually (generate_code_sync). The create_app method brings everything together by orchestrating the full app generation pipeline and saving the results, including code files and a detailed README, into a specified output directory, offering a complete, ready-to-use application scaffold from a single prompt.

Copy CodeCopiedUse a different Browserdef demo_tinydev():
“””Demo the TinyDev code generator”””

api_key = “Use Your API Key here”

if api_key == “YOUR_GEMINI_API_KEY_HERE”:
print(” Please set your Gemini API key!”)
print(“Get one free at: https://makersuite.google.com/app/apikey”)
return None

tiny_dev = TinyDev(api_key)

demo_prompts = [
“a simple HTML/JS/CSS tic tac toe game”,
“a Python web scraper that gets the latest news from multiple sources”,
“a responsive landing page for a local coffee shop with contact form”,
“a Flask REST API for managing a todo list”,
“a JavaScript calculator with a modern UI”
]

print(” TinyDev – AI Code Generator”)
print(“=” * 50)
print(“Inspired by smol-ai/developer, powered by Gemini API”)
print(f”Available demo projects:”)

for i, prompt in enumerate(demo_prompts, 1):
print(f”{i}. {prompt}”)

demo_prompt = demo_prompts[0]
print(f”n Running demo: {demo_prompt}”)

try:
results = tiny_dev.create_app(demo_prompt)

print(f”n Results Summary:”)
print(f” Prompt: {results[‘prompt’]}”)
print(f” Output: {results[‘output_dir’]}”)
print(f” Files: {len(results[‘files’])}”)

print(f”n Generated Files:”)
for filename in results[‘files’].keys():
print(f” – {filename}”)

if results[‘files’]:
preview_file = list(results[‘files’].keys())[0]
preview_code = results[‘files’][preview_file]
print(f”n Preview of {preview_file}:”)
print(“-” * 40)
print(preview_code[:400] + “…” if len(preview_code) > 400 else preview_code)
print(“-” * 40)

print(f”n This uses the same proven workflow as smol-ai/developer!”)
print(f” Check {results[‘output_dir’]} for all generated files”)

return results

except Exception as e:
print(f” Demo failed: {e}”)
return None

The demo_tinydev() function showcases TinyDev’s capabilities by running a predefined demo using one of several sample prompts, such as generating a Tic Tac Toe game or a Python news scraper. It initializes the TinyDev class with a Gemini API key, selects the first prompt from a list of project ideas, and guides the user through the full code generation pipeline, including planning shared dependencies, defining file structure, and generating code. After execution, it summarizes the output, previews a sample file, and points to the directory where the complete app has been saved.

Copy CodeCopiedUse a different Browserdef interactive_tinydev():
“””Interactive version where you can try your own prompts”””
api_key = input(” Enter your Gemini API key: “).strip()

if not api_key:
print(” API key required!”)
return

tiny_dev = TinyDev(api_key)

print(“n Interactive TinyDev Mode”)
print(“Type your app ideas and watch them come to life!”)

while True:
prompt = input(“n Describe your app (or ‘quit’): “).strip()

if prompt.lower() in [‘quit’, ‘exit’, ‘q’]:
print(” Goodbye!”)
break

if prompt:
try:
results = tiny_dev.create_app(prompt, f”/content/app_{hash(prompt) % 10000}”)
print(f” Success! Check {results[‘output_dir’]}”)
except Exception as e:
print(f” Error: {e}”)

print(” TinyDev – AI Code Generator Ready!”)
print(“Inspired by smol-ai/developer, powered by free Gemini API”)
print(“nTo run demo: demo_tinydev()”)
print(“To try interactive mode: interactive_tinydev()”)

The interactive_tinydev() function allows users to generate applications from their custom prompts in real time. After entering a valid Gemini API key, users can describe any app idea, and TinyDev will develop the complete project, code, structure, and supporting files automatically. The process continues in a loop until the user types ‘quit’. This interactive mode enables hands-on experimentation and rapid prototyping from natural language descriptions.

Copy CodeCopiedUse a different Browserdemo_tinydev()

Finally, calling demo_tinydev() runs a predefined demonstration of TinyDev using a sample app prompt. It walks through the full workflow, planning, file structure creation, and code generation, to showcase how the tool automatically builds a complete application from a simple idea.

In conclusion, TinyDev class demonstrates the potential of using AI to automate application scaffolding with remarkable accuracy and efficiency. By breaking down the code generation process into intuitive phases, it ensures that outputs are logically sound, well-structured, and aligned with the user’s intent. Whether you’re exploring new app ideas or seeking to accelerate development, TinyDev provides a lightweight and user-friendly solution powered by the Gemini models. It’s a practical tool for developers looking to integrate AI into their workflow without unnecessary complexity or overhead.

Check out the Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Building AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDev appeared first on MarkTechPost.

AI-Generated Ad Created with Google’s Veo3 Airs During NBA Finals, S …

A lone AI filmmaker, a cutting-edge generative video model, and a national TV spot during one of the year’s biggest sporting events. This isn’t the plot of a sci-fi movie; it’s the new reality of advertising, and it was created in just 3 days.

TLDR:

First of its Kind: An AI-generated commercial for the events-betting platform Kalshi was created using Google’s generative video model, Veo3, and aired nationally during the NBA Finals.

Drastic Cost Reduction: AI Filmmaker PJ Accetturo created the 15-clip ad in just 3 days, resulting in an estimated 95% cost reduction compared to traditional commercial production. It cost cost $2,000 

The “Gemini-to-Veo3” Workflow: The ad was produced using a simple but powerful 4-step process: rough script, AI-powered prompt generation with Gemini, video generation with Veo3, and final editing in standard software like Adobe Premiere.

The Future is Agile: The project signals a shift towards smaller, highly skilled creative teams leveraging AI to produce high-volume, brand-adjacent content quickly and affordably.

Human Skill is Still Key: Despite the technological leap, the creator emphasizes that professional taste, directorial experience, and, most importantly, comedy writing are the new moats for creatives in the age of AI….and I agree!

In a landmark moment for both the advertising and AI, a commercial generated almost entirely by AI aired on national television this week during the NBA Finals. The ad, for the event-betting market Kalshi, was the brainchild of self-described “AI Filmmaker” PJ Accetturo, who was hired to produce a spot about people placing wagers on everything from sports to current events.

Ad is here.

The result is a testament to the rapidly advancing capabilities of generative video technology. In a detailed post on X, Accetturo unveiled the shockingly efficient process behind the ad, which leverages Google’s powerful new text-to-video model, Veo3. This achievement, coming just weeks after Veo3 public debut, underscores the breakneck speed at which AI is being adopted for high-stakes commercial use.

The AI-Powered Creative Workflow: From Script to Screen in Days

Accetturo’s process, which he claims has generated over 30 million views across various projects in just three weeks, is a masterclass in human-AI collaboration. He breaks down his viral video workflow into four simple steps:

Write a Rough Script: The process begins with a foundational creative idea.

Use Gemini for Prompts: Google’s Gemini is used to flesh out the script into a detailed shot list and generate specific prompts for the video model.

Generate with Veo3: The prompts are fed into Veo3 (via google flow) to generate the raw video clips.

Edit in CapCut/Premiere: The final AI-generated clips are assembled, timed, and polished using industry-standard video editing software.

Co-Writing with a Machine: Crafting the Vision

For the Kalshi spot, the creative process began by establishing a few key dialogue snippets. From there, Accetturo collaborated with Gemini to invent what he called “10 wild characters in unhinged situations” to deliver the lines.

“I co-write with Gemini,” Accetturo explains, “asking it for ideas, picking the best ones, and shaping them into a simple script.” This partnership allows for rapid ideation, blending human creative direction with the boundless imagination of a large language model.

Prompting is the New Directing

The critical translation layer between script and screen is the prompt. Accetturo has refined a specific method for this, using Gemini to convert each shot from the script into a highly detailed paragraph for Veo.

“I then ask Gemini to take the script and convert every shot into a detailed Veo prompt,” he notes. “I always tell it to return 5 prompts at a time—any more than that and the quality starts to slip.”

The key, he stresses, is to treat each prompt as a standalone instruction, providing Veo with the full context every single time to maintain consistency in character, setting, and tone.

Here’s an example of a detailed prompt used for the ad:

A handheld medium-wide shot, filmed like raw street footage on a crowded Miami strip at night. An old white man in his late 60s struts confidently down the sidewalk, surrounded by tourists and clubgoers. He’s grinning from ear to ear, his belly proudly sticking out from a cropped pink T-shirt. He wears extremely short neon green shorts, white tube socks, beat-up sneakers, and a massive foam cowboy hat with sequins on it…As he walks, he turns slightly toward the camera, still mid-strut, and shouts with full confidence and joy: ‘Indiana got that dog in ’em!’

Tips, Tricks, and the Bottom Line

Working with Veo3 requires a few insider tricks. Accetturo advises running prompts in “fast mode” and iterating quickly. If a generation is not perfect, he suggests pasting the original prompt back into Gemini, asking for specific changes, and trying again. He also notes a few current quirks, like occasional unwanted subtitles and the need to use phrases like “screaming at the top of their lungs” or all-caps dialogue to elicit yelling from the AI characters.

While Veo3 does not yet support consistent characters across multiple shots, highly detailed descriptions in each prompt can create a strong illusion of continuity.

The most staggering statistic from this experiment is the efficiency. “This took about 300–400 generations to get 15 usable clips,” Accetturo writes. “One person, 2-3 days. That’s a 95% cost reduction vs traditional ads.”

The Future of Advertising and the Creative “Moat”

Accetturo is quick to point out that this revolution doesn’t spell the end of creative professionals. “Just because this was cheap doesn’t mean anyone can do it,” he states. “I’ve been a director [for] 15+ years. Brands still pay a premium for taste.”

He envisions a future dominated by small, agile teams producing “viral, brand-adjacent content weekly,” achieving 80-90% of the impact for a fraction of the cost.

So, what is the defensible skill for filmmakers and advertisers in this new paradigm? According to Accetturo, it’s not technical prowess but creative instinct. I agree and this is a good proxy for an human+ AI future.

“The most valuable skill in entertainment and advertising is comedy writing,” he concludes. “If you can make people laugh, they’ll watch the full ad, engage with it, and some of them will become customers.”

The Kalshi NBA Finals ad is more than just a clever commercial; it’s a dispatch from the future of media creation—a future that is arriving fast.

“AI Filmmaker” PJ Accetturo newsletter is here : https://pjace.beehiiv.com
The post AI-Generated Ad Created with Google’s Veo3 Airs During NBA Finals, Slashing Production Costs by 95% appeared first on MarkTechPost.

Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Trai …

Post-training methods for pre-trained language models (LMs) depend on human supervision through demonstrations or preference feedback to specify desired behaviors. However, this approach faces critical limitations as tasks and model behaviors become very complex. Human supervision is unreliable in these scenarios as LMs learn to mimic mistakes in demonstrations or exploit inherent flaws in feedback systems. The core challenge lies in training LMs for tasks that exceed human capability in reliability in demonstrations or evaluations. Recent research has identified diverse failure modes, including reward-hacking of human-designed supervision signals or real humans themselves.

Limitations of Human Supervision in LLM Post-Training

Researchers have explored several approaches to scale beyond human supervision. One standard method utilizes high-quality verifiable rewards, such as matching model outputs with ground-truth solutions in mathematical domains. Despite evidence that pre-trained base models have strong latent capabilities for downstream tasks, with post-training adding minimal improvements, effective elicitation remains challenging. The Contrast Consistent Search (CCS) method is an unsupervised elicitation approach that uses logical consistency to find latent knowledge without supervision. However, CCS underperforms supervised approaches and often fails to identify knowledge due to other prominent features satisfying consistency properties.

Introducing Internal Coherence Maximization (ICM)

Researchers from Anthropic, Schmidt Sciences, Independent, Constellation, New York University, and George Washington University have proposed Internal Coherence Maximization (ICM), which fine-tunes pre-trained models on their own generated labels without using any provided labels. ICM solves this by searching for label sets that are both logically consistent and mutually predictable according to the pre-trained model. Since optimal label set identification remains computationally infeasible, ICM uses a simulated annealing-inspired search algorithm to approximate the maximum objective. Moreover, this method matches the performance of training on golden labels on TruthfulQA and GSM8K, and outperforms training on crowdsourced human labels on Alpaca.

How the ICM Algorithm Works

The ICM algorithm follows an iterative three-step process: (a) the system samples a new unlabeled example from the dataset for potential inclusion, (b) it determines the optimal label for this example while simultaneously resolving any logical inconsistencies, and (c) the algorithm evaluates whether to accept this new labeled example based on the scoring function. ICM is evaluated across three datasets: TruthfulQA for truthfulness assessment, GSM8K-verification for mathematical correctness, and Alpaca for helpfulness and harmlessness. Researchers used four baselines in their experiments: Zero-shot, Zero-shot (Chat), Golden Label, and Human Label. Moreover, Experiments used two open-weight models, Llama 3.1 8B and 70B, and two proprietary models: Claude 3 Haiku and Claude 3.5 Haiku.

Benchmark Performance and Model Comparisons

In superhuman capability elicitation tasks, ICM matches golden supervision accuracy at 80%, outperforming the estimated human accuracy of 60%. Using ICM-generated reward models, researchers successfully trained an assistant chatbot without human supervision. The unsupervised reward model achieves 75.0% accuracy on RewardBench, compared to 72.2% for human-supervised alternatives trained on production data. Moreover, using both the unsupervised and human-supervised RM, two policies are trained with RL to create helpful, harmless, and honest assistants. The policy trained with the unsupervised RM achieves a 60% win rate. However, these policies still lag behind the publicly released Claude 3.5 Haiku, which achieves 92% win rates.

Conclusion and Future Outlook

This paper introduces Internal Coherence Maximization (ICM), an advancement in unsupervised LM for fine-tuning pre-trained models on self-generated labels. The method consistently matches golden supervision performance and surpasses crowdsourced human supervision across GSM8K-verification, TruthfulQA, and Alpaca reward modeling tasks. However, ICM’s limitations include dependency on concept salience within pre-trained models and ineffectiveness with long inputs due to context window constraints. As LMs advance beyond human evaluation capabilities, ICM offers promising alternatives to traditional RLHF, ensuring model alignment with human intent without human supervision boundaries.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Internal Coherence Maximization (ICM): A Label-Free, Unsupervised Training Framework for LLMs appeared first on MarkTechPost.

MemOS: A Memory-Centric Operating System for Evolving and Adaptive Lar …

LLMs are increasingly seen as key to achieving Artificial General Intelligence (AGI), but they face major limitations in how they handle memory. Most LLMs rely on fixed knowledge stored in their weights and short-lived context during use, making it hard to retain or update information over time. Techniques like RAG attempt to incorporate external knowledge but lack structured memory management. This leads to problems such as forgetting past conversations, poor adaptability, and isolated memory across platforms. Fundamentally, today’s LLMs don’t treat memory as a manageable, persistent, or sharable system, limiting their real-world usefulness. 

To address the limitations of memory in current LLMs, researchers from MemTensor (Shanghai) Technology Co., Ltd., Shanghai Jiao Tong University, Renmin University of China, and the Research Institute of China Telecom have developed MemO. This memory operating system makes memory a first-class resource in language models. At its core is MemCube, a unified memory abstraction that manages parametric, activation, and plaintext memory. MemOS enables structured, traceable, and cross-task memory handling, allowing models to adapt continuously, internalize user preferences, and maintain behavioral consistency. This shift transforms LLMs from passive generators into evolving systems capable of long-term learning and cross-platform coordination. 

As AI systems grow more complex—handling multiple tasks, roles, and data types—language models must evolve beyond understanding text to also retaining memory and learning continuously. Current LLMs lack structured memory management, which limits their ability to adapt and grow over time. MemOS, a new system that treats memory as a core, schedulable resource. It enables long-term learning through structured storage, version control, and unified memory access. Unlike traditional training, MemOS supports a continuous “memory training” paradigm that blurs the line between learning and inference. It also emphasizes governance, ensuring traceability, access control, and safe use in evolving AI systems. 

MemOS is a memory-centric operating system for language models that treats memory not just as stored data but as an active, evolving component of the model’s cognition. It organizes memory into three distinct types: Parametric Memory (knowledge baked into model weights via pretraining or fine-tuning), Activation Memory (temporary internal states, such as KV caches and attention patterns, used during inference), and Plaintext Memory (editable, retrievable external data, like documents or prompts). These memory types interact within a unified framework called the MemoryCube (MemCube), which encapsulates both content and metadata, allowing dynamic scheduling, versioning, access control, and transformation across types. This structured system enables LLMs to adapt, recall relevant information, and efficiently evolve their capabilities, transforming them into more than just static generators.

At the core of MemOS is a three-layer architecture: the Interface Layer handles user inputs and parses them into memory-related tasks; the Operation Layer manages the scheduling, organization, and evolution of different types of memory; and the Infrastructure Layer ensures safe storage, access governance, and cross-agent collaboration. All interactions within the system are mediated through MemCubes, allowing traceable, policy-driven memory operations. Through modules like MemScheduler, MemLifecycle, and MemGovernance, MemOS maintains a continuous and adaptive memory loop—from the moment a user sends a prompt, to memory injection during reasoning, to storing useful data for future use. This design not only enhances the model’s responsiveness and personalization but also ensures that memory remains structured, secure, and reusable. 

In conclusion, MemOS is a memory operating system designed to make memory a central, manageable component in LLMs. Unlike traditional models that depend mostly on static model weights and short-term runtime states, MemOS introduces a unified framework for handling parametric, activation, and plaintext memory. At its core is MemCube, a standardized memory unit that supports structured storage, lifecycle management, and task-aware memory augmentation. The system enables more coherent reasoning, adaptability, and cross-agent collaboration. Future goals include enabling memory sharing across models, self-evolving memory blocks, and building a decentralized memory marketplace to support continual learning and intelligent evolution. 

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models appeared first on MarkTechPost.

Sakana AI Introduces Text-to-LoRA (T2L): A Hypernetwork that Generates …

Transformer models have significantly influenced how AI systems approach tasks in natural language understanding, translation, and reasoning. These large-scale models, particularly large language models (LLMs), have grown in size and complexity to the point where they encompass broad capabilities across various domains. However, applying these models to new, specialized tasks remains a complex operation. Each new application typically demands careful dataset selection, hours of fine-tuning, and a high degree of computational power. Although these models offer a strong foundation in knowledge, their rigidity in handling new domains with minimal data remains a core limitation. As researchers aim to bring AI closer to human-like adaptability, the focus has shifted toward more efficient methods that allow such models to modify their behavior without retraining every parameter.

The Challenge of Customizing LLMs for New Tasks

The central difficulty lies in adapting foundation models to unique applications without repeating costly and time-intensive training cycles. Most solutions today rely on creating new adapters for each task, which are separate components trained to steer the model’s behavior. These adapters must be made from scratch for every task, and any benefits learned from one application often cannot be transferred to another. This adaptation process is time-consuming and lacks scalability. Moreover, tuning models on specific datasets usually requires a high level of precision in hyperparameter choices, and failing to find the right configuration can lead to poor results. Even when adaptation is successful, the result is often a large collection of isolated task-specific components that are not easy to integrate or reuse.

In response to these limitations, researchers have adopted Low-Rank Adaptation (LoRA), a technique that modifies only a small set of parameters rather than the entire model. LoRA injects low-rank matrices into specific layers of a frozen LLM, allowing the base weights to remain unchanged while enabling task-specific customization. This method reduces the number of trainable parameters. However, for each task, a new LoRA adapter still needs to be trained from scratch. While more efficient than full fine-tuning, this method does not allow for fast, on-the-fly adaptation. Recent advancements have attempted to compress these adapters further or combine multiple adapters during inference; however, they still rely heavily on prior training and cannot generate new adapters dynamically.

Introducing Text-to-LoRA: Instant Adapter Generation from Task Descriptions

Researchers at Sakana AI introduced Text-to-LoRA (T2L), designed to instantly generate task-specific LoRA adapters from textual descriptions of the target task, instead of creating and training new adapters for each task. T2L functions as a hypernetwork capable of outputting adapter weights in a single forward pass. It learns from a library of pre-existing LoRA adapters covering various domains, including GSM8K, Arc-challenge, BoolQ, and others. Once trained, T2L can interpret a task’s description and generate the required adapter without additional training. This ability not only eliminates the need for manual adapter generation but also enables the system to generalize to tasks it has never encountered before.

The T2L architecture uses a combination of module-specific and layer-specific embeddings to guide the generation process. Three architectural variants were tested: a large version with 55 million parameters, a medium with 34 million, and a small with just 5 million. Despite their differences in size, all models were capable of generating the necessary low-rank matrices for adapter functionality. The training utilized the Super Natural Instructions dataset across 479 tasks, with each task described in natural language and encoded into vector form. By merging these descriptions with learned layer and module embeddings, T2L creates the low-rank A and B matrices needed for adapter functionality. This allows one model to replace hundreds of hand-crafted LoRAs, producing consistent results with a much smaller computational footprint.

Benchmark Performance and Scalability of T2L

On benchmarks such as Arc-easy and GSM8K, T2L matched or surpassed the performance of task-specific LoRAs. For instance, the accuracy on Arc-easy using T2L was 76.6%, matching the accuracy of the best manually tuned adapter. On BoolQ, it reached 89.9%, slightly outperforming the original adapter. Even on more difficult benchmarks like PIQA and Winogrande, where overfitting typically hurts performance, T2L delivered better results than manually trained adapters. These improvements are believed to stem from the lossy compression inherent in the hypernetwork training, which acts as a form of regularization. When increasing the number of training datasets from 16 to 479, the performance in zero-shot settings improved substantially, showing T2L’s capability to generalize with broader exposure during training.

Several Key Takeaways from the Research include:

T2L allows instant adaptation of LLMs using only natural language descriptions.

It supports zero-shot generalization to tasks not seen during training.

Three architectural variants of T2L were tested with parameter counts of 55M, 34M, and 5M.

Benchmarks include ArcE, BoolQ, GSM8K, Hellaswag, PIQA, MBPP, and more.

T2L achieved benchmark accuracies of 76.6% (ArcE), 89.9% (BoolQ), and 92.6% (Hellaswag).

It matched or exceeded manually trained LoRAs in performance on multiple tasks.

Trained using 479 tasks from the Super Natural Instructions dataset.

T2L uses the gte-large-en-v1.5 model for generating task embeddings.

LoRA adapters produced by T2L target only query and value projections in attention blocks, totaling 3.4M parameters.

Performance remained consistent even with higher reconstruction loss, showing resilience to compression.

In conclusion, this research highlights a major step forward in flexible and efficient model adaptation. Instead of relying on repetitive, resource-heavy procedures, T2L uses natural language itself as a control mechanism, enabling models to specialize using simple task descriptions. This capability dramatically reduces the time and cost required to adapt LLMs to new domains. Moreover, it suggests that as long as enough prior adapters are available for training, future models could potentially adapt in seconds to any task described in plain English. The use of hypernetworks to dynamically construct adapters also means less storage is needed for model specialization, further increasing the practicality of this method in production environments.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
The post Sakana AI Introduces Text-to-LoRA (T2L): A Hypernetwork that Generates Task-Specific LLM Adapters (LoRAs) based on a Text Description of the Task appeared first on MarkTechPost.

Build a Secure AI Code Execution Workflow Using Daytona SDK

In this Daytona SDK tutorial, we provide a hands-on walkthrough for leveraging Daytona’s secure sandbox environment to execute untrusted or AI-generated Python code safely within Notebook. Beginning with straightforward sandbox creation and basic code execution, the guide demonstrates how to isolate processes, install dependencies, and run simple scripts without jeopardizing the host environment. As the tutorial progresses, it delves into data processing with pandas, file operations including reading and writing JSON files, and the execution of complex AI-generated snippets such as recursive functions and sorting algorithms. Finally, it showcases parallel task execution across multiple sandboxes and proper cleanup procedures, ensuring that every resource is managed and disposed of correctly.

Copy CodeCopiedUse a different Browserimport os
import time
import json
from typing import List, Dict, Any

try:
import daytona_sdk
except ImportError:
print(“Installing Daytona SDK…”)
!pip install daytona-sdk
import daytona_sdk

from daytona_sdk import Daytona, DaytonaConfig, CreateSandboxParams

We install and import the Daytona SDK (if not already present), then initialize the core Daytona classes (Daytona, DaytonaConfig, and CreateSandboxParams) for configuring and creating secure Python sandboxes. It also brings in standard utilities like os, time, and json for use within those sandboxes.

Copy CodeCopiedUse a different Browserclass DaytonaTutorial:
“””Complete tutorial for Daytona SDK – Secure AI Code Execution Platform”””

def __init__(self, api_key: str):
“””Initialize Daytona client”””
self.config = DaytonaConfig(api_key=api_key)
self.daytona = Daytona(self.config)
self.sandboxes: List[Any] = []

def basic_sandbox_demo(self):
“””Demo 1: Basic sandbox creation and code execution”””
print(” Demo 1: Basic Sandbox Operations”)
print(“-” * 40)

try:
sandbox = self.daytona.create(CreateSandboxParams(language=”python”))
self.sandboxes.append(sandbox)

print(f” Created sandbox: {sandbox.id}”)

code = ‘print(“Hello from Daytona Sandbox!”)nprint(f”2 + 2 = {2 + 2}”)’
response = sandbox.process.code_run(code)

if response.exit_code == 0:
print(f” Output: {response.result}”)
else:
print(f” Error: {response.result}”)

except Exception as e:
print(f” Error in basic demo: {e}”)

def data_processing_demo(self):
“””Demo 2: Data processing in isolated environment”””
print(“n Demo 2: Secure Data Processing”)
print(“-” * 40)

try:
sandbox = self.daytona.create(CreateSandboxParams(language=”python”))
self.sandboxes.append(sandbox)

install_cmd = “import subprocess; subprocess.run([‘pip’, ‘install’, ‘pandas’])”
response = sandbox.process.code_run(install_cmd)

data_code = “””
import pandas as pd
import json

# Create sample dataset
data = {
‘name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Diana’],
‘age’: [25, 30, 35, 28],
‘salary’: [50000, 60000, 70000, 55000]
}

df = pd.DataFrame(data)
result = {
‘total_records’: len(df),
‘avg_age’: df[‘age’].mean(),
‘avg_salary’: df[‘salary’].mean(),
‘summary’: df.describe().to_dict()
}

print(json.dumps(result, indent=2))
“””

response = sandbox.process.code_run(data_code)
if response.exit_code == 0:
print(” Data processing completed:”)
print(response.result)
else:
print(f” Error: {response.result}”)

except Exception as e:
print(f” Error in data processing demo: {e}”)

def file_operations_demo(self):
“””Demo 3: File operations within sandbox”””
print(“n Demo 3: File Operations”)
print(“-” * 40)

try:
sandbox = self.daytona.create(CreateSandboxParams(language=”python”))
self.sandboxes.append(sandbox)

file_code = “””
import os
import json

# Create a sample file
data = {‘message’: ‘Hello from Daytona!’, ‘timestamp’: ‘2025-06-13’}
with open(‘sample.json’, ‘w’) as f:
json.dump(data, f, indent=2)

# Read and display file contents
with open(‘sample.json’, ‘r’) as f:
content = f.read()
print(“File contents:”)
print(content)

# List files in current directory
files = os.listdir(‘.’)
print(f”\nFiles in directory: {files}”)
“””

response = sandbox.process.code_run(file_code)
if response.exit_code == 0:
print(” File operations completed:”)
print(response.result)
else:
print(f” Error: {response.result}”)

except Exception as e:
print(f” Error in file operations demo: {e}”)

def ai_code_execution_demo(self):
“””Demo 4: Simulated AI-generated code execution”””
print(“n Demo 4: AI-Generated Code Execution”)
print(“-” * 40)

ai_codes = [
“# Calculate fibonacci sequencendef fib(n):n if n <= 1: return nn return fib(n-1) + fib(n-2)nprint([fib(i) for i in range(10)])”,
“# Sort algorithmndef bubble_sort(arr):n n = len(arr)n for i in range(n):n for j in range(0, n-i-1):n if arr[j] > arr[j+1]:n arr[j], arr[j+1] = arr[j+1], arr[j]n return arrnprint(bubble_sort([64, 34, 25, 12, 22, 11, 90]))”,
“# Data analysisnimport mathndata = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]nmean = sum(data) / len(data)nvariance = sum((x – mean) ** 2 for x in data) / len(data)nstd_dev = math.sqrt(variance)nprint(f’Mean: {mean}, Std Dev: {std_dev:.2f}’)”
]

try:
sandbox = self.daytona.create(CreateSandboxParams(language=”python”))
self.sandboxes.append(sandbox)

for i, code in enumerate(ai_codes, 1):
print(f”n Executing AI Code Snippet {i}:”)
response = sandbox.process.code_run(code)

if response.exit_code == 0:
print(f” Output: {response.result}”)
else:
print(f” Error: {response.result}”)

time.sleep(1)

except Exception as e:
print(f” Error in AI code execution demo: {e}”)

def parallel_execution_demo(self):
“””Demo 5: Multiple sandboxes for parallel processing”””
print(“n Demo 5: Parallel Execution”)
print(“-” * 40)

tasks = [
“print(‘Task 1: Computing prime numbers’)nprimes = [i for i in range(2, 50) if all(i % j != 0 for j in range(2, int(i**0.5) + 1))]nprint(f’Primes: {primes[:10]}’)”,
“print(‘Task 2: String processing’)ntext = ‘Hello Daytona World’nprint(f’Reversed: {text[::-1]}’)nprint(f’Word count: {len(text.split())}’)”,
“print(‘Task 3: Mathematical calculations’)nimport mathnresult = sum(math.sqrt(i) for i in range(1, 101))nprint(f’Sum of square roots 1-100: {result:.2f}’)”
]

try:
parallel_sandboxes = []
for i in range(len(tasks)):
sandbox = self.daytona.create(CreateSandboxParams(language=”python”))
parallel_sandboxes.append(sandbox)
self.sandboxes.append(sandbox)

results = []
for i, (sandbox, task) in enumerate(zip(parallel_sandboxes, tasks)):
print(f”n Starting parallel task {i+1}”)
response = sandbox.process.code_run(task)
results.append((i+1, response))

for task_num, response in results:
if response.exit_code == 0:
print(f” Task {task_num} completed: {response.result}”)
else:
print(f” Task {task_num} failed: {response.result}”)

except Exception as e:
print(f” Error in parallel execution demo: {e}”)

def cleanup_sandboxes(self):
“””Clean up all created sandboxes”””
print(“n Cleaning up sandboxes…”)
print(“-” * 40)

for sandbox in self.sandboxes:
try:
self.daytona.remove(sandbox)
print(f” Removed sandbox: {sandbox.id}”)
except Exception as e:
print(f” Error removing sandbox {sandbox.id}: {e}”)

self.sandboxes.clear()
print(” Cleanup completed!”)

def run_full_tutorial(self):
“””Run the complete Daytona tutorial”””
print(” Daytona SDK Complete Tutorial”)
print(“=” * 50)
print(“Secure & Isolated AI Code Execution Platform”)
print(“=” * 50)

self.basic_sandbox_demo()
self.data_processing_demo()
self.file_operations_demo()
self.ai_code_execution_demo()
self.parallel_execution_demo()
self.cleanup_sandboxes()

print(“n Tutorial completed successfully!”)
print(“Key Daytona features demonstrated:”)
print(“• Secure sandbox creation”)
print(“• Isolated code execution”)
print(“• File system operations”)
print(“• Parallel processing”)
print(“• Resource cleanup”)

This DaytonaTutorial class encapsulates a complete end-to-end guide for using the Daytona SDK: it initializes a secure sandbox client with your API key, demonstrates isolated code execution (from simple prints through pandas data processing and file I/O to AI-generated snippets), orchestrates parallel tasks across multiple sandboxes, and finally ensures clean teardown of all resources. Each method is self-contained, showcasing key Daytona features, sandbox creation, dependency installation, safe execution, and resource cleanup, in a clear, step-by-step workflow that’s ideal for running in Notebook.

Copy CodeCopiedUse a different Browserdef main():
“””Main function to run the tutorial”””

print(” Daytona Setup Instructions:”)
print(“1. Visit: https://app.daytona.io”)
print(“2. Create an account”)
print(“3. Generate API key at: https://app.daytona.io/dashboard/keys”)
print(“4. Replace ‘YOUR_API_KEY’ below with your actual key”)
print(“-” * 50)

API_KEY = “Use Your API Key Here”

if API_KEY == “YOUR_API_KEY”:
print(” Please set your Daytona API key before running the tutorial!”)
print(” Update the API_KEY variable with your key from https://app.daytona.io/dashboard/keys”)
return

try:
tutorial = DaytonaTutorial(API_KEY)
tutorial.run_full_tutorial()

except Exception as e:
print(f” Tutorial failed: {e}”)
print(” Make sure your API key is valid and you have network access”)

The main() function outlines the initial setup steps, guiding users to create a Daytona account and generate their API key, then validates that the key has been provided before instantiating the DaytonaTutorial class and running the full walkthrough. If the API key is missing or invalid, it prints clear instructions and aborts, ensuring a smooth first-time experience.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
main()

Finally, the above standard Python entry-point check ensures that main() is only invoked when the script is run directly, initiating the Daytona tutorial workflow in a clear and controlled manner.

In conclusion, by following this tutorial, developers gain a comprehensive understanding of Daytona’s core capabilities: creating isolated Python sandboxes, performing secure data manipulations, managing file I/O, running arbitrary or AI-generated code, and orchestrating parallel workloads, all while maintaining strict separation from the host system. The cleanup routines underscore the importance of resource hygiene in long-running workflows. Armed with these foundational skills, users can confidently integrate Daytona into larger machine-learning pipelines, automated testing frameworks, or any scenario that requires the safe execution of dynamic code.

Check out the Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post Build a Secure AI Code Execution Workflow Using Daytona SDK appeared first on MarkTechPost.

Apple Researchers Reveal Structural Failures in Large Reasoning Models …

Artificial intelligence has undergone a significant transition from basic language models to advanced models that focus on reasoning tasks. These newer systems, known as Large Reasoning Models (LRMs), represent a class of tools designed to simulate human-like thinking by producing intermediate reasoning steps before arriving at conclusions. The focus has moved from generating accurate outputs to understanding the process that leads to these answers. This shift has raised questions about how these models manage tasks with layered complexity and whether they truly possess reasoning abilities or are simply leveraging training patterns to guess outcomes.

Redefining Evaluation: Moving Beyond Final Answer Accuracy

A recurring problem with evaluating machine reasoning is that traditional benchmarks mostly assess the final answer without examining the steps involved in arriving at it. Final answer accuracy alone does not reveal the quality of internal reasoning, and many benchmarks are contaminated with data that may have been seen during training. This creates a misleading picture of a model’s true capabilities. To explore actual reasoning, researchers require environments where problem difficulty can be precisely controlled and intermediate steps can be analyzed. Without such settings, it is hard to determine whether these models can generalize solutions or merely memorize patterns.

To evaluate reasoning more reliably, the research team at Apple designed a setup using four puzzle environments: Tower of Hanoi, River Crossing, Checkers Jumping, and Blocks World. These puzzles allow precise manipulation of complexity by changing elements such as the number of disks, checkers, or agents involved. Each task requires different reasoning abilities, such as constraint satisfaction and sequential planning. Importantly, these environments are free from typical data contamination, enabling thorough checks of both outcomes and the reasoning steps in between. This method ensures a detailed investigation of how models behave across varied task demands.

The research introduced a comparative study using two sets of models: Claude 3.7 Sonnet and DeepSeek-R1, along with their “thinking” variants and their standard LLM counterparts. These models were tested across the puzzles under identical token budgets to measure both accuracy and reasoning efficiency. This helped reveal performance shifts across low, medium, and high-complexity tasks. One of the most revealing observations was the formation of three performance zones. In simple tasks, non-thinking models outperformed reasoning variants. For medium complexity, reasoning models gained an edge, while both types collapsed completely as complexity peaked.

Comparative Insights: Thinking vs. Non-Thinking Models Under Stress

An in-depth analysis revealed that reasoning effort increased with task difficulty up to a certain point but then declined despite the availability of resources. For instance, in the Tower of Hanoi, Claude 3.7 Sonnet (thinking) maintained high accuracy until complexity reached a certain threshold, after which performance dropped to zero. Even when these models were supplied with explicit solution algorithms, they failed to execute steps beyond specific complexity levels. In one case, Claude 3.7 could manage around 100 steps correctly for the Tower of Hanoi but was unable to complete simpler River Crossing tasks requiring only 11 moves when $N = 3$. This inconsistency exposed serious limitations in symbolic manipulation and exact computation.

The performance breakdown also highlighted how LRMs handle their internal thought process. Models frequently engaged in “overthinking,” generating correct intermediate solutions early in the process but continuing to explore incorrect paths. This led to inefficient use of tokens. At medium complexity levels, models began to find correct answers later in their reasoning chains. However, at high levels of complexity, they failed to produce accurate solutions. Quantitative analysis confirmed that solution accuracy dropped to near zero as the problem complexity increased, and the number of reasoning tokens allocated began to decline unexpectedly.

Scaling Limits and the Collapse of Reasoning

This research presents a sobering assessment of how current Learning Resource Management Systems (LRMs) operate. Research from Apple makes it clear that, despite some progress, today’s reasoning models are still far from achieving generalized reasoning. The work identifies how performance scales, where it collapses, and why over-reliance on benchmark accuracy fails to capture deeper reasoning behavior. Controlled puzzle environments have proven to be a powerful tool for uncovering hidden weaknesses in these systems and emphasizing the need for more robust designs in the future.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation appeared first on MarkTechPost.

Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Clim …

Limitations of Traditional Climate Modeling

Earth system models are essential tools for forecasting environmental changes and helping us prepare for the future. However, their high computational demands make it difficult to run them at resolutions fine enough for detailed, local predictions. Currently, most models are limited to a resolution around 100 kilometers—roughly the size of Hawai’i—making it hard to generate accurate projections for specific regions. Yet, city-scale forecasts at approximately 10 kilometers are vital for real-world applications, such as agriculture, water resource planning, and disaster preparedness. Improving the resolution of these models is key to better protecting communities and supporting more effective local decision-making. 

Introducing Dynamical-Generative Downscaling with AI

Researchers at Google have introduced a method that combines traditional physics-based climate modeling with generative AI to assess regional environmental risks. Published in PNAS, their approach—called dynamical-generative downscaling—utilizes diffusion models, a type of AI that learns complex patterns, to convert broad global climate projections into detailed, local predictions at a resolution of approximately 10 km. This method not only bridges the gap between large-scale models and real-world decision-making needs but also does so far more efficiently and affordably than current high-resolution techniques, making it feasible to apply across the growing volume of climate data now available. 

To better understand local environmental changes at fine resolutions (around 10 km), scientists typically use a method called dynamical downscaling. This process takes broad data from global climate models and refines it using regional climate models, like zooming in on a worldwide map to see more detail. While this technique provides highly accurate local forecasts by factoring in terrain and regional weather patterns, it comes at a steep computational cost, making it too slow and expensive to apply broadly across many climate scenarios. Simpler statistical methods are faster but often fail to model extreme events or reliably adapt to new future conditions.

Improving Accuracy and Efficiency with R2D2

To overcome these challenges, researchers have introduced a more efficient method that merges the strengths of physics-based models with generative AI. This two-step process begins with a physics-based simulation that downscales global data to a mid-level resolution, ensuring consistency across different global models. Then, a generative AI model called R2D2 fills in the finer details—like small-scale weather features shaped by terrain—by learning from high-resolution examples. By focusing on the differences between medium and high resolutions, R2D2 improves accuracy and generalizes well to unseen scenarios. This combined approach enables faster, cost-effective, and realistic local climate projections across a wide range of future scenarios. 

To test the new approach, researchers trained the model using one high-resolution climate projection from the Western U.S. and then evaluated it on seven others. Compared to traditional statistical methods, their AI-powered downscaling model significantly reduced errors by over 40% in predicting variables like temperature, humidity, and wind. It also more accurately captured complex weather patterns, like heatwaves combined with droughts or wildfire risks from strong winds. This method enhances both accuracy and efficiency, providing more accurate estimates of extreme weather and uncertainty while utilizing only a fraction of the computing power required by traditional high-resolution simulations. 

In conclusion, the new AI-powered downscaling approach is a major leap forward in making detailed, regional climate forecasts more accessible and affordable. By combining traditional physics-based modeling with generative AI, the method delivers accurate, city-scale (~10 km) climate risk assessments while cutting computing costs by up to 85%. Unlike older methods, which are limited by scale and expense, this technique can efficiently handle large ensembles of climate projections. It captures uncertainties more comprehensively and supports smarter planning in agriculture, disaster preparedness, water management, and infrastructure. In short, it turns complex global data into actionable local insights—faster, cheaper, and more accurately than ever before. 

Check out the Paper and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty Assessment appeared first on MarkTechPost.

Deploy Qwen models with Amazon Bedrock Custom Model Import

We’re excited to announce that Amazon Bedrock Custom Model Import now supports Qwen models. You can now import custom weights for Qwen2, Qwen2_VL, and Qwen2_5_VL architectures, including models like Qwen 2, 2.5 Coder, Qwen 2.5 VL, and QwQ 32B. You can bring your own customized Qwen models into Amazon Bedrock and deploy them in a fully managed, serverless environment—without having to manage infrastructure or model serving.
In this post, we cover how to deploy Qwen 2.5 models with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the AWS infrastructure at an effective cost.
Overview of Qwen models
Qwen 2 and 2.5 are families of large language models, available in a wide range of sizes and specialized variants to suit diverse needs:

General language models: Models ranging from 0.5B to 72B parameters, with both base and instruct versions for general-purpose tasks
Qwen 2.5-Coder: Specialized for code generation and completion
Qwen 2.5-Math: Focused on advanced mathematical reasoning
Qwen 2.5-VL (vision-language): Image and video processing capabilities, enabling multimodal applications

Overview of Amazon Bedrock Custom Model Import
Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing foundation models (FMs) through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage the underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Amazon Bedrock tools and features like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents. Amazon Bedrock Custom Model Import is generally available in the US-East (N. Virginia), US-West (Oregon), and Europe (Frankfurt) AWS Regions. Now, we’ll explore how you can use Qwen 2.5 models for two common use cases: as a coding assistant and for image understanding. Qwen2.5-Coder is a state-of-the-art code model, matching capabilities of proprietary models like GPT-4o. It supports over 90 programming languages and excels at code generation, debugging, and reasoning. Qwen 2.5-VL brings advanced multimodal capabilities. According to Qwen, Qwen 2.5-VL is not only proficient at recognizing objects such as flowers and animals, but also at analyzing charts, extracting text from images, interpreting document layouts, and processing long videos.
Prerequisites
Before importing the Qwen model with Amazon Bedrock Custom Model Import, make sure that you have the following in place:

An active AWS account
An Amazon Simple Storage Service (Amazon S3) bucket to store the Qwen model files
Sufficient permissions to create Amazon Bedrock model import jobs
Verified that your Region supports Amazon Bedrock Custom Model Import

Use case 1: Qwen coding assistant
In this example, we will demonstrate how to build a coding assistant using the Qwen2.5-Coder-7B-Instruct model

Go to to Hugging Face and search for and copy the Model ID Qwen/Qwen2.5-Coder-7B-Instruct:

You will use Qwen/Qwen2.5-Coder-7B-Instruct for the rest of the walkthrough. We don’t demonstrate fine-tuning steps, but you can also fine-tune before importing.

Use the following command to download a snapshot of the model locally. The Python library for Hugging Face provides a utility called snapshot download for this:

from huggingface_hub import snapshot_download

snapshot_download(repo_id=” Qwen/Qwen2.5-Coder-7B-Instruct”,
                local_dir=f”./extractedmodel/”)

Depending on your model size, this could take a few minutes. When completed, your Qwen Coder 7B model folder will contain the following files.

Configuration files: Including config.json, generation_config.json, tokenizer_config.json, tokenizer.json, and vocab.json
Model files: Four safetensor files and model.safetensors.index.json
Documentation: LICENSE, README.md, and merges.txt

Upload the model to Amazon S3, using boto3 or the command line:

aws s3 cp ./extractedfolder s3://yourbucket/path/ –recursive

Start the import model job using the following API call:

response = self.bedrock_client.create_model_import_job(
                jobName=”uniquejobname”,
                importedModelName=”uniquemodelname”,
                roleArn=”fullrolearn”,
                modelDataSource={
                    ‘s3DataSource’: {
                        ‘s3Uri’: “s3://yourbucket/path/”
                    }
                }
            )
            

You can also do this using the AWS Management Console for Amazon Bedrock.

In the Amazon Bedrock console, choose Imported models in the navigation pane.
Choose Import a model.

Enter the details, including a Model name, Import job name, and model S3 location.

Create a new service role or use an existing service role. Then choose Import model

After you choose Import on the console, you should see status as importing when model is being imported:

If you’re using your own role, make sure you add the following trust relationship as describes in  Create a service role for model import.
After your model is imported, wait for model inference to be ready, and then chat with the model on the playground or through the API. In the following example, we append Python to prompt the model to directly output Python code to list items in an S3 bucket. Remember to use the right chat template to input prompts in the format required. For example, you can get the right chat template for any compatible model on Hugging Face using below code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(“Qwen/Qwen2.5-Coder-7B-Instruct”)

# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
prompt = “Write sample boto3 python code to list files in a bucket stored in the variable `my_bucket`”
messages = [
    {“role”: “system”, “content”: “You are a helpful coding assistant.”},
    {“role”: “user”, “content”: prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

Note that when using the invoke_model APIs, you must use the full Amazon Resource Name (ARN) for the imported model. You can find the Model ARN in the Bedrock console, by navigating to the Imported models section and then viewing the Model details page, as shown in the following figure

After the model is ready for inference, you can use Chat Playground in Bedrock console or APIs to invoke the model.

Use case 2: Qwen 2.5 VL image understanding
Qwen2.5-VL-* offers multimodal capabilities, combining vision and language understanding in a single model. This section demonstrates how to deploy Qwen2.5-VL using Amazon Bedrock Custom Model Import and test its image understanding capabilities.
Import Qwen2.5-VL-7B to Amazon Bedrock
Download the model from Huggingface Face and upload it to Amazon S3:

from huggingface_hub import snapshot_download

hf_model_id = “Qwen/Qwen2.5-VL-7B-Instruct”

# Enable faster downloads
os.environ[“HF_HUB_ENABLE_HF_TRANSFER”] = “1”

# Download model locally
snapshot_download(repo_id=hf_model_id, local_dir=f”./{local_directory}”)

Next, import the model to Amazon Bedrock (either via Console or API):

response = bedrock.create_model_import_job(
    jobName=job_name,
    importedModelName=imported_model_name,
    roleArn=role_arn,
    modelDataSource={
        ‘s3DataSource’: {
            ‘s3Uri’: s3_uri
        }
    }
)

Test the vision capabilities
After the import is complete, test the model with an image input. The Qwen2.5-VL-* model requires proper formatting of multimodal inputs:

def generate_vl(messages, image_base64, temperature=0.3, max_tokens=4096, top_p=0.9):
    processor = AutoProcessor.from_pretrained(“Qwen/QVQ-72B-Preview”)
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    response = client.invoke_model(
        modelId=model_id,
        body=json.dumps({
            ‘prompt’: prompt,
            ‘temperature’: temperature,
            ‘max_gen_len’: max_tokens,
            ‘top_p’: top_p,
            ‘images’: [image_base64]
        }),
        accept=’application/json’,
        contentType=’application/json’
    )
    
    return json.loads(response[‘body’].read().decode(‘utf-8’))

# Using the model with an image
file_path = “cat_image.jpg”
base64_data = image_to_base64(file_path)

messages = [
    {
        “role”: “user”,
        “content”: [
            {“image”: base64_data},
            {“text”: “Describe this image.”}
        ]
    }
]

response = generate_vl(messages, base64_data)

# Print response
print(“Model Response:”)
if ‘choices’ in response:
    print(response[‘choices’][0][‘text’])
elif ‘outputs’ in response:
    print(response[‘outputs’][0][‘text’])
else:
    print(response)
    

When provided with an example image of a cat (such the following image), the model accurately describes key features such as the cat’s position, fur color, eye color, and general appearance. This demonstrates Qwen2.5-VL-* model’s ability to process visual information and generate relevant text descriptions.

The model’s response:

This image features a close-up of a cat lying down on a soft, textured surface, likely a couch or a bed. The cat has a tabby coat with a mix of dark and light brown fur, and its eyes are a striking green with vertical pupils, giving it a captivating look. The cat’s whiskers are prominent and extend outward from its face, adding to the detailed texture of the image. The background is softly blurred, suggesting a cozy indoor setting with some furniture and possibly a window letting in natural light. The overall atmosphere of the image is warm and serene, highlighting the cat’s relaxed and content demeanor.

Pricing
You can use Amazon Bedrock Custom Model Import to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import doesn’t charge for model import. You are charged for inference based on two factors: the number of active model copies and their duration of activity. Billing occurs in 5-minute increments, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, Region, and compute unit version, and is tiered by model copy size. The custom model unites required for hosting depends on the model’s architecture, parameter count, and context length. Amazon Bedrock automatically manages scaling based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this might involve cold-start latency of up to a minute. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.
For more information, see Amazon Bedrock pricing.
Clean up
To avoid ongoing charges after completing the experiments:

Delete your imported Qwen models from Amazon Bedrock Custom Model Import using the console or the API.
Optionally, delete the model files from your S3 bucket if you no longer need them.

Remember that while Amazon Bedrock Custom Model Import doesn’t charge for the import process itself, you are billed for model inference usage and storage.
Conclusion
Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like Qwen 2.5, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of Qwen 2.5’s advanced AI capabilities and Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.
For more information, refer to the Amazon Bedrock User Guide.

About the Authors
Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in Product Management, Engineering, and Go-To-Market. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing Generative AI technologies and driving real-world impact with Generative AI.
Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.
Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.
Dharinee Gupta is an Engineering Manager at AWS Bedrock, where she focuses on enabling customers to seamlessly utilize open source models through serverless solutions. Her team specializes in optimizing these models to deliver the best cost-performance balance for customers. Prior to her current role, she gained extensive experience in authentication and authorization systems at Amazon, developing secure access solutions for Amazon offerings. Dharinee is passionate about making advanced AI technologies accessible and efficient for AWS customers.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
June Won is a Principal Product Manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Build generative AI solutions with Amazon Bedrock

Generative AI is revolutionizing how businesses operate, interact with customers, and innovate. If you’re embarking on the journey to build a generative AI-powered solution, you might wonder how to navigate the complexities involved from selecting the right models to managing prompts and enforcing data privacy.
In this post, we show you how to build generative AI applications on Amazon Web Services (AWS) using the capabilities of Amazon Bedrock, highlighting how Amazon Bedrock can be used at each step of your generative AI journey. This guide is valuable for both experienced AI engineers and newcomers to the generative AI space, helping you use Amazon Bedrock to its fullest potential.
Amazon Bedrock is a fully managed service that provides a unified API to access a wide range of high-performing foundation models (FMs) from leading AI companies like Anthropic, Cohere, Meta, Mistral AI, AI21 Labs, Stability AI, and Amazon. It offers a robust set of tools and features designed to help you build generative AI applications efficiently while adhering to best practices in security, privacy, and responsible AI.
Calling an LLM with an API
You want to integrate a generative AI feature into your application through a straightforward, single-turn interaction with a large language model (LLM). Perhaps you need to generate text, answer a question, or provide a summary based on user input. Amazon Bedrock simplifies generative AI application development and scaling through a unified API for accessing diverse, leading FMs. With support for Amazon models and leading AI providers, you have the freedom to experiment without being locked into a single model or provider. With the rapid pace of development in AI, you can seamlessly switch models for optimized performance with no application rewrite required.
Beyond direct model access, Amazon Bedrock expands your options with the Amazon Bedrock Marketplace. This marketplace gives you access to over 100 specialized FMs; you can discover, test, and integrate new capabilities all through fully managed endpoints. Whether you need the latest innovation in text generation, image synthesis, or domain-specific AI, Amazon Bedrock provides the flexibility to adapt and scale your solution with ease.
With one API, you stay agile and can effortlessly switch between models, upgrade to the latest versions, and future-proof your generative AI applications with minimal code changes. To summarize, Amazon Bedrock offers the following benefits:

Simplicity: No need to manage infrastructure or deal with multiple APIs
Flexibility: Experiment with different models to find the best fit
Scalability: Scale your application without worrying about underlying resources

To get started, use the Chat or Text playground to experiment with different FMs, and use the Converse API to integrate FMs into your application.
After you’ve integrated a basic LLM feature, the next step is optimizing the performance and making sure you’re using the right model for your requirements. This brings us to the importance of evaluating and comparing models.
Choosing the right model for your use case
Selecting the right FM for your use case is crucial, but with so many options available, how do you know which one will give you the best performance for your application? Whether it’s for generating more relevant responses, summarizing information, or handling nuanced queries, choosing the best model is key to providing optimal performance.
You can use Amazon Bedrock model evaluation to rigorously test different FMs to find the one that delivers the best results for your use case. Whether you’re in the early stages of development or preparing for launch, selecting the right model can make a significant difference in the effectiveness of your generative AI solutions.
The model evaluation process consists of the following components:

Automatic and human evaluation: Begin by experimenting with different models using automated evaluation metrics like accuracy, robustness, or toxicity. You can also bring in human evaluators to measure more subjective factors, such as friendliness, style, or how well the model aligns with your brand voice.
Custom datasets and metrics: Evaluate the performance of models using your own datasets or pre-built options. Customize the metrics that matter most for your project, making sure the selected model aligns with your business or operational goals.
Iterative feedback: Throughout the development process, run evaluations iteratively, allowing for faster refinement. This helps you compare models side by side, so you can make a data-driven decision when selecting the FM that fits your use case.

Imagine you’re building a customer support AI assistant for an ecommerce service. You can model evaluation to test multiple FMs with real customer queries, evaluating which model provides the most accurate, friendly, and contextually appropriate responses. By comparing models side by side, you can choose the model that will deliver the best possible user experience for your customers. After you’ve evaluated and selected the ideal model, the next step is making sure it aligns with your business needs. Off-the-shelf models might perform well, but for a truly tailored experience, you need more customization. This leads to the next important step in your generative AI journey: personalizing models to reflect your business context. You need to make sure the model generates the most accurate and contextually relevant responses. Even the best FMs will not have access to the latest or domain-specific information critical to your business. To solve this, the model needs to use your proprietary data sources, making sure its outputs reflect the most up-to-date and relevant information. This is where you can use Retrieval Augmented Generation (RAG) to enrich the model’s responses by incorporating your organization’s unique knowledge base.
Enriching model responses with your proprietary data
A publicly available LLM might perform well on general knowledge tasks, but struggle with outdated information or lack context from your organization’s proprietary data. You need a way to provide the model with the most relevant, up-to-date insights to provide accuracy and contextual depth. There are two key approaches that you can use to enrich model responses:

RAG: Use RAG to dynamically retrieve relevant information at query time, enriching model responses without requiring retraining
Fine-tuning: Use RAG to customize your chosen model by training it on proprietary data, improving its ability to handle organization-specific tasks or domain knowledge

We recommend starting with RAG because of its flexible and straightforward to implement. You can then fine-tune the model for deeper domain adaptation if needed. RAG dynamically retrieves relevant information at query time, making sure model responses stay accurate and context aware. In this approach, data is first processed and indexed in a vector database or similar retrieval system. When a user submits a query, Amazon Bedrock searches this indexed data to find relevant context, which is injected into the prompt. The model then generates a response based on both the original query and the retrieved insights without requiring additional training.
Amazon Bedrock Knowledge Bases automates the RAG pipeline—including data ingestion, retrieval, prompt augmentation, and citations—reducing the complexity of setting up custom integrations. By seamlessly integrating proprietary data, you can make sure that the models generate accurate, contextually rich, and continuously updated responses.
Bedrock Knowledge Bases supports various data types to tailor AI-generated responses to business-specific needs:

Unstructured data: Extract insights from text-heavy sources like documents, PDFs, and emails
Structured data: Enable natural language queries on databases, data lakes, and warehouses without moving or preprocessing data
Multimodal data: Process both text and visual elements in documents and images using Amazon Bedrock Data Automation
GraphRAG: Enhance knowledge retrieval with graph-based relationships, enabling AI to understand entity connections for more context-aware responses

With these capabilities, Amazon Bedrock reduces data silos, making it straightforward to enrich AI applications with both real-time and historical knowledge. Whether working with text, images, structured datasets, or interconnected knowledge graphs, Amazon Bedrock provides a fully managed, scalable solution without the need for complex infrastructure. To summarize, using RAG with Amazon Bedrock offers the following benefits:

Up-to-date information: Responses include the latest data from your knowledge bases
Accuracy: Reduces the risk of incorrect or irrelevant answers
No extra infrastructure: You can avoid setting up and managing your own vector databases or custom integrations

When your model is pulling from the most accurate and relevant data, you might find that its general behavior still needs some refinement perhaps in its tone, style, or understanding of industry-specific language. This is where you can further fine-tune the model to align it even more closely with your business needs.
Tailoring models to your business needs
Out-of-the-box FMs provide a strong starting point, but they often lack the precision, brand voice, or industry-specific expertise required for real-world applications. Maybe the language doesn’t align with your brand, or the model struggles with specialized terminology. You might have experimented with prompt engineering and RAG to enhance responses with additional context. Although these techniques help, they have limitations (for example, longer prompts can increase latency and cost), and models might still lack deep domain expertise needed for domain-specific tasks. To fully harness generative AI, businesses need a way to securely adapt models, making sure AI-generated responses are not only accurate but also relevant, reliable, and aligned with business goals.
Amazon Bedrock simplifies model customization, enabling businesses to fine-tune FMs with proprietary data without building models from scratch or managing complex infrastructure.
Rather than retraining an entire model, Amazon Bedrock provides a fully managed fine-tuning process that creates a private copy of the base FM. This makes sure your proprietary data remains confidential and isn’t used to train the original model. Amazon Bedrock offers two powerful techniques to help businesses refine models efficiently:

Fine-tuning: You can train an FM with labeled datasets to improve accuracy in industry-specific terminology, brand voice, and company workflows. This allows the model to generate more precise, context-aware responses without relying on complex prompts.
Continued pre-training: If you have unlabeled domain-specific data, you can use continued pre-training to further train an FM on specialized industry knowledge without manual labeling. This approach is especially useful for regulatory compliance, domain-specific jargon, or evolving business operations.

By combining fine-tuning for core domain expertise with RAG for real-time knowledge retrieval, businesses can create highly specialized AI models that stay accurate and adaptable, and make sure the style of responses align with business goals. To summarize, Amazon Bedrock offers the following benefits:

Privacy-preserved customization: Fine-tune models securely while making sure that your proprietary data remains private
Efficiency: Achieve high accuracy and domain relevance without the complexity of building models from scratch

As your project evolves, managing and optimizing prompts becomes critical, especially when dealing with different iterations or testing multiple prompt versions. The next step is refining your prompts to maximize model performance.
Managing and optimizing prompts
As your AI projects scale, managing multiple prompts efficiently becomes a growing challenge. Tracking versions, collaborating with teams, and testing variations can quickly become complex. Without a structured approach, prompt management can slow down innovation, increase costs, and make iteration cumbersome. Optimizing a prompt for one FM doesn’t always translate well to another. A prompt that performs well with one FM might produce inconsistent or suboptimal outputs with another, requiring significant rework. This makes switching between models time-consuming and inefficient, limiting your ability to experiment with different AI capabilities effectively. Without a centralized way to manage, test, and refine prompts, AI development becomes slower, more costly, and less adaptable to evolving business needs.
Amazon Bedrock simplifies prompt engineering with Amazon Bedrock Prompt Management, an integrated system that helps teams create, refine, version, and share prompts effortlessly. Instead of manually adjusting prompts for months, Amazon Bedrock accelerates experimentation and enhances response quality without additional code. Bedrock Prompt Management introduces the following capabilities:

Versioning and collaboration: Manage prompt iterations in a shared workspace, so teams can track changes and reuse optimized prompts.
Side-by-side testing: Compare up to two prompt variations simultaneously to analyze model behavior and identify the most effective format.
Automated prompt optimization: Fine-tune and rewrite prompts based on the selected FM to improve response quality. You can select a model, apply optimization, and generate a more accurate, contextually relevant prompt.

Bedrock Prompt Management offers the following benefits:

Efficiency: Quickly iterate and optimize prompts without writing additional code
Teamwork: Enhance collaboration with shared access and version control
Insightful testing: Identify which prompts perform best for your use case

After you’ve optimized your prompts for the best results, the next challenge is optimizing your application for cost and latency by choosing the most appropriate model within a family for a given task. This is where intelligent prompt routing can help.
Optimizing efficiency with intelligent model selection
Not all prompts require the same level of AI processing. Some are straightforward and need fast responses, whereas others require deeper reasoning and more computational power. Using high-performance models for every request increases costs and latency, even when a lighter, faster model could generate an equally effective response. At the same time, relying solely on smaller models might reduce accuracy for complex queries. Without an automated approach, business must manually determine which model to use for each request, leading to higher costs, inefficiencies, and slower development cycles.
Amazon Bedrock Intelligent Prompt Routing optimizes AI performance and cost by dynamically selecting the most appropriate FM for each request. Instead of manually choosing a model, Amazon Bedrock automates model selection within a model family, making sure that each prompt is routed to the best-performing model for its complexity. Bedrock Intelligent Prompt Routing offers the following capabilities:

Adaptive model routing: Automatically directs simple prompts to lightweight models and complex queries to more advanced models, providing the right balance between speed and efficiency
Performance balance: Makes sure that you use high-performance models only when necessary, reducing AI inference costs by up to 30%
Effortless integration: Automatically selects the right model within a family, simplifying deployment

By automating model selection, Amazon Bedrock removes the need for manual decision-making, reduces operational overhead, and makes sure AI applications run efficiently at scale. With Amazon Bedrock Intelligent Prompt Routing, each query is processed by the most efficient model, delivering speed, cost savings, and high-quality responses. The next step in optimizing AI efficiency is reducing redundant computations in frequently used prompts. Many AI applications require maintaining context across multiple interactions, which can lead to performance bottlenecks, increased costs, and unnecessary processing overhead.
Reducing redundant processing for faster responses
As your generative AI applications scale, efficiency becomes just as critical as accuracy. Applications that repeatedly use the same context—such as document Q&A systems (where users ask multiple questions about the same document) or coding assistants that maintain context about code files—often face performance bottlenecks and rising costs because of redundant processing. Each time a query includes long, static context, models reprocess unchanged information, leading to increased latency as models repeatedly analyze the same content and unnecessary token usage inflates compute expenses. To keep AI applications fast, cost-effective, and scalable, optimizing how prompts are reused and processed is essential.
Amazon Bedrock Prompt Caching enhances efficiency by storing frequently used portions of prompts—reducing redundant computations and improving response times. It offers the following benefits:

Faster processing: Skips unnecessary recomputation of cached prompt prefixes, boosting overall throughput
Lower latency: Reduces processing time for long, repetitive prompts, delivering a smoother user experience, and reducing latency by up to 85% for supported models
Cost-efficiency: Minimizes compute resource usage by avoiding repeated token processing, reducing costs by up to 90%

With prompt caching, AI applications respond faster, reduce operational costs, and scale efficiently while maintaining high performance. With Bedrock Prompt Caching providing faster responses and cost-efficiency, the next step is enabling AI applications to move beyond static prompt-response interactions. This is where agentic AI comes in, empowering applications to dynamically orchestrate multistep processes, automate decision-making, and drive intelligent workflows.
Automating multistep tasks with agentic AI
As AI applications grow more sophisticated, automating complex, multistep tasks become essential. You need a solution that can interact with internal systems, APIs, and databases to execute intricate workflows autonomously. The goal is to reduce manual intervention, improve efficiency, and create more dynamic, intelligent applications. Traditional AI models are reactive; they generate responses based on inputs but lack the ability to plan and execute multistep tasks. Agentic AI refers to AI systems that act with autonomy, breaking down complex tasks into logical steps, making decisions, and executing actions without constant human input. Unlike traditional models that only respond to prompts, agentic AI models have the following capabilities:

Autonomous planning and execution: Breaks complex tasks into smaller steps, makes decisions, and plans actions to complete the workflow
Chaining capabilities: Handles sequences of actions based on a single request, enabling the AI to manage intricate tasks that would otherwise require manual intervention or multiple interactions
Interaction with APIs and systems: Connects to your enterprise systems and automatically invokes necessary APIs or databases to fetch or update data

Amazon Bedrock Agents enables AI-powered task automation by using FMs to plan, orchestrate, and execute workflows. With a fully managed orchestration layer, Amazon Bedrock simplifies the process of deploying, scaling, and managing AI agents. Bedrock Agents offers the following benefits:

Task orchestration: Uses FMs’ reasoning capabilities to break down tasks, plan execution, and manage dependencies
API integration: Automatically calls APIs within enterprise systems to interact with business applications
Memory retention: Maintains context across interactions, allowing agents to remember previous steps, providing a seamless user experience

When a task requires multiple specialized agents, Amazon Bedrock supports multi-agent collaboration, making sure agents work together efficiently while alleviating manual orchestration overhead. This unlocks the following capabilities:

Supervisor-agent coordination: A supervisor agent delegates tasks to specialized subagents, providing optimal distribution of workloads
Efficient task execution: Supports parallel task execution, enabling faster processing and improved accuracy
Flexible collaboration modes: You can choose between the following modes:

Fully orchestrated supervisor mode: A central agent manages the full workflow, providing seamless coordination
Routing mode: Basic tasks bypass the supervisor and go directly to subagents, reducing unnecessary orchestration

Seamless integration: Works with enterprise APIs and internal knowledge bases, making it straightforward to automate business operations across multiple domains

By using multi-agent collaboration, you can increase task success rates, reduce execution time, and improve accuracy, making AI-driven automation more effective for real-world, complex workflows. To summarize, agentic AI offers the following benefits:

Automation: Reduces manual intervention in complex processes
Flexibility: Agents can adapt to changing requirements or gather additional information as needed
Transparency: You can use the trace capability to debug and optimize agent behavior

Although automating tasks with agents can streamline operations, handling sensitive information and enforcing privacy is paramount, especially when interacting with user data and internal systems. As your application grows more sophisticated, so do the security and compliance challenges.
Maintaining security, privacy, and responsible AI practices
As you integrate generative AI into your business, security, privacy, and compliance become critical concerns. AI-generated responses must be safe, reliable, and aligned with your organization’s policies to help violating brand guidelines or regulatory policies, and must not include inaccurate or misleading responses.
Amazon Bedrock Guardrails provides a comprehensive framework to enhance security, privacy, and accuracy in AI-generated outputs. With built-in safeguards, you can enforce policies, filter content, and improve trustworthiness in AI interactions. Bedrock Guardrails offers the following capabilities:

Content filtering: Block undesirable topics and harmful content in user inputs and model responses.
Privacy protection: Detect and redact sensitive information like personally identifiable information (PII) and confidential data to help prevent data leaks.
Custom policies: Define organization-specific rules to make sure AI-generated content aligns with internal policies and brand guidelines.
Hallucination detection: Identify and filter out responses not grounded in your data sources through the following capabilities:

Contextual grounding checks: Make sure model responses are factually correct and relevant by validating them against enterprise data source. Detect hallucinations when outputs contain unverified or irrelevant information.
Automated reasoning for accuracy: Moves beyond trust me to prove it AI outputs by applying mathematically sound logic and structured reasoning to verify factual correctness.

With security and privacy measures in place, your AI solution is not only powerful but also responsible. However, if you’ve already made significant investments in custom models, the next step is to integrate them seamlessly into Amazon Bedrock.
Using existing custom models with Amazon Bedrock Custom Model Import
Use Amazon Bedrock Custom Model Import if you’ve already invested in custom models developed outside of Amazon Bedrock and want to integrate them into your new generative AI solution without managing additional infrastructure.
Bedrock Custom Model Import includes the following capabilities:

Seamless integration: Import your custom models into Amazon Bedrock
Unified API access: Interact with models—both base and custom—through the same API
Operational efficiency: Let Amazon Bedrock handle the model lifecycle and infrastructure management

Bedrock Custom Model Import offers the following benefits:

Cost savings: Maximize the value of your existing models
Simplified management: Reduce overhead by consolidating model operations
Consistency: Maintain a unified development experience across models

By importing custom models, you can use your prior investments. To truly unlock the potential of your models and prompt structures, you can automate more complex workflows, combining multiple prompts and integrating with other AWS services.
Automating workflows with Amazon Bedrock Flows
You need to build complex workflows that involve multiple prompts and integrate with other AWS services or business logic, but you want to avoid extensive coding.
Amazon Bedrock Flows has the following capabilities:

Visual builder: Drag-and-drop components to create workflows
Workflow automation: Link prompts with AWS services and automate sequences
Testing and versioning: Test flows directly in the console and manage versions

Amazon Bedrock Flows offers the following benefits:

No-code solution: Build workflows without writing code
Speed: Accelerate development and deployment of complex applications
Collaboration: Share and manage workflows within your team

With workflows now automated and optimized, you’re nearly ready to deploy your generative AI-powered solution. The final stage is making sure that your generative AI solution can scale efficiently and maintain high performance as demand grows.
Monitoring and logging to close the loop on AI operations
As you prepare to move your generative AI application into production, it’s critical to implement robust logging and observability to monitor system health, verify compliance, and quickly troubleshoot issues. Amazon Bedrock offers built-in observability capabilities that integrate seamlessly with AWS monitoring tools, enabling teams to track performance, understand usage patterns, and maintain operational control

Model invocation logging: You can enable detailed logging of model invocations, capturing input prompts and output responses. These logs can be streamed to Amazon CloudWatch or Amazon Simple Storage Service (Amazon S3) for real-time monitoring or long-term analysis. Logging is configurable through the AWS Management Console or the CloudWatchConfig API.
CloudWatch metrics: Amazon Bedrock provides rich operational metrics out-of-the-box, including:

Invocation count
Token usage (input/output)
Response latency
Error rates (for example, invalid input and model failures)

These capabilities are essential for running generative AI solutions at scale with confidence. By using CloudWatch, you gain visibility across the full AI pipeline from input prompts to model behavior; making it straightforward to maintain uptime, performance, and compliance as your application grows.
Finalizing and scaling your generative AI solution
You’re ready to deploy your generative AI application and need to scale it efficiently while providing reliable performance. Whether you’re handling unpredictable workloads, enhancing resilience, or needing consistent throughput, you must choose the right scaling approach. Amazon Bedrock offers three flexible scaling options that you can use to tailor your infrastructure to your workload needs:

On-demand: Start with the flexibility of on-demand scaling, where you pay only for what you use. This option is ideal for early-stage deployments or applications with variable or unpredictable traffic. It offers the following benefits:

No commitments.
Pay only for tokens processed (input/output).
Great for dynamic or fluctuating workloads.

Cross-Region inference: When your traffic grows or becomes unpredictable, you can use cross-Region inference to handle bursts by distributing compute across multiple AWS Regions, enhancing availability without additional cost. It offers the following benefits:

Up to two times larger burst capacity.
Improved resilience and availability.
No additional charges, you have the same pricing as your primary Region.

Provisioned Throughput: For large, consistent workloads, Provisioned Throughput maintains a fixed level of performance. This option is perfect when you need predictable throughput, particularly for custom models. It offers the following benefits:

Consistent performance for high-demand applications.
Required for custom models.
Flexible commitment terms (1 month or 6 months).

Conclusion
Building generative AI solutions is a multifaceted process that requires careful consideration at every stage. Amazon Bedrock simplifies this journey by providing a unified service that supports each phase, from model selection and customization to deployment and compliance. Amazon Bedrock offers a comprehensive suite of features that you can use to streamline and enhance your generative AI development process. By using its unified tools and APIs, you can significantly reduce complexity, enabling accelerated development and smoother workflows. Collaboration becomes more efficient because team members can work seamlessly across different stages, fostering a more cohesive and productive environment. Additionally, Amazon Bedrock integrates robust security and privacy measures, helping to ensure that your solutions meet industry and organization requirements. Finally, you can use its scalable infrastructure to bring your generative AI solutions to production faster while minimizing overhead. Amazon Bedrock stands out as a one-stop solution that you can use to build sophisticated, secure, and scalable generative AI applications. Its extensive capabilities alleviate the need for multiple vendors and tools, streamlining your workflow and enhancing productivity.
Explore Amazon Bedrock and discover how you can use its features to support your needs at every stage of generative AI development. To learn more, see the Amazon Bedrock User Guide.

About the authors
Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services, driving AI-led transformation across North America’s FinTech sector. He partners with organizations to design and execute cloud and AI strategies that speed up innovation and deliver measurable business impact. His work has consistently translated into millions in value through enhanced efficiency and additional revenue streams. With deep expertise in AI/ML, Generative AI, and cloud-native architectures, Sajjan enables financial institutions to achieve scalable, data-driven outcomes. When not architecting the future of finance, he enjoys traveling and spending time with family. Connect with him on LinkedIn.
Axel Larsson is a Principal Solutions Architect at AWS based in the greater New York City area. He supports FinTech customers and is passionate about helping them transform their business through cloud and AI technology. Outside of work, he is an avid tinkerer and enjoys experimenting with home automation.

How Netsertive built a scalable AI assistant to extract meaningful ins …

This post was co-written with Herb Brittner from Netsertive.
Netsertive is a leading digital marketing solutions provider for multi-location brands and franchises, helping businesses maximize local advertising, improve engagement, and gain deep customer insights.
With a growing demand in providing more actionable insights from their customer call tracking data, Netsertive needed a solution that could unlock business intelligence from every call, making it easier for franchises to improve customer service and boost conversion rates. The team was looking for a single, flexible system that could do several things:

Understand phone calls – Automatically create summaries of what was discussed
Gauge customer feelings – Determine if the caller was happy, upset, or neutral
Identify important topics – Pull out keywords related to frequent services, questions, problems, and mentions of competitors
Improve agent performance – Offer advice and suggestions for coaching
Track performance over time – Generate reports on trends for individual locations, regions, and the entire country

Crucially, this new system needed to work smoothly with their existing Multi-Location Experience (MLX) platform. The MLX platform is specifically designed for businesses with many locations and helps them manage both national and local marketing. It allows them to run campaigns across various online channels, including search engines, social media, display ads, videos, connected TVs, and online reviews, as well as manage SEO, business listings, reviews, social media posting, and individual location web pages.
In this post, we show how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova, to bring their next generation of the platform to life.
Solution overview
Operating a comprehensive digital marketing solution, Netsertive handles campaign execution while providing key success metrics through their Insights Manager product. The platform features location-specific content management capabilities and robust lead capture functionality, collecting data from multiple sources, including paid campaigns, organic website traffic, and attribution pro forms. With CRM integration and call tracking features, MLX creates a seamless flow of customer data and marketing insights. This combination of managed services, automated tools, and analytics makes MLX a single source of truth for businesses seeking to optimize their digital marketing efforts while taking advantage of Netsertive’s expertise in campaign management. To address their desire to provide more actionable insights on the platform from customer call tracking data, Netsertive considered various solutions. After evaluating different tools and models, they decided to use Amazon Bedrock and the Amazon Nova Micro model. This choice was driven by the API-driven approach of Amazon Bedrock, its wide selection of large language models (LLMs), and the performance of the Amazon Nova Micro model specifically. They selected Amazon Nova Micro based on its ability to deliver fast response times at a low cost, while providing consistent and intelligent insights—key factors for Netsertive. With its generation speed of over 200 tokens per second and highly performant language understanding skills, this text-only model proved ideal for Netsertive. The following diagram shows how their MLX platform receives real-time phone calls and uses Amazon Nova Micro in Amazon Bedrock for processing real-time phone calls.

The real-time call processing flow consists of the following steps:

When a call comes in, it’s immediately routed to the Lead API. This process captures both the live call transcript and important metadata about the caller. This system continuously processes new calls as they arrive, facilitating real-time handling of incoming communications.
The captured transcript is forwarded to Amazon Bedrock for analysis. The system currently uses a standardized base prompt for all customers, and the architecture is designed to allow for customer-specific prompt customization as an added layer of context.
Amazon Nova Micro processes the transcript and returns a structured JSON response. This response includes multiple analysis components: sentiment analysis of the conversation, a concise call summary, identified key terms, overall call theme classification, and specific coaching suggestions for improvement.
All analysis results are systematically stored in an Amazon Aurora database with their associated key metrics. This makes sure the processed data is properly indexed and readily available for both immediate access and future analysis.

The aggregate report schedule flow consists of the following steps:

The aggregate analysis process automatically initiates on both weekly and monthly schedules. During each run, the system gathers call data that falls within the specified time period.
This aggregate analysis uses both Amazon Bedrock and Amazon Nova Micro, applying a specialized prompt designed specifically for trend analysis. This prompt differs from the real-time analysis to focus on identifying patterns and insights across multiple calls.

The processed aggregate data from both workflows is transformed into comprehensive reports displaying trend analysis and comparative metrics through the UI. This provides stakeholders with valuable insights into performance patterns and trends over time while allowing the user to dive deeper into specific metrics.
Results
The implementation of generative AI to create a real-time call data analysis solution has been a transformative journey for Netsertive. Their new Call Insights AI feature, using Amazon Nova Micro on Amazon Bedrock, only takes minutes to create actionable insights, compared to their previous manual call review processes, which took hours or even days for customers with high call volumes. Netsertive chose Amazon Bedrock and Amazon Nova Micro for their solution after a swift evaluation period of approximately 1 week of testing different tools and models. Their development approach was methodical and customer-focused. The Call Insights AI feature was added to their platform’s roadmap based on direct customer feedback and internal marketing expertise. The entire development process, from creating and testing their Amazon Nova Micro prompts to integrating Amazon Bedrock with their MLX platform, was completed within approximately 30 days before launching in beta. The transformation of real-time call data analysis isn’t just about processing more calls—it’s about creating a more comprehensive understanding of customer interactions. By implementing Amazon Bedrock and Amazon Nova Micro, Netsertive is able to better understand call purposes and value, enhance measurement capabilities, and progress towards more automated and efficient analysis systems. This evolution can not only streamline operations but also provide customers with more actionable insights about their digital marketing performance.
Conclusion
In this post, we shared how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova. This solution helped scale their MLX platform to provide their customers with instant, actionable insights, creating a more engaging and informative user experience. By using the advanced natural language processing capabilities of Amazon Bedrock and the high-performance, low-latency Amazon Nova Micro model, Netsertive was able to build a comprehensive call intelligence system that goes beyond just transcription and sentiment analysis.
The success of this project has demonstrated the transformative potential of generative AI in driving business intelligence and operational efficiency. To learn more about building powerful, generative AI assistants and applications using Amazon Bedrock and Amazon Nova, see Generative AI on AWS.

About the authors
Nicholas Switzer is an AI/ML Specialist Solutions Architect at Amazon Web Services. He joined AWS in 2022 and specializes in AI/ML, generative AI, IoT, and edge AI. He is based in the US and enjoys building intelligent products that improve everyday life.
Jane Ridge is Senior Solutions Architect at Amazon Web Services with over 20 years of technology experience. She joined AWS in 2020 and is based in the US. She is passionate around enabling growth of her customers through innovative solutions combined with her deep technical expertise in the AWS ecosystem. She is known for her ability to guide customers through all stages of their cloud journey and deliver impactful solutions.
Herb Brittner is the Vice President of Product & Engineering at Netsertive, where he leads the development of AI-driven digital marketing solutions for multi-location brands and franchises. With a strong background in product innovation and scalable engineering, he specializes in using machine learning and cloud technologies to drive business insights and customer engagement. Herb is passionate about building data-driven platforms that enhance marketing performance and operational efficiency.