Google DeepMind Achieves State-of-the-Art Data-Efficient Reinforcement …

Reinforcement Learning RL trains agents to maximize rewards by interacting with an environment. Online RL alternates between taking actions, collecting observations and rewards, and updating policies using this experience. Model-free RL (MFRL) maps observations to actions but requires extensive data collection. Model-based RL (MBRL) mitigates this by learning a world model (WM) for planning in an imagined environment. Standard benchmarks like Atari-100k test sample efficiency, but their deterministic nature allows memorization rather than generalization. To encourage broader skills, researchers use Crafter, a 2D Minecraft-like environment. Craftax-classic, a JAX-based version, introduces procedural environments, partial observability, and a sparse reward system, requiring deep exploration.

MBRL methods vary based on how WMs are used—for background planning (training policies with imagined data) or decision-time planning (conducting lookahead searches during inference). As seen in MuZero and EfficientZero, decision-time planning is effective but computationally expensive for large WMs like transformers. Background planning, originating from Dyna-Q learning, has been refined in deep RL models like Dreamer, IRIS, and DART. WMs also differ in generative ability; while non-generative WMs excel in efficiency, generative WMs better integrate real and imagined data. Many modern architectures use transformers, though recurrent state-space models like DreamerV2/3 remain relevant.

Researchers from Google DeepMind introduce an advanced MBRL method that sets a new benchmark in the Craftax-classic environment, a complex 2D survival game requiring generalization, deep exploration, and long-term reasoning. Their approach achieves a 67.42% reward after 1M steps, surpassing DreamerV3 (53.2%) and human performance (65.0%). They enhance MBRL with a robust model-free baseline, “Dyna with warmup” for real and imagined rollouts, a nearest-neighbor tokenizer for patch-based image processing, and block teacher forcing for efficient token prediction. These refinements collectively improve sample efficiency, achieving state-of-the-art performance in data-efficient RL.

The study enhances the MFRL baseline by expanding the model size and incorporating a Gated Recurrent Unit (GRU), increasing rewards from 46.91% to 55.49%. Additionally, the study introduces an MBRL approach using a Transformer World Model (TWM) with VQ-VAE quantization, achieving 31.93% rewards. To further optimize performance, a Dyna-based method integrates real and imagined rollouts, improving learning efficiency. Replacing VQ-VAE with a patch-wise nearest-neighbor tokenizer boosts performance from 43.36% to 58.92%. These advancements demonstrate the effectiveness of combining memory mechanisms, transformer-based models, and improved observation encoding in reinforcement learning.

The study presents results from experiments on the Craftax-classic benchmark, conducted on 8 H100 GPUs over 1M steps. Each method collected 96-length trajectories in 48 parallel environments. For MBRL methods, imaginary rollouts were generated at 200k environment steps and updated 500 times. The “MBRL ladder” progression showed significant improvements, with the best agent (M5) achieving a 67.42% reward. Ablation studies confirmed the importance of each component, such as Dyna, NNT, patches, and BTF. Compared with existing methods, the best MBRL agent achieved a state-of-the-art performance. Additionally, Craftax Full experiments demonstrated generalization to harder environments.

In conclusion, the study introduces three key improvements to vision-based MBRL agents using TWM for background planning. These enhancements include Dyna with warmup, patch nearest-neighbor tokenization, and block teacher forcing. The proposed MBRL agent performs better on the Craftax-classic benchmark, surpassing previous state-of-the-art models and human expert rewards. Future work includes exploring generalization beyond Craftax, prioritizing experience replay, integrating off-policy RL algorithms, and refining the tokenizer for large pre-trained models like SAM and Dino-V2. Additionally, the policy will be modified to accept latent tokens from non-reconstructive world models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)
The post Google DeepMind Achieves State-of-the-Art Data-Efficient Reinforcement Learning RL with Improved Transformer World Models appeared first on MarkTechPost.

Trellix lowers cost, increases speed, and adds delivery flexibility wi …

This post is co-written with Martin Holste from Trellix. 
Security teams are dealing with an evolving universe of cybersecurity threats. These threats are expanding in form factor, sophistication, and the attack surface they target. Constrained by talent and budget limitations, teams are often forced to prioritize the events pursued for investigation, limiting the ability to detect and identify new threats. Trellix Wise is an AI-powered technology enabling security teams to automate threat investigation and add risk scores to events. With Trellix Wise, security teams can now complete what used to take multiple analysts hours of work to investigate in seconds, enabling them to expand the security events they are able to cover.
Trellix, a leading company delivering cybersecurity’s broadest AI-powered platform to over 53,000 customers worldwide, emerged in 2022 from the merger of McAfee Enterprise and FireEye. The company’s comprehensive, open, and native AI-powered security platform helps organizations build operational resilience against advanced threats. Trellix Wise is available to customers as part of the Trellix Security Platform. This post discusses the adoption and evaluation of Amazon Nova foundation models (FMs) by Trellix.
With growing adoption and use, the Trellix team has been exploring ways to optimize the cost structure of Trellix Wise investigations. Smaller, cost-effective FMs seemed promising and Amazon Nova Micro stood out as an option because of its quality and cost. In early evaluations, the Trellix team observed that Amazon Nova Micro delivered inferences three times faster and at nearly 100-fold lower cost.
The following figures are the results of tests by Trellix comparing Amazon Nova Micro to other models on Amazon Bedrock.

The Trellix team identified areas where Amazon Nova Micro can complement their use of Anthropic’s Claude Sonnet, delivering lower costs and higher overall speeds. Additionally, the professional services team at Trellix found Amazon Nova Lite to be a strong model for code generation and code understanding and is now using Amazon Nova Lite to speed up their custom solution delivery workflows.
Trellix Wise, generative-AI-powered threat investigation to assist security analysts
Trellix Wise is built on Amazon Bedrock and uses Anthropic’s Claude Sonnet as its primary model. The platform uses the Amazon OpenSearch Service stores billions of security events collected from the environments monitored. OpenSearch Service comes with a built-in vector database capability, making it straightforward to use data stored in OpenSearch Service as context data in a Retrieval Augmented Generation (RAG) architecture with Amazon Bedrock Knowledge Bases. Using OpenSearch Service and Amazon Bedrock, Trellix Wise carries out its automated, proprietary threat investigation steps on each event. This includes retrieval of required data for analysis, analysis of the data using insights from other custom-built machine learning (ML) models, and risk scoring. This sophisticated approach enables the service to interpret complex security data patterns and make intelligent decisions about each event. The Trellix Wise investigation gives each event a risk score and allows analysts to dive deeper into the results of the analysis, to determine whether human follow-up is necessary.
The following screenshot shows an example of an event on the Trellix Wise dashboard.

With growing scale of adoption, Trellix has been evaluating ways to improve cost and speed. The Trellix team has determined not all stages in the investigation need the accuracy of Claude Sonnet, and that some stages can benefit from faster, lower cost models that nevertheless are highly accurate for the target task. This is where Amazon Nova Micro has helped improve the cost structure of investigations.
Improving investigation cost with Amazon Nova Micro, RAG, and repeat inferences
The threat investigation workflow consists of multiple steps, from data collection, to analysis, to assigning of a risk score for the event. The collections stage retrieves event-related information for analysis. This is implemented through one or more inference calls to a model in Amazon Bedrock. The priority in this stage is to maximize completeness of the retrieval data and minimize inaccuracy (hallucinations). The Trellix team identified this stage as the optimal stage in the workflow to optimize for speed and cost.
The Trellix team concluded, based on their testing, Amazon Nova Micro offered two key advantages. Its speed allows it to process 3-5 inferences in the same time as a single Claude Sonnet inference and it’s cost per inference is almost 100 times lower. The Trellix team determined that by running multiple inferences, you can maximize the coverage of required data and still lower costs by a factor of 30. Although the model responses had a higher variability than the larger models, running multiple passes enables getting to a more exhaustive response-set. The response limitations enforced through proprietary prompt engineering and reference data constrain the response space, limiting hallucinations and inaccuracies in the response.
Before implementing the approach, the Trellix team carried out detailed testing to review the response completeness, cost, and speed. The team realized early in their generative AI journey that standardized benchmarks are not sufficient when evaluating models for a specific use case. A test harness replicating the information gathering workflows was set up and detailed evaluations of multiple models were carried out, to validate the benefits of this approach before moving ahead. The speed and cost benefits observed by Trellix helped validate the benefits before moving the new approach into production. The approach is now deployed in a limited pilot environment. Detailed evaluations are being carried out as part of a phased roll-out into production.
Conclusion
In this post, we shared how Trellix adopted and evaluated Amazon Nova models, resulting in significant inference speedup and lower costs. Reflecting on the project, the Trellix team recognizes the following as key enablers allowing them to achieve these results:

Access to a broad range of models, including smaller highly capable models like Amazon Nova Micro and Amazon Nova Lite, accelerated the team’s ability to easily experiment and adopt new models as appropriate.
The ability to constrain responses to avoid hallucinations, using pre-built use-case specific scaffolding that incorporated proprietary data, processes, and policies, reduced the risk of hallucinations and inaccuracies.
Data services that enabled effective integration of data alongside foundation models simplified implementation and reduced the time to production for new components.

“Amazon Bedrock makes it easy to evaluate new models and approaches as they become available. Using Amazon Nova Micro alongside Anthropic’s Claude Sonnet allows us to deliver the best coverage to our customers, fast, and at the best operating cost.“ says Martin Holste, Senior Director, Engineering, Trellix. “We’re really happy with the flexibility that Amazon Bedrock allows us as we continue to evaluate and improve Trellix Wise and the Trellix Security Platform.”
Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.

About the Authors
Martin Holste is the CTO for Cloud and GenAI at Trellix. Firat Elbey is a Principal Product Manager at Amazon AGI. Deepak Mohan is a Principal Product Marketing Manager at AWS.

OfferUp improved local results by 54% and relevance recall by 27% with …

This post is co-written with Andrés Vélez Echeveri and Sean Azlin from OfferUp.

OfferUp is an online, mobile-first marketplace designed to facilitate local transactions and discovery. Known for its user-friendly app and trust-building features, including user ratings and in-app chat, OfferUp enables users to buy and sell items and explore a broad range of jobs and local services. As part of its ongoing mission to enhance user experience and drive business growth, OfferUp constantly seeks to improve its search capabilities, making it faster and more intuitive for users to discover, transact, and connect in their local communities.
In this two-part blog post series, we explore the key opportunities OfferUp embraced on their journey to boost and transform their existing search solution from traditional lexical search to modern multimodal search powered by Amazon Bedrock and Amazon OpenSearch Service. OfferUp found that multimodal search improved relevance recall by 27%, reduced geographic spread (which means more local results) by 54%, and grew search depth by 6.5%. This series delves into strategies, architecture patterns, business benefits and technical steps to modernize your own search solution
Foundational search architecture
OfferUp hosts millions of active listings, with millions more added monthly by its users. Previously, OfferUp’s search engine was built with Elasticsearch (v7.10) on Amazon Elastic Compute Cloud (Amazon EC2), using a keyword search algorithm to find relevant listings. The following diagram illustrates the data pipeline for indexing and query in the foundational search architecture.

Figure 1: Foundational search architecture
The data indexing workflow consists of the following steps:

As an OfferUp user creates or updates a listing, any new images are uploaded directly to Amazon Simple Storage Service (Amazon S3) using signed upload URLs.
The OfferUp user submits the new or updated listing details (title, description, image ids) to a posting microservice.
The posting microservice then persists the changes using the listing writer microservice in Amazon DynamoDB.
The listing writer microservice publishes listing change events to an Amazon Simple Notification Service (Amazon SNS) topic, which an Amazon Simple Queue Service (Amazon SQS) queue subscribes to.
The listing indexer AWS Lambda function continuously polls the queue and processes incoming listing updates.
The indexer retrieves the full listing details through the listing reader microservice from the DynamoDB table.
Finally, the indexer updates or inserts these listing details into Elasticsearch.

This flow makes sure that new or updated listings are indexed and made available for search queries in Elasticsearch.
The data query workflow consists of the following steps:

OfferUp users perform text searches, such as “summer shirt” or “running shoes”’.
The search microservice processes the query requests and retrieves relevant listings from Elasticsearch using keyword search (BM25 as a ranking algorithm).

Challenges with the foundational search architecture
OfferUp continuously strives to enhance user experience, focusing specifically on improving search relevance, which directly impacts Engagement with Seller Response (EWSR) and drives ad impressions. Although the foundational search architecture effectively surfaces a broad and diverse inventory, OfferUp encountered several limitations that prevent it from achieving optimal outcomes. These challenges include:

Context understanding – Keyword searches don’t account for the context in which a term is used. This can lead to irrelevant results if the same keyword has different meanings or uses. Keywords alone can’t discern user intent. For instance, “apple” could refer to the fruit, the technology company, or the brand name in different contexts.
Synonym and variation awareness – Keyword searches might miss results if the search terms vary or if synonyms are used. For example, searching for “car” might not return results for “sedan”. Similarly, searching for iPhone 11 can return results for iPhone 10 and iPhone 12.
Complex query management – The foundational search approach struggled with complex, multi-concept queries like “red running shoes,” often returning results that included shoes in other colors or footwear not designed for running.

Keyword search, which uses BM25 as a ranking algorithm, lacks the ability to understand semantic relationships between words, often missing semantically relevant results if they don’t contain exact keywords.
Solution overview
To improve search quality, OfferUp explored various software and hardware solutions focused on boosting search relevance while maintaining cost-efficiency. Ultimately, OfferUp selected Amazon Titan Multimodal Embeddings and Amazon OpenSearch Service for their fully managed services, which support a robust multimodal search solution capable of delivering high accuracy and fast responses across search and recommendation use cases. This choice also simplifies the deployment and operation of large-scale search capabilities on the OfferUp app, meeting the high throughput and latency requirements.
Amazon Titan Multimodal Embeddings G1 model
This model is pre-trained on large datasets, so you can use it as-is or customize this model by fine-tuning with your own data for a particular task. This model is used for use cases like searching images by text, by image, or by a combination of text and image for similarity and personalization. It translates the input image or text into an embedding that contains the semantic meaning of both the image and text in the same semantic space. By comparing embeddings, the model produces more relevant and contextual responses than keyword matching alone.
The Amazon Titan Multimodal Embeddings G1 offers the following configurations:

Model ID – amazon.titan-embed-image-v1
Max input text tokens – 256
Max input image size – 25 MB
Output vector size – 1,024 (default), 384, 256
Inference types – On-Demand, Provisioned Throughput

OpenSearch Service’s vector database capabilities
Vector databases enable the storage and indexing of vectors alongside metadata, facilitating low-latency queries to discover assets based on similarity. These databases typically use k-nearest (k-NN) indexes built with advanced algorithms such as Hierarchical Navigable Small Worlds (HNSW) and Inverted File (IVF) systems. Beyond basic k-NN functionality, vector databases offer a robust foundation for applications that require data management, fault tolerance, resource access controls, and an efficient query engine.
OpenSearch is a powerful, open-source suite that provides scalable and flexible tools for search, analytics, security monitoring, and observability—all under the Apache 2.0 license. With Amazon OpenSearch Service, you get a fully managed solution that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud. By using Amazon OpenSearch Service as a vector database, you can combine traditional search, analytics, and vector search into one comprehensive solution. OpenSearch’s vector capabilities help accelerate AI application development, making it easier for teams to operationalize, manage, and integrate AI-driven assets.
To further boost these capabilities, OpenSearch offers advanced features, such as:

Connector for Amazon Bedrock – You can seamlessly integrate Amazon Bedrock machine learning (ML) models with OpenSearch through built-in connectors for services, enabling direct access to advanced ML features.
Ingest Pipeline – With ingest pipelines, you can process, transform, and route data efficiently, maintaining smooth data flows and real-time accessibility for search.
Neural Search – Neural search transforms text and images into vectors and facilitates vector search both at ingestion time and at search time. This allows end-to-end configuration of ingest pipelines, search pipelines, and the necessary connectors without having to leave OpenSearch
Transformed multimodal search architecture – OfferUp transformed its foundational search architecture with Amazon Bedrock Titan Multimodal and Amazon OpenSearch Service.

The following diagram below illustrates the data pipeline for indexing and query in the transformed multimodal search architecture:

Figure 2: Transformed multimodal search architecture
The data indexing workflow consists of the following steps:

As an OfferUp user creates or updates a listing, any new images are uploaded directly to Amazon Simple Storage Service (Amazon S3) using signed upload URLs.
The OfferUp user submits the new or updated listing details (title, description, image ids) to a posting microservice.
The posting microservice then persists the changes using the listing writer microservice in Amazon DynamoDB.
The listing writer microservice publishes listing change events to an Amazon Simple Notification Service (Amazon SNS) topic, which an Amazon Simple Queue Service (Amazon SQS) queue subscribes to.
The listing indexer AWS Lambda function continuously polls the queue and processes incoming listing updates.
The indexer retrieves the full listing details through the listing reader microservice from the DynamoDB table.
The Lambda indexer relies on the image microservice to retrieve listing images and encode them in base64 format.
The indexer lambda sends inserts and updates with listing details and base 64-encoded images to an Amazon OpenSearch Service domain.
An OpenSearch Ingest pipeline invokes the OpenSearch connector for Amazon Bedrock. The Titan Multimodal Embeddings model generates multi-dimensional vector embeddings for the listing image and description.
Listing data and embeddings are then stored in an Amazon OpenSearch index.

The data query workflow consists of the following steps:

OfferUp users perform both text and image searches, such as “gray faux leather sofa” or “running shoes”.
The search microservice captures the query and forwards it to Amazon OpenSearch Service domain, which invokes a neural search pipeline. The neural search pipeline forwards each search request to the same Amazon Titan Multimodal Embeddings model to convert the text and images into multi-dimensional vector embeddings.
OpenSearch Service then uses the vectors to find the k-nearest neighbors (KNN) to the vectorized search term and image to retrieve the relevant listings.

After extensive A/B testing with various k values, OfferUp found that a k value of 128 delivers the best search results while optimizing compute resources.
OfferUp multimodal search migration path
OfferUp adopted a three-step process to implement multimodal search functionality into their foundational search architecture.

Identify the Designated Market Areas (DMAs) – OfferUp categorizes its DMAs into high density and low density. High DMA density represents geographic locations with a higher user concentration, whereas low DMA density refers to locations with fewer users. OfferUp initially identified three business-critical high-density locations where multimodal search solutions demonstrated promising results in offline experiments, making them ideal candidates for multimodal search.
Set up Infrastructure and necessary configurations – This includes the following

OpenSearch Service: The OpenSearch domain is deployed across 3 Availability Zones (AZs) to provide high availability. The cluster comprises 3 cluster manager nodes (m6g.xlarge.search instance) dedicated to manage cluster operations. For data handling, 24 data nodes (r6gd.2xlarge.search instances) are used, optimized for both storage and processing. The index is configured with 12 shards and three read replicas to enhance read performance. Each shard consumes around 11.6GB of memory.
Embeddings model: The infrastructure enables access to Amazon Titan Multimodal Embeddings G1 in Amazon Bedrock.

Use backfilling – Backfilling converts an image of every active listing into vectors using Amazon Titan Multimodal Embeddings and stores that in OpenSearch Service. In the first phase, OfferUp backfilled 12 million active listings. OfferUp rolled out multimodal search experiments in these three DMAs where input token size could vary between 3 – 15.

Benefits of multimodal search
In this section, we discuss the benefits of multimodal search
Business metrics
OfferUp evaluated the impact of multimodal search through A/B testing to manage traffic control and user experiment variations. In this experiment, the control group used the existing keyword-based search, and the variant group experienced the new multimodal search functionality. The test included a substantial user base, allowing for a robust comparison.

The results of the multimodal search implementation were compelling.
User engagement increased by 2.2%, and EWSR saw a 3.8% improvement, highlighting enhanced relevance in search outcomes
Search depth grew by 6.5%, as users explored results more thoroughly, indicating improved relevance beyond the top search items
Importantly, the need for fanout searches (broader search queries) decreased by 54.2%, showing that more users found relevant local results quickly
Ad impressions also rose by 0.91%, sustaining ad visibility while enhancing search performance

Technical metrics
OfferUp conducted additional experiments to assess technical metrics, utilizing 6 months of production system data to examine relevance recall with a focus on the top k=10 most relevant results within high-density and low-density DMAs. By segmenting these locations, OfferUp gained insights into how variations in user distribution across different market densities affect system performance, allowing for a deeper understanding of relevance recall efficiency in diverse markets.
relevance recall (RR)= sum(listing relevance score) / number of retrieved listings
Listing relevance is labeled as (1, 0) and is based on query correlations with the listing retrieved.

1: Listing is relevant
0: listing is not relevant

Conclusion
In this post, we demonstrated how OfferUp transformed its foundational search architecture using Amazon Titan Multimodal Embeddings and OpenSearch Service, significantly increasing user engagement, improving search quality and offering users the ability to search with both text and images. OfferUp selected Amazon Titan Multimodal Embeddings and Amazon OpenSearch Service for their fully managed capabilities, enabling the development of a robust multimodal search solution with high accuracy and a faster time to market for search and recommendation use cases.
We are excited to share these insights with the broader community and support organizations embarking on their own multimodal search journeys or seeking to improve search precision. Based on our experience, we highly recommend using Amazon Bedrock and Amazon OpenSearch services to achieve similar outcomes.
In the next part of the series, we discuss how to build multimodal search solution with an Amazon SageMaker Jupyter notebook, Amazon Titan Multimodal Embeddings model and OpenSearch Service.

About the authors
Purna Sanyal is GenAI Specialist Solution Architect at AWS, helping customers to solve their business problems with successful adoption of cloud native architecture and digital transformation. He has specialization in data strategy, machine learning and Generative AI. He is passionate about building large-scale ML systems that can serve global users with optimal performance.
Andrés Vélez Echeveri is a Staff Data Scientist and Machine Learning Engineer at OfferUp, focused on enhancing the search experience by optimizing retrieval and ranking components within a recommendation system. He has a specialization in machine learning and generative AI. He is passionate about creating scalable AI systems that drive innovation and user impact.
Sean Azlin is a Principal Software Development Engineer at OfferUp, focused on leveraging technology to accelerate innovation, decrease time-to-market, and empower others to succeed and thrive. He is highly experienced in building cloud-native distributed systems at any scale. He is particularly passionate about GenAI and its many potential applications.

Enhancing LLM Capabilities with NeMo Guardrails on Amazon SageMaker Ju …

As large language models (LLMs) become increasingly integrated into customer-facing applications, organizations are exploring ways to leverage their natural language processing capabilities. Many businesses are investigating how AI can enhance customer engagement and service delivery, and facing challenges in making sure LLMs driven engagements are on topic and follow the desired instructions.
In this blog post, we explore a real-world scenario where a fictional retail store, AnyCompany Pet Supplies, leverages LLMs to enhance their customer experience. Specifically, this post will cover:

What NeMo Guardrails is. We will provide a brief introduction to guardrails and the Nemo Guardrails framework for managing LLM interactions.
Integrating with Amazon SageMaker JumpStart to utilize the latest large language models with managed solutions.
Creating an AI Assistant capable of understanding customer inquiries, providing contextually aware responses, and steering conversations as needed.
Implementing Sophisticated Conversation Flows using variables and branching flows to react to the conversation content, ask for clarifications, provide details, and guide the conversation based on user intent.
Incorporating your Data into the Conversation to provide factual, grounded responses aligned with your use case goals using retrieval augmented generation or by invoking functions as tools.

Through this practical example, we’ll illustrate how startups can harness the power of LLMs to enhance customer experiences and the simplicity of Nemo Guardrails to guide the LLMs driven conversation toward the desired outcomes.
Note: For any considerations of adopting this architecture in a production setting, it is imperative to consult with your company specific security policies and requirements. Each production environment demands a uniquely tailored security architecture that comprehensively addresses its particular risks and regulatory standards. Some links for security best practices are shared below but we strongly recommend reaching out to your account team for detailed guidance and to discuss the appropriate security architecture needed for a secure and compliant deployment.
What is Nemo Guardrails?
First, let’s try to understand what guardrails are and why we need them. Guardrails (or “rails” for short) in LLM applications function much like the rails on a hiking trail — they guide you through the terrain, keeping you on the intended path. These mechanisms help ensure that the LLM’s responses stay within the desired boundaries and produces answers from a set of pre-approved statements.
NeMo Guardrails, developed by NVIDIA, is an open-source solution for building conversational AI products. It allows developers to define and constrain the topics the AI agent will engage with, the possible responses it can provide, and how the agent interacts with various tools at its disposal.
The architecture consists of five processing steps, each with its own set of controls, referred to as “rails” in the framework. Each rail defines the allowed outcomes (see Diagram 1):

Input and Output Rails: These identify the topic and provide a blocking mechanism for what the AI can discuss.
Retrieval and Execution Rails: These govern how the AI interacts with external tools and data sources.
Dialog Rails: These maintain the conversational flow as defined by the developer.

For a retail chatbot like AnyCompany Pet Supplies’ AI assistant, guardrails help make sure that the AI collects the information needed to serve the customer, provides accurate product information, maintains a consistent brand voice, and integrates with the surrounding services supporting to perform actions on behalf of the user.

Diagram 1: The architecture of NeMo Guardrails, showing how interactions, rails and integrations are structured.

Within each rail, NeMo can understand user intent, invoke integrations when necessary, select the most appropriate response based on the intent and conversation history and generate a constrained message as a reply (see Diagram 2).

Diagram 2: The flow from input forms to the final output, including how integrations and AI services are utilized.

An Introduction to Colang
Creating a conversational AI that’s smart, engaging and operates with your use case goals in mind can be challenging. This is where NeMo Guardrails comes in. NeMo Guardrails is a toolset designed to create robust conversational agents, utilizing Colang — a modelling language specifically tailored for defining dialogue flows and guardrails. Let’s delve into how NeMo Guardrails own language can enhance your AI’s performance and provide a guided and seamless user experience.
Colang is purpose-built for simplicity and flexibility, featuring fewer constructs than typical programming languages, yet offering remarkable versatility. It leverages natural language constructs to describe dialogue interactions, making it intuitive for developers and simple to maintain.
Let’s delve into a basic Colang script to see how it works:

define user express greeting
“hello”
“hi”
“what’s up?”

define bot express greeting
“Hey there!”

define bot ask how are you
“How are you doing?”

define flow greeting
user express greeting
bot express greeting
bot ask how are you

In this script, we see the three fundamental types of blocks in Colang:

User Message Blocks (define user …): These define possible user inputs.
Bot Message Blocks (define bot …): These specify the bot’s responses.
Flow Blocks (define flow …): These describe the sequence of interactions.

In the example above, we defined a simple dialogue flow where a user expresses gratitude, and the bot responds with a welcoming message. This straightforward approach allows developers to construct intricate conversational pathways that uses the examples given to route the conversation toward the desired responses.
Integrating Llama 3.1 and NeMo Guardrails on SageMaker JumpStart
For this post, we’ll use Llama 3.1 8B instruct model from Meta, a recent model that strikes excellent balance between size, inference cost and conversational capabilities. We will launch it via Amazon SageMaker JumpStart, which provides access to numerous foundation models from providers such as Meta, Cohere, Hugging Face, Anthropic and more.
By leveraging SageMaker JumpStart, you can quickly evaluate and select suitable foundation models based on quality, alignment and reasoning metrics. The selected models can then be further fine-tuned on your data to better match your specific use case needs. On top of ample model choice, the additional benefit is that it enables your data to remain within your Amazon VPC during both inference and fine-tuning.
When integrating models from SageMaker JumpStart with NeMo Guardrails, the direct interaction with the SageMaker inference API requires some customization, which we will explore below.
Creating an Adapter for NeMo Guardrails To verify compatibility, we need to create an adapter to make sure that requests and responses match the format expected by NeMo Guardrails. Although NeMo Guardrails provides a SagemakerEndpoint wrapper class, it requires some customization to handle the Llama 3.1 model API exposed by SageMaker JumpStart properly.
Below, you will find an implementation of a NeMo-compatible class that arranges the parameters required to call our SageMaker endpoint:

class ContentHandler(LLMContentHandler):
content_type = ‘application/json’
accepts = ‘application/json’

def transform_input(self, prompt: str, model_kwargs=None):
if model_kwargs is None:
model_kwargs = {}

# Ensure the ‘stop’ parameter is set
model_kwargs.setdefault(‘stop’, [‘<|eot_id|>’])

input_data = {
‘inputs’: prompt,
‘parameters’: model_kwargs
}
return json.dumps(input_data).encode(“utf-8”)

def transform_output(self, output):
output_data = json.loads(output.read().decode(“utf-8”))
return output_data.get(“generated_text”, f”Error: {output_data.get(‘error’, ‘Unknown error’)}”)

class CustomSagemakerEndpoint(SagemakerEndpoint):
content_handler = ContentHandler()
endpoint_name = llm_predictor.endpoint_name
region_name = llm_predictor.sagemaker_session.boto_region_name

Structuring the Prompt for Llama 3.1
The Llama 3.1 model from Meta requires prompts to follow a specific structure, including special tokens like </s> and {role} to define parts of the conversation. When invoking the model through NeMo Guardrails, you must make sure that the prompts are formatted correctly.
To achieve seamless integration, you can modify the prompt.yaml file. Here’s an example:

prompts:
– prompt: “<|endoftext|>{role}: {user_message}<|endoftext|>{role}: {assistant_response}”

For more details on formatting input text for Llama models, you can explore these resources:

Meta Llama 3.1 models are now available in Amazon Bedrock
Meta Llama 3.1 models are now available in Amazon SageMaker JumpStart

Creating an AI Assistant
In our task to create an intelligent and responsible AI assistant for AnyCompany Pet Supplies, we’re leveraging NeMo Guardrails to build a conversational AI chatbot that can understand customer needs, provide product recommendations, and guide users through the purchase process. Here’s how we implement this.
At the heart of NeMo Guardrails are two key concepts: flows and intents. These work together to create a structured, responsive, and context-aware conversational AI.
Flows in NeMo Guardrails
Flows define the conversation structure and guide the AI’s responses. They are sequences of actions that the AI should follow in specific scenarios. For example:

define flow unrelated question
user ask unrelated question
bot refuse unrelated question

define flow help user with pets
user ask about pets
bot answer question

These flows outline how the AI should respond in different situations. When a user asks about pets, the chatbot will provide an answer. When faced with an unrelated question, it will politely refuse to answer.
Intent Capturing and Flow Selection
The process of choosing which flow to follow begins with capturing the user intent. NeMo Guardrails uses a multi-faceted approach to understand user intent:

Pattern Matching: The system first looks for predefined patterns that correspond to specific intents:

define user ask about pets
“What are the best shampoos for dogs?”
“How frequently should I take my puppy on walks?”
“How do I train my cat?”

define user ask unrelated question
“Can you help me with my travel plans?”
“What’s the weather like today?”
“Can you provide me with some investment advice?”

Dynamic Intent Recognition: After selecting the most likely candidates, NeMo uses a sophisticated intent recognition system defined in the prompts.yml file to narrow down the intent:

– task: generate_user_intent
content: |-
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant specialised in pet products.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
“””
{{ general_instructions }}
“””

# This is an example how a conversation between a user and the bot can go:
{{ sample_conversation }}
# The previous conversation was just an example

# This is the current conversation between the user and the bot:
Assistant: Hello
{{ history | user_assistant_sequence }}

# Choose the user current intent from this list: {{ potential_user_intents }}

Ignore the user question. The assistant task is to choose an intent from this list. The last messages are more important for defining the current intent.
Write only one of: {{ potential_user_intents }}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
stop:
– “<|eot_id|>”

This prompt is designed to guide the chatbot in determining the user’s intent. Let’s break it down:

Context Setting: The prompt begins by defining the AI’s role as a pet product specialist. This focuses the chatbot’s attention on pet-related queries.
General Instructions: The {{ general_instructions }} variable contains overall guidelines for the chatbot’s behavior, as defined in our config.yml.
Example Conversation: The {{ sample_conversation }} provides a model of how interactions should flow, giving the chatbot context for understanding user intents.
Current Conversation: The {{ history | user_assistant_sequence }} variable includes the actual conversation history, allowing the chatbot to consider the context of the current interaction.
Intent Selection: The chatbot is instructed to choose from a predefined list of intents {{ potential_user_intents }}. This constrains the chatbot to a set of known intents, ensuring consistency and predictability in intent recognition.
Recency Bias: The prompt specifically mentions that “the last messages are more important for defining the current intent.” This instructs the chatbot to prioritize recent context, which is often most relevant to the current intent.
Single Intent Output: The chatbot is instructed to “Write only one of: {{ potential_user_intents }}“. This provides a clear, unambiguous intent selection.

In Practice:
Here’s how this process works in practice (see Diagram 3):

When a user sends a message, NeMo Guardrails initiates the intent recognition task.
The chatbot reviews the conversation history, focusing on the most recent messages.
It matches the user’s input against a list of predefined intents.
The chatbot selects the most suitable intent based on this analysis.
The identified intent determines the corresponding flow to guide the conversation.

Diagram 3: Two example conversation flows, one denied by the input rails, one allowed to the dialog rail where the LLM picks up the conversation.

For example, if a user asks, “What’s the best food for a kitten?”, the chatbot might classify this as a “product_inquiry” intent. This intent would then activate a flow designed to recommend pet food products.
While this structured approach to intent recognition makes sure that the chatbot’s responses are focused and relevant to the user’s needs, it may introduce latency due to the need to process and analyze conversation history and intent in real-time. Each step, from intent recognition to flow selection, involves computational processing, which can impact the response time, especially in more complex interactions. Finding the right balance between flexibility, control, and real-time processing is crucial for creating an effective and reliable conversational AI system.
Implement Sophisticate Conversation Flows
In our earlier discussion about Colang, we examined its core structure and its role in crafting conversational flows. Now, we will delve into one of Colang’s standout features: the ability to utilize variables to capture and process user input. This functionality enables us to construct conversational agents that are not only more dynamic but also highly responsive, tailoring their interactions based on precise user data.
Continuing with our practical example of developing a pet store assistant chatbot:

define flow answer about pet products
user express pet products needs

$pet_type = …

if $pet_type == “not available”
bot need clarification
else
if $pet_type == “dog”:
bot say “Depending on your dog’s coat type, different grooming tools like deshedding brushes or mitts might be useful.”
else if $pet_type == “bird”
bot say “For birds, it’s crucial to use non-toxic cleaning products and sprays to maintain healthy feathers. I recommend looking for products specifically labeled for avian use.”
else
bot say “For cats, especially those with long hair, a good brushing routine with a wire or bristle brush can help prevent mats and keep their coat healthy.”

In the provided example above, we encounter the line:
$pet_type = …
The ellipsis (…) serves as a placeholder in Colang, signaling where data extraction or inference is to be performed. This notation does not represent executable code but rather suggests that some form of logic or natural language processing should be applied at this stage.
More specifically, the use of an ellipsis here implies that the system is expected to:

Analyze the user’s input previously captured under “user express pet products needs.”
Determine or infer the type of pet being discussed.
Store this information in the $pet_type variable.

The comment accompanying this line sheds more light on the intended data extraction process:
#extract the specific pet type at very high level if available, like dog, cat, bird. Make sure you still class things like puppy as “dog”, kitty as “cat”, etc. if available or “not available” if none apply
This directive indicates that the extraction should:

Recognize the pet type at a high level (dog, cat, bird).
Classify common variations (e.g., “puppy” as “dog”).
Default to “not available” if no clear pet type is identified.

Returning to our initial code snippet, we use the $pet_type variable to customize responses, enabling the bot to offer specific advice based on whether the user has a dog, bird, or cat.
Next, we will expand on this example to integrate a Retrieval Augmented Generation (RAG) workflow, enhancing our assistant’s capabilities to recommend specific products tailored to the user’s inputs.
Bring Your Data into the Conversation
Incorporating advanced AI capabilities using a model like the Llama 3.1 8B instruct model requires more than just managing the tone and flow of conversations; it necessitates controlling the data the model accesses to respond to user queries. A common technique to achieve this is Retrieval Augmented Generation (RAG). This method involves searching a semantic database for content relevant to a user’s request and incorporating those findings into the model’s response context.
The typical approach uses an embedding model, which converts a sentence into a semantic numeric representation—referred to as a vector. These vectors are then stored in a vector database designed to efficiently search and retrieve closely related semantic information. For more information on this topic, please refer to Getting started with Amazon Titan Text Embeddings in Amazon Bedrock.
NeMo Guardrails simplifies this process: developers can store relevant content in a designated ‘kb’ folder. NeMo automatically reads this data, applies its internal embedding model and stores the vectors in an “Annoy” index, which functions as an in-memory vector database. However, this method might not scale well for extensive data sets typical in e-commerce environments. To address scalability, here are two solutions:

Custom Adapter or Connector: Implement your own extension of the EmbeddingsIndex base class. This allows you to customize storage, search and data embedding processes according to your specific requirements, whether local or remote. This integration makes sure that that relevant information remains in the conversational context throughout the user interaction, though it does not allow for precise control over when or how the information is used. For example:

class CustomVectorDatbaseConnector(EmbeddingsIndex):
@property
def embedding_size(self):
return 768

async def add_item(self, item: IndexItem):
“””Adds a new item to the index.”””
pass # Implementation needed

async def add_items(self, items: List[IndexItem]):
“””Adds multiple items to the index.”””
pass # Implementation needed

async def search(self, text: str, max_results: int) -> List[IndexItem]:
“””
Searches for items in the index that are most similar to the provided text.
“””

Retrieval Augmented Generation via Function Call: Define a function that handles the retrieval process using your preferred provider and technique. This function can directly update the conversational context with relevant data, ensuring that the AI can consider this information in its responses. For example:

async def rag(context: dict, llm: BaseLLM, search: str) -> ActionResult:
“”” retrieval component of a retrieval augmented generation function
you can directly manipulate the NeMo’s in-context knowledge here using:
context_updates = {“relevant_chunks”: json.dumps(data)}
“””
return “true”

In the conversation rail’s flow, use variables and function calls to precisely manage searches and the integration of results:

#describe the characteristics of a product that would satisfy the user
$product_characteristics = …
$is_success = execute rag(search=$product_characteristics)

These methods offer different levels of flexibility and control, making them suitable for various applications depending on the complexity of your system. In the next section, we will see how these techniques are applied in a more complex scenario to further enhance the capabilities of our AI assistant.
Complete Example with Variables, Retrievers and Conversation Flows
Scenario Overview
Let’s explore a complex implementation scenario with NeMo Guardrails interacting with multiple tools to drive specific business outcomes. We’ll keep the focus on the pet store e-commerce site that is being upgraded with a conversational sales agent. This agent is integrated directly into the search field at the top of the page. For instance, when a user searches for “double coat shampoo,” the results page displays several products and a chat window automatically engages the user by processing the search terms.
Detailed Conversation Flow
As the user interaction begins, the AI processes the input from the search field:

define flow answer about pet products
user express pet products needs
#extract the specific pet type at very high level if available, like dog, cat, bird if available or “pet” if none apply
$pet_type = …
#extract the specific breed of the pet if available or “not available” if none apply
$pet_breed = …
if $pet_breed == “not available”
bot ask informations about the pet breed

Output: “Would you be able to share the type of your dog breed?”
This initiates the engine’s recognition of the user’s intent to inquire about pet products. Here, the chatbot uses variables to try and extract the type and breed of the pet. If the breed isn’t immediately available from the input, the bot requests further clarification.
Retrieval and Response Generation
If the user responds with the breed (e.g., “It’s a Labradoodle”), the chatbot proceeds to tailor its search for relevant products:

else:
#describe the characteristics of a product that would satisfy the user
$product_characteristics = …
#call our previously defined retrieval function
$results = execute rag(search=$product_characteristics)
#write a text message describing in a list all the products in the addional context writing their name and their ASIN number
#and any features that relate to the user needs, offering to put one in the cart, in a single paragraph
$product_message = …
bot $product_message

Output: We found several shampoos for Labradoodles: [Product List]. Would you like to add any of these to your cart? 
The chatbot uses the extracted variables to refine product search criteria, then retrieves relevant items using an embedded retrieval function. It formats this information into a user-friendly message, listing available products and offering further actions.
Advancing the Sale
If the user expresses a desire to purchase a product (“I’d like to buy the second option from the list”), the chatbot transitions to processing the order:

define flow user wants product
user express intent to buy
#the value must be the user city of residence or “unknown”
$user_city = …
#the value must be the user full street address or “unknown”
$user_address_street = …
if $user_city == “unknown”
if $user_address_street == “unknown”
bot ask address

Output: “Great choice! To finalize your order, could you please provide your full shipping address?” 
At this point, we wouldn’t have the shipping information so the bot ask for it. However, if this was a known customer, the data could be injected into the conversation from other sources. For example, if the user is authenticated and has made previous orders, their shipping address can be retrieved from the user profile database and automatically populated within the conversation flow. Then the model would just have asked for confirmation about the purchase, skipping the part about asking for shipping information.
Completing the Sale
Once our variables are filled and we have enough information to process the order, we can transition the conversation naturally into a sales motion and have the bot finalize the order:

else:
#the value must be asin of the product the user intend to buy
$product_asin = …
$cart = execute add_order(city=$user_city, street=$user_address_street, product=$product_asin)
bot $cart

Output: “Success” 
In this example, we’ve implemented a mock function called add_order to simulate a backend service call. This function verifies the address and places the chosen product into the user’s session cart. You can capture the return string from this function on the client side and take further action, for instance, if it indicates ‘Success,’ you can then run some JavaScript to display the filled cart to the user. This will show the cart with the item, pre-entered shipping details and a ready checkout button within the user interface, closing the sales loop experience for the user and tying together the conversational interface with the shopping cart and purchasing flow.
Maintaining Conversation Integrity
During this interaction, the NeMo Guardrails framework maintains the conversation within the boundaries set by the Colang configuration. For example, if the user deviates with a question such as ‘What’s the weather like today?’, NeMo Guardrails will classify this as part of a refusal flow and outside the relevant topics of ordering pet supplies. It will then tactfully declines to address the unrelated query and steers the discussion back towards selecting and ordering products, replying with a standard response like, ‘I’m afraid I can’t help with weather information, but let’s continue with your pet supplies order.’ as defined in Colang.
Clean Up
When using Amazon SageMaker JumpStart you’re deploying the selected models using on-demand GPU instances managed by Amazon SageMaker. These instances are billed per second and it’s important to optimize your costs by turning off the endpoint when not needed.
To clean up your resources, please ensure that you run the clean up cells in the three notebooks that you used. Make sure you delete the appropriate model and endpoints by executing similar cells:

llm_model.delete_model()
llm_predictor.delete_predictor()

Please note that in the third notebook, you additionally need to delete the embedding endpoints:

embedding_model.delete_model()
embedding_predictor.delete_predictor()

Additionally, you can make sure that you have deleted the appropriate resources manually by completing the following steps:

Delete the model artifacts:

On the Amazon SageMaker console, choose Models under Inference in the navigation pane.
Please ensure you do not have llm-model and embedding-model artifacts.
To delete these artifacts, choose the appropriate models and click Delete under Actions dropdown menu.

Delete endpoint configurations:

On the Amazon SageMaker console, choose Endpoint configuration under Inference in the navigation pane.
Please ensure you do not have llm-model and embedding-model endpoint configuration.
To delete these configurations, choose the appropriate endpoint configurations and click Delete under Actions dropdown menu.

Delete the endpoints:

On the Amazon SageMaker console, choose Endpoints under Inference in the navigation pane.
Please ensure you do not have llm-model and embedding-model endpoints running.
To delete these endpoints, choose the appropriate model endpoint names and click Delete under Actions dropdown menu.

Best Practices and Considerations
When integrating NeMo Guardrails with SageMaker JumpStart, it’s important to consider AI governance frameworks and security best practices to ensure responsible AI deployment. While this blog focuses on showcasing the core functionality and capabilities of NeMo Guardrails, security aspects are beyond its scope.
For further guidance, please explore:

Amazon SageMaker Best Practices
NVIDIA NeMo Guardrails Documentation
Building AI Responsibly on AWS

Conclusion
Integrating NeMo Guardrails with Large Language Models (LLMs) is a powerful step forward in deploying AI in customer-facing applications. The example of AnyCompany Pet Supplies illustrates how these technologies can enhance customer interactions while handling refusal and guiding the conversation toward the implemented outcomes. Looking forward, maintaining this balance of innovation and responsibility will be key to realizing the full potential of AI in various industries. This journey towards ethical AI deployment is crucial for building sustainable, trust-based relationships with customers and shaping a future where technology aligns seamlessly with human values.
Next Steps
You can find the examples used within this article via this link.
We encourage you to explore and implement NeMo Guardrails to enhance your own conversational AI solutions. By leveraging the guardrails and techniques demonstrated in this post, you can quickly constraint LLMs to drive tailored and effective results for your use case.

About the Authors
Georgi Botsihhin is a Startup Solutions Architect at Amazon Web Services (AWS), based in the United Kingdom. He helps customers design and optimize applications on AWS, with a strong interest in AI/ML technology. Georgi is part of the Machine Learning Technical Field Community (TFC) at AWS. In his free time, he enjoys staying active through sports and taking long walks with his dog.
Lorenzo Boccaccia is a Startup Solutions Architect at Amazon Web Services (AWS), based in Spain. He helps startups in creating cost-effective, scalable solutions for their workloads running on AWS, with a focus on containers and EKS. Lorenzo is passionate about Generative AI and is is a certified AWS Solutions Architect Professional, Machine Learning Specialist and part of the Containers TFC. In his free time, he can be found online taking part sim racing leagues.

Fine-Tuning Llama 3.2 3B Instruct for Python Code: A Comprehensive Gui …

In this tutorial, we’ll walk through how to set up and perform fine-tuning on the Llama 3.2 3B Instruct model using a specially curated Python code dataset. By the end of this guide, you’ll have a better understanding of how to customize large language models for code-related tasks and practical insight into the tools and configurations needed to leverage Unsloth for fine-tuning.

Installing Required Dependencies

Copy CodeCopiedUse a different Browser!pip install “unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git”
!pip install “git+https://github.com/huggingface/transformers.git”
!pip install -U trl
!pip install –no-deps trl peft accelerate bitsandbytes
!pip install torch torchvision torchaudio triton
!pip install xformers
!python -m xformers.info
!python -m bitsandbytes

These commands install and update all the necessary libraries—such as Unsloth, Transformers, and xFormers—needed for fine-tuning the Llama 3.2 3B Instruct model on Python code. Finally, we run diagnostic commands to verify the successful installation of xFormers and BitsAndBytes.

Essential Imports

Copy CodeCopiedUse a different Browserfrom unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import torch
from datasets import load_dataset

We import classes and functions from Unsloth, TRL, and Transformers for model training and fine-tuning. Also, we load a Python code dataset with Hugging Face’s `load_dataset` to prepare training samples.

Loading the Python Code Dataset

Copy CodeCopiedUse a different Browsermax_seq_length = 2048
dataset = load_dataset(“user/Llama-3.2-Python-Alpaca-143k”, split=”train”) #Save the dataset on your user profile on HF, then load the dataset on your user id

We set the sequence length to 2048 tokens for the fine-tuned model and load a custom Python code dataset from Hugging Face. Ensure you have the dataset stored under your username for proper access.

Initializing the Llama 3.2 3B Model

Copy CodeCopiedUse a different Browsermodel, tokenizer = FastLanguageModel.from_pretrained(
model_name = “unsloth/Llama-3.2-3B-Instruct-bnb-4bit”,
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True
)

We load the Llama 3.2 3B Instruct model in 4-bit format using the Unsloth library, which reduces memory usage. To handle longer text inputs, we also set the maximum sequence length to 2048.

Configuring LoRA with Unsloth

Copy CodeCopiedUse a different Browsermodel = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = [“q_proj”, “k_proj”, “v_proj”, “o_proj”,
“gate_proj”, “up_proj”, “down_proj”,],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = “none”, # Supports any, but = “none” is optimized
# [NEW] “unsloth” uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = “unsloth”, # True or “unsloth” for very long context
random_state = 3407,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
max_seq_length = max_seq_length
)

We apply LoRA (Low-Rank Adaptation) to our 4-bit loaded model, specifying the rank (r), alpha (lora_alpha), and dropout settings. The use_gradient_checkpointing = “unsloth” enables more efficient memory usage and allows training with longer context lengths. Additional LoRA options like use_rslora and loftq_config are available for more advanced fine-tuning techniques but are disabled here for simplicity. Finally, we set the maximum sequence length to match our earlier configuration.

Mounting Google Drive

Copy CodeCopiedUse a different Browserfrom google.colab import drive
drive.mount(“/content/drive”)

We import the Google Colab drive module to enable access to Google Drive from within the Colab environment.

Setting Up and Running the Training Loop

Copy CodeCopiedUse a different Browsertrainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = “text”,
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 60,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = “adamw_8bit”,
weight_decay = 0.01,
lr_scheduler_type = “linear”,
seed = 3407,
output_dir = “/content/drive/My Drive/Llama-3.2-3B-Instruct-bnb-4bit”
),
)

trainer.train()

We create an instance of SFTTrainer with our loaded model, tokenizer, and Python code dataset, specifying the text field for training. The TrainingArguments define key hyperparameters such as batch size, learning rate, maximum training steps, and hardware-specific settings like fp16 or bf16. In this example, we set the output directory to Google Drive to conveniently store checkpoints and logs. Finally, we invoke the trainer.train() method to begin the fine-tuning process.

Saving the Fine-Tuned Model

Copy CodeCopiedUse a different Browsermodel.save_pretrained(“lora_model”) # Local saving
tokenizer.save_pretrained(“lora_model”)

We save the LoRA-trained model and its tokenizer to a local folder named lora_model. This allows you to load and use the fine-tuned model later without repeating the training process.

In conclusion, throughout this tutorial, we demonstrated how to fine-tune the Llama 3.2 3B Instruct model on a Python code dataset using the Unsloth library, LoRA, and efficient 4-bit quantization. By leveraging the provided scripts, you can train a smaller, memory-efficient model that excels at both generating and understanding Python code. In the process, we showcased the integration of Unsloth for optimized memory usage, LoRA for flexible model adaptation, and Hugging Face tools for dataset handling and training. This setup enables you to build and customize language models tailored to specific code-related tasks, improving accuracy and resource efficiency.

Download the Colab Notebook here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.
The post Fine-Tuning Llama 3.2 3B Instruct for Python Code: A Comprehensive Guide with Unsloth appeared first on MarkTechPost.

Deep Agent Released R1-V: Reinforcing Super Generalization in Vision-L …

Vision-language models (VLMs) face a critical challenge in achieving robust generalization beyond their training data while maintaining computational resources and cost efficiency. Approaches, such as chain-of-thought supervised fine-tuning (CoT-SFT), often lead to overfitting, where models perform well on seen data but struggle with new, unseen scenarios. This limitation reduces their effectiveness in applications that demand adaptability, such as autonomous systems, medical imaging, and visual reasoning tasks. Also, the prevailing assumption is that increasing model size is the key to improved performance. The need for a more efficient training paradigm that enhances generalization, minimizes overfitting and reduces computational costs has become crucial for advancing VLMs.

Deep Agent released R1-V to resolve some of the above concerns. This novel reinforcement learning approach enhances the generalization ability of VLMs while being cost-effective. This approach demonstrates how reinforcement learning with verifiable rewards (RLVR) can outperform traditional CoT-SFT in effectiveness and robustness when dealing with out-of-distribution (OOD) data.

The main objective of the R1-V approach is to enhance VLMs’ ability to generalize beyond their training datasets. R1-V tackles this issue by employing reinforcement learning techniques that guide the model to learn generalizable skills rather than memorizing training examples. In particular, it focuses on teaching VLMs to develop robust visual counting abilities, an essential skill in many AI applications, including image recognition, autonomous systems, and visual reasoning.

Image Source

A major highlight of R1-V is its training efficiency. Despite utilizing a relatively small model with only 2 billion parameters, R1-V performs better than a significantly larger 72 billion parameter model in OOD tests. This demonstrates that model size is not the sole determinant of performance; the training methodology and reinforcement learning strategies are crucial in enhancing a model’s capabilities.

R1-V was trained on eight A100 GPUs for 30 minutes, with a total computational cost of only $2.62. This cost-effectiveness makes it an attractive alternative for researchers and developers who wish to achieve high performance without extensive computational resources. R1-V also stands out due to its reliance on a curated training dataset. The model was trained using CLEVR-70k and R1-Distilled Visual Reasoning datasets, specifically designed to encourage visual reasoning and robust decision-making. Using these datasets ensures that the model develops a deep understanding of visual relationships and logical reasoning rather than simply learning to recognize patterns from a given dataset.

Image Source

In conclusion, the development of R1-V supports open-source AI research by making its code, model weights, datasets, and training scripts publicly available. This allows the AI research community to refine and improve vision-language modeling. R1-V’s reinforcement learning approach enables rapid learning of patterns and structures in data. It leads to high performance with minimal computational cost. This challenges the assumption that extensive training and massive datasets are necessary for state-of-the-art AI performance. Instead, efficient training methodologies can reduce computational demands while maintaining or surpassing traditional results.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.
The post Deep Agent Released R1-V: Reinforcing Super Generalization in Vision-Language Models with Cost-Effective Reinforcement Learning to Outperform Larger Models appeared first on MarkTechPost.

NYU Researchers Introduce WILDCHAT-50M: A Large-Scale Synthetic Datase …

Large language model (LLM) post-training focuses on refining model behavior and enhancing capabilities beyond their initial training phase. It includes supervised fine-tuning (SFT) and reinforcement learning to align models with human preferences and specific task requirements. Synthetic data is crucial, allowing researchers to evaluate and optimize post-training techniques. However, open research in this domain is still in its early stages, facing data availability and scalability limitations. Without high-quality datasets, analyzing the performance of different fine-tuning strategies and assessing their effectiveness in real-world applications becomes difficult.

One of the primary challenges in this field is the scarcity of large-scale, publicly available synthetic datasets suitable for LLM post-training. Researchers must access diverse conversational datasets to conduct meaningful comparative analyses and improve alignment strategies. The lack of standardized datasets limits the ability to evaluate post-training performance across different models. Moreover, large-scale data generation costs and computational requirements are prohibitive for many academic institutions. These factors create barriers to improving model efficiency and ensuring fine-tuned LLMs generalize well across tasks and user interactions.

Existing approaches to synthetic data collection for LLM training rely on a combination of model-generated responses and benchmark datasets. Datasets, such as WildChat-1M from Allen AI and LMSys-Chat-1M, provide valuable insights into synthetic data usage. However, they are often restricted in scale and model diversity. Researchers have developed various techniques to assess synthetic data quality, including LLM judge-based evaluations and efficiency metrics for runtime and VRAM usage. Despite these efforts, the field still lacks a comprehensive and publicly accessible dataset that allows for large-scale experimentation and optimization of post-training methodologies.

Researchers from New York University (NYU) introduced WILDCHAT-50M, an extensive dataset designed to facilitate LLM post-training. The dataset builds upon the WildChat collection and expands it to include responses from over 50 open-weight models. These models range from 0.5 billion to 104 billion parameters, making WILDCHAT-50M the largest and most diverse public dataset of chat transcripts. The dataset enables a broad comparative analysis of synthetic data generation models and is a foundation for further improving post-training techniques. By making WILDCHAT-50M publicly accessible, the research team aims to bridge the gap between industry-scale post-training and academic research.

The dataset was developed by synthesizing chat transcripts from multiple models, each participating in over one million multi-turn conversations. The dataset comprises approximately 125 million chat transcripts, offering an unprecedented scale of synthetic interactions. The data collection process took place over two months using a shared research cluster of 12×8 H100 GPUs. This setup allowed researchers to optimize runtime efficiency and ensure a diverse range of responses. The dataset also served as the basis for RE-WILD, a novel supervised fine-tuning (SFT) mix that enhances LLM training efficiency. Through this approach, researchers successfully demonstrated that WILDCHAT-50M could optimize data usage while maintaining high levels of post-training performance.

The effectiveness of WILDCHAT-50M was validated through a series of rigorous benchmarks. The RE-WILD SFT approach, based on WILDCHAT-50M, outperformed the Tulu-3 SFT mixture developed by Allen AI while using only 40% of the dataset size. The evaluation included multiple performance metrics, with specific improvements in response coherence, model alignment, and benchmark accuracy. The dataset’s ability to enhance runtime efficiency was also highlighted, with throughput efficiency analyses indicating substantial improvements in token processing speed. Further, models fine-tuned using WILDCHAT-50M demonstrated significant enhancements in instruction-following capabilities and overall chat performance across various evaluation benchmarks.

This research underscores the importance of high-quality synthetic data in LLM post-training and presents WILDCHAT-50M as a valuable resource for optimizing model alignment. By providing a large-scale, publicly available dataset, the researchers have enabled further advancements in supervised fine-tuning methodologies. The comparative analyses conducted in this study offer key insights into the effectiveness of different data generation models and post-training strategies. Moving forward, the introduction of WILDCHAT-50M is expected to support a broader range of academic and industrial research efforts, ultimately contributing to developing more efficient and adaptable language models.

Check out the Paper, Dataset on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.
The post NYU Researchers Introduce WILDCHAT-50M: A Large-Scale Synthetic Dataset for Efficient LLM Post-Training appeared first on MarkTechPost.

Orchestrate seamless business systems integrations using Amazon Bedroc …

Generative AI has revolutionized technology through generating content and solving complex problems. To fully take advantage of this potential, seamless integration with existing business systems and efficient access to data are crucial. Amazon Bedrock Agents provides the integration capabilities to connect generative AI models with the wealth of information and workflows already in place within an organization, enabling the creation of efficient and impactful generative AI applications.
Amazon Bedrock is a fully managed service that enables the development and deployment of generative AI applications using high-performance foundation models (FMs) from leading AI companies through a single API. Amazon Bedrock Agents allows you to streamline workflows and automate repetitive tasks across your company systems and data sources, while maintaining security, privacy, and responsible AI practices. Using these agents, you can enable generative AI applications to execute multiple tasks across your company systems and data sources. Businesses can now unlock the power of generative AI to automate tasks, generate content, and solve complex problems—all while maintaining connectivity to critical enterprise systems and data sources.
The post showcases how generative AI can be used to logic, reason, and orchestrate integrations using a fictitious business process. It demonstrates strategies and techniques for orchestrating Amazon Bedrock agents and action groups to seamlessly integrate generative AI with existing business systems, enabling efficient data access and unlocking the full potential of generative AI.
This solution also integrates with Appian Case Management Studio. Cases are a vital part of case management applications and represent a series of tasks to complete or a multi-step problem to solve. Appian Case Management Studio is an out-of-the box suite of applications that facilitates rapid development of case management apps. The fictitious business process used in this post creates a case in Appian for further review.
Business workflow
The following workflow shows the fictitious business process.

The workflow consists of the following steps:

The user asks the generative AI assistant to determine if a device needs review.
If a device type is provided, the assistant checks if it’s a Type 3 device.
If it’s a Type 3 device, the assistant asks the user for the device name.
The assistant checks if a document exists with the provided name.
If the document exists, the assistant creates a case in Appian to start a review.
If the document doesn’t exist, the assistant sends an email for review.

Solution overview
The following diagram illustrates the architecture of the solution.

The system workflow includes the following steps:

The user interacts with the generative AI application, which connects to Amazon Bedrock Agents.
The application uses Amazon Bedrock Knowledge Bases to answer the user questions. These knowledge bases are created with Amazon Simple Storage Service (Amazon S3) as the data source and Amazon Titan (or another model of your choice) as the embedding model.
Amazon Bedrock Agents uses action groups to integrate with different systems.
The action groups call different AWS Lambda functions within private subnet of a virtual private cloud (VPC).
The agent uses a tree-of-thought (ToT) prompt to execute different actions from the action groups.
A Lambda function fetches the classification of the device from Amazon DynamoDB. The function invokes DynamoDB using a gateway endpoint.
A Lambda function checks if quality documents exist in Amazon S3. The function invokes Amazon S3 using interface endpoints.
A Lambda function calls the Appian REST API using a NAT gateway in a public subnet.
The Appian key is stored in AWS Secrets Manager.
A Lambda function uses AWS Identity and Access Management (IAM) permissions to make an SDK call to Amazon Simple Email Service (Amazon SES). Amazon SES sends an email using SMTP to verified emails provided by the user.

Prerequisites
You will need the following prerequisites before you can build the solution:

A valid AWS account.
Access to Anthropic’s Claude 3 Sonnet or the model you intend to use (for more information, see Access Amazon Bedrock foundation models). For this post, we use Anthropic’s Claude 3 Sonnet, and all instructions are pertaining to that model. If you want to use another FM, update the prompts accordingly.
An IAM role in the account that has sufficient permissions to create the necessary resources.
AWS CloudTrail logging enabled for operational and risk auditing. For more details, see Creating a trail for your AWS account.
AWS Budgets policy notifications enabled to protect you from unwanted billing. For more details, see Enable Budget policy.
Two email addresses to send and receive emails. Do not use existing verified identities in Amazon SES for these email addresses. The AWS CloudFormation template will fail otherwise.

This solution is supported only in the us-east-1 AWS Region. You can make the necessary changes to the CloudFormation template to deploy to other Regions.
Create an Appian account
Depending on your needs, follow the corresponding steps to create an Appian account.
Sign up for Appian Community Edition for personal use
The Appian Community Edition provides a personal environment for learning and exploration at no additional cost. To sign up for Apian Community Edition, complete the following steps:

Visit the Appian Community Edition page.
Enter your email address and choose Submit to receive confirmation and login details.
Check your inbox for a verification email from Appian.
Choose the link in the email to validate your email address and finish setting up your account by providing your first name, last name, email, and password, then accept the terms.
Choose Register to complete the registration.
Choose the activation link and log in with your email address and password.
Complete your profile by entering information about your company, phone number, and learning interests, among other details.
Choose Access Environment.
Choose your region (USA, India, or Germany) by choosing the appropriate link.
Navigate to Appian Designer and start exploring Appian’s features and capabilities.

Purchase Appian Platform for business use
If you’re evaluating Appian for your organization, complete the following steps:

Visit the Appian Platform listing at AWS Marketplace.
Choose View purchase options.
Fill out the contract form by providing your duration, renewal settings, and contract options.
Choose Create Contract. to submit your request.

An Appian representative will contact you to discuss your needs. They might provide access to a trial environment or schedule a personalized demo.

Follow the instructions provided by the Appian representative to access your account.

By following these steps, you can create an Appian account suited to your personal learning or business evaluation needs. Whether you’re exploring Appian’s platform individually or assessing it for your organization, Appian provides resources and support to help you get started.
Note the following values, which we will use in the CloudFormation template below.

AppianHostEndpoint
AppianAPIKey

Deploy the CloudFormation template
Complete the following steps to deploy the CloudFormation template:

Download the CloudFormation template.
Open the AWS CloudFormation console in the us-east-1
Choose Stacks in the navigation pane, then choose Create stack.
Upload the template and choose Next.
For Stack name, enter a name, such as QualityReviewStack.
In the Parameters section, provide the following information:

For DynamoDBTableName, enter the name of the DynamoDB table.
For Fromemailaddress, enter the email address to send emails.
For Toemailaddress, enter the email address to receive emails.
For AppianHostEndpoint enter the AppianHostEndpoint captured earlier.
For AppianAPIKey enter the AppianAPIKey captured earlier.

Leave other settings as default and choose Next.

Under Capabilities on the last page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Submit to create the CloudFormation stack.

After the successful deployment of the whole stack, an email will be sent to the email addresses provided earlier.

Verify the newly created email identities by choosing link in the email.
On the Resources tab of the CloudFormation template, make a note of the physical IDs for the following resource logical IDs. You will need them later.

OpenAPISpecsS3Bucket
QualityFormsBucket

This post does not cover auto scaling of AWS Lambda. To integrate Lambda with AWS Application Auto Scaling, see AWS Lambda and Application Auto Scaling.
Upload Open API files to the S3 bucket
Complete the following steps to upload the Open API specifications to Amazon S3:

Download the following the Open API specifications:

Device Classification (deviceclassification.json)
Verify Quality Documents (verifyQualityDocuments.json)
Email Reviewers (emailReviewers.json)
Appian Case (appian-case.json)

On the Amazon S3 console, navigate to the OpenAPISpecsS3Bucket captured earlier.
Upload the downloaded files to the bucket.

Upload the quality forms to the S3 bucket
Complete the following steps to upload the quality form to the Amazon S3:

Download the dummy quality form.
On the AWS CloudFormation console, navigate to the Resources tab of the stack and choose the link next to the physical ID of QualityFormsBucket.

Upload the file downloaded sample articles to the bucket.

Create an effective prompt
Before we configure the agents, we will define a prompt. Prompts are the key to unlocking the full potential of Amazon Bedrock agents. Prompts are the textual inputs that guide the agent’s behavior and responses. Crafting well-designed prompts is essential for making sure that the agent understands the context, intent, and desired output.
When creating prompts, consider the following best practices:

Provide clear and concise instructions
Include relevant background information and context
Follow the model best practices to format the prompt

Amazon Bedrock Agents supports advanced prompting techniques, Chain of thought (CoT) and Tree-of-thought (ToT) prompting. CoT prompting is a technique that enhances the reasoning capabilities of FMs by breaking down complex questions or tasks into smaller, more manageable steps. ToT prompting is a technique used to improve FM reasoning capabilities by breaking down larger problem statements into a treelike format, where each problem is divided into smaller subproblems. We use Tree-of-thought (ToT) prompting and start by breaking down the business process into logical steps and then incorporate model formatting.
The following is the prompt developed for Anthropic’s Claude 3 Sonnet:

You are an agent that helps determine if device requires a quality review and you always use actions groups to answer. To verify if a review is needed, follow these steps:

1. Ask the user to provide the device type. If not provided, prompt for it.
2. Fetch the device classification from the database based on the provided device type using deviceClassification action group
3. If the classification returned from action group is Class III or 3
4. Ask the user for the specific device name.
5. Check if the device name has quality review forms using the verifyifformsExists action group
6. If a quality review document exists:
7. Prepare an email with the relevant content.
8. Ask for to email address and from email address
9. Send the email to the user.
10. If no quality review document exists, create a case.

Create an Amazon Bedrock Agent
The first step in configuring Amazon Bedrock Agents is to define their capabilities. Amazon Bedrock agents can be trained to perform a wide range of tasks, from natural language processing and generation to task completion and decision-making. When defining an agent’s capabilities, consider the specific use case and the desired outcomes.
To create an agent, complete the following steps:

On the Amazon Bedrock console, choose Agents in the navigation pane.
Choose Create Agent.

In the Agent details section, enter a name for the agent and an optional description.
Choose Create.

In the agent builder, choose Create and use a new service role for the agent resource role.

Choose Anthropic’s Claude 3 Sonnet as the model.
In the Instructions for the Agent section, provide the prompt crafted earlier.

In the Additional settings section, for User input, select Enabled.

Choose Save and exit to save the agent.

Create action groups
Complete the following steps to create the action groups for the newly created agent:

On the Amazon Bedrock console, choose Agents in the navigation pane.
Choose the newly created agent and choose Edit in Agent Builder.
In the Action groups section, choose Add.

In the Action group details section, change the automatically generated name to checkdeviceclassification and provide an optional description for your action group.
In the Action group type section, select Define with API schemas to use the OpenAPI schema.

In the Action group invocation section, select Select an existing Lambda function to use an existing Lambda function.
On the drop-down menu, choose the Lambda function with the name containing DeviceClassification.

In the Action group schema section, select Define via in-line schema editor to define the schema.
Choose JSON on the drop-down menu next to
Open the device classification file downloaded earlier and copy the content of the schema file.
Enter the content in the schema editor.

Choose Create to create an action group.
Repeat the preceding steps to create additional action groups. Use the following table to map the action groups to the respective Lambda functions and Open API schemas.

Action Group Name
Lambda Functin Name Containing
Open API Schema

checkdeviceclassification
DeviceClassification
deviceclassification.json

verifyqualitydocuments
VerifyQualityDocuments
verifyQualityDocuments.json

emailreviewers
EmailReviewers
emailReviewers.json

appiancase
Appian
appian-case.json

To customize the agent’s behavior to your specific use case, you can modify the prompt templates for the preprocessing, orchestration, knowledge base response generation, and postprocessing steps. For more information, see Enhance agent’s accuracy using advanced prompt templates in Amazon Bedrock.
Create a knowledge base
You can create an Amazon Bedrock knowledge base to retrieve information from your proprietary data and generate responses to answer natural language questions. As part of creating a knowledge base, you configure a data source and a vector store of your choice.
The prompt crafted earlier provides instructions that are not dependent on a knowledge base. To use a knowledge base, modify the prompt accordingly.
Prepare the agent
Complete the following steps to prepare the agent for deployment:

On the Amazon Bedrock console, navigate to the agent you created.
In the agent builder, choose Save.

After the agent is saved, the Prepare button will be enabled.

Choose Prepare to build the agent.

Test the agent
To test the agent, we use the Amazon Bedrock agent console. You can embed the API calls into your applications.
If you use AWS published API calls to access Amazon Bedrock through the network, the client must adhere to the following requirements.
Complete the following steps to test the agent on the Amazon Bedrock console:

On the Test page for the agent, choose the arrows icon to enlarge the test window.

In the message bar, enter “verify if the device requires review.”

The agent will respond by asking for the type of device.

Enter “HIV diagnostic tests.”

The CloudFormation template only deploys “HIV diagnostic tests” as a Type 3 device.
The agent fetches the classification of the device from the DynamoDB. You can update the CloudFormation template to add more values.
Because the classification of HIV diagnostic tests is Type 3, the agent will ask for the device name to verify if the quality document exists.

Enter anytech.

The agent will verify if the document with the name anytech exists in Amazon S3. (Earlier, you uploaded a dummy document for anytech.)
The agent should now ask for an email address to receive the quality review request.

An email will be sent with the review details.

Repeat the preceding steps but this time, enter anytechorg as the document name.

We did not upload a document named anytechorg, so the agent will create a case by asking for the following information:

First name
Last name
Mobile phone number
Description
Title of the case

Provide the required information to the agent.

The agent now creates a case.
Best practices
Consider the following best practices for building efficient and well-architected generative AI applications:

Follow best practices for creating accurate and reliable agents using Amazon Bedrock Agents.
Review the following architectural considerations and development lifecycle practices to build robust, scalable, and secure intelligent agents.
Refer to the guidance to optimize the performance of Amazon Bedrock agents.
Follow best practices to protect your application against prompt injection.
Use the VPC interface endpoints to create a private connection between your VPC and Amazon Bedrock.
Minimize harmful content in models using Amazon Bedrock Guardrails.
Monitor your generative AI applications using Amazon CloudWatch logs and metrics.

Clean up
To avoid incurring future charges, delete the resources you created. To clean up the AWS environment, complete the following steps:

Empty the contents of the S3 buckets you created as part of the CloudFormation stack.
Delete the agent from Amazon Bedrock.
Delete the CloudFormation stack you created.

Conclusion
Integrating generative AI with existing systems is crucial to unlocking its transformative potential. By using tools like Amazon Bedrock Agents, organizations can seamlessly connect generative AI to core data and workflows, enabling automation, content generation, and problem-solving while maintaining connectivity. The strategies and techniques showcased in this post demonstrate how generative AI can be orchestrated to drive maximum value across a wide range of use cases, from extracting intelligence from regulatory submissions to providing prescriptive guidance to industry. As generative AI continues to evolve, the ability to integrate it with existing infrastructure will be paramount to realizing its true business impact.
To get started with integrating generative AI into your business, explore How Amazon Bedrock Agents works and discover how you can unlock the transformative potential of this technology across your organization.
Stay up to date with the latest advancements in generative AI and start building on AWS. If you’re seeking assistance on how to begin, check out the Generative AI Innovation Center.

About the Authors
Sujatha Dantuluri is a seasoned Senior Solutions Architect in the US federal civilian team at AWS, with over two decades of experience supporting commercial and federal government clients. Her expertise lies in architecting mission-critical solutions and working closely with customers to ensure their success. Sujatha is an accomplished public speaker, frequently sharing her insights and knowledge at industry events and conferences.
Arianna Burgman is a Solutions Architect at AWS based in NYC, supporting state and local government agencies. She is a data and AI enthusiast with experience collaborating with organizations to architect technical solutions that further their missions for continuous innovation and positive, lasting impact.
Annie Cimack is an Associate Solutions Architect based in Arlington, VA, supporting public sector customers across the federal government as well as higher education. Her area of focus is data analytics, and she works closely with customers of all sizes to support projects ranging from storage to intelligent document processing.
Sunil Bemarkar is a Sr. Partner Solutions Architect at AWS based out of San Francisco with over 20 years of experience in the information technology field. He works with various independent software vendors and AWS partners specialized in cloud management tools and DevOps segments to develop joint solutions and accelerate cloud adoption on AWS.
Marcelo Silva is a Principal Product Manager at Amazon Web Services, leading strategy and growth for Amazon Bedrock Knowledge Bases and Amazon Lex.

Top AI Coding Agents in 2025

AI-powered coding agents have significantly transformed software development in 2025, offering advanced features that enhance productivity and streamline workflows. Below is an overview of some of the leading AI coding agents available today.

Devin AI

Designed for complex development tasks, Devin AI utilizes multi-agent parallel workflows to manage intricate projects efficiently. Its architecture supports the simultaneous execution of multiple processes, making it suitable for large-scale applications that require robust performance and scalability.

GitHub Copilot

GitHub Copilot is an AI-powered coding assistant that provides code suggestions and autocompletion features. Integrated directly into the development environment, it helps developers write code more efficiently by predicting and generating code snippets in real-time.

Magic Patterns

Magic Patterns enhances development efficiency by providing tools to build UI components more swiftly. It offers a library of reusable patterns and components, reducing the time spent on repetitive tasks and allowing developers to focus on more complex aspects of their projects.

Windsurf

Windsurf focuses on automated code analysis with cascading features, providing developers with insights into code quality and potential issues. Its analytical tools help maintain high coding standards and facilitate the early detection of bugs or vulnerabilities.

Uizard AI

Uizard AI focuses on rapid prototyping, enabling UI/UX designers to quickly transform ideas into interactive prototypes. This accelerates the design process, allowing for swift iterations and user testing, ultimately leading to more user-centered designs.

Replit Agent

Optimized for small and medium-sized enterprise (SME) workflows, Replit Agent offers advanced coding automation features. It integrates with various development tools and platforms, streamlining the coding process and reducing the overhead associated with managing development environments.

Galileo AI

Tailored for mobile UI development, Galileo AI assists developers in creating intuitive and visually appealing interfaces. Its features are optimized for mobile platforms, ensuring that applications deliver a seamless user experience across different devices and screen sizes.

Warp

Warp employs a multi-agent-based approach to coding, automating various aspects of the development process. Its architecture allows for the delegation of tasks to specialized agents, enhancing efficiency and enabling developers to manage complex projects more effectively.

Lovable Dev

Specializing in converting Figma designs into fully functional applications, Lovable Dev bridges the gap between design and development. This tool streamlines the UI/UX development process, allowing designers and developers to collaborate more effectively and bring designs to life with minimal manual coding.

Bolt New

Bolt New is recognized for its user-friendly interface and ease of deployment, making it suitable for both novice and experienced developers. It supports rapid prototyping and integrates seamlessly with various development environments, facilitating quick iterations during the development process.

V0 Dev

V0 Dev offers support for multiple frontend frameworks, providing developers with the flexibility to choose the most appropriate tools for their projects. Its compatibility with various frameworks makes it a versatile choice for building diverse applications, accommodating different project requirements and developer preferences.

Cursor

Cursor is designed to help developers maintain control over their codebases, offering tools for version control, code reviews, and collaboration. It ensures that code remains organized and maintainable, supporting best practices in software development.

Conclusion

The landscape of AI coding agents in 2025 offers a diverse array of tools tailored to various aspects of software development. From design conversion and rapid prototyping to advanced code analysis and multi-agent workflows, these tools enhance efficiency and support developers in creating high-quality applications. As AI continues to evolve, we can anticipate even more sophisticated features and integrations that will further transform the development process.

Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.
The post Top AI Coding Agents in 2025 appeared first on MarkTechPost.

Anthropic Introduces Constitutional Classifiers: A Measured AI Approac …

Large language models (LLMs) have become an integral part of various applications, but they remain vulnerable to exploitation. A key concern is the emergence of universal jailbreaks—prompting techniques that bypass safeguards, allowing users to access restricted information. These exploits can be used to facilitate harmful activities, such as synthesizing illegal substances or evading cybersecurity measures. As AI capabilities advance, so too do the methods used to manipulate them, underscoring the need for reliable safeguards that balance security with practical usability.

To mitigate these risks, Anthropic researchers introduce Constitutional Classifiers, a structured framework designed to enhance LLM safety. These classifiers are trained using synthetic data generated in accordance with clearly defined constitutional principles. By outlining categories of restricted and permissible content, this approach provides a flexible mechanism for adapting to evolving threats.

Rather than relying on static rule-based filters or human moderation, Constitutional Classifiers take a more structured approach by embedding ethical and safety considerations directly into the system. This allows for more consistent and scalable filtering without significantly compromising usability.

How It Works and Its Benefits

Anthropic’s approach centers on three key aspects:

Robustness Against Jailbreaks: The classifiers are trained on synthetic data that reflects constitutional rules, improving their ability to identify and block harmful content.

Practical Deployment: The framework introduces a manageable 23.7% inference overhead, ensuring that it remains feasible for real-world use.

Adaptability: Because the constitution can be updated, the system remains responsive to emerging security challenges.

The classifiers function at both input and output stages. The input classifier screens prompts to prevent harmful queries from reaching the model, while the output classifier evaluates responses as they are generated, allowing for real-time intervention if necessary. This token-by-token evaluation helps maintain a balance between safety and user experience.

Findings and Observations

Anthropic conducted extensive testing, involving over 3,000 hours of red-teaming with 405 participants, including security researchers and AI specialists. The results highlight the effectiveness of Constitutional Classifiers:

No universal jailbreak was discovered that could consistently bypass the safeguards.

The system successfully blocked 95% of jailbreak attempts, a significant improvement over the 14% refusal rate observed in unguarded models.

The classifiers introduced only a 0.38% increase in refusals on real-world usage, indicating that unnecessary restrictions remain minimal.

Most attack attempts focused on subtle rewording and exploiting response length, rather than finding genuine vulnerabilities in the system.

While no security measure is completely infallible, these findings suggest that Constitutional Classifiers offer a meaningful improvement in reducing the risks associated with universal jailbreaks.

Figure 5: Constitutional classifiers substantially improve robustness over harmlessness training alone.

Conclusion

Anthropic’s Constitutional Classifiers represent a pragmatic step toward strengthening AI safety. By structuring safeguards around explicit constitutional principles, this approach provides a flexible and scalable way to manage security risks without unduly restricting legitimate use. As adversarial techniques continue to evolve, ongoing refinement will be necessary to maintain the effectiveness of these defenses. Nonetheless, this framework demonstrates that a well-designed, adaptive safety mechanism can significantly mitigate risks while preserving practical functionality.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.
The post Anthropic Introduces Constitutional Classifiers: A Measured AI Approach to Defending Against Universal Jailbreaks appeared first on MarkTechPost.

This AI Paper from Meta Introduces Diverse Preference Optimization (Di …

Large-scale language models (LLMs) have advanced the field of artificial intelligence as they are used in many applications. Although they can almost perfectly simulate human language, they tend to lose in terms of response diversity. This limitation is particularly problematic in tasks requiring creativity, such as synthetic data generation and storytelling, where diverse outputs are essential for maintaining relevance and engagement.

One of the major challenges in language model optimization is the reduction in response diversity due to preference training techniques. Post-training methods like reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) tend to concentrate probability mass on a limited number of high-reward responses. This results in models generating repetitive outputs for various prompts, restricting their adaptability in creative applications. The decline in diversity hinders the potential of language models to function effectively in fields that require broad-ranging outputs.

Previous methods for preference optimization primarily emphasize aligning models with high-quality human preferences. Supervised fine-tuning and RLHF techniques, while effective at improving model alignment, inadvertently lead to response homogenization. Direct Preference Optimization (DPO) selects highly rewarded responses while discarding low-quality ones, reinforcing the tendency for models to produce predictable outputs. Attempts to counteract this issue, such as adjusting sampling temperatures or applying KL divergence regularization, have failed to significantly enhance diversity without compromising output quality.

Researchers from Meta, New York University, and ETH Zurich have introduced Diverse Preference Optimization (DivPO), a novel technique designed to enhance response diversity while maintaining high quality. Unlike traditional optimization methods prioritizing the highest-rewarded response, DivPO selects preference pairs based on quality and diversity. This ensures that the model generates outputs that are not only human-aligned but also varied, making them more effective in creative and data-driven applications.

DivPO operates by sampling multiple responses for a given prompt and scoring them using a reward model. Instead of selecting the single highest-rewarded response, the most diverse, high-quality response is chosen as the preferred output. Simultaneously, the least varied response that does not meet the quality threshold is selected as the rejected output. This contrastive optimization strategy allows DivPO to learn a broader distribution of responses while ensuring that each output retains a high-quality standard. The approach incorporates various diversity criteria, including model probability, word frequency, and an LLM-based diversity judgment, to assess each response’s distinctiveness systematically.

Extensive experiments were conducted to validate the effectiveness of DivPO, focusing on structured persona generation and open-ended creative writing tasks. The results demonstrated that DivPO significantly increased diversity without sacrificing quality. Compared to standard preference optimization methods, DivPO led to a 45.6% increase in persona attribute diversity and a 74.6% rise in story diversity. The experiments also showed that DivPO prevents models from generating a small subset of responses disproportionately, ensuring a more even distribution of generated attributes. A key observation was that models trained using DivPO consistently outperformed baseline models in diversity evaluations while maintaining high quality, as assessed by the ArmoRM reward model.

Further analysis of persona generation revealed that traditional fine-tuned models, such as Llama-3.1-8B-Instruct, failed to produce varied persona attributes, often repeating a limited set of names. DivPO rectified this issue by expanding the generated attribute range, leading to a more balanced and representative output distribution. The structured persona generation task demonstrated that online DivPO with word frequency criteria improved diversity by 30.07% compared to the baseline model while maintaining a comparable level of response quality. Similarly, the keyword-based creative writing task showed a substantial improvement, with DivPO achieving a 13.6% increase in diversity and a 39.6% increase in quality relative to the standard preference optimization models.

These findings confirm that preference optimization methods inherently reduce diversity, challenging language models designed for open-ended tasks. DivPO effectively mitigates this issue by incorporating diversity-aware selection criteria, enabling language models to maintain high-quality responses without limiting variability. By balancing diversity with alignment, DivPO enhances the adaptability and utility of LLMs across multiple domains, ensuring they remain useful for creative, analytical, and synthetic data generation applications. The introduction of DivPO marks a significant advancement in preference optimization, offering a practical solution to the long-standing problem of response collapse in language models.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.
The post This AI Paper from Meta Introduces Diverse Preference Optimization (DivPO): A Novel Optimization Method for Enhancing Diversity in Large Language Models appeared first on MarkTechPost.

Accelerate video Q&A workflows using Amazon Bedrock Knowledge Base …

Organizations are often inundated with video and audio content that contains valuable insights. However, extracting those insights efficiently and with high accuracy remains a challenge. This post explores an innovative solution to accelerate video and audio review workflows through a thoughtfully designed user experience that enables human and AI collaboration. By approaching the problem from the user’s point of view, we can create a powerful tool that allows people to quickly find relevant information within long recordings without the risk of AI hallucinations.
Many professionals, from lawyers and journalists to content creators and medical practitioners, need to review hours of recorded content regularly to extract verifiably accurate insights. Traditional methods of manual review or simple keyword searches over transcripts are time-consuming and often miss important context. More advanced AI-powered summarization tools exist, but they risk producing hallucinations or inaccurate information, which can be dangerous in high-stakes environments like healthcare or legal proceedings.
Our solution, the Recorded Voice Insight Extraction Webapp (ReVIEW), addresses these challenges by providing a seamless method for humans to collaborate with AI, accelerating the review process while maintaining accuracy and trust in the results. The application is built on top of Amazon Transcribe and Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.
User experience
To accelerate a user’s review of a long-form audio or video while mitigating the risk of hallucinations, we introduce the concept of timestamped citations. Not only are large language models (LLMs) capable of answering a user’s question based on the transcript of the file, they are also capable of identifying the timestamp (or timestamps) of the transcript during which the answer was discussed. By using a combination of transcript preprocessing, prompt engineering, and structured LLM output, we enable the user experience shown in the following screenshot, which demonstrates the conversion of LLM-generated timestamp citations into clickable buttons (shown underlined in red) that navigate to the correct portion of the source video.

The user in this example has uploaded a number of videos, including some recordings of AWS re:Invent talks. You’ll notice that the preceding answer actually contains a hallucination originating from an error in the transcript; the AI assistant replied that “Hyperpaths” was announced, when in reality the service is called Amazon SageMaker HyperPod.
The user in the preceding screenshot had the following journey:

The user asks the AI assistant “What’s new with SageMaker?” The assistant searches the timestamped transcripts of the uploaded re:Invent videos.
The assistant provides an answer with citations. Those citations contain both the name of the video and a timestamp, and the frontend displays buttons corresponding to the citations. Each citation can point to a different video, or to different timestamps within the same video.
The user reads that SageMaker “Hyperpaths” was announced. They proceed to verify the accuracy of the generated answer by selecting the buttons, which auto play the source video starting at that timestamp.
The user sees that the product is actually called Amazon SageMaker HyperPod, and can be confident that SageMaker HyperPod was the product announced at re:Invent.

This experience, which is at the heart of the ReVIEW application, enables users to efficiently get answers to questions based on uploaded audio or video files and to verify the accuracy of the answers by rewatching the source media for themselves.
Solution overview
The full code for this application is available on the GitHub repo.
The architecture of the solution is shown in the following diagram, showcasing the flow of data through the application.

The workflow consists of the following steps:

A user accesses the application through an Amazon CloudFront distribution, which adds a custom header and forwards HTTPS traffic to an Elastic Load Balancing application load balancer. Behind the load balancer is a containerized Streamlit application running on Amazon Elastic Container Service (Amazon ECS).
Amazon Cognito handles user logins to the frontend application and Amazon API Gateway.
When a user uploads a media file through the frontend, a pre-signed URL is generated for the frontend to upload the file to Amazon Simple Storage Service (Amazon S3).
The frontend posts the file to an application S3 bucket, at which point a file processing flow is initiated through a triggered AWS Lambda. The file is sent to Amazon Transcribe and the resulting transcript is stored in Amazon S3. The transcript gets postprocessed into a text form more appropriate for use by an LLM, and an AWS Step Functions state machine syncs the transcript to a knowledge base configured in Amazon Bedrock Knowledge Bases. The knowledge base sync process handles chunking and embedding of the transcript, and storing embedding vectors and file metadata in an Amazon OpenSearch Serverless vector database.
If a user asks a question of one specific transcript (designated by the “pick media file” dropdown menu in the UI), the entire transcript is used to generate the response, so a retrieval step using the knowledge base is not required and an LLM is called directly through Amazon Bedrock.
If the user is asking a question whose answer might appear in any number of source videos (by choosing Chat with all media files on the dropdown menu in the UI), the Amazon Bedrock Knowledge Bases RetrieveAndGenerate API is used to embed the user query, find semantically similar chunks in the vector database, input those chunks into an LLM prompt, and generate a specially formatted response.
Throughout the process, application data from tracking transcription and ingestion status, mapping user names to uploaded files, and caching responses are accomplished with Amazon DynamoDB.

One important characteristic of the architecture is the clear separation of frontend and backend logic through an API Gateway deployed REST API. This was a design decision to enable users of this application to replace the Streamlit frontend with a custom frontend. There are instructions for replacing the frontend in the README of the GitHub repository.
Timestamped citations
The key to this solution lies in the prompt engineering and structured output format. When generating a response to a user’s question, the LLM is instructed to not only provide an answer to the question (if possible), but also to cite its sources in a specific way.
The full prompt can be seen in the GitHub repository, but a shortened pseudo prompt (for brevity) is shown here:

You are an intelligent AI which attempts to answer questions based on retrieved chunks of automatically generated transcripts.
Below are retrieved chunks of transcript with metadata including the file name. Each chunk includes a <media_name> and lines of a transcript, each line beginning with a timestamp.
$$ retrieved transcript chunks $$
Your answer should be in json format, including a list of partial answers, each of which has a citation. The citation should include the source file name and timestamp. Here is the user’s question:
$$ user question $$

The frontend then parses the LLM response into a fixed schema data model, described with Pydantic BaseModels:
from pydantic import BaseModel

class Citation(BaseModel):
“””A single citation from a transcript”””
media_name: str
timestamp: int

class PartialQAnswer(BaseModel):
“””Part of a complete answer, to be concatenated with other partial answers”””
partial_answer: str
citations: List[Citation]

class FullQAnswer(BaseModel):
“””Full user query response including citations and one or more partial answers”””
answer: List[PartialQAnswer]
This format allows the frontend to parse the response and display buttons for each citation that cue up the relevant media segment for user review.
Deployment details
The solution is deployed in the form of one AWS Cloud Development Kit (AWS CDK) stack, which contains four nested stacks:

A backend that handles transcribing uploaded media and tracking job statuses
A Retrieval Augmented Generation (RAG) stack that handles setting up OpenSearch Serverless and Amazon Bedrock Knowledge Bases
An API stack that stands up an Amazon Cognito authorized REST API and various Lambda functions to logically separate the frontend from the backend
A frontend stack that consists of a containerized Streamlit application running as a load balanced service in an ECS cluster, with a CloudFront distribution connected to the load balancer

Prerequisites
The solution requires the following prerequisites:

You need to have an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?
You also need to request access to at least one Amazon Bedrock LLM (to generate answers to questions) and one embedding model (to find transcript chunks that are semantically similar to a user question). The following Amazon Bedrock models are the default, but can be changed using a configuration file at the application deployment time as described later in this post:

Amazon Titan Embeddings V2 – Text
Amazon’s Nova Pro

You need a Python environment with AWS CDK dependencies installed. For instructions, see Working with the AWS CDK in Python.
Docker is required to build the Streamlit frontend container at deployment time.
The minimal IAM permissions needed to bootstrap and deploy the AWS CDK are described in the ReVIEW/infra/minimal-iam-policy.json file in the GitHub repository. Make sure the IAM user or role deploying the stacks has these permissions.

Clone the repository
Fork the repository, and clone it to the location of your choice. For example:

$ git clone https://github.com/aws-samples/recorded-voice-insight-extraction-webapp.git

Edit the deployment config file
Optionally, edit the infra/config.yaml file to provide a descriptive base name for your stack. This file is also where you can choose specific Amazon Bedrock embedding models for semantic retrieval and LLMs for response generation, and define chunking strategies for the knowledge base that will ingest transcriptions of uploaded media files. This file is also where you can reuse an existing Amazon Cognito user pool if you want to bootstrap your application with an existing user base.
Deploy the AWS CDK stacks
Deploy the AWS CDK stacks with the following code:

$ cd infra
$ cdk bootstrap
$ cdk deploy –-all

You only need to use the preceding command one time per AWS account. The deploy command will deploy the parent stack and four nested stacks. The process takes approximately 20 minutes to complete.
When the deployment is complete, a CloudFront distribution URL of the form xxx.cloudfront.net will be printed on the console screen to access the application. This URL can also be found on the AWS CloudFormation console by locating the stack whose name matches the value in the config file, then choosing the Outputs tab and locating the value associated with the key ReVIEWFrontendURL. That URL will lead you to a login screen like the following screenshot.

Create an Amazon Cognito user to access the app
To log in to the running web application, you have to create an Amazon Cognito user. Complete the following steps:

On the Amazon Cognito console, navigate to the recently created user pool.
In the Users section under User Management¸ choose Create user.
Create a user name and password to log in to the ReVIEW application deployed in the account.

When the application deployment is destroyed (as described in the cleanup section), the Amazon Cognito pool remains to preserve the user base. The pool can be fully removed manually using the Amazon Cognito console.
Test the application
Test the application by uploading one or more audio or video files on the File Upload tab. The application supports media formats supported by Amazon Transcribe. If you are looking for a sample video, consider downloading a TED talk. After uploading, you will see the file appear on the Job Status tab. You can track processing progress through transcription, postprocessing, and knowledge base syncing steps on this tab. After at least one file is marked Complete, you can chat with it on the Chat With Your Media tab.
The Analyze Your Media tab allows you to create and apply custom LLM template prompts to individual uploaded files. For example, you can create a basic summary template, or an extract key information template, and apply it to your uploaded files here. This functionality was not described in detail in this post.
Clean up
The deployed application will incur ongoing costs even if it isn’t used, for example from OpenSearch Serverless indexing and search OCU minimums. To delete all resources created when deploying the application, run the following command:

$ cdk destroy –-all

Conclusion
The solution presented in this post demonstrates a powerful pattern for accelerating video and audio review workflows while maintaining human oversight. By combining the power of AI models in Amazon Bedrock with human expertise, you can create tools that not only boost productivity but also maintain the critical element of human judgment in important decision-making processes.
We encourage you to explore this fully open sourced solution, adapt it to your specific use cases, and provide feedback on your experiences.
For expert assistance, the AWS Generative AI Innovation Center, AWS Professional Services, and our AWS Partners are here to help.

About the Author
David Kaleko is a Senior Applied Scientist in the AWS Generative AI Innovation Center.

Boost team innovation, productivity, and knowledge sharing with Amazon …

As enterprises rapidly expand their applications, platforms, and infrastructure, it becomes increasingly challenging to keep up with technology trends, best practices, and programming standards. Enterprises typically provide their developers, engineers, and architects with a variety of knowledge resources such as user guides, technical wikis, code repositories, and specialized tools. However, over time these resources often become siloed within individual teams or organizational silos, making it difficult for employees to easily access relevant information across the broader organization. This lack of knowledge sharing can lead to duplicated efforts, reduced productivity, and missed opportunities to use institutional expertise.
Imagine you’re a developer tasked with troubleshooting a complex issue in your company’s cloud infrastructure. You scour through outdated user guides and scattered conversations, but can’t find the right answer. Minutes turn into hours, sometimes days, as you struggle to piece together the information you need, all while your project falls behind.
To address these challenges, the MuleSoft team integrated Amazon Q Apps, a capability within Amazon Q Business, a generative AI-powered assistant service, directly into their Cloud Central portal—an individualized portal that shows assets owned, costs and usage, and AWS Well-Architected recommendations to over 100 engineer teams. Amazon Q Apps is designed to use Amazon Q Business and its ability to draw upon an enterprise’s own internal data, documents, and systems to provide conversational assistance to users. By tapping into these rich information sources, you can enable your users to create Amazon Q Apps that can answer questions, summarize key points, generate custom content, and even securely complete certain tasks—all without the user having to navigate through disparate repositories or systems. Prior to Amazon Q Apps, MuleSoft was using a chatbot that used Slack, Amazon Lex V2, and Amazon Kendra. The chatbot solution didn’t meet the needs of the engineering and development teams, which prompted the exploration of Amazon Q Apps.
In this post, we demonstrate how Amazon Q Apps can help maximize the value of existing knowledge resources and improve productivity among various teams, ranging from finance to DevOps to support engineers. We share specific examples of how the generative AI assistant can enable surface relevant information, distill complex topics, generate custom content, and execute workflows—all while maintaining robust security and data governance controls.
In addition to demonstrating the power of Amazon Q Apps, we provide guidance on prompt engineering and system prompts reflective of real-world use cases using the rich features of Amazon Q Apps. For instance, let’s consider the scenario of troubleshooting network connectivity. By considering personas and their specific lines of business, we can derive the optimal tone and language to provide a targeted, actionable response. This level of personalization is key to delivering optimized customer experiences and building trust.
Improve production with Amazon Q Apps
Amazon Q Apps is a feature within Amazon Q Business that assists you in creating lightweight, purpose-built applications within Amazon Q Business. You can create these apps in several ways like creating applications with your own words to fit specific requirements, or by transforming your conversations with an Amazon Q Business assistant into prompts that then can be used to generate an application.
With Amazon Q Apps, you can build, share, and customize applications on enterprise data to streamline tasks and boost individual and team productivity. You can also publish applications to an admin-managed library and share them with their coworkers. Amazon Q Apps inherits user permissions, access controls, and enterprise guardrails from Amazon Q Business for secure sharing and adherence to data governance policies.
Amazon Q Apps is only available to users with a Pro subscription. If you have the Lite subscription, you will not be able to view or use Amazon Q Apps.
MuleSoft’s use case with Amazon Q Apps
The team needed a more personalized approach to Amazon Q Business. Upon the announcement of Amazon Q Apps, the team determined it could solve an immediate need across teams. Their Cloud Central portal is already geared for a personalized experience for its users. MuleSoft completed a successful proof of concept integrating Amazon Q Apps into their overall Cloud Central portal. Cloud Central (see the following screenshot) serves as a single pane of glass for both managers and team members to visualize and understand each persona’s personalized cloud assets, cost metrics, and Well-Architected status based on application or infrastructure.

Fig 1: Salesforce MuleSoft Cloud Central Portal

The MuleSoft support team was looking for a way to help them troubleshoot network traffic latency when they rolled out a new customer into their production environment. The MuleSoft team found Amazon Q Apps helpful in providing possible causes for network latency for virtual private clouds (VPCs) as well as in providing prescriptive guidance on how to troubleshoot VPC network latencies. We explore a similar network latency use case in this post.
Solution overview
In this post, we focus on creating Amazon Q applications from the Amazon Q Business Chat and Amazon Q Apps Creator:

Amazon Q Business Chat – You can use the Amazon Q Apps icon in the Amazon Q Business Chat assistant to generate a prompt that can be used to create an application. This feature summarizes the Amazon Q Business Chat conversation to create a prompt that you can review and edit before generating an application.
Amazon Q Apps Creator – With Amazon Q Apps Creator, you can describe the type of application you want to build using your own words to generate an application. Amazon Q Apps will generate an application for you based on the provided prompt.

Pre-requisites
Make sure you have an AWS account. If not, you can sign up one. Refer to Pre-requisites for Amazon Q Apps for the steps to complete prior to deploying Amazon Q Apps. For more information, see Getting started with Amazon Q Business.
Create an application using Amazon Q Business Chat
You can choose the Amazon Q Apps icon from an Amazon Q chat conversation to generate an application prompt and using it to create an Amazon Q application. The icon is available in the conversations pane on the left, above the Amazon Q Assistant Chat conversation in the upper-right corner, or on the prompt dropdown menu.

Let’s explore an example of using an Amazon Q chat assistant conversation to create an application.

Begin by asking the Amazon Q Business assistant a question related to the data that is provided in the Amazon Q Business application.

For this example, we ask about steps to troubleshoot network latency.

After you’ve finished your conversation, choose the Amazon Q Apps icon in either the conversation pane or in the upper-right corner to launch Amazon Q App Creator.
Review the generated prompt from the conversation and update the prompt to match your application purpose as needed.
Choose Generate to create the application.
To test the application, we enter under User input “I am unable to reach my EC2 host via port 22,” and choose Run.
Review the generated text output and confirm that the troubleshooting steps look correct.
Share the app with all in the library, choose Publish.

The Amazon Q Apps library will show all published applications shared by your teammates. Only users who have access to the Amazon Q Business application will be able to view your published application.

You can choose labels where the application will reside, relating to teams, personas, or categories.

Create an application using Amazon Q Apps Creator
You can start building an Amazon Q application with Amazon Q Apps Creator by describing the task you want to create an application for. Complete the following steps:

Choose Apps in the navigation pane.
Enter your prompt or use an example prompt.

For this post, we enter the prompt “Create an app that crafts insightful content for users to troubleshoot AWS services. It takes inputs like as a use case to work backwards from on a solution. Based on these inputs, the app generates a tailored response for resolving the AWS service use case, providing steps to remediate and content links.”

Choose Generate to create the application.

The Amazon Q application was created with AWS Use Case, Troubleshooting Steps, and Additional Resources sections translated from your prompt.

To test the application, we enter under User input “which AWS tool to manage many AWS accounts and take advantage of consolidated billing,” and choose Run.

The Troubleshooting Steps section highlights using AWS Organizations and provides a walkthrough. The Additional Resources section provides more information about your use case, while citing AWS customer references.

Choose Share to publish your application and choose the appropriate labels.

Results
MuleSoft offers a prime example of the transformative impact of Amazon Q Apps. With this solution, MuleSoft was able to realize a 50% reduction in team inquiries—from 100 down to just 50. These inquiries spanned a wide range, from basic AWS service information to complex networking troubleshooting and even Amazon Elastic Block Store (Amazon EBS) volume migrations from gp2 to gp3.
Pricing
Amazon Q Business offers subscription options for you to customize your access. For more details, see Amazon Q Business pricing.
Conclusion
Amazon Q Business empowers enterprises to maximize the value of their knowledge resources by democratizing access to powerful conversational AI capabilities. Through Amazon Q Apps, organizations can create purpose-built applications using internal data and systems, unlocking new solutions and accelerating innovation.
The MuleSoft team demonstrated this by integrating Amazon Q Apps into their Cloud Central portal, enhancing user experience, streamlining collaboration, and optimizing cloud infrastructure while maintaining robust security and data governance.
Amazon Q Apps provides flexible generative AI application development using natural language, allowing organizations to build and securely publish custom applications tailored to their unique needs. This approach enables teams to boost innovation, productivity, and knowledge sharing across job functions.
By leveraging Amazon Q Business, enterprises can find answers, build applications, and drive productivity using their own enterprise data and conversational AI capabilities.
To learn about other Amazon Q Business customers’ success stories, see Amazon Q Developer customers.
*Note Amazon Q Apps is only available to users with the Pro subscription, if you have the Lite subscription you will not be able to view or use Amazon Q Apps.

About the Authors
Rueben Jimenez is an AWS Sr Solutions Architect. Designing and implementing complex Data Analytics, Machine learning, Generative AI, and cloud infrastructure solutions.
Tiffany Myers is an AWS Product Manager for Amazon Q Apps. Launching generative AI solutions for business users.
Summer Petersil is a Strategic Account Representative (SAR) on the AWS Salesforce team, where she leads Generative AI (GenAI) enablement efforts.

Creating a Medical Question-Answering Chatbot Using Open-Source BioMis …

In this tutorial, we’ll build a powerful, PDF-based question-answering chatbot tailored for medical or health-related content. We’ll leveRAGe the open-source BioMistral LLM and LangChain’s flexible data orchestration capabilities to process PDF documents into manageable text chunks. We’ll then encode these chunks using Hugging Face embeddings, capturing deep semantic relationships and storing them in a Chroma vector database for high-efficiency retrieval. Finally, by employing a Retrieval-Augmented Generation (RAG) system, we’ll integrate the retrieved context directly into our chatbot’s responses, ensuring clear, authoritative answers for users. This approach allows us to rapidly sift through large volumes of medical PDFs, providing context-rich, accurate, and easy-to-understand insights.

Setting up tools

Copy CodeCopiedUse a different Browser!pip install langchain sentence-transformers chromadb llama-cpp-python langchain_community pypdf
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter,RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain_community.llms import LlamaCpp
from langchain.chains import RetrievalQA, LLMChain
import pathlib
import textwrap
from IPython.display import display
from IPython.display import Markdown

def to_markdown(text):
text = text.replace(‘•’, ‘ *’)
return Markdown(textwrap.indent(text, ‘> ‘, predicate=lambda _: True))
from google.colab import drive
drive.mount(‘/content/drive’)

First, we install and configure Python packages for document processing, embedding generation, local LLMs, and advanced retrieval-based workflows with LlamaCpp. We leverage langchain_community for PDF loading and text splitting, set up RetrievalQA and LLMChain for question answering, and include a to_markdown utility plus Google Drive mounting.

Setting up API key access

Copy CodeCopiedUse a different Browserfrom google.colab import userdata
# Or use `os.getenv(‘HUGGINGFACEHUB_API_TOKEN’)` to fetch an environment variable.
import os
from getpass import getpass

HF_API_KEY = userdata.get(“HF_API_KEY”)
os.environ[“HF_API_KEY”] = “HF_API_KEY”

Here, we securely fetch and set the Hugging Face API key as an environment variable in Google Colab. It can also leverage the HUGGINGFACEHUB_API_TOKEN environment variable to avoid directly exposing sensitive credentials in your code.

Loading and Extracting PDFs from a Directory

Copy CodeCopiedUse a different Browserloader = PyPDFDirectoryLoader(‘/content/drive/My Drive/Data’)
docs = loader.load()

We use PyPDFDirectoryLoader to scan the specified folder for PDFs, extract their text into a document list, and lay the groundwork for tasks like question answering, summarization, or keyword extraction.

Splitting Loaded Text Documents into Manageable Chunks

Copy CodeCopiedUse a different Browsertext_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

In this code snippet, RecursiveCharacterTextSplitter is applied to break down each document in docs into smaller, more manageable segments.

Initializing Hugging Face Embeddings

Copy CodeCopiedUse a different Browserembeddings = HuggingFaceEmbeddings(model_name=”BAAI/bge-base-en-v1.5″)

Using HuggingFaceEmbeddings, we create an object using the BAAI/bge-base-en-v1.5 model. It converts text into numerical vectors.

Building a Vector Store and Running a Similarity Search

Copy CodeCopiedUse a different Browservectorstore = Chroma.from_documents(chunks, embeddings)
query = “who is at risk of heart disease”
search = vectorstore.similarity_search(query)
to_markdown(search[0].page_content)

We first build a Chroma vector store (Chroma.from_documents) from the text chunks and the specified embedding model. Next, you create a query asking, “who is at risk of heart disease,” and perform a similarity search against the stored embeddings. The top result (search[0].page_content) is then converted to Markdown for clearer display.

Creating a Retriever and Fetching Relevant Documents

Copy CodeCopiedUse a different Browserretriever = vectorstore.as_retriever(
search_kwargs={‘k’: 5}
)
retriever.get_relevant_documents(query)

We convert the Chroma vector store into a retriever (vectorstore.as_retriever) that efficiently fetches the most relevant documents for a given query. 

Initializing BioMistral-7B  Model with LlamaCpp

Copy CodeCopiedUse a different Browserllm = LlamaCpp(
model_path= “/content/drive/MyDrive/Model/BioMistral-7B.Q4_K_M.gguf”,
temperature=0.3,
max_tokens=2048,
top_p=1)

We set up an open-source local BioMistral LLM using LlamaCpp, pointing to a pre-downloaded model file. We also configure generation parameters such as temperature, max_tokens, and top_p, which control randomness, the maximum tokens generated, and the nucleus sampling strategy.

Setting Up a Retrieval-Augmented Generation (RAG) Chain with a Custom Prompt

Copy CodeCopiedUse a different Browserfrom langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate
template = “””
<|context|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
<|user|>
{query}
</s>
<|assistant|>
“””
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
{‘context’: retriever, ‘query’: RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

Using the above, we set up an RAG pipeline using the LangChain framework. It creates a custom prompt with instructions and placeholders, incorporates a retriever for context, and leverages a language model for generating answers. The flow is defined as a series of operations (RunnablePassthrough for direct query handling, the ChatPromptTemplate for prompt construction, the LLM for response generation, and finally, the StrOutputParser to produce a clean text string).

Invoking the RAG Chain to Answer a Health-Related Query

Copy CodeCopiedUse a different Browserresponse = rag_chain.invoke(“Why should I care about my heart health?”)
to_markdown(response)

Now, we call the previously constructed RAG chain with a user’s query. It passes the query to the retriever, retrieves relevant context from the document collection, and feeds that context into the LLM to generate a concise, accurate answer.

In conclusion, by integrating BioMistral via LlamaCpp and taking advantage of LangChain’s flexibility, we are able to build a medical-RAG chatbot with context awareness. From chunk-based indexing to seamless RAG pipelines, it streamlines the process of mining large volumes of PDF data for relevant insights. Users receive clear and easily readable answers by formatting final responses in Markdown. This design can be extended or tailored for various domains, ensuring scalability and precision in knowledge retrieval across diverse documents.

Use the Colab Notebook here. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)
The post Creating a Medical Question-Answering Chatbot Using Open-Source BioMistral LLM, LangChain, Chroma’s Vector Storage, and RAG: A Step-by-Step Guide appeared first on MarkTechPost.

Google AI Introduces Parfait: A Privacy-First AI System for Secure Dat …

Protecting user data while enabling advanced analytics and machine learning is a critical challenge. Organizations must process and analyze data without compromising privacy, but existing solutions often struggle to balance security with functionality. This creates barriers to innovation, limiting collaboration and the development of privacy-conscious technologies. A solution that ensures transparency minimizes data exposure, preserves anonymity, and allows external verification is needed. Addressing these challenges makes it possible to unlock new opportunities for secure and privacy-first computing, enabling businesses and researchers to collaborate effectively while maintaining strict data protection standards.

Recent research has explored various privacy-preserving techniques for data aggregation, model training, and analytics. Differential privacy has been widely adopted to add noise to datasets, ensuring individual data points remain unidentifiable. Federated learning allows models to be trained across decentralized devices without sharing raw data, enhancing security. Additionally, trusted execution environments (TEEs) provide hardware-based security for private computations. Despite these advancements, existing methods often involve trade-offs between accuracy, efficiency, and privacy, highlighting the need for more robust, scalable, and verifiable privacy-first solutions.

Researchers from Google introduced a new approach, Parfait, designed to enhance privacy-first computing by integrating multiple privacy-preserving techniques into a unified framework. It prioritizes transparency by offering clear insights into data usage and processing methods. It incorporates federated learning, federated analytics, and secure aggregation to minimize data exposure, allowing computations to occur locally without transferring raw data. Additionally, it employs differential privacy algorithms for tasks like model training and analytics, ensuring sensitive information remains anonymized. By combining these techniques, Parfait enables secure data handling while maintaining accuracy and efficiency.

Another key aspect of Parfait is external verifiability, which ensures that privacy claims can be independently verified. TEEs are utilized to create secure workflows where computations can be audited without compromising confidentiality. This enhances trust among users and organizations by ensuring that privacy protocols are upheld. Parfait fosters a collaborative space and enables businesses and open-source projects to innovate securely while adhering to strict privacy principles. Its comprehensive design aims to address existing challenges in privacy-preserving computation, striking a balance between data security, accessibility, and performance.

The results demonstrate that Parfait effectively enhances privacy-preserving computing by ensuring secure data aggregation, retrieval, and analysis. It successfully maintains data confidentiality while enabling collaborative innovation across various domains. Using federated learning and differential privacy techniques minimizes the risk of privacy breaches. Additionally, trusted execution environments provide verifiability, reinforcing user trust. The framework balances privacy and efficiency, proving its capability to handle tasks like model training, analytics, and secure computation. These findings highlight Parfait’s potential to set a new standard for privacy-first computing, making it a valuable tool for businesses and open-source projects.

In conclusion, Parfait introduces a robust framework for privacy-preserving computing, enabling secure data aggregation, retrieval, and analytics without compromising confidentiality. Integrating advanced privacy techniques such as federated learning, differential privacy, and trusted execution environments ensures transparency, minimizes data exposure, and enhances security. The results highlight its effectiveness in balancing privacy with computational efficiency, making it a tool for businesses and open-source communities. Parfait sets the stage for future innovations in privacy-first computing, paving the way for more secure, verifiable, and collaborative AI applications that respect user data while enabling meaningful insights and advancements.

Check out the Technical Details and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)
The post Google AI Introduces Parfait: A Privacy-First AI System for Secure Data Aggregation and Analytics appeared first on MarkTechPost.