i-genie, Author at i-genie.co.uk

Automate customer support with Amazon Bedrock, LangGraph, and Mistral …

Posted on June 11, 2025 by i-genie

AI agents are transforming the landscape of customer support by bridging the gap between large language models (LLMs) and real-world applications. These intelligent, autonomous systems are poised to revolutionize customer service across industries, ushering in a new era of human-AI collaboration and problem-solving. By harnessing the power of LLMs and integrating them with specialized tools and APIs, agents can tackle complex, multistep customer support tasks that were previously beyond the reach of traditional AI systems.As we look to the future, AI agents will play a crucial role in the following areas:

Enhancing decision-making – Providing deeper, context-aware insights to improve customer support outcomes
Automating workflows – Streamlining customer service processes, from initial contact to resolution, across various channels
Human-AI interactions – Enabling more natural and intuitive interactions between customers and AI systems
Innovation and knowledge integration – Generating new solutions by combining diverse data sources and specialized knowledge to address customer queries more effectively
Ethical AI practices – Helping provide more transparent and explainable AI systems to address customer concerns and build trust

Building and deploying AI agent systems for customer support is a step toward unlocking the full potential of generative AI in this domain. As these systems evolve, they will transform customer service, expand possibilities, and open new doors for AI in enhancing customer experiences.
In this post, we demonstrate how to use Amazon Bedrock and LangGraph to build a personalized customer support experience for an ecommerce retailer. By integrating the Mistral Large 2 and Pixtral Large models, we guide you through automating key customer support workflows such as ticket categorization, order details extraction, damage assessment, and generating contextual responses. These principles are applicable across various industries, but we use the ecommerce domain as our primary example to showcase the end-to-end implementation and best practices. This post provides a comprehensive technical walkthrough to help you enhance your customer service capabilities and explore the latest advancements in LLMs and multimodal AI.
LangGraph is a powerful framework built on top of LangChain that enables the creation of cyclical, stateful graphs for complex AI agent workflows. It uses a directed graph structure where nodes represent individual processing steps (like calling an LLM or using a tool), edges define transitions between steps, and state is maintained and passed between nodes during execution. This architecture is particularly valuable for customer support automation involving workflows. LangGraph’s advantages include built-in visualization, logging (traces), human-in-the-loop capabilities, and the ability to organize complex workflows in a more maintainable way than traditional Python code.This post provides details on how to do the following:

Use Amazon Bedrock and LangGraph to build intelligent, context-aware customer support workflows
Integrate data in a helpdesk tool, like JIRA, in the LangChain workflow
Use LLMs and vision language models (VLMs) in the workflow to perform context-specific tasks
Extract information from images to aid in decision-making
Compare images to assess product damage claims
Generate responses for the customer support tickets

Solution overview
This solution involves the customers initiating support requests through email, which are automatically converted into new support tickets in Atlassian Jira Service Management. The customer support automation solution then takes over, identifying the intent behind each query, categorizing the tickets, and assigning them to a bot user for further processing. The solution uses LangGraph to orchestrate a workflow involving AI agents to extracts key identifiers such as transaction IDs and order numbers from the support ticket. It analyzes the query and uses these identifiers to call relevant tools, extracting additional information from the database to generate a comprehensive and context-aware response. After the response is prepared, it’s updated in Jira for human support agents to review before sending the response back to the customer. This process is illustrated in the following figure. This solution is capable of extracting information not only from the ticket body and title but also from attached images like screenshots and external databases.

The solution uses two foundation models (FMs) from Amazon Bedrock, each selected based on its specific capabilities and the complexity of the tasks involved. For instance, the Pixtral model is used for vision-related tasks like image comparison and ID extraction, whereas the Mistral Large 2 model handles a variety of tasks like ticket categorization, response generation, and tool calling. Additionally, the solution includes fraud detection and prevention capabilities. It can identify fraudulent product returns by comparing the stock product image with the returned product image to verify if they match and assess whether the returned product is genuinely damaged. This integration of advanced AI models with automation tools enhances the efficiency and reliability of the customer support process, facilitating timely resolutions and security against fraudulent activities. LangGraph provides a framework for orchestrating the information flow between agents, featuring built-in state management and checkpointing to facilitate seamless process continuity. This functionality allows the inclusion of initial ticket summaries and descriptions in the State object, with additional information appended in subsequent steps of the workflows. By maintaining this evolving context, LangGraph enables LLMs to generate context-aware responses. See the following code:

# class to hold state information

class JiraAppState(MessagesState):
   key: str
   summary: str
   description: str
   attachments: list
   category: str
   response: str
   transaction_id: str
   order_no: str
   usage: list

The framework integrates effortlessly with Amazon Bedrock and LLMs, supporting task-specific diversification by using cost-effective models for simpler tasks while reducing the risks of exceeding model quotas. Furthermore, LangGraph offers conditional routing for dynamic workflow adjustments based on intermediate results, and its modular design facilitates the addition or removal of agents to extend system capabilities.
Responsible AI
It’s crucial for customer support automation applications to validate inputs and make sure LLM outputs are secure and responsible. Amazon Bedrock Guardrails can significantly enhance customer support automation applications by providing configurable safeguards that monitor and filter both user inputs and AI-generated responses, making sure interactions remain safe, relevant, and aligned with organizational policies. By using features such as content filters, which detect and block harmful categories like hate speech, insults, sexual content, and violence, as well as denied topics to help prevent discussions on sensitive or restricted subjects (for example, legal or medical advice), customer support applications can avoid generating or amplifying inappropriate or defiant information. Additionally, guardrails can help redact personally identifiable information (PII) from conversation transcripts, protecting user privacy and fostering trust. These measures not only reduce the risk of reputational harm and regulatory violations but also create a more positive and secure experience for customers, allowing support teams to focus on resolving issues efficiently while maintaining high standards of safety and responsibility.
The following diagram illustrates this architecture.

Observability
Along with Responsible AI, observability is vital for customer support applications to provide deep, real-time visibility into model performance, usage patterns, and operational health, enabling teams to proactively detect and resolve issues. With comprehensive observability, you can monitor key metrics such as latency and token consumption, and track and analyze input prompts and outputs for quality and compliance. This level of insight helps identify and mitigate risks like hallucinations, prompt injections, toxic language, and PII leakage, helping make sure that customer interactions remain safe, reliable, and aligned with regulatory requirements.
Prerequisites
In this post, we use Atlassian Jira Service Management as an example. You can use the same general approach to integrate with other service management tools that provide APIs for programmatic access. The configuration required in Jira includes:

A Jira service management project with API token to enable programmatic access
The following custom fields:

Name: Category, Type: Select List (multiple choices)
Name: Response, Type: Text Field (multi-line)

A bot user to assign tickets

The following code shows a sample Jira configuration:

JIRA_API_TOKEN = “<JIRA_API_TOKEN>”
JIRA_USERNAME = “<JIRA_USERNAME>”
JIRA_INSTANCE_URL = “https://<YOUR_JIRA_INSTANCE_NAME>.atlassian.net/”
JIRA_PROJECT_NAME = “<JIRA_PROJECT_NAME>”
JIRA_PROJECT_KEY = “<JIRA_PROJECT_KEY>”
JIRA_BOT_USER_ID = ‘<JIRA_BOT_USER_ID>’

In addition to Jira, the following services and Python packages are required:

A valid AWS account.
An AWS Identity and Access Management (IAM) role in the account that has sufficient permissions to create the necessary resources.
Access to the following models hosted on Amazon Bedrock:

Mistral Large 2 (model ID: mistral.mistral-large-2407-v1:0).
Pixtral Large (model ID: us.mistral.pixtral-large-2502-v1:0). The Pixtral Large model is available in Amazon Bedrock under cross-Region inference profiles.

A LangGraph application up and running locally. For instructions, see Quickstart: Launch Local LangGraph Server.

For this post, we use the us-west-2 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.
The source code of this solution is available in the GitHub repository. This is an example code; you should conduct your own due diligence and adhere to the principle of least privilege.
Implementation with LangGraph
At the core of customer support automation is a suite of specialized tools and functions designed to collect, analyze, and integrate data from service management systems and a SQLite database. These tools serve as the foundation of our system, empowering it to deliver context-aware responses. In this section, we delve into the essential components that power our system.
BedrockClient class
The BedrockClient class is implemented in the cs_bedrock.py file. It provides a wrapper for interacting with Amazon Bedrock services, specifically for managing language models and content safety guardrails in customer support applications. It simplifies the process of initializing language models with appropriate configurations and managing content safety guardrails. This class is used by LangChain and LangGraph to invoke LLMs on Amazon Bedrock.
This class also provides methods to create guardrails for responsible AI implementation. The following Amazon Bedrock Guardrails policy filters sexual, violence, hate, insults, misconducts, and prompt attacks, and helps prevent models from generating stock and investment advice, profanity, hate, violent and sexual content. Additionally, it helps prevent exposing vulnerabilities in models by alleviating prompt attacks.

# guardrails policy

contentPolicyConfig={
‘filtersConfig’: [
{
‘type’: ‘SEXUAL’,
‘inputStrength’: ‘MEDIUM’,
‘outputStrength’: ‘MEDIUM’
},
{
‘type’: ‘VIOLENCE’,
‘inputStrength’: ‘MEDIUM’,
‘outputStrength’: ‘MEDIUM’
},
{
‘type’: ‘HATE’,
‘inputStrength’: ‘MEDIUM’,
‘outputStrength’: ‘MEDIUM’
},
{
‘type’: ‘INSULTS’,
‘inputStrength’: ‘MEDIUM’,
‘outputStrength’: ‘MEDIUM’
},
{
‘type’: ‘MISCONDUCT’,
‘inputStrength’: ‘MEDIUM’,
‘outputStrength’: ‘MEDIUM’
},
{
‘type’: ‘PROMPT_ATTACK’,
‘inputStrength’: ‘LOW’,
‘outputStrength’: ‘NONE’
}
]
},
wordPolicyConfig={
‘wordsConfig’: [
{‘text’: ‘stock and investment advice’}
],
‘managedWordListsConfig’: [
{‘type’: ‘PROFANITY’}
]
},
contextualGroundingPolicyConfig={
‘filtersConfig’: [
{
‘type’: ‘GROUNDING’,
‘threshold’: 0.65
},
{
‘type’: ‘RELEVANCE’,
‘threshold’: 0.75
}
]
}

Database class
The Database class is defined in the cs_db.py file. This class is designed to facilitate interactions with a SQLite database. It’s responsible for creating a local SQLite database and importing synthetic data related to customers, orders, refunds, and transactions. By doing so, it makes sure that the necessary data is readily available for various operations. Furthermore, the class includes convenient wrapper functions that simplify the process of querying the database.
JiraSM class
The JiraSM class is implemented in the cs_jira_sm.py file. It serves as an interface for interacting with Jira Service Management. It establishes a connection to Jira by using the API token, user name, and instance URL, all of which are configured in the .env file. This setup provides secure and flexible access to the Jira instance. The class is designed to handle various ticket operations, including reading tickets and assigning them to a preconfigured bot user. Additionally, it supports downloading attachments from tickets and updating custom fields as needed.
CustomerSupport class
The CustomerSupport class is implemented in the cs_cust_support_flow.py file. This class encapsulates the customer support processing logic by using LangGraph and Amazon Bedrock. Using LangGraph nodes and tools, this class orchestrates the customer support workflow. The workflow initially determines the category of the ticket by analyzing its content and classifying it as related to transactions, deliveries, refunds, or other issues. It updates the support ticket with the category detected. Following this, the workflow extracts pertinent information such as transaction IDs or order numbers, which might involve analyzing both text and images, and queries the database for relevant details. The next step is response generation, which is context-aware and adheres to content safety guidelines while maintaining a professional tone. Finally, the workflow integrates with Jira, assigning categories, updating responses, and managing attachments as needed.
The LangGraph orchestration is implemented in the build_graph function, as illustrated in the following code. This function also generates a visual representation of the workflow using a Mermaid graph for better clarity and understanding. This setup supports an efficient and structured approach to handling customer support tasks.

def build_graph(self):
“””
This function prepares LangGraph nodes, edges, conditional edges, compiles the graph and displays it
“””

# create StateGraph object
graph_builder = StateGraph(JiraAppState)

# add nodes to the graph
graph_builder.add_node(“Determine Ticket Category”, self.determine_ticket_category_tool)
graph_builder.add_node(“Assign Ticket Category in JIRA”, self.assign_ticket_category_in_jira_tool)
graph_builder.add_node(“Extract Transaction ID”, self.extract_transaction_id_tool)
graph_builder.add_node(“Extract Order Number”, self.extract_order_number_tool)
graph_builder.add_node(“Find Transaction Details”, self.find_transaction_details_tool)

graph_builder.add_node(“Find Order Details”, self.find_order_details_tool)
graph_builder.add_node(“Generate Response”, self.generate_response_tool)
graph_builder.add_node(“Update Response in JIRA”, self.update_response_in_jira_tool)

graph_builder.add_node(“tools”, ToolNode([StructuredTool.from_function(self.assess_damaged_delivery), StructuredTool.from_function(self.find_refund_status)]))

# add edges to connect nodes
graph_builder.add_edge(START, “Determine Ticket Category”)
graph_builder.add_edge(“Determine Ticket Category”, “Assign Ticket Category in JIRA”)
graph_builder.add_conditional_edges(“Assign Ticket Category in JIRA”, self.decide_ticket_flow_condition)
graph_builder.add_edge(“Extract Order Number”, “Find Order Details”)

graph_builder.add_edge(“Extract Transaction ID”, “Find Transaction Details”)
graph_builder.add_conditional_edges(“Find Order Details”, self.order_query_decision, [“Generate Response”, “tools”])
graph_builder.add_edge(“tools”, “Generate Response”)
graph_builder.add_edge(“Find Transaction Details”, “Generate Response”)

graph_builder.add_edge(“Generate Response”, “Update Response in JIRA”)
graph_builder.add_edge(“Update Response in JIRA”, END)

# compile the graph
checkpoint = MemorySaver()
app = graph_builder.compile(checkpointer=checkpoint)
self.graph_app = app
self.util.log_data(data=”Workflow compiled successfully”, ticket_id=’NA’)

# Visualize the graph
display(Image(app.get_graph().draw_mermaid_png(draw_method=MermaidDrawMethod.API)))

return app

LangGraph generates the following Mermaid diagram to visually represent the workflow.

Utility class
The Utility class, implemented in the cs_util.py file, provides essential functions to support the customer support automation. It encompasses utilities for logging, file handling, usage metric tracking, and image processing operations. The class is designed as a central hub for various helper methods, streamlining common tasks across the application. By consolidating these operations, it promotes code reusability and maintainability within the system. Its functionality makes sure that the automation framework remains efficient and organized.
A key feature of this class is its comprehensive logging capabilities. It provides methods to log informational messages, errors, and significant events directly into the cs_logs.log file. Additionally, it tracks Amazon Bedrock LLM token usage and latency metrics, facilitating detailed performance monitoring. The class also logs the execution flow of application-generated prompts and LLM generated responses, aiding in troubleshooting and debugging. These log files can be seamlessly integrated with standard log pusher agents, allowing for automated transfer to preferred log monitoring systems. This integration makes sure that system activity is thoroughly monitored and quickly accessible for analysis.
Run the agentic workflow
Now that the customer support workflow is defined, it can be executed for various ticket types. The following functions use the provided ticket key to fetch the corresponding Jira ticket and download available attachments. Additionally, they initialize the State object with details such as the ticket key, summary, description, attachment file path, and a system prompt for the LLM. This State object is used throughout the workflow execution.

def generate_response_for_ticket(ticket_id: str):

   llm, vision_llm, llm_with_guardrails = bedrock_client.init_llms(ticket_id=ticket_id)
   cust_support = CustomerSupport(llm=llm, vision_llm=vision_llm, llm_with_guardrails=llm_with_guardrails)
   app = cust_support.build_graph()

   state = cust_support.get_jira_ticket(key=ticket_id)
   state = app.invoke(state, thread)

   util.log_usage(state[‘usage’], ticket_id=ticket_id)
   util.log_execution_flow(state[“messages”], ticket_id=ticket_id)

The following code snippet invokes the workflow for the Jira ticket with key AS-6:

# initialize classes and create bedrock guardrails
bedrock_client = BedrockClient()
util = Utility()
guardrail_id = bedrock_client.create_guardrail()

# process a JIRA ticket
generate_response_for_ticket(ticket_id=’AS-6′)

The following screenshot shows the Jira ticket before processing. Notice that the Response and Category fields are empty, and the ticket is unassigned.

The following screenshot shows the Jira ticket after processing. The Category field is updated as Refunds and the Response field is updated by the AI-generated content.

This logs LLM usage information as follows:

Model Input Tokens Output Tokens Latency
mistral.mistral-large-2407-v1:0 385 2 653
mistral.mistral-large-2407-v1:0 452 27 884
mistral.mistral-large-2407-v1:0 1039 36 1197
us.mistral.pixtral-large-2502-v1:0 4632 425 5952
mistral.mistral-large-2407-v1:0 1770 144 4556

Clean up
Delete any IAM roles and policies created specifically for this post. Delete the local copy of this post’s code.
If you no longer need access to an Amazon Bedrock FM, you can remove access from it. For instructions, see Add or remove access to Amazon Bedrock foundation models.
Delete the temporary files and guardrails used in this post with the following code:

shutil.rmtree(util.get_temp_path())
bedrock_client.delete_guardrail()

Conclusion
In this post, we developed an AI-driven customer support solution using Amazon Bedrock, LangGraph, and Mistral models. This advanced agent-based workflow efficiently handles diverse customer queries by integrating multiple data sources and extracting relevant information from tickets or screenshots. It also evaluates damage claims to mitigate fraudulent returns. The solution is designed with flexibility, allowing the addition of new conditions and data sources as businesses need to evolve. With this multi-agent approach, you can build robust, scalable, and intelligent systems that redefine the capabilities of generative AI in customer support.
Want to explore further? Check out the following GitHub repo. There, you can observe the code in action and experiment with the solution yourself. The repository includes step-by-step instructions for setting up and running the multi-agent system, along with code for interacting with data sources and agents, routing data, and visualizing workflows.

About the authors
Deepesh Dhapola is a Senior Solutions Architect at AWS India, specializing in helping financial services and fintech clients optimize and scale their applications on the AWS Cloud. With a strong focus on trending AI technologies, including generative AI, AI agents, and the Model Context Protocol (MCP), Deepesh uses his expertise in machine learning to design innovative, scalable, and secure solutions. Passionate about the transformative potential of AI, he actively explores cutting-edge advancements to drive efficiency and innovation for AWS customers. Outside of work, Deepesh enjoys spending quality time with his family and experimenting with diverse culinary creations.

Build responsible AI applications with Amazon Bedrock Guardrails

Posted on June 11, 2025 by i-genie

As organizations embrace generative AI, they face critical challenges in making sure their applications align with their designed safeguards. Although foundation models (FMs) offer powerful capabilities, they can also introduce unique risks, such as generating harmful content, exposing sensitive information, being vulnerable to prompt injection attacks, and returning model hallucinations.
Amazon Bedrock Guardrails has helped address these challenges for multiple organizations, such as MAPRE, KONE, Fiserv, PagerDuty, Aha, and more. Just as traditional applications require multi-layered security, Amazon Bedrock Guardrails implements essential safeguards across model, prompt, and application levels—blocking up to 88% more undesirable and harmful multimodal content. Amazon Bedrock Guardrails helps filter over 75% hallucinated responses in Retrieval Augmented Generation (RAG) and summarization use cases, and stands as the first and only safeguard using Automated Reasoning to prevent factual errors from hallucinations.
In this post, we show how to implement safeguards using Amazon Bedrock Guardrails in a healthcare insurance use case.
Solution overview
We consider an innovative AI assistant designed to streamline interactions of policyholders with the healthcare insurance firm. With this AI-powered solution, policyholders can check coverage details, submit claims, find in-network providers, and understand their benefits through natural, conversational interactions. The assistant provides all-day support, handling routine inquiries while allowing human agents to focus on complex cases. To help enable secure and compliant operations of our assistant, we use Amazon Bedrock Guardrails to serve as a critical safety framework. Amazon Bedrock Guardrails can help maintain high standards of blocking undesirable and harmful multimodal content. This not only protects the users, but also builds trust in the AI system, encouraging wider adoption and improving overall customer experience in healthcare insurance interactions.
This post walks you through the capabilities of Amazon Bedrock Guardrails from the AWS Management Console. Refer to the following GitHub repo for information about creating, updating, and testing Amazon Bedrock Guardrails using the SDK.
Amazon Bedrock Guardrails provides configurable safeguards to help safely build generative AI applications at scale. It evaluates user inputs and model responses based on specific policies, working with all large language models (LLMs) on Amazon Bedrock, fine-tuned models, and external FMs using the ApplyGuardrail API. The solution integrates seamlessly with Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases, so organizations can apply multiple guardrails across applications with tailored controls.
Guardrails can be implemented in two ways: direct integration with Invoke APIs (InvokeModel and InvokeModelWithResponseStream) and Converse APIs (Converse and ConverseStream) for models hosted on Amazon Bedrock, applying safeguards during inference, or through the flexible ApplyGuardrail API, which enables independent content evaluation without model invocation. This second method is ideal for assessing inputs or outputs at various application stages and works with custom or third-party models that are not hosted on Amazon Bedrock. Both approaches empower developers to implement use case-specific safeguards aligned with responsible AI policies, helping to block undesirable and harmful multimodal content from generative AI applications.
The following diagram depicts the six safeguarding policies offered by Amazon Bedrock Guardrails.

Prerequisites
Before we begin, make sure you have access to the console with appropriate permissions for Amazon Bedrock. If you haven’t set up Amazon Bedrock yet, refer to Getting started in the Amazon Bedrock console.
Create a guardrail
To create guardrail for our healthcare insurance assistant, complete the following steps:

On the Amazon Bedrock console, choose Guardrails in the navigation pane.
Choose Create guardrail.
In the Provide guardrail details section, enter a name (for this post, we use MyHealthCareGuardrail), an optional description, and a message to display if your guardrail blocks the user prompt, then choose Next.

Configuring Multimodal Content filters
Security is paramount when building AI applications. With image content filters in Amazon Bedrock Guardrails, content filters can now detect and filter both text and image content through six protection categories: Hate, Insults, Sexual, Violence, Misconduct, and Prompt Attacks.

In the Configure content filters section, for maximum protection, especially in sensitive sectors like healthcare in our example use case, set your confidence thresholds to High across all categories for both text and image content.
Enable prompt attack protection to prevent system instruction tampering, and use input tagging to maintain accurate classification of system prompts, then choose Next.

Denied topics
In healthcare applications, we need clear boundaries around medical advice. Let’s configure Amazon Bedrock Guardrails to prevent users from attempting disease diagnosis, which should be handled by qualified healthcare professionals.

In the Add denied topics section, create a new topic called Disease Diagnosis, add example phrases that represent diagnostic queries, and choose Confirm.

This setting helps makes sure our application stays within appropriate boundaries for insurance-related queries while avoiding medical diagnosis discussions. For example, when users ask questions like “Do I have diabetes?” or “What’s causing my headache?”, the guardrail will detect these as diagnosis-related queries and block them with an appropriate response.

After you set up your denied topics, choose Next to proceed with word filters.

Word filters
Configuring word filters in Amazon Bedrock Guardrails helps keep our healthcare insurance application focused and professional. These filters help maintain conversation boundaries and make sure responses stay relevant to health insurance queries.
Let’s set up word filters for two key purposes:

Block inappropriate language to maintain professional discourse
Filter irrelevant topics that fall outside the healthcare insurance scope

To set them up, do the following:

In the Add word filters section, add custom words or phrases to filter (in our example, we include off-topic terms like “stocks,” “investment strategies,” and “financial performance”), then choose Next.

Sensitive information filtersWith sensitive information filters, you can configure filters to block email addresses, phone numbers, and other personally identifiable information (PII), as well as set up custom regex patterns for industry-specific data requirements. For example, healthcare providers use these filters to maintain HIPAA compliance to help automatically block PII types that they include. This way, they can use AI capabilities while helping to maintain strict patient privacy standards.

For our example, configure filters for blocking the email address and phone number of healthcare insurance users, then choose Next.

Contextual grounding checks We use Amazon Bedrock Guardrails contextual grounding and relevance checks in our application to help validate model responses, detect hallucinations, and support alignment with reference sources.

Set up the thresholds for contextual grounding and relevance checks (we set them to 0.7), then choose Next.

Automated Reasoning checks
Automated Reasoning checks help detect hallucinations and provide a verifiable proof that our application’s model (LLM) response is accurate.
The first step to incorporate Automated Reasoning checks for our application is to create an Automated Reasoning policy that is composed of a set of variables, defined with a name, type, and description, and the logical rules that operate on the variables. These rules are expressed in formal logic, but they’re translated to natural language to make it straightforward for a user without formal logic expertise to refine a model. Automated Reasoning checks use the variable descriptions to extract their values when validating a Q&A.

To create an Automated Reasoning policy, choose the new Automated Reasoning menu option under Safeguards.
Create a new policy and give it a name, then upload an existing document that defines the right solution space, such as an HR guideline or an operational manual. For this demo, we use an example healthcare insurance policy document that includes the insurance coverage policies applicable to insurance holders.

Automated Reasoning checks is in preview in Amazon Bedrock Guardrails in the US West (Oregon) AWS Region. To request to be considered for access to the preview today, contact your AWS account team.

Define the policy’s intent and processing parameters and choose Create policy.

The system now initiates an automated process to create your Automated Reasoning policy. This process involves analyzing your document, identifying key concepts, breaking down the document into individual units, translating these natural language units into formal logic, validating the translations, and finally combining them into a comprehensive logical model. You can review the generated structure, including the rules and variables, and edit these for accuracy through the UI.

To attach the Automated Reasoning policy to your guardrail, turn on Enable Automated Reasoning policy, choose the policy and policy version you want to use, then choose Next.

Review the configurations set in the previous steps and choose Create guardrail.

Test your guardrail
We can now test our healthcare insurance call center application with different inputs and see how the configured guardrail intervenes for harmful and undesirable multimodal content.

On the Amazon Bedrock console, on the guardrail details page, choose Select model in the Test panel.

Choose your model, then choose Apply.

For our example, we use the Amazon Nova Lite FM, which is a low-cost multimodal model that is lightning fast for processing image, video, and text input. For your use case, you can use another model of your choice.

Enter a query prompt with a denied topic.

For example, if we ask “I have cold and sore throat, do you think I have Covid, and if so please provide me information on what is the coverage,” the system recognizes this as a request for a disease diagnosis. Because Disease Diagnosis is configured as a denied topic in the guardrail settings, the system blocks the response.

Choose View trace to see the details of the intervention.

You can test with other queries. For example, if we ask “What is the financial performance of your insurance company in 2024?”, the word filter guardrail that we configured earlier intervenes. You can choose View trace to see that the word filter was invoked.

Next, we use a prompt to validate if PII data in input can be blocked using the guardrail. We ask “Can you send my lab test report to abc@gmail.com?” Because the guardrail was set up to block email addresses, the trace shows an intervention due to PII detection in the input prompt.

If we enter the prompt “I am frustrated on someone, and feel like hurting the person.” The text content filter is invoked for Violence because we set up Violence as a high threshold for detection of the harmful content while creating the guardrail.

If we provide an image file in the prompt that contains content of the category Violence, the image content filter gets invoked for Violence.

Finally, we test the Automated Reasoning policy by using the Test playground on the Amazon Bedrock console. You can input a sample user question and an incorrect answer to check if your Automated Reasoning policy works correctly. In our example, according to the insurance policy provided, new insurance claims take a minimum 7 days to get processed. Here, we input the question “Can you process my new insurance claim in less than 3 days?” and the incorrect answer “Yes, I can process it in 3 days.”

The Automated Reasoning checks marked the answer as Invalid and provided details about why, including which specific rule was broken, the relevant variables it found, and recommendations for fixing the issue.

Independent API
In addition to using Amazon Bedrock Guardrails as shown in the preceding section for Amazon Bedrock hosted models, you can now use Amazon Bedrock Guardrails to apply safeguards on input prompts and model responses for FMs available in other services (such as Amazon SageMaker), on infrastructure such as Amazon Elastic Compute Cloud (Amazon EC2), on on-premises deployments, and other third-party FMs beyond Amazon Bedrock. The ApplyGuardrail API assesses text using your preconfigured guardrails in Amazon Bedrock, without invoking the FMs.
While testing Amazon Bedrock Guardrails, select Use ApplyGuardrail API to validate user inputs using MyHealthCareGuardrail. The following test doesn’t require you to choose an Amazon Bedrock hosted model, you can test configured guardrails as an independent API.

Conclusion
In this post, we demonstrated how Amazon Bedrock Guardrails helps block harmful and undesirable multimodal content. Using a healthcare insurance call center scenario, we walked through the process of configuring and testing various guardrails. We also highlighted the flexibility of our ApplyGuardrail API, which implements guardrail checks on any input prompt, regardless of the FM in use. You can seamlessly integrate safeguards across models deployed on Amazon Bedrock or external platforms.
Ready to take your AI applications to the next level of safety and compliance? Check out Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions, which enables security and compliance teams to establish mandatory guardrails for model inference calls, helping to consistently enforce your guardrails across AI interactions. To dive deeper into Amazon Bedrock Guardrails, refer to Use guardrails for your use case, which includes advanced use cases with Amazon Knowledge Bases and Amazon Bedrock Agents.
This guidance is for informational purposes only. You should still perform your own independent assessment and take measures to ensure that you comply with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses and terms of use that apply to you, your content, and the third-party model referenced in this guidance. AWS has no control or authority over the third-party model referenced in this guidance and does not make any representations or warranties that the third-party model is secure, virus-free, operational, or compatible with your production environment and standards. AWS does not make any representations, warranties, or guarantees that any information in this guidance will result in a particular outcome or result.
References

Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock
Guardrails for Amazon Bedrock helps implement safeguards customized to your use cases and responsible AI policies
Detect and filter harmful content by using Amazon Bedrock Guardrails

About the authors
Divya Muralidharan is a Solutions Architect at AWS, supporting a strategic customer. Divya is an aspiring member of the AI/ML technical field community at AWS. She is passionate about using technology to accelerate growth, provide value to customers, and achieve business outcomes. Outside of work, she spends time cooking, singing, and growing plants.
Rachna Chadha is a Principal Technologist at AWS, where she helps customers leverage generative AI solutions to drive business value. With decades of experience in helping organizations adopt and implement emerging technologies, particularly within the healthcare domain, Rachna is passionate about the ethical and responsible use of artificial intelligence. She believes AI has the power to create positive societal change and foster both economic and social progress. Outside of work, Rachna enjoys spending time with her family, hiking, and listening to music.

Effective cost optimization strategies for Amazon Bedrock

Posted on June 11, 2025 by i-genie

Customers are increasingly using generative AI to enhance efficiency, personalize experiences, and drive innovation across various industries. For instance, generative AI can be used to perform text summarization, facilitate personalized marketing strategies, create business-critical chat-based assistants, and so on. However, as generative AI adoption grows, associated costs can escalate in several areas including cost in inference, deployment, and model customization. Effective cost optimization can help to make sure that generative AI initiatives remain financially sustainable and deliver a positive return on investment. Proactive cost management makes the best of generative AI’s transformative potential available to businesses while maintaining their financial health.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.
With the increasing adoption of Amazon Bedrock, optimizing costs is a must to help keep the expenses associated with deploying and running generative AI applications manageable and aligned with your organization’s budget. In this post, you’ll learn about strategic cost optimization techniques while using Amazon Bedrock.
Understanding Amazon Bedrock pricing
Amazon Bedrock offers a comprehensive pricing model based on actual usage of FMs and related services. The core pricing components include model inference (available in On-Demand, Batch, and Provisioned Throughput options), model customization (charging for training, storage, and inference), and Custom Model Import (free import but charges for inference and storage). Through Amazon Bedrock Marketplace, you can access over 100 models with varying pricing structures for proprietary and public models. You can check out Amazon Bedrock pricing for a pricing overview and more details on pricing models.
Cost monitoring in Amazon Bedrock
You can monitor the cost of your Amazon Bedrock usage using the following approaches:

Application inference profiles – Amazon Bedrock provides application inference profiles that you can use to apply custom cost allocation tags to track, manage, and control on-demand FM costs and usage across different workloads and tenants.
Cost allocation tagging – You can tag all Amazon Bedrock models, aligning usage to specific organizational taxonomies such as cost centers, business units, teams, and applications for precise expense tracking. To carry out tagging operations, you need the Amazon Resource Name (ARN) of the resource on which you want to carry out a tagging operation.
Integration with AWS cost tools – Amazon Bedrock cost monitoring integrates with AWS Budgets, AWS Cost Explorer, AWS Cost and Usage Reports, and AWS Cost Anomaly Detection, enabling organizations to set tag-based budgets, receive alerts for usage thresholds, and detect unusual spending patterns.
Amazon CloudWatch metrics monitoring – Organizations can use Amazon CloudWatch to monitor runtime metrics for Amazon Bedrock applications by inference profile, set alarms based on thresholds, and receive notifications for real-time management of resource usage and costs. You can monitor all parts of your Amazon Bedrock application using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. You can graph the metrics using the AWS Management Console for CloudWatch. You can also set alarms that watch for certain thresholds and send notifications or take action when values exceed those thresholds.
Resource-specific visibility – CloudWatch provides metrics such as Invocations, InvocationLatency, InputTokenCount, OutputTokenCount, and various error metrics that can be filtered by model IDs and other dimensions for granular monitoring of Amazon Bedrock usage and performance.

Cost optimization strategies for Amazon Bedrock
When building generative AI applications with Amazon Bedrock, implementing thoughtful cost optimization strategies can significantly reduce your expenses while maintaining application performance. In this section, you’ll find key approaches to consider in the following order:

Select the appropriate model
Determine if it needs customization

If yes, explore options in the correct order
If no, proceed to the next step

Perform prompt engineering and management
Design efficient agents
Select the correct consumption option

This flow is shown in the following flow diagram.

Choose an appropriate model for your use case
Amazon Bedrock provides access to a diverse portfolio of FMs through a single API. The service continually expands its offerings with new models and providers, each with different pricing structures and capabilities.
For example, consider the on-demand pricing variation among Amazon Nova models in the US East (Ohio) AWS Region. This pricing is current as of May 21, 2025. Refer to the Amazon Bedrock pricing page for latest data.
As shown in the following table, the price varies significantly between Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro models. For example, Amazon Nove Micro is approximately 1.71 times cheaper than Amazon Note Lite based on per 1,000 input tokens as of this writing. If you don’t need multimodal capability and the accuracy of Amazon Nova Micro meets your use case, then you need not opt for Amazon Nova Lite. This demonstrates why selecting the right model for your use case is critical. The largest or most advanced model isn’t always necessary for every application.

Amazon Nova models
Price per 1,000 input tokens
Price per 1,000 output tokens

Amazon Nova Micro
$0.000035
$0.00014

Amazon Nova Lite
$0.00006
$0.00024

Amazon Nova Pro
$0.0008
$0.0032

One of the key advantages of Amazon Bedrock is its unified API, which abstracts the complexity of working with different models. You can switch between models by changing the model ID in your request with minimal code modifications. With this flexibility, you can select the most cost and performance optimized model that meets your requirements and upgrade only when necessary.
Best practice: Use Amazon Bedrock native features to evaluate the performance of the foundation model for your use case. Begin with an automatic model evaluation job to narrow down the scope. Follow it up by using LLM as a judge or human-based evaluation as required for your use case.
Perform model customization in the right order
When customizing FMs in Amazon Bedrock for contextualizing responses, choosing the strategy in correct order can significantly reduce your expenses while maximizing performance. You have four primary strategies available, each with different cost implications:

Prompt Engineering – Start by crafting high-quality prompts that effectively condition the model to generate desired responses. This approach requires minimal resources and no additional infrastructure costs beyond your standard inference calls.
RAG – Amazon Bedrock Knowledge Bases is a fully managed feature with built-in session context management and source attribution that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows.
Fine-tuning – This approach involves providing labeled training data to improve model performance on specific tasks. Although its effective, fine-tuning requires additional compute resources and creates custom model versions with associated hosting costs.
Continued pre-training – The most resource-intensive option involves providing unlabeled data to further train an FM on domain-specific content. This approach incurs the highest costs and longest implementation time.

The following graph shows the escalation of the complexity, quality, cost, and time of these four approaches.

Best practice: Implement these strategies progressively. Begin with prompt engineering as your foundation—it’s cost-effective and can often deliver impressive results with minimal investment. Refer to the Optimize for clear and concise prompts section to learn about different strategies that you can follow to write good prompts. Next, integrate RAG when you need to incorporate proprietary information into responses. These two approaches together should address most use cases while maintaining efficient cost structures. Explore fine-tuning and continued pre-training only when you have specific requirements that can’t be addressed through the first two methods and your use case justifies the additional expense.
By following this implementation hierarchy, shown in the following figure, you can optimize both your Amazon Bedrock performance and your budget allocation. Here is the high-level mental model for choosing different options:

Use Amazon Bedrock native model distillation feature
Amazon Bedrock Model Distillation is a powerful feature that you can use to access smaller, more cost-effective models without sacrificing performance and accuracy for your specific use cases.

Enhance accuracy of smaller (student) cost-effective models – With Amazon Bedrock Model Distillation, you can select a teacher model whose accuracy you want to achieve for your use case and then select a student model that you want to fine-tune. Model distillation automates the process of generating responses from the teacher and using those responses to fine-tune the student model.
Maximize distilled model performance with proprietary data synthesis – Fine-tuning a smaller, cost-efficient model to achieve accuracy similar to a larger model for your specific use case is an iterative process. To remove some of the burden of iteration needed to achieve better results, Amazon Bedrock Model Distillation might choose to apply different data synthesis methods that are best suited for your use case. For example, Amazon Bedrock might expand the training dataset by generating similar prompts, or it might generate high-quality synthetic responses using customer provided prompt-response pairs as golden examples.
Reduce cost by bringing your production data – With traditional fine-tuning, you’re required to create prompts and responses. With Amazon Bedrock Model Distillation, you only need to provide prompts, which are used to generate synthetic responses and fine-tune student models.

Best practice: Consider model distillation when you have a specific, well-defined use case where a larger model performs well but costs more than desired. This approach is particularly valuable for high-volume inference scenarios where the ongoing cost savings will quickly offset the initial investment in distillation.
Use Amazon Bedrock intelligent prompt routing
With Amazon Bedrock Intelligent Prompt Routing, you can now use a combination of FMs from the same model family to help optimize for quality and cost when invoking a model. For example, you can route between the Anthropic’s Claude model family—between Claude 3.5 Sonnet and Claude 3 Haiku depending on the complexity of the prompt. This is particularly useful for applications like customer service assistants, where uncomplicated queries can be handled by smaller, faster, and more cost-effective models, and complex queries are routed to more capable models. Intelligent prompt routing can reduce costs by up to 30% without compromising on accuracy.
Best practice: Implement intelligent prompt routing for applications that handle a wide range of query complexities.
Optimize for clear and concise prompts
Optimizing prompts for clarity and conciseness in Amazon Bedrock focuses on structured, efficient communication with the model to minimize token usage and maximize response quality. Through techniques such as clear instructions, specific output formats, and precise role definitions, you can achieve better results while reducing costs associated with token consumption.

Structured instructions – Break down complex prompts into clear, numbered steps or bullet points. This helps the model follow a logical sequence and improves the consistency of responses while reducing token usage.
Output specifications – Explicitly define the desired format and constraints for the response. For example, specify word limits, format requirements, or use indicators like Please provide a brief summary in 2-3 sentences to control output length.
Avoid redundancy – Remove unnecessary context and repetitive instructions. Keep prompts focused on essential information and requirements because superfluous content can increase costs and potentially confuse the model.
Use separators – Employ clear delimiters (such as triple quotes, dashes, or XML-style tags) to separate different parts of the prompt to help the model to distinguish between context, instructions, and examples.
Role and context precision – Start with a clear role definition and specific context that’s relevant to the task. For example, You are a technical documentation specialist focused on explaining complex concepts in simple terms provides better guidance than a generic role description.

Best practice: Amazon Bedrock offers a fully managed feature to optimize prompts for a select model. This helps to reduce costs by improving prompt efficiency and effectiveness, leading to better results with fewer tokens and model invocations. The prompt optimization feature automatically refines your prompts to follow best practices for each specific model, eliminating the need for extensive manual prompt engineering that could take months of experimentation. Use this built-in prompt optimization feature in Amazon Bedrock to get started and optimize further to get better results as needed. Experiment with prompts to make them clear and concise to reduce the number of tokens without compromising the quality of the responses.
Optimize cost and performance using Amazon Bedrock prompt caching
You can use prompt caching with supported models on Amazon Bedrock to reduce inference response latency and input token costs. By adding portions of your context to a cache, the model can use the cache to skip recomputation of inputs, enabling Amazon Bedrock to share in the compute savings and lower your response latencies.

Significant cost reduction – Prompt caching can reduce costs by up to 90% compared to standard model inference costs, because cached tokens are charged at a reduced rate compared to non-cached input tokens.
Ideal use cases – Prompt caching is particularly valuable for applications with long and repeated contexts, such as document Q&A systems where users ask multiple questions about the same document or coding assistants that maintain context about code files.
Improved latency – Implementing prompt caching can decrease response latency by up to 85% for supported models by eliminating the need to reprocess previously seen content, making applications more responsive.
Cache retention period – Cached content remains available for up to 5 minutes after each access, with the timer resetting upon each successful cache hit, making it ideal for multiturn conversations about the same context.
Implementation approach – To implement prompt caching, developers identify frequently reused prompt portions, tag these sections using the cachePoint block in API calls, and monitor cache usage metrics (cacheReadInputTokenCount and cacheWriteInputTokenCount) in response metadata to optimize performance.

Best practice: Prompt caching is valuable in scenarios where applications repeatedly process the same context, such as document Q&A systems where multiple users query the same content. The technique delivers maximum benefit when dealing with stable contexts that don’t change frequently, multiturn conversations about identical information, applications that require fast response times, high-volume services with repetitive requests, or systems where cost optimization is critical without sacrificing model performance.
Cache prompts within the client application
Client-side prompt caching helps reduce costs by storing frequently used prompts and responses locally within your application. This approach minimizes API calls to Amazon Bedrock models, resulting in significant cost savings and improved application performance.

Local storage implementation – Implement a caching mechanism within your application to store common prompts and their corresponding responses, using techniques such as in-memory caching (Redis, Memcached) or application-level caching systems.
Cache hit optimization – Before making an API call to Amazon Bedrock, check if the prompt or similar variations exist in the local cache. This reduces the number of billable API calls to the FMs, directly impacting costs. You can check Caching Best Practices to learn more.
Expiration strategy – Implement a time-based cache expiration strategy such as Time To Live (TTL) to help make sure that cached responses remain relevant while maintaining cost benefits. This aligns with the 5-minute cache window used by Amazon Bedrock for optimal cost savings.
Hybrid caching approach – Combine client-side caching with the built-in prompt caching of Amazon Bedrock for maximum cost optimization. Use the local cache for exact matches and the Amazon Bedrock cache for partial context reuse.
Cache monitoring – Implement cache hit:miss ratio monitoring to continually optimize your caching strategy and identify opportunities for further cost reduction through cached prompt reuse.

Best practice: In performance-critical systems and high-traffic websites, client-side caching enhances response times and user experience while minimizing dependency on ongoing Amazon Bedrock API interactions.
Build small and focused agents that interact with each other rather than a single large monolithic agent
Creating small, specialized agents that interact with each other in Amazon Bedrock can lead to significant cost savings while improving solution quality. This approach uses the multi-agent collaboration capability of Amazon Bedrock to build more efficient and cost-effective generative AI applications.
The multi-agent architecture advantage: You can use Amazon Bedrock multi-agent collaboration to orchestrate multiple specialized AI agents that work together to tackle complex business problems. By creating smaller, purpose-built agents instead of a single large one, you can:

Optimize model selection based on specific tasks – Use more economical FMs for simpler tasks and reserve premium models for complex reasoning tasks
Enable parallel processing – Multiple specialized agents can work simultaneously on different aspects of a problem, reducing overall response time
Improve solution quality – Each agent focuses on its specialty, leading to more accurate and relevant responses

Best practice: Select appropriate models for each specialized agent, matching capabilities to task requirements while optimizing for cost. Based on the complexity of the task, you can choose either a low-cost model or a high-cost model to optimize the cost. Use AWS Lambda functions that retrieve only the essential data to reduce unnecessary cost in Lambda execution. Orchestrate your system with a lightweight supervisor agent that efficiently handles coordination without consuming premium resources.
Choose the desired throughput depending on the usage
Amazon Bedrock offers two distinct throughput options, each designed for different usage patterns and requirements:

On-Demand mode – Provides a pay-as-you-go approach with no upfront commitments, making it ideal for early-stage proof of concepts (POCs) on development and test environments, applications with unpredictable or seasonal or sporadic traffic with significant variation.

With On-Demand pricing, you’re charged based on actual usage:

Text generation models – Pay per input token processed and output token generated
Embedding models – Pay per input token processed
Image generation models – Pay per image generated

Provisioned Throughput mode – By using Provisioned Throughput, you can purchase dedicated model units for specific FMs to get higher level of throughput for a model at a fixed cost. This makes Provisioned Throughput suitable for production workload requiring predictable performance without throttling. If you customized a model, you must purchase Provisioned Throughput to be able to use it.

Each model unit delivers a defined throughput capacity measured by the maximum number of tokens processed per minute. Provisioned Throughput is billed hourly with commitment options of 1-month or 6-month terms, with longer commitments offering greater discounts.
Best practice: If you’re working on a POC or on a use case that has a sporadic workload using one of the base FMs from Amazon Bedrock, use On-Demand mode to take the benefit of pay-as-you-go pricing. However, if you’re working on a steady state workload where throttling must be avoided, or if you’re using custom models, you should opt for provisioned throughput that matches your workload. Calculate your token processing requirements carefully to avoid over-provisioning.
Use batch inference
With batch mode, you can get simultaneous large-scale predictions by providing a set of prompts as a single input file and receiving responses as a single output file. The responses are processed and stored in your Amazon Simple Storage Service (Amazon S3) bucket so you can access them later. Amazon Bedrock offers select FMs from leading AI providers like Anthropic, Meta, Mistral AI, and Amazon for batch inference at a 50% lower price compared to On-Demand inference pricing. Refer to Supported AWS Regions and models for batch inference for more details. This approach is ideal for non-real-time workloads where you need to process large volumes of content efficiently.
Best practice: Identify workloads in your application that don’t require real-time responses and migrate them to batch processing. For example, instead of generating product descriptions on-demand when users view them, pre-generate descriptions for new products in a nightly batch job and store the results. This approach can dramatically reduce your FM costs while maintaining the same output quality.
Conclusion
As organizations increasingly adopt Amazon Bedrock for their generative AI applications, implementing effective cost optimization strategies becomes crucial for maintaining financial efficiency. The key to successful cost optimization lies in taking a systematic approach. That is, start with basic optimizations such as proper model selection and prompt engineering, then progressively implement more advanced techniques such as caching and batch processing as your use cases mature. Regular monitoring of costs and usage patterns, combined with continuous optimization of these strategies, will help make sure that your generative AI initiatives remain both effective and economically sustainable.Remember that cost optimization is an ongoing process that should evolve with your application’s needs and usage patterns, making it essential to regularly review and adjust your implementation of these strategies.For more information about Amazon Bedrock pricing and the cost optimization strategies discussed in this post, refer to:

Amazon Bedrock pricing
Amazon Bedrock Model Distillation
Amazon Bedrock Intelligent Prompt Routing
Amazon Bedrock prompt caching
Process multiple prompts with batch inference
Monitoring the performance of Amazon Bedrock

About the authors
Biswanath Mukherjee is a Senior Solutions Architect at Amazon Web Services. He works with large strategic customers of AWS by providing them technical guidance to migrate and modernize their applications on AWS Cloud. With his extensive experience in cloud architecture and migration, he partners with customers to develop innovative solutions that leverage the scalability, reliability, and agility of AWS to meet their business needs. His expertise spans diverse industries and use cases, enabling customers to unlock the full potential of the AWS Cloud.
Upendra V is a Senior Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprise customers design and deploy production-ready Generative AI workloads, implement Large Language Models (LLMs) and Agentic AI systems, and optimize cloud deployments. With expertise in cloud adoption and machine learning, he enables organizations to build and scale AI-driven applications efficiently.

Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset fo …

Posted on June 10, 2025 by i-genie

Despite the substantial progress in text-to-image (T2I) generation brought about by models such as DALL-E 3, Imagen 3, and Stable Diffusion 3, achieving consistent output quality — both in aesthetic and alignment terms — remains a persistent challenge. While large-scale pretraining provides general knowledge, it is insufficient to achieve high aesthetic quality and alignment. Supervised fine-tuning (SFT) serves as a critical post-training step but its effectiveness is strongly dependent on the quality of the fine-tuning dataset.

Current public datasets used in SFT either target narrow visual domains (e.g., anime or specific art genres) or rely on basic heuristic filters over web-scale data. Human-led curation is expensive, non-scalable, and frequently fails to identify samples that yield the greatest improvements. Moreover, recent T2I models use internal proprietary datasets with minimal transparency, limiting the reproducibility of results and slowing collective progress in the field.

Approach: A Model-Guided Dataset Curation

To mitigate these issues, Yandex have released Alchemist, a publicly available, general-purpose SFT dataset composed of 3,350 carefully selected image-text pairs. Unlike conventional datasets, Alchemist is constructed using a novel methodology that leverages a pre-trained diffusion model to act as a sample quality estimator. This approach enables the selection of training data with high impact on generative model performance without relying on subjective human labeling or simplistic aesthetic scoring.

Alchemist is designed to improve the output quality of T2I models through targeted fine-tuning. The release also includes fine-tuned versions of five publicly available Stable Diffusion models. The dataset and models are accessible on Hugging Face under an open license. More about the methodology and experiments — in the preprint .

Technical Design: Filtering Pipeline and Dataset Characteristics

The construction of Alchemist involves a multi-stage filtering pipeline starting from ~10 billion web-sourced images. The pipeline is structured as follows:

Initial Filtering: Removal of NSFW content and low-resolution images (threshold >1024×1024 pixels).

Coarse Quality Filtering: Application of classifiers to exclude images with compression artifacts, motion blur, watermarks, and other defects. These classifiers were trained on standard image quality assessment datasets such as KonIQ-10k and PIPAL.

Deduplication and IQA-Based Pruning: SIFT-like features are used for clustering similar images, retaining only high-quality ones. Images are further scored using the TOPIQ model, ensuring retention of clean samples.

Diffusion-Based Selection: A key contribution is the use of a pre-trained diffusion model’s cross-attention activations to rank images. A scoring function identifies samples that strongly activate features associated with visual complexity, aesthetic appeal, and stylistic richness. This enables the selection of samples most likely to enhance downstream model performance.

Caption Rewriting: The final selected images are re-captioned using a vision-language model fine-tuned to produce prompt-style textual descriptions. This step ensures better alignment and usability in SFT workflows.

Through ablation studies, the authors determine that increasing the dataset size beyond 3,350 (e.g., 7k or 19k samples) results in lower quality of fine-tuned models, reinforcing the value of targeted, high-quality data over raw volume.

Results Across Multiple T2I Models

The effectiveness of Alchemist was evaluated across five Stable Diffusion variants: SD1.5, SD2.1, SDXL, SD3.5 Medium, and SD3.5 Large. Each model was fine-tuned using three datasets: (i) the Alchemist dataset, (ii) a size-matched subset from LAION-Aesthetics v2, and (iii) their respective baselines.

Human Evaluation: Expert annotators performed side-by-side assessments across four criteria — text-image relevance, aesthetic quality, image complexity, and fidelity. Alchemist-tuned models showed statistically significant improvements in aesthetic and complexity scores, often outperforming both baselines and LAION-Aesthetics-tuned versions by margins of 12–20%. Importantly, text-image relevance remained stable, suggesting that prompt alignment was not negatively affected.

Automated Metrics: Across metrics such as FD-DINOv2, CLIP Score, ImageReward, and HPS-v2, Alchemist-tuned models generally scored higher than their counterparts. Notably, improvements were more consistent when compared to size-matched LAION-based models than to baseline models.

Dataset Size Ablation: Fine-tuning with larger variants of Alchemist (7k and 19k samples) led to lower performance, underscoring that stricter filtering and higher per-sample quality is more impactful than dataset size.

Yandex has utilized the dataset to train its proprietary text-to-image generative model, YandexART v2.5, and plans to continue leveraging it for future model updates.

Conclusion

Alchemist provides a well-defined and empirically validated pathway to improve the quality of text-to-image generation via supervised fine-tuning.The approach emphasizes sample quality over scale and introduces a replicable methodology for dataset construction without reliance on proprietary tools.

While the improvements are most notable in perceptual attributes like aesthetics and image complexity, the framework also highlights the trade-offs that arise in fidelity, particularly for newer base models already optimized through internal SFT. Nevertheless, Alchemist establishes a new standard for general-purpose SFT datasets and offers a valuable resource for researchers and developers working to advance the output quality of generative vision models.

Check out the Paper here and Alchemist Dataset on Hugging Face. Thanks to the Yandex team for the thought leadership/ Resources for this article.
The post Yandex Releases Alchemist: A Compact Supervised Fine-Tuning Dataset for Enhancing Text-to-Image T2I Model Quality appeared first on MarkTechPost.

How to Create Smart Multi-Agent Workflows Using the Mistral Agents API …

Posted on June 10, 2025 by i-genie

In this tutorial, we’ll explore how to create smart, multi-agent workflows using the Mistral Agents API’s Handoffs feature. This lets different agents work together by passing tasks to each other, enabling complex problems to be solved in a modular and efficient way. We’ll build a system where agents collaborate to answer inflation-related questions—performing calculations, fetching data online, and creating visualizations—to deliver clear, accurate, and dynamic responses.

Step 1: Setting up dependencies

Installing the libraries

Copy CodeCopiedUse a different Browserpip install mistralai pydantic

Loading the Mistral API Key

You can get an API key from https://console.mistral.ai/api-keys

Copy CodeCopiedUse a different Browserfrom getpass import getpass
MISTRAL_API_KEY = getpass(‘Enter Mistral API Key: ‘)

Step 2: Agent Prerequisites and Setup

Initializing the Agent

Copy CodeCopiedUse a different Browserfrom mistralai import CompletionArgs, ResponseFormat, JSONSchema
from pydantic import BaseModel
from mistralai import Mistral

client = Mistral(MISTRAL_API_KEY)

Creating the Custom Function

The adjust_for_inflation function calculates how much a given amount of money would be worth after accounting for inflation over time. It uses the compound formula based on the number of years and the annual inflation rate. If the end year is before the start year, it returns an error. Otherwise, it returns the adjusted value along with the input details. For example, adjust_for_inflation(1000, 1899, 2025, 10) shows what ₹1000 from 1899 would be worth in 2025 at 10% inflation.

Copy CodeCopiedUse a different Browserdef adjust_for_inflation(amount: float, start_year: int, end_year: int, annual_inflation_rate: float):
“””
Calculates inflation-adjusted value using compound formula.
“””
if end_year < start_year:
return {“error”: “End year must be greater than or equal to start year.”}

years = end_year – start_year
adjusted_value = amount * ((1 + annual_inflation_rate / 100) ** years)

return {
“original_amount”: amount,
“start_year”: start_year,
“end_year”: end_year,
“inflation_rate”: annual_inflation_rate,
“adjusted_value”: round(adjusted_value, 2)
}

adjust_for_inflation(1000, 1899, 2025, 10)

Creating Structured Output for Mathematical Reasoning

Copy CodeCopiedUse a different Browserclass CalcResult(BaseModel):
reasoning: str
result: str

inflation_tool = {
“type”: “function”,
“function”: {
“name”: “adjust_for_inflation”,
“description”: “Calculate the value of money adjusted for inflation over a time period.”,
“parameters”: {
“type”: “object”,
“properties”: {
“amount”: {
“type”: “number”,
“description”: “Original amount of money”
},
“start_year”: {
“type”: “integer”,
“description”: “The starting year for inflation adjustment”
},
“end_year”: {
“type”: “integer”,
“description”: “The ending year for inflation adjustment”
},
“annual_inflation_rate”: {
“type”: “number”,
“description”: “Annual inflation rate in percent”
}
},
“required”: [“amount”, “start_year”, “end_year”, “annual_inflation_rate”]
}
}
}

Step 3: Creating the Agents

Defining the different agents

In this setup, we define a multi-agent system using Mistral Agents API to handle inflation-related economic queries. The main agent (economics-agent) acts as a coordinator that routes tasks to specialized agents. The inflation-agent performs inflation adjustment calculations using a custom function. If the inflation rate is missing from the query, the websearch-agent fetches it from the internet. The calculator-agent handles complex numerical computations with step-by-step reasoning, while the graph-agent uses the code interpreter to visualize inflation trends over time. Together, these agents collaborate via handoffs to deliver accurate, dynamic responses to economic queries.

Copy CodeCopiedUse a different Browser# Main Agent
economics_agent = client.beta.agents.create(
model=”mistral-large-latest”,
name=”economics-agent”,
description=”Handles economic queries and delegates inflation calculations.”,
)

# Inflation Function Agent
inflation_agent = client.beta.agents.create(
model=”mistral-large-latest”,
name=”inflation-agent”,
description=”Agent that calculates inflation-adjusted value using a custom function.”,
tools=[inflation_tool],
)

# Web Search Agent
websearch_agent = client.beta.agents.create(
model=”mistral-large-latest”,
name=”websearch-agent”,
description=”Agent that can search the internet for missing economic data such as inflation rates.”,
tools=[{“type”: “web_search”}]
)

# Calculator Agent
from pydantic import BaseModel

class CalcResult(BaseModel):
reasoning: str
result: str

calculator_agent = client.beta.agents.create(
model=”mistral-large-latest”,
name=”calculator-agent”,
description=”Agent used to make detailed calculations.”,
instructions=”When doing calculations, explain step by step.”,
completion_args=CompletionArgs(
response_format=ResponseFormat(
type=”json_schema”,
json_schema=JSONSchema(
name=”calc_result”,
schema=CalcResult.model_json_schema(),
)
)
)
)

# Graph Agent
graph_agent = client.beta.agents.create(
model=”mistral-large-latest”,
name=”graph-agent”,
description=”Agent that generates graphs using code interpreter.”,
instructions=”Use code interpreter to draw inflation trends.”,
tools=[{“type”: “code_interpreter”}]
)

Defining the Handoffs Responsibilities

This configuration defines how agents delegate tasks among each other:

The Main Agent (economics_agent) serves as the entry point and delegates queries either to the inflation_agent (for inflation calculations) or the websearch_agent (to fetch missing data like inflation rates).

The inflation_agent, after receiving either the user query or web-fetched data, can further pass tasks to the calculator_agent (for detailed math) or graph_agent (to visualize trends).

The websearch_agent can pass control to the inflation_agent after retrieving required information, like the inflation rate.

calculator_agent and graph_agent are considered terminal agents. However, optional mutual handoff is enabled in case one needs to do follow-up work (e.g., graphing a calculated result or vice versa).

Copy CodeCopiedUse a different Browser# Main Agent hands off to inflation_agent and websearch_agent
economics_agent = client.beta.agents.update(
agent_id=economics_agent.id,
handoffs=[inflation_agent.id, websearch_agent.id]
)

# Inflation Agent can delegate to calculator_agent or graph_agent if deeper analysis or visualization is needed
inflation_agent = client.beta.agents.update(
agent_id=inflation_agent.id,
handoffs=[calculator_agent.id, graph_agent.id]
)

# Web Search Agent can hand off to inflation_agent (after finding the missing rate)
websearch_agent = client.beta.agents.update(
agent_id=websearch_agent.id,
handoffs=[inflation_agent.id]
)

# Calculator and Graph agents are terminal–they don’t hand off further
# But if needed, we could let them hand off to each other:
calculator_agent = client.beta.agents.update(
agent_id=calculator_agent.id,
handoffs=[graph_agent.id] # Optional
)

graph_agent = client.beta.agents.update(
agent_id=graph_agent.id,
handoffs=[calculator_agent.id] # Optional
)

Step 4: Running the Agent

Example A: What is the current inflation rate in India?

In this example, the prompt “What is the current inflation rate in India?” is passed to the economics_agent, which is the main entry point for handling economic queries. Since the question requires real-time data that isn’t included in the agent’s static knowledge, the economics_agent automatically hands off the query to the websearch_agent, which is equipped with web search capabilities.

Copy CodeCopiedUse a different Browserprompt = “What is the current inflation rate in India?”
response = client.beta.conversations.start(
agent_id=economics_agent.id,
inputs=prompt
)
print(response.outputs[-1].content[0].text)

Example B: What is the inflation-adjusted value of 5,000 from the year 2010 to 2023 with an annual inflation rate of 6.5%. Explain calculation steps and plot a graph with data labels

This code block sends the prompt to an economics agent, checks if the agent triggers a specific function call (adjust_for_inflation), executes that function locally with the provided arguments, and then returns the computed result back to the agent. Finally, it prints the agent’s response, which includes the inflation calculation explanation, along with the Python code to plot the trend.

Copy CodeCopiedUse a different Browserimport json

from mistralai.models import FunctionResultEntry

prompt = “””What is the inflation-adjusted value of 5,000 from the year 2010 to 2023 with annual inflation rate of 6.5%.
Explain calculation steps and plot a graph with data labels”””

response = client.beta.conversations.start(
agent_id=economics_agent.id,
inputs=prompt
)

# Check for function call
if response.outputs[-1].type == “function.call” and response.outputs[-1].name == “adjust_for_inflation”:
args = json.loads(response.outputs[-1].arguments)

# Run local function
function_result = json.dumps(adjust_for_inflation(**args))

# Return result to Mistral
result_entry = FunctionResultEntry(
tool_call_id=response.outputs[-1].tool_call_id,
result=function_result
)

response = client.beta.conversations.append(
conversation_id=response.conversation_id,
inputs=[result_entry]
)

print(response.outputs[-1].content)
else:
print(response.outputs[-1].content)

The following code block was returned by the agent to plot the trend of inflation-adjusted value over time.

Copy CodeCopiedUse a different Browserimport matplotlib.pyplot as plt
import numpy as np

# Parameters
original_amount = 5000
start_year = 2010
end_year = 2023
inflation_rate = 6.5 / 100 # Convert percentage to decimal

# Calculate the number of years
num_years = end_year – start_year + 1

# Calculate the adjusted value for each year
years = np.arange(start_year, end_year + 1)
adjusted_values = original_amount * (1 + inflation_rate) ** (years – start_year)

# Plot the graph
plt.figure(figsize=(10, 6))
plt.plot(years, adjusted_values, marker=’o’, linestyle=’-‘, color=’b’)

# Add data labels
for year, value in zip(years, adjusted_values):
plt.text(year, value, f’${value:.2f}’, ha=’right’)

# Add titles and labels
plt.title(‘Inflation-Adjusted Value Over Time’)
plt.xlabel(‘Year’)
plt.ylabel(‘Adjusted Value’)

# Save the plot as an image
plt.savefig(‘inflation_adjusted_value.png’)

# Show the plot
plt.show()

Check out the Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 98k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Create Smart Multi-Agent Workflows Using the Mistral Agents API’s Handoffs Feature appeared first on MarkTechPost.

ALPHAONE: A Universal Test-Time Framework for Modulating Reasoning in …

Posted on June 10, 2025 by i-genie

Large reasoning models, often powered by large language models, are increasingly used to solve high-level problems in mathematics, scientific analysis, and code generation. The central idea is to simulate two types of cognition: rapid responses for simpler reasoning and deliberate, slower thought for more complex problems. This dual-mode thinking reflects how humans transition from intuitive reactions to analytical thinking depending on task complexity, a principle that drives innovations in cognitive modeling and AI reasoning frameworks.

One persistent issue arises from the model’s inability to self-regulate these shifts between fast and slow thinking. Rather than aligning with task demands, models tend to default to fixed patterns, leading to either premature conclusions or excessive processing. This inefficiency becomes particularly evident when handling tasks that demand a delicate balance of deliberation and swiftness. The failure to optimize this transition has limited the reasoning accuracy of these models, often leading to errors or unnecessary computation, particularly in high-stakes applications such as competitive math problems or real-time code analysis.

To tackle this, previous solutions have introduced test-time scaling approaches. Parallel scaling strategies utilize multiple outputs from a model and then select the best one using metrics like self-consistency or perplexity. In contrast, sequential scaling alters how the model reasons over time by either restricting or encouraging the formation of prolonged chains of thought. One example is the Chain of Draft method, which limits reasoning steps to a strict word count to reduce overthinking. Another approach, S1, extends slow reasoning near the end by adding “wait” tokens. However, these methods often lack synchronization between the duration of reasoning and the scheduling of slow-to-fast thinking transitions, failing to offer a universal solution that effectively adapts reasoning processes.

Researchers from the University of Illinois Urbana-Champaign and UC Berkeley have introduced ALPHAONE, which brings a novel modulation system to control reasoning dynamics during test time. ALPHAONE introduces a concept called the “alpha moment,” controlled by a universal parameter α, that defines when the model transitions from slow to fast reasoning. This framework modifies the reasoning process by adjusting both the duration and structure of thought, making it possible to unify and extend prior methods with a more adaptable strategy for handling complex reasoning tasks.

The mechanism is divided into two core phases. In the pre-alpha phase, ALPHAONE initiates slow reasoning using a probabilistic schedule that inserts the token “wait” after structural breaks like “nn,” governed by a Bernoulli process. This insertion is not static but based on a user-defined function that adjusts over time—for example, using a linear annealing pattern to taper off slow thinking. Once the model hits the alpha moment, the post-alpha phase begins by replacing “wait” tokens with the explicit end-of-thinking token “</think>.” This ensures a decisive shift to fast thinking, mitigating inertia caused by prolonged slow reasoning and enabling the efficient generation of answers.

ALPHAONE demonstrated superior results across six benchmarks in mathematics, science, and code generation. For example, using the DeepSeek-R1-Distill-Qwen-1.5B model, ALPHAONE boosted accuracy in AMC23 from 57.5% to 70.0% while reducing average token length from 5339 to 4952. Similar gains were noted with larger models: with the 7B model, performance on OlympiadBench rose from 50.4% to 55.7%, and with the 32B Qwen QwQ model, performance in AIME24 jumped from 40.0% to 53.3%. On average, across all models and tasks, ALPHAONE improved accuracy by +6.15% and used fewer tokens compared to standard models and other baselines like S1 and Chain of Draft.

These results confirm that managing the flow between slow and fast reasoning is crucial for achieving better performance in complex problem-solving. By enabling structured modulation via a universal framework, ALPHAONE resolves previous inefficiencies and opens up a scalable, efficient path forward for reasoning models. The approach showcases how thoughtful scheduling of cognition-like behaviors in AI can yield practical, measurable benefits in performance and resource efficiency.

Check out the Paper, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 98k+ ML SubReddit and Subscribe to our Newsletter.
The post ALPHAONE: A Universal Test-Time Framework for Modulating Reasoning in AI Models appeared first on MarkTechPost.

Building intelligent AI voice agents with Pipecat and Amazon Bedrock � …

Posted on June 10, 2025 by i-genie

Voice AI is transforming how we interact with technology, making conversational interactions more natural and intuitive than ever before. At the same time, AI agents are becoming increasingly sophisticated, capable of understanding complex queries and taking autonomous actions on our behalf. As these trends converge, you see the emergence of intelligent AI voice agents that can engage in human-like dialogue while performing a wide range of tasks.
In this series of posts, you will learn how to build intelligent AI voice agents using Pipecat, an open-source framework for voice and multimodal conversational AI agents, with foundation models on Amazon Bedrock. It includes high-level reference architectures, best practices and code samples to guide your implementation.
Approaches for building AI voice agents
There are two common approaches for building conversational AI agents:

Using cascaded models: In this post (Part 1), you will learn about the cascaded models approach, diving into the individual components of a conversational AI agent. With this approach, voice input passes through a series of architecture components before a voice response is sent back to the user. This approach is also sometimes referred to as pipeline or component model voice architecture.
Using speech-to-speech foundation models in a single architecture: In Part 2, you will learn how Amazon Nova Sonic, a state-of-the-art, unified speech-to-speech foundation model can enable real-time, human-like voice conversations by combining speech understanding and generation in a single architecture.

Common use cases
AI voice agents can handle multiple use cases, including but not limited to:

Customer Support: AI voice agents can handle customer inquiries 24/7, providing instant responses and routing complex issues to human agents when necessary.
Outbound Calling: AI agents can conduct personalized outreach campaigns, scheduling appointments or following up on leads with natural conversation.
Virtual Assistants: Voice AI can power personal assistants that help users manage tasks, answer questions.

Architecture: Using cascaded models to build an AI voice agent
To build an agentic voice AI application with the cascaded models approach, you need to orchestrate multiple architecture components involving multiple machine learning and foundation models.

Figure 1: Architecture overview of a Voice AI Agent using Pipecat
These components include:
WebRTC Transport: Enables real-time audio streaming between client devices and the application server.
Voice Activity Detection (VAD): Detects speech using Silero VAD with configurable speech start and speech end times, and noise suppression capabilities to remove background noise and enhance audio quality.
Automatic Speech Recognition (ASR): Uses Amazon Transcribe for accurate, real-time speech-to-text conversion.
Natural Language Understanding (NLU): Interprets user intent using latency-optimized inference on Bedrock with models like Amazon Nova Pro optionally enabling prompt caching to optimize for speed and cost efficiency in Retrieval Augmented Generation (RAG) use cases.
Tools Execution and API Integration: Executes actions or retrieves information for RAG by integrating backend services and data sources via Pipecat Flows and leveraging the tool use capabilities of foundation models.
Natural Language Generation (NLG): Generates coherent responses using Amazon Nova Pro on Bedrock, offering the right balance of quality and latency.
Text-to-Speech (TTS): Converts text responses back into lifelike speech using Amazon Polly with generative voices.
Orchestration Framework: Pipecat orchestrates these components, offering a modular Python-based framework for real-time, multimodal AI agent applications.
Best practices for building effective AI voice agents
Developing responsive AI voice agents requires focus on latency and efficiency. While best practices continue to emerge, consider the following implementation strategies to achieve natural, human-like interactions:
Minimize conversation latency: Use latency-optimized inference for foundation models (FMs) like Amazon Nova Pro to maintain natural conversation flow.
Select efficient foundation models: Prioritize smaller, faster foundation models (FMs) that can deliver quick responses while maintaining quality.
Implement prompt caching: Utilize prompt caching to optimize for both speed and cost efficiency, especially in complex scenarios requiring knowledge retrieval.
Deploy text-to-speech (TTS) fillers: Use natural filler phrases (such as “Let me look that up for you”) before intensive operations to maintain user engagement while the system makes tool calls or long-running calls to your foundation models.
Build a robust audio input pipeline: Integrate components like noise to support clear audio quality for better speech recognition results.
Start simple and iterate: Begin with basic conversational flows before progressing to complex agentic systems that can handle multiple use cases.
Region availability: Low-latency and prompt caching features may only be available in certain regions. Evaluate the trade-off between these advanced capabilities and selecting a region that is geographically closer to your end-users.
Example implementation: Build your own AI voice agent in minutes
This post provides a sample application on Github that demonstrates the concepts discussed. It uses Pipecat and and its accompanying state management framework, Pipecat Flows with Amazon Bedrock, along with Web Real-time Communication (WebRTC) capabilities from Daily to create a working voice agent you can try in minutes.
Prerequisites
To setup the sample application, you should have the following prerequisites:

Python 3.10+
An AWS account with appropriate Identity and Access Management (IAM) permissions for Amazon Bedrock, Amazon Transcribe, and Amazon Polly
Access to foundation models on Amazon Bedrock
Access to an API key for Daily
Modern web browser (such as Google Chrome or Mozilla Firefox) with WebRTC support

Implementation Steps
After you complete the prerequisites, you can start setting up your sample voice agent:

Clone the repository: git clone https://github.com/aws-samples/build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock
cd build-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock/part-1
Set up the environment: cd server
python3 -m venv venv
source venv/bin/activate # Windows: venvScriptsactivate
pip install -r requirements.txt
Configure API key in.env: DAILY_API_KEY=your_daily_api_key
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_REGION=your_aws_region
Start the server: python server.py
Connect via browser at http://localhost:7860 and grant microphone access
Start the conversation with your AI voice agent

Customizing your voice AI agent
To customize, you can start by:

Modifying flow.py to change conversation logic
Adjusting model selection in bot.py for your latency and quality needs

To learn more, see documentation for Pipecat Flows and review the README of our code sample on Github.
Cleanup
The instructions above are for setting up the application in your local environment. The local application will leverage AWS services and Daily through AWS IAM and API credentials. For security and to avoid unanticipated costs, when you are finished, delete these credentials to make sure that they can no longer be accessed.
Accelerating voice AI implementations
To accelerate AI voice agent implementations, AWS Generative AI Innovation Center (GAIIC) partners with customers to identify high-value use cases and develop proof-of-concept (PoC) solutions that can quickly move to production.
Customer Testimonial: InDebted
InDebted, a global fintech transforming the consumer debt industry, collaborates with AWS to develop their voice AI prototype.

“We believe AI-powered voice agents represent a pivotal opportunity to enhance the human touch in financial services customer engagement. By integrating AI-enabled voice technology into our operations, our goals are to provide customers with faster, more intuitive access to support that adapts to their needs, as well as improving the quality of their experience and the performance of our contact centre operations”
says Mike Zhou, Chief Data Officer at InDebted.

By collaborating with AWS and leveraging Amazon Bedrock, organizations like InDebted can create secure, adaptive voice AI experiences that meet regulatory standards while delivering real, human-centric impact in even the most challenging financial conversations.
Conclusion
Building intelligent AI voice agents is now more accessible than ever through the combination of open-source frameworks such as Pipecat, and powerful foundation models with latency optimized inference and prompt caching on Amazon Bedrock.
In this post, you learned about two common approaches on how to build AI voice agents, delving into the cascaded models approach and its key components. These essential components work together to create an intelligent system that can understand, process, and respond to human speech naturally. By leveraging these rapid advancements in generative AI, you can create sophisticated, responsive voice agents that deliver real value to your users and customers.
To get started with your own voice AI project, try our code sample on Github or contact your AWS account team to explore an engagement with AWS Generative AI Innovation Center (GAIIC).
You can also learn about building AI voice agents using a unified speech-to-speech foundation models, Amazon Nova Sonic in Part 2.

About the Authors
Adithya Suresh serves as a Deep Learning Architect at the AWS Generative AI Innovation Center, where he partners with technology and business teams to build innovative generative AI solutions that address real-world challenges.
Daniel Wirjo is a Solutions Architect at AWS, focused on FinTech and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas.
Karan Singh is a Generative AI Specialist at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise generative AI challenges.
Xuefeng Liu leads a science team at the AWS Generative AI Innovation Center in the Asia Pacific regions. His team partners with AWS customers on generative AI projects, with the goal of accelerating customers’ adoption of generative AI.

Stream multi-channel audio to Amazon Transcribe using the Web Audio AP …

Posted on June 10, 2025 by i-genie

Multi-channel transcription streaming is a feature of Amazon Transcribe that can be used in many cases with a web browser. Creating this stream source has it challenges, but with the JavaScript Web Audio API, you can connect and combine different audio sources like videos, audio files, or hardware like microphones to obtain transcripts.
In this post, we guide you through how to use two microphones as audio sources, merge them into a single dual-channel audio, perform the required encoding, and stream it to Amazon Transcribe. A Vue.js application source code is provided that requires two microphones connected to your browser. However, the versatility of this approach extends far beyond this use case—you can adapt it to accommodate a wide range of devices and audio sources.
With this approach, you can get transcripts for two sources in a single Amazon Transcribe session, offering cost savings and other benefits compared to using a separate session for each source.
Challenges when using two microphones
For our use case, using a single-channel stream for two microphones and enabling Amazon Transcribe speaker label identification to identify the speakers might be enough, but there are a few considerations:

Speaker labels are randomly assigned at session start, meaning you will have to map the results in your application after the stream has started
Mislabeled speakers with similar voice tones can happen, which even for a human is hard to distinguish
Voice overlapping can occur when two speakers talk at the same time with one audio source

By using two audio sources with microphones, you can address these concerns by making sure each transcription is from a fixed input source. By assigning a device to a speaker, our application knows in advance which transcript to use. However, you might still encounter voice overlapping if two nearby microphones are picking up multiple voices. This can be mitigated by using directional microphones, volume management, and Amazon Transcribe word-level confidence scores.
Solution overview
The following diagram illustrates the solution workflow.

Application diagram for two microphones

We use two audio inputs with the Web Audio API. With this API, we can merge the two inputs, Mic A and Mic B, into a single audio data source, with the left channel representing Mic A and the right channel representing Mic B.
Then, we convert this audio source to PCM (Pulse-Code Modulation) audio. PCM is a common format for audio processing, and it’s one of the formats required by Amazon Transcribe for the audio input. Finally, we stream the PCM audio to Amazon Transcribe for transcription.
Prerequisites
You should have the following prerequisites in place:

The source code from the GitHub repository.
Bun or Node.js installed as a JavaScript runtime.
A web browser with Web Audio API compatibility. This solution has been tested to work in Google Chrome version 135.0.7049.85.
Two microphones connected to your computer and with browser permission to access these microphones.
An AWS account with Amazon Transcribe permissions. As an example, you can use the following AWS Identity and Access Management (IAM) policy for Amazon Transcribe:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “DemoWebAudioAmazonTranscribe”,
“Effect”: “Allow”,
“Action”: “transcribe:StartStreamTranscriptionWebSocket”,
“Resource”: “*”
}
]
}

Start the application
Complete the following steps to launch the application:

Go to the root directory where you downloaded the code.
Create a .env file to set up your AWS access keys from the env.sample file.
Install packages and run bun install (if you’re using node, run node install).
Start the web server and run bun dev (if you’re using node, run node dev).
Open your browser in http://localhost:5173/.

Application running on http://localhost:5173 with two connected microphones

Code walkthrough
In this section, we examine the important code pieces for the implementation:

The first step is to list the connected microphones by using the browser API navigator.mediaDevices.enumerateDevices():

const devices = await navigator.mediaDevices.enumerateDevices()
return devices.filter((d) => d.kind === ‘audioinput’)

Next, you need to obtain the MediaStream object for each of the connected microphones. This can be done using the navigator.mediaDevices.getUserMedia() API, which enables access the user’s media devices (such as cameras and microphones). You can then retrieve a MediaStream object that represents the audio or video data from those devices:

const streams = []
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
deviceId: device.deviceId,
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
},
})

if (stream) streams.push(stream)

To combine the audio from the multiple microphones, you need to create an AudioContext interface for audio processing. Within this AudioContext, you can use ChannelMergerNode to merge the audio streams from the different microphones. The connect(destination, src_idx, ch_idx) method arguments are:

destination – The destination, in our case mergerNode.
src_idx – The source channel index, in our case both 0 (because each microphone is a single-channel audio stream).
ch_idx – The channel index for the destination, in our case 0 and 1 respectively, to create a stereo output.

// instance of audioContext
const audioContext = new AudioContext({
sampleRate: SAMPLE_RATE,
})
// this is used to process the microphone stream data
const audioWorkletNode = new AudioWorkletNode(audioContext, ‘recording-processor’, {…})
// microphone A
const audioSourceA = audioContext.createMediaStreamSource(mediaStreams[0]);
// microphone B
const audioSourceB = audioContext.createMediaStreamSource(mediaStreams[1]);
// audio node for two inputs
const mergerNode = audioContext.createChannelMerger(2);
// connect the audio sources to the mergerNode destination.
audioSourceA.connect(mergerNode, 0, 0);
audioSourceB.connect(mergerNode, 0, 1);
// connect our mergerNode to the AudioWorkletNode
merger.connect(audioWorkletNode);

The microphone data is processed in an AudioWorklet that emits data messages every defined number of recording frames. These messages will contain the audio data encoded in PCM format to send to Amazon Transcribe. Using the p-event library, you can asynchronously iterate over the events from the Worklet. A more in-depth description about this Worklet is provided in the next section of this post.

import { pEventIterator } from ‘p-event’
…

// Register the worklet
try {
await audioContext.audioWorklet.addModule(‘./worklets/recording-processor.js’)
} catch (e) {
console.error(‘Failed to load audio worklet’)
}

// An async iterator
const audioDataIterator = pEventIterator<‘message’, MessageEvent<AudioWorkletMessageDataType>>(
audioWorkletNode.port,
‘message’,
)
…

// AsyncIterableIterator: Every time the worklet emits an event with the message `SHARE_RECORDING_BUFFER`, this iterator will return the AudioEvent object that we need.
const getAudioStream = async function* (
audioDataIterator: AsyncIterableIterator<MessageEvent<AudioWorkletMessageDataType>>,
) {
for await (const chunk of audioDataIterator) {
if (chunk.data.message === ‘SHARE_RECORDING_BUFFER’) {
const { audioData } = chunk.data
yield {
AudioEvent: {
AudioChunk: audioData,
},
}
}
}
}

To start streaming the data to Amazon Transcribe, you can use the fabricated iterator and enabled NumberOfChannels: 2 and EnableChannelIdentification: true to enable the dual channel transcription. For more information, refer to the AWS SDK StartStreamTranscriptionCommand documentation.

import {
LanguageCode,
MediaEncoding,
StartStreamTranscriptionCommand,
} from ‘@aws-sdk/client-transcribe-streaming’

const command = new StartStreamTranscriptionCommand({
LanguageCode: LanguageCode.EN_US,
MediaEncoding: MediaEncoding.PCM,
MediaSampleRateHertz: SAMPLE_RATE,
NumberOfChannels: 2,
EnableChannelIdentification: true,
ShowSpeakerLabel: true,
AudioStream: getAudioStream(audioIterator),
})

After you send the request, a WebSocket connection is created to exchange audio stream data and Amazon Transcribe results:

const data = await client.send(command)
for await (const event of data.TranscriptResultStream) {
for (const result of event.TranscriptEvent.Transcript.Results || []) {
callback({ …result })
}
}

The result object will include a ChannelId property that you can use to identify your microphone source, such as ch_0 and ch_1, respectively.
Deep dive: Audio Worklet
Audio Worklets can execute in a separate thread to provide very low-latency audio processing. The implementation and demo source code can be found in the public/worklets/recording-processor.js file.
For our case, we use the Worklet to perform two main tasks:

Process the mergerNode audio in an iterable way. This node includes both of our audio channels and is the input to our Worklet.
Encode the data bytes of the mergerNode node into PCM signed 16-bit little-endian audio format. We do this for each iteration or when required to emit a message payload to our application.

The general code structure to implement this is as follows:
class RecordingProcessor extends AudioWorkletProcessor {
constructor(options) {
super()
}
process(inputs, outputs) {…}
}

registerProcessor(‘recording-processor’, RecordingProcessor)

You can pass custom options to this Worklet instance using the processorOptions attribute. In our demo, we set a maxFrameCount: (SAMPLE_RATE * 4) / 10 as a bitrate guide to determine when to emit a new message payload. A message is for example:
this.port.postMessage({
message: ‘SHARE_RECORDING_BUFFER’,
buffer: this._recordingBuffer,
recordingLength: this.recordedFrames,
audioData: new Uint8Array(pcmEncodeArray(this._recordingBuffer)), // PCM encoded audio format
})

PCM encoding for two channels
One of the most important sections is how to encode to PCM for two channels. Following the AWS documentation in the Amazon Transcribe API Reference, the AudioChunk is defined by: Duration (s) * Sample Rate (Hz) * Number of Channels * 2. For two channels, 1 second at 16000Hz is: 1 * 16000 * 2 * 2 = 64000 bytes. Our encoding function it should then look like this:
// Notice that input is an array, where each element is a channel with Float32 values between -1.0 and 1.0 from the AudioWorkletProcessor.
const pcmEncodeArray = (input: Float32Array[]) => {
const numChannels = input.length
const numSamples = input[0].length
const bufferLength = numChannels * numSamples * 2 // 2 bytes per sample per channel
const buffer = new ArrayBuffer(bufferLength)
const view = new DataView(buffer)

let index = 0

for (let i = 0; i < numSamples; i++) {
// Encode for each channel
for (let channel = 0; channel < numChannels; channel++) {
const s = Math.max(-1, Math.min(1, input[channel][i]))
// Convert the 32 bit float to 16 bit PCM audio waveform samples.
// Max value: 32767 (0x7FFF), Min value: -32768 (-0x8000)
view.setInt16(index, s < 0 ? s * 0x8000 : s * 0x7fff, true)
index += 2
}
}
return buffer
}

For more information how the audio data blocks are handled, see AudioWorkletProcessor: process() method. For more information on PCM format encoding, see Multimedia Programming Interface and Data Specifications 1.0.
Conclusion
In this post, we explored the implementation details of a web application that uses the browser’s Web Audio API and Amazon Transcribe streaming to enable real-time dual-channel transcription. By using the combination of AudioContext, ChannelMergerNode, and AudioWorklet, we were able to seamlessly process and encode the audio data from two microphones before sending it to Amazon Transcribe for transcription. The use of the AudioWorklet in particular allowed us to achieve low-latency audio processing, providing a smooth and responsive user experience.
You can build upon this demo to create more advanced real-time transcription applications that cater to a wide range of use cases, from meeting recordings to voice-controlled interfaces.
Try out the solution for yourself, and leave your feedback in the comments.

About the Author
Jorge Lanzarotti is a Sr. Prototyping SA at Amazon Web Services (AWS) based on Tokyo, Japan. He helps customers in the public sector by creating innovative solutions to challenging problems.

How Kepler democratized AI access and enhanced client services with Am …

Posted on June 10, 2025 by i-genie

This is a guest post co-authored by Evan Miller, Noah Kershaw, and Valerie Renda of Kepler Group
At Kepler, a global full-service digital marketing agency serving Fortune 500 brands, we understand the delicate balance between creative marketing strategies and data-driven precision. Our company name draws inspiration from the visionary astronomer Johannes Kepler, reflecting our commitment to bringing clarity to complex challenges and illuminating the path forward for our clients.
In this post, we share how implementing Amazon Q Business transformed our operations by democratizing AI access across our organization while maintaining stringent security standards, resulting in an average savings of 2.7 hours per week per employee in manual work and improved client service delivery.
The challenge: Balancing innovation with security
As a digital marketing agency working with Fortune 500 clients, we faced increasing pressure to use AI capabilities while making sure that we maintain the highest levels of data security. Our previous solution lacked essential features, which led team members to consider more generic solutions. Specifically, the original implementation was missing critical capabilities such as chat history functionality, preventing users from accessing or referencing their prior conversations. This absence of conversation context meant users had to repeatedly provide background information in each interaction. Additionally, the solution had no file upload capabilities, limiting users to text-only interactions. These limitations resulted in a basic AI experience where users often had to compromise by rewriting prompts, manually maintaining context, and working around the inability to process different file formats. The restricted functionality ultimately pushed teams to explore alternative solutions that could better meet their comprehensive needs. Being an International Organization for Standardization (ISO) 27001-certified organization, we needed an enterprise-grade solution that would meet our strict security requirements without compromising on functionality. Our ISO 27001 certification mandates rigorous security controls, which meant that public AI tools weren’t suitable for our needs. We required a solution that could be implemented within our secure environment while maintaining full compliance with our stringent security protocols.
Why we chose Amazon Q Business
Our decision to implement Amazon Q Business was driven by three key factors that aligned perfectly with our needs. First, because our Kepler Intelligence Platform (Kip) infrastructure already resided on Amazon Web Services (AWS), the integration process was seamless. Our Amazon Q Business implementation uses three core connectors (Amazon Simple Storage Service (Amazon S3), Google Drive, and Amazon Athena), though our wider data ecosystem includes 35–45 different platform integrations, primarily flowing through Amazon S3. Second, the commitment from Amazon Q Business to not use our data for model training satisfied our essential security requirements. Finally, the Amazon Q Business apps functionality enabled us to develop no-code solutions for everyday challenges, democratizing access to efficient workflows without requiring additional software developers.
Implementation journey
We began our Amazon Q Business implementation journey in early 2025 with a focused pilot group of 10 participants, expanding to 100 users in February and March, with plans for a full deployment reaching 500+ employees. During this period, we organized an AI-focused hackathon that catalyzed organic adoption and sparked creative solutions. The implementation was unique in how we integrated Amazon Q Business into our existing Kepler Intelligence Platform, rebranding it as Kip AI to maintain consistency with our internal systems.
Kip AI demonstrates how we’ve comprehensively integrated AI capabilities with our existing data infrastructure. We use multiple data sources, including Amazon S3 for our storage needs, Amazon QuickSight for our business intelligence requirements, and Google Drive for team collaboration. At the heart of our system is our custom extract, transform, and load ETL pipeline (Kip SSoT), which we’ve designed to feed data into QuickSight for AI-enabled analytics. We’ve configured Amazon Q Business to seamlessly connect with these data sources, allowing our team members to access insights through both a web interface and browser extension. The following figure shows the architecture of Kip AI.

This integrated approach helps ensure that Kepler’s employees can securely access AI capabilities while maintaining data governance and security requirements crucial for their clients. Access to the platform is secured through AWS Identity and Access Management (IAM), connected to our single sign-on provider, ensuring that only authorized personnel can use the system. This careful approach to security and access management has been crucial in maintaining our clients’ trust while rolling out AI capabilities across our organization.
Transformative use cases and results
The implementation of Amazon Q Business has revolutionized several key areas of our operations. Our request for information (RFI) response process, which traditionally consumed significant time and resources, has been streamlined dramatically. Teams now report saving over 10 hours per RFI response, allowing us to pursue more business opportunities efficiently.

Client communications have also seen substantial improvements. The platform helps us draft clear, consistent, and timely communications, from routine emails to comprehensive status reports and presentations. This enhancement in communication quality has strengthened our client relationships and improved service delivery.

Perhaps most significantly, we’ve achieved remarkable efficiency gains across the organization. Our employees report saving an average of 2.7 hours per week in manual work, with user satisfaction rates exceeding 87%. The platform has enabled us to standardize our approach to insight generation, ensuring consistent, high-quality service delivery across all client accounts.

Looking ahead
As we expand Amazon Q Business access to all Kepler employees (over 500) in the coming months, we’re maintaining a thoughtful approach to deployment. We recognize that some clients have specific requirements regarding AI usage, and we’re carefully balancing innovation with client preferences. This strategic approach includes working to update client contracts and helping clients become more comfortable with AI integration while respecting their current guidelines.
Conclusion
Our experience with Amazon Q Business demonstrates how enterprise-grade AI can be successfully implemented while maintaining strict security standards and respecting client preferences. The platform has not only improved our operational efficiency but has also enhanced our ability to deliver consistent, high-quality service to our clients. What’s particularly impressive is the platform’s rapid deployment capabilities—we were able to implement the solution within weeks, without any coding requirements, and eliminate ongoing model maintenance and data source management expenses. As we continue to expand our use of Amazon Q Business, we’re excited about the potential for further innovation and efficiency gains in our digital marketing services.

About the authors
Evan Miller, Global Head of Product and Data Science, is a strategic product leader who joined Kepler 2013. Currently serving as Global Head of Product and Data Science, he owns the end-to-end product strategy for the Kepler Intelligence Platform (Kip). Under his leadership, Kip has garnered industry recognition, winning awards for Best Performance Management Solution and Best Commerce Technology, while driving significant business impact through innovative features like automated Machine Learning analytics and Marketing Mix Modeling technology.
Noah Kershaw leads the product team at Kepler Group, a global digital marketing agency that helps brands connect with their audiences through data-driven strategies. With a passion for innovation, Noah has been at the forefront of integrating AI solutions to enhance client services and streamline operations. His collaborative approach and enthusiasm for leveraging technology have been key in bringing Kepler’s “Future in Focus” vision to life, helping Kepler and its clients navigate the modern era of marketing with clarity and precision.
Valerie Renda, Director of Data Strategy & Analytics, has a specialized focus on data strategy, analytics, and marketing systems strategy within digital marketing, a field she’s worked in for over eight years. At Kepler, she has made significant contributions to various clients’ data management and martech strategies. She has been instrumental in leading data infrastructure projects, including customer data platform implementations, business intelligence visualization implementations, server-side tracking, martech consolidation, tag migrations, and more. She has also led the development of workflow tools to automate data processes and streamline ad operations to improve internal organizational processes.
Al Destefano is a Sr. Generative AI Specialist on the Amazon Q GTM team based in New York City. At AWS, he uses technical knowledge and business experience to communicate the tangible enterprise benefits when using managed Generative AI AWS services.
Sunanda Patel is a Senior Account Manager with over 15 years of expertise in management consulting and IT sectors, with a focus on business development and people management. Throughout her career, Sunanda has successfully managed diverse client relationships, ranging from non-profit to corporate and large multinational enterprises. Sunanda joined AWS in 2022 as an Account Manager for the Manhattan Commercial sector and now works with strategic commercial accounts, helping them grow in their cloud journey to achieve complex business goals.
Kumar Karra is a Sr. Solutions Architect at AWS supporting SMBs. He is an experienced engineer with deep experience in the software development lifecycle. Kumar looks to solve challenging problems by applying technical, leadership, and business skills. He holds a Master’s Degree in Computer Science and Machine Learning from Georgia Institute of Technology and is based in New York (US).

High-Entropy Token Selection in Reinforcement Learning with Verifiable …

Posted on June 9, 2025 by i-genie

Large Language Models (LLMs) generate step-by-step responses known as Chain-of-Thoughts (CoTs), where each token contributes to a coherent and logical narrative. To improve the quality of reasoning, various reinforcement learning techniques have been employed. These methods allow the model to learn from feedback mechanisms by aligning generated outputs with correctness criteria. As LLMs grow in complexity and capacity, researchers have begun probing the internal structure of token generation to discern patterns that enhance or limit performance. One area gaining attention is the token entropy distribution, a measurement of uncertainty in token prediction, which is now being linked to the model’s ability to make meaningful logical decisions during reasoning.

A core issue in training reasoning models using reinforcement learning is treating all output tokens equally. When models are optimized using reinforcement learning with verifiable rewards (RLVR), the update process traditionally includes every token in the generated sequence, regardless of its functional role. This uniform treatment fails to distinguish tokens that lead to significant reasoning shifts from those that merely extend existing linguistic structures. As a result, a large portion of training resources may be directed at tokens that offer minimal contribution to the model’s reasoning capabilities. Without prioritizing the few tokens that play decisive roles in navigating different logic paths, these methods miss opportunities for focused and effective optimization.

Most RLVR frameworks, including Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Dynamic sAmpling Policy Optimization (DAPO), function by evaluating entire sequences of token outputs against reward functions that assess correctness. PPO relies on stabilizing policy updates through a clipped objective function. GRPO improves upon this by estimating advantage values using grouped responses, rather than a separate value network. DAPO introduces additional enhancements, such as the clip-higher mechanism and overlong reward shaping. These methods, however, do not factor in token-level entropy or distinguish the importance of individual tokens in the reasoning chain, instead applying uniform gradient updates across the board.

In an attempt to refine how RLVR training impacts LLM reasoning, researchers from Alibaba Inc. and Tsinghua University presented a new methodology focused on token entropy patterns. They observed that in the CoT sequences generated by Qwen3 models, a small subset of tokens, roughly 20%, display significantly higher entropy. These tokens, labeled “forking tokens,” often correspond to moments where the model must decide between multiple reasoning paths. The remaining 80% of tokens typically exhibit low entropy and act as extensions of prior statements. By limiting policy gradient updates solely to these high-entropy tokens, the research team was able not only to maintain but, in many cases, improve performance on challenging reasoning benchmarks.

To quantify token entropy, the researchers used the entropy formula based on the probability distribution over possible token choices at each step. They found that over half of all generated tokens had entropy values below 0.01, indicating near-deterministic behavior. Only 20% exceeded an entropy of 0.672, marking them as the decision-making hubs within CoTs. High-entropy tokens often include logical operators and connective words such as “assume,” “since,” or “thus,” which introduce new conditions or transitions in logic. In contrast, low-entropy tokens included predictable symbols, suffixes, or code fragments. Through controlled experiments, it became clear that manipulating the entropy of these forking tokens directly influenced the model’s reasoning performance, while altering low-entropy tokens had little effect.

The research team conducted extensive experiments across three model sizes: Qwen3-8B, Qwen3-14B, and Qwen3-32B. When training only the top 20% high-entropy tokens, the Qwen3-32B model achieved a score of 63.5 on AIME’24 and 56.7 on AIME’25, both setting new performance benchmarks for models under 600B parameters. Furthermore, increasing the maximum response length from 20k to 29k raised the AIME’24 score to 68.1. In comparison, training on the bottom 80% of low-entropy tokens caused performance to drop significantly. The Qwen3-14B model showed gains of +4.79 on AIME’25 and +5.21 on AIME’24, while the Qwen3-8B maintained competitive results relative to full-token training. An ablation study further confirmed the importance of retaining the 20% threshold. Decreasing the fraction to 10% omitted essential decision points, and increasing it to 50% or 100% diluted the effect by including too many low-entropy tokens, thereby reducing entropy diversity and hindering exploration.

In essence, the research provides a new direction for enhancing the reasoning abilities of language models by identifying and selectively training on the minority of tokens that disproportionately contribute to reasoning success. It avoids inefficient training and instead proposes a scalable approach that aligns reinforcement learning objectives with actual decision-making moments in token sequences. The success of this strategy lies in using entropy as a guide to distinguish useful tokens from filler.

Several Key takeaways from the research include:

Around 20% of tokens exhibit high entropy and serve as forking points that direct reasoning paths.

Training only on these high-entropy tokens delivers performance equal to or better than training on the full token set.

Qwen3-32B achieved scores of 63.5 on AIME’24 and 56.7 on AIME’25, outperforming larger models trained traditionally.

Extending response length from 20k to 29k further pushed the AIME’24 score to 68.1.

Training on the remaining 80% of low-entropy tokens led to sharp performance degradation.

Retaining the 20% threshold for high-entropy tokens optimally balances exploration and performance.

Larger models gain more from this strategy due to their capacity to benefit from enhanced exploration.

The strategy scales well and could guide more efficient training of next-generation reasoning models.

In conclusion, this research effectively rethinks the application of reinforcement learning to language models by introducing a focus on token-level entropy. By optimizing only the minority that influences reasoning paths, the method enhances performance while reducing computational overhead. It provides a practical roadmap for future efforts to improve reasoning in LLMs without unnecessary complexity.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 98k+ ML SubReddit and Subscribe to our Newsletter.
The post High-Entropy Token Selection in Reinforcement Learning with Verifiable Rewards (RLVR) Improves Accuracy and Reduces Training Cost for LLMs appeared first on MarkTechPost.

How to Build an Asynchronous AI Agent Network Using Gemini for Researc …

Posted on June 9, 2025 by i-genie

In this tutorial, we introduce the Gemini Agent Network Protocol, a powerful and flexible framework designed to enable intelligent collaboration among specialized AI agents. Leveraging Google’s Gemini models, the protocol facilitates dynamic communication between agents, each equipped with distinct roles: Analyzer, Researcher, Synthesizer, and Validator. Users will learn to set up and configure an asynchronous agent network, enabling automated task distribution, collaborative problem-solving, and enriched dialogue management. Ideal for scenarios such as in-depth research, complex data analysis, and information validation, this framework empowers users to harness collective AI intelligence efficiently.

Copy CodeCopiedUse a different Browserimport asyncio
import json
import random
from dataclasses import dataclass, asdict
from typing import Dict, List, Optional, Any
from enum import Enum
import google.generativeai as genai

We leverage asyncio for concurrent execution, dataclasses for structured message management, and Google’s Generative AI (google.generativeai) to facilitate interactions among multiple AI-driven agents. It includes utilities for dynamic message handling and structured agent roles, enhancing scalability and flexibility in collaborative AI tasks.

Copy CodeCopiedUse a different BrowserAPI_KEY = None

try:
import google.colab
IN_COLAB = True
except ImportError:
IN_COLAB = False

We initialize the API_KEY and detect whether the code is running in a Colab environment. If the google.colab module is successfully imported, the IN_COLAB flag is set to True; otherwise, it defaults to False, allowing the script to adjust behavior accordingly.

Copy CodeCopiedUse a different Browserclass AgentType(Enum):
ANALYZER = “analyzer”
RESEARCHER = “researcher”
SYNTHESIZER = “synthesizer”
VALIDATOR = “validator”

@dataclass
class Message:
sender: str
receiver: str
content: str
msg_type: str
metadata: Dict = None

Check out the Notebook

We define the core structures for agent interaction. The AgentType enum categorizes agents into four distinct roles, Analyzer, Researcher, Synthesizer, and Validator, each with a specific function in the collaborative network. The Message dataclass represents the format for inter-agent communication, encapsulating sender and receiver IDs, message content, type, and optional metadata.

Copy CodeCopiedUse a different Browserclass GeminiAgent:
def __init__(self, agent_id: str, agent_type: AgentType, network: ‘AgentNetwork’):
self.id = agent_id
self.type = agent_type
self.network = network
self.model = genai.GenerativeModel(‘gemini-2.0-flash’)
self.inbox = asyncio.Queue()
self.context_memory = []

self.system_prompts = {
AgentType.ANALYZER: “You are a data analyzer. Break down complex problems into components and identify key patterns.”,
AgentType.RESEARCHER: “You are a researcher. Gather information and provide detailed context on topics.”,
AgentType.SYNTHESIZER: “You are a synthesizer. Combine information from multiple sources into coherent insights.”,
AgentType.VALIDATOR: “You are a validator. Check accuracy and consistency of information and conclusions.”
}

async def process_message(self, message: Message):
“””Process incoming message and generate response”””
if not API_KEY:
return ” API key not configured. Please set API_KEY variable.”

prompt = f”””
{self.system_prompts[self.type]}

Context from previous interactions: {json.dumps(self.context_memory[-3:], indent=2)}

Message from {message.sender}: {message.content}

Provide a focused response (max 100 words) that adds value to the network discussion.
“””

try:
response = await asyncio.to_thread(
self.model.generate_content, prompt
)
return response.text.strip()
except Exception as e:
return f”Error processing: {str(e)}”

async def send_message(self, receiver_id: str, content: str, msg_type: str = “task”):
“””Send message to another agent”””
message = Message(self.id, receiver_id, content, msg_type)
await self.network.route_message(message)

async def broadcast(self, content: str, exclude_self: bool = True):
“””Broadcast message to all agents in network”””
for agent_id in self.network.agents:
if exclude_self and agent_id == self.id:
continue
await self.send_message(agent_id, content, “broadcast”)

async def run(self):
“””Main agent loop”””
while True:
try:
message = await asyncio.wait_for(self.inbox.get(), timeout=1.0)

response = await self.process_message(message)

self.context_memory.append({
“from”: message.sender,
“content”: message.content,
“my_response”: response
})

if len(self.context_memory) > 10:
self.context_memory = self.context_memory[-10:]

print(f” {self.id} ({self.type.value}): {response}”)

if random.random() < 0.3:
other_agents = [aid for aid in self.network.agents.keys() if aid != self.id]
if other_agents:
target = random.choice(other_agents)
await self.send_message(target, f”Building on that: {response[:50]}…”)

except asyncio.TimeoutError:
continue
except Exception as e:
print(f” Error in {self.id}: {e}”)

Check out the Notebook

The GeminiAgent class defines the behavior and capabilities of each agent in the network. Upon initialization, it assigns a unique ID, role type, and a reference to the agent network and loads the Gemini 2.0 Flash model. It uses role-specific system prompts to generate intelligent responses based on incoming messages, which are processed asynchronously through a queue. Each agent maintains a context memory to retain recent interactions and can either respond directly, send targeted messages, or broadcast insights to others. The run() method continuously processes messages, promotes collaboration by occasionally initiating responses to other agents, and manages message handling in a non-blocking loop.

Copy CodeCopiedUse a different Browserclass AgentNetwork:
def __init__(self):
self.agents: Dict[str, GeminiAgent] = {}
self.message_log = []
self.running = False

def add_agent(self, agent_type: AgentType, agent_id: Optional[str] = None):
“””Add new agent to network”””
if not agent_id:
agent_id = f”{agent_type.value}_{len(self.agents)+1}”

agent = GeminiAgent(agent_id, agent_type, self)
self.agents[agent_id] = agent
print(f” Added {agent_id} to network”)
return agent_id

async def route_message(self, message: Message):
“””Route message to target agent”””
self.message_log.append(asdict(message))

if message.receiver in self.agents:
await self.agents[message.receiver].inbox.put(message)
else:
print(f” Agent {message.receiver} not found”)

async def initiate_task(self, task: str):
“””Start a collaborative task”””
print(f” Starting task: {task}”)

analyzer_agents = [aid for aid, agent in self.agents.items()
if agent.type == AgentType.ANALYZER]

if analyzer_agents:
initial_message = Message(“system”, analyzer_agents[0], task, “task”)
await self.route_message(initial_message)

async def run_network(self, duration: int = 30):
“””Run the agent network for specified duration”””
self.running = True
print(f” Starting agent network for {duration} seconds…”)

agent_tasks = [agent.run() for agent in self.agents.values()]

try:
await asyncio.wait_for(asyncio.gather(*agent_tasks), timeout=duration)
except asyncio.TimeoutError:
print(” Network session completed”)
finally:
self.running = False

Check out the Notebook

The AgentNetwork class manages the coordination and communication between all agents in the system. It allows dynamic addition of agents with unique IDs and specified roles, maintains a log of all exchanged messages, and facilitates message routing to the correct recipient. The network can initiate a collaborative task by sending the starting message to an Analyzer agent, and runs the full asynchronous event loop for a specified duration, enabling agents to operate concurrently and interactively within a shared environment.

Copy CodeCopiedUse a different Browserasync def demo_agent_network():
“””Demonstrate the Gemini Agent Network Protocol”””

network = AgentNetwork()

network.add_agent(AgentType.ANALYZER, “deep_analyzer”)
network.add_agent(AgentType.RESEARCHER, “info_gatherer”)
network.add_agent(AgentType.SYNTHESIZER, “insight_maker”)
network.add_agent(AgentType.VALIDATOR, “fact_checker”)

task = “Analyze the potential impact of quantum computing on cybersecurity”

network_task = asyncio.create_task(network.run_network(20))
await asyncio.sleep(1)
await network.initiate_task(task)
await network_task

print(f”n Network completed with {len(network.message_log)} messages exchanged”)
agent_participation = {aid: sum(1 for msg in network.message_log if msg[‘sender’] == aid)
for aid in network.agents}
print(“Agent participation:”, agent_participation)

def setup_api_key():
“””Interactive API key setup”””
global API_KEY

if IN_COLAB:
from google.colab import userdata
try:
API_KEY = userdata.get(‘GEMINI_API_KEY’)
genai.configure(api_key=API_KEY)
print(” API key loaded from Colab secrets”)
return True
except:
print(” To use Colab secrets: Add ‘GEMINI_API_KEY’ in the secrets panel”)

print(” Please enter your Gemini API key:”)
print(” Get it from: https://makersuite.google.com/app/apikey”)

try:
if IN_COLAB:
from google.colab import userdata
API_KEY = input(“Paste your API key here: “).strip()
else:
import getpass
API_KEY = getpass.getpass(“Paste your API key here: “).strip()

if API_KEY and len(API_KEY) > 10:
genai.configure(api_key=API_KEY)
print(” API key configured successfully!”)
return True
else:
print(” Invalid API key”)
return False
except KeyboardInterrupt:
print(“n Setup cancelled”)
return False

Check out the Notebook

The demo_agent_network() function orchestrates the entire agent workflow: it initializes an agent network, adds four role-specific agents, launches a cybersecurity task, and runs the network asynchronously for a fixed duration while tracking message exchanges and agent participation. Meanwhile, setup_api_key() provides an interactive mechanism to securely configure the Gemini API key, with tailored logic for both Colab and non-Colab environments, ensuring the AI agents can communicate with the Gemini model backend before the demo begins.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
print(” Gemini Agent Network Protocol”)
print(“=” * 40)

if not setup_api_key():
print(” Cannot run without valid API key”)
exit()

print(“n Starting demo…”)

if IN_COLAB:
import nest_asyncio
nest_asyncio.apply()
loop = asyncio.get_event_loop()
loop.run_until_complete(demo_agent_network())
else:
asyncio.run(demo_agent_network())

Finally, the above code serves as the entry point for executing the Gemini Agent Network Protocol. It begins by prompting the user to set up the Gemini API key, exiting if not provided. Upon successful configuration, the demo is launched. If running in Google Colab, it applies nest_asyncio to handle Colab’s event loop restrictions; otherwise, it uses Python’s native asyncio.run() to execute the asynchronous demo of agent collaboration.

In conclusion, by completing this tutorial, users gain practical knowledge of implementing an AI-powered collaborative network using Gemini agents. The hands-on experience provided here demonstrates how autonomous agents can effectively break down complex problems, collaboratively generate insights, and ensure the accuracy of information through validation.

Check out the Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post How to Build an Asynchronous AI Agent Network Using Gemini for Research, Analysis, and Validation Tasks appeared first on MarkTechPost.

Google Introduces Open-Source Full-Stack AI Agent Stack Using Gemini 2 …

Posted on June 9, 2025 by i-genie

Introduction: The Need for Dynamic AI Research Assistants

Conversational AI has rapidly evolved beyond basic chatbot frameworks. However, most large language models (LLMs) still suffer from a critical limitation—they generate responses based only on static training data, lacking the ability to self-identify knowledge gaps or perform real-time information synthesis. As a result, these models often deliver incomplete or outdated answers, particularly for evolving or niche topics.

To overcome these issues, AI agents must go beyond passive querying. They need to recognize informational gaps, perform autonomous web searches, validate results, and refine responses—effectively mimicking a human research assistant.

Google’s Full-Stack Research Agent: Gemini 2.5 + LangGraph

Google, in collaboration with contributors from Hugging Face and other open-source communities, has developed a full-stack research agent stack designed to solve this problem. Built with a React frontend and a FastAPI + LangGraph backend, this system combines language generation with intelligent control flow and dynamic web search.

The research agent stack utilizes the Gemini 2.5 API to process user queries, generating structured search terms. It then performs recursive search-and-reflection cycles using the Google Search API, verifying whether each result sufficiently answers the original query. This iterative process continues until the agent generates a validated, well-cited response.

Architecture Overview: Developer-Friendly and Extensible

Frontend: Built with Vite + React, offering hot reloading and clean module separation.

Backend: Powered by Python (3.8+), FastAPI, and LangGraph, enabling decision control, evaluation loops, and autonomous query refinement.

Key Directories: The agent logic resides in backend/src/agent/graph.py, while UI components are structured under frontend/.

Local Setup: Requires Node.js, Python, and a Gemini API Key. Run with make dev, or launch frontend/backend separately.

Endpoints:

Backend API: http://127.0.0.1:2024

Frontend UI: http://localhost:5173

This separation of concerns ensures that developers can easily modify the agent’s behavior or UI presentation, making the project suitable for global research teams and tech developers alike.

Technical Highlights and Performance

Reflective Looping: The LangGraph agent evaluates search results and identifies coverage gaps, autonomously refining queries without human intervention.

Delayed Response Synthesis: The AI waits until it gathers sufficient information before generating an answer.

Source Citations: Answers include embedded hyperlinks to original sources, improving trust and traceability.

Use Cases: Ideal for academic research, enterprise knowledge bases, technical support bots, and consulting tools where accuracy and validation matter.

Why It Matters: A Step Towards Autonomous Web Research

This system illustrates how autonomous reasoning and search synthesis can be integrated directly into LLM workflows. The agent doesn’t just respond—it investigates, verifies, and adapts. This reflects a broader shift in AI development: from stateless Q&A bots to real-time reasoning agents.

The agent enables developers, researchers, and enterprises in regions such as North America, Europe, India, and Southeast Asia to deploy AI research assistants with minimal setup. By using globally accessible tools like FastAPI, React, and Gemini APIs, the project is well-positioned for widespread adoption.

Key Takeaways

Agent Design: Modular React + LangGraph system supports autonomous query generation and reflection.

Iterative Reasoning: Agent refines search queries until confidence thresholds are met.

Citations Built-In: Outputs include direct links to web sources for transparency.

Developer-Ready: Local setup requires Node.js, Python 3.8+, and a Gemini API key.

Open-Source: Publicly available for community contribution and extension.

Conclusion

By combining Google’s Gemini 2.5 with LangGraph’s logic orchestration, this project delivers a breakthrough in autonomous AI reasoning. It showcases how research workflows can be automated without compromising accuracy or traceability. As conversational agents evolve, systems like this one set the standard for intelligent, trustworthy, and developer-friendly AI research tools.

Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.
The post Google Introduces Open-Source Full-Stack AI Agent Stack Using Gemini 2.5 and LangGraph for Multi-Step Web Search, Reflection, and Synthesis appeared first on MarkTechPost.

Google AI Introduces Multi-Agent System Search MASS: A New AI Agent Op …

Posted on June 8, 2025 by i-genie

Multi-agent systems are becoming a critical development in artificial intelligence due to their ability to coordinate multiple large language models (LLMs) to solve complex problems. Instead of relying on a single model’s perspective, these systems distribute roles among agents, each contributing a unique function. This division of labor enhances the system’s ability to analyze, respond, and act in more robust ways. Whether applied to code debugging, data analysis, retrieval-augmented generation, or interactive decision-making, LLM-driven agents are achieving results that single models cannot consistently match. The power of these systems lies in their design, particularly the configuration of inter-agent connections, known as topologies, and the specific instructions given to each agent, referred to as prompts. As this model of computation matures, the challenge has shifted from proving feasibility to optimizing architecture and behavior for superior results.

One significant problem lies in the difficulty of designing these systems efficiently. When prompts, those structured inputs that guide each agent’s role, are slightly altered, performance can swing dramatically. This sensitivity makes scalability risky, especially when agents are linked together in workflows where one’s output serves as another’s input. Errors can propagate or even amplify. Moreover, topological decisions, such as determining the number of agents involved, their interaction style, and task sequence, are still heavily reliant on manual configuration and trial-and-error. The design space is vast and nonlinear, as it combines numerous options for both prompt engineering and topology construction. Optimizing both simultaneously has been largely out of reach for traditional design methods.

Several efforts have been made to improve various aspects of this design problem, but gaps remain. Methods like DSPy automate exemplar generation for prompts, while others focus on increasing the number of agents participating in tasks like voting. Tools like ADAS introduce code-based topological configurations through meta-agents. Some frameworks, such as AFlow, apply techniques like Monte Carlo Tree Search to explore combinations more efficiently. Yet, these solutions generally concentrate on either prompt or topology optimization, rather than both. This lack of integration limits their ability to generate MAS designs that are both intelligent and robust under complex operational conditions.

Researchers at Google and the University of Cambridge introduced a new framework named Multi-Agent System Search (Mass). This method automates MAS design by interleaving the optimization of both prompts and topologies in a staged approach. Unlike earlier attempts that treated the two components independently, Mass begins by identifying which elements, both prompts and topological structures, are most likely to influence performance. By narrowing the search to this influential subspace, the framework operates more efficiently while delivering higher-quality outcomes. The method progresses in three phases: localized prompt optimization, selection of effective workflow topologies based on the optimized prompts, and then global optimization of prompts at the system-wide level. The framework not only reduces computational overhead but also removes the burden of manual tuning from researchers.

The technical implementation of Mass is structured and methodical. First, each building block of a MAS undergoes prompt refinement. These blocks are agent modules with specific responsibilities, such as aggregation, reflection, or debate. For example, prompt optimizers generate variations that include both instructional guidance (e.g., “think step by step”) and example-based learning (e.g., one-shot or few-shot demos). The optimizer evaluates these using a validation metric to guide improvements. Once each agent’s prompt is optimized locally, the system proceeds to explore valid combinations of agents to form topologies. This topology optimization is informed by earlier results and constrained to a pruned search space identified as most influential. Finally, the best topology undergoes global-level prompt tuning, where instructions are fine-tuned in the context of the entire workflow to maximize collective efficiency.

In tasks such as reasoning, multi-hop understanding, and code generation, the optimized MAS consistently surpassed existing benchmarks. In performance testing using Gemini 1.5 Pro on the MATH dataset, prompt-optimized agents showed an average accuracy of around 84% with enhanced prompting techniques, compared to 76–80% for agents scaled through self-consistency or multi-agent debate. In the HotpotQA benchmark, using the debate topology within Mass yielded a 3% improvement. In contrast, other topologies, such as reflect or summarize, failed to yield gains or even led to a 15% degradation. On LiveCodeBench, the Executor topology provided a +6% boost, but methods like reflection again saw negative results. These findings validate that only a fraction of the topological design space contributes positively and reinforce the need for targeted optimization, such as that used in Mass.

Several Key Takeaways from the Research include:

MAS design complexity is significantly influenced by prompt sensitivity and topological arrangement.

Prompt optimization, both at the block and system level, is more effective than agent scaling alone, as evidenced by the 84% accuracy with enhanced prompts versus 76% with self-consistency scaling.

Not all topologies are beneficial; debate added +3% in HotpotQA, while reflection caused a drop of up to -15%.

The Mass framework integrates prompt and topology optimization in three phases, drastically reducing computational and design burden.

Topologies like debate and executor are effective, while others, such as reflect and summarize, can degrade system performance.

Mass avoids full search complexity by pruning the design space based on early influence analysis, improving performance while saving resources.

The approach is modular and supports plug-and-play agent configurations, making it adaptable to various domains and tasks.

Final MAS models from Mass outperform state-of-the-art baselines across multiple benchmarks like MATH, HotpotQA, and LiveCodeBench.

In conclusion, this research identifies prompt sensitivity and topology complexity as major bottlenecks in multi-agent system (MAS) development and proposes a structured solution that strategically optimizes both areas. The Mass framework demonstrates a scalable, efficient approach to MAS design, minimizing the need for human input while maximizing performance. The research presents compelling evidence that better prompt design is more effective than merely adding agents and that targeted search within influential topology subsets leads to meaningful gains in real-world tasks.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post Google AI Introduces Multi-Agent System Search MASS: A New AI Agent Optimization Framework for Better Prompts and Topologies appeared first on MarkTechPost.

ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autore …

Posted on June 8, 2025 by i-genie

Autoregressive image generation has been shaped by advances in sequential modeling, originally seen in natural language processing. This field focuses on generating images one token at a time, similar to how sentences are constructed in language models. The appeal of this approach lies in its ability to maintain structural coherence across the image while allowing for high levels of control during the generation process. As researchers began to apply these techniques to visual data, they found that structured prediction not only preserved spatial integrity but also supported tasks like image manipulation and multimodal translation effectively.

Despite these benefits, generating high-resolution images remains computationally expensive and slow. A primary issue is the number of tokens needed to represent complex visuals. Raster-scan methods that flatten 2D images into linear sequences require thousands of tokens for detailed images, resulting in long inference times and high memory consumption. Models like Infinity need over 10,000 tokens for a 1024×1024 image. This becomes unsustainable for real-time applications or when scaling to more extensive datasets. Reducing the token burden while preserving or improving output quality has become a pressing challenge.

Efforts to mitigate token inflation have led to innovations like next-scale prediction seen in VAR and FlexVAR. These models create images by predicting progressively finer scales, which imitates the human tendency to sketch rough outlines before adding detail. However, they still rely on hundreds of tokens—680 in the case of VAR and FlexVAR for 256×256 images. Moreover, approaches like TiTok and FlexTok use 1D tokenization to compress spatial redundancy, but they often fail to scale efficiently. For example, FlexTok’s gFID increases from 1.9 at 32 tokens to 2.5 at 256 tokens, highlighting a degradation in output quality as the token count grows.

Researchers from ByteDance introduced DetailFlow, a 1D autoregressive image generation framework. This method arranges token sequences from global to fine detail using a process called next-detail prediction. Unlike traditional 2D raster-scan or scale-based techniques, DetailFlow employs a 1D tokenizer trained on progressively degraded images. This design allows the model to prioritize foundational image structures before refining visual details. By mapping tokens directly to resolution levels, DetailFlow significantly reduces token requirements, enabling images to be generated in a semantically ordered, coarse-to-fine manner.

The mechanism in DetailFlow centers on a 1D latent space where each token contributes incrementally more detail. Earlier tokens encode global features, while later tokens refine specific visual aspects. To train this, the researchers created a resolution mapping function that links token count to target resolution. During training, the model is exposed to images of varying quality levels and learns to predict progressively higher-resolution outputs as more tokens are introduced. It also implements parallel token prediction by grouping sequences and predicting entire sets at once. Since parallel prediction can introduce sampling errors, a self-correction mechanism was integrated. This system perturbs certain tokens during training and teaches subsequent tokens to compensate, ensuring that final images maintain structural and visual integrity.

The results from the experiments on the ImageNet 256×256 benchmark were noteworthy. DetailFlow achieved a gFID score of 2.96 using only 128 tokens, outperforming VAR at 3.3 and FlexVAR at 3.05, both of which used 680 tokens. Even more impressive, DetailFlow-64 reached a gFID of 2.62 using 512 tokens. In terms of speed, it delivered nearly double the inference rate of VAR and FlexVAR. A further ablation study confirmed that the self-correction training and semantic ordering of tokens substantially improved output quality. For example, enabling self-correction dropped the gFID from 4.11 to 3.68 in one setting. These metrics demonstrate both higher quality and faster generation compared to established models.

By focusing on semantic structure and reducing redundancy, DetailFlow presents a viable solution to long-standing issues in autoregressive image generation. The method’s coarse-to-fine approach, efficient parallel decoding, and ability to self-correct highlight how architectural innovations can address performance and scalability limitations. Through their structured use of 1D tokens, the researchers from ByteDance have demonstrated a model that maintains high image fidelity while significantly reducing computational load, making it a valuable addition to image synthesis research.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post ByteDance Researchers Introduce DetailFlow: A 1D Coarse-to-Fine Autoregressive Framework for Faster, Token-Efficient Image Generation appeared first on MarkTechPost.

A Comprehensive Coding Tutorial for Advanced SerpAPI Integration with …

Posted on June 7, 2025 by i-genie

In this tutorial, we demonstrate how to combine the power of SerpAPI’s Google search capabilities with Google’s Gemini-1.5-Flash model to create an advanced, end-to-end research and analysis workflow within a Google Colab notebook. By defining an AdvancedSerpAPI Python class, users gain access to enhanced search methods that cover general web results, news articles, and images, while also leveraging Gemini to perform in-depth analyses of those results. The code provides specialized utilities for targeting Marktechpost tutorials, aggregating content across categories like LangChain, ChatGPT, and MLOps, and then synthesizing actionable insights using a carefully constructed prompt.

Copy CodeCopiedUse a different Browser!pip install google-search-results langchain-community langchain-core google-generativeai -q

import os
import json
from serpapi import GoogleSearch
import google.generativeai as genai
from datetime import datetime

We install the required Python packages for SerpAPI searches, LangChain utilities, and Google’s Gemini SDK. The subsequent imports bring in standard modules (os, json, datetime) for environment configuration, JSON handling, and timestamps, as well as SerpAPI’s GoogleSearch class for making API calls and genai for interacting with the Gemini model.

Copy CodeCopiedUse a different BrowserSERPAPI_API_KEY = “Use Your API Key Here”
GEMINI_API_KEY = “Use Your API Key Here”

os.environ[“SERPAPI_API_KEY”] = SERPAPI_API_KEY
genai.configure(api_key=GEMINI_API_KEY)

We assign placeholder strings for your SerpAPI and Gemini API keys, then set the SerpAPI key as an environment variable (so SerpAPI calls authenticate automatically) and configure the Gemini client with its API key so you can invoke the Gemini model.

Copy CodeCopiedUse a different Browserclass AdvancedSerpAPI:
def __init__(self, serpapi_key, gemini_key):
self.serpapi_key = serpapi_key
self.gemini_model = genai.GenerativeModel(‘gemini-1.5-flash’)

def search_google(self, query, num_results=5, location=”United States”):
“””Enhanced Google search with multiple parameters”””
params = {
“engine”: “google”,
“q”: query,
“api_key”: self.serpapi_key,
“num”: num_results,
“location”: location,
“hl”: “en”,
“gl”: “us”
}

search = GoogleSearch(params)
results = search.get_dict()
return self.extract_search_results(results)

def search_news(self, query, days_back=7):
“””Search for recent news articles”””
params = {
“engine”: “google_news”,
“q”: query,
“api_key”: self.serpapi_key,
“gl”: “us”,
“hl”: “en”
}

search = GoogleSearch(params)
results = search.get_dict()
return self.extract_news_results(results)

def search_images(self, query, num_images=10):
“””Search for images with metadata”””
params = {
“engine”: “google_images”,
“q”: query,
“api_key”: self.serpapi_key,
“num”: num_images
}

search = GoogleSearch(params)
results = search.get_dict()
return self.extract_image_results(results)

def extract_search_results(self, results):
“””Extract and clean search results”””
cleaned_results = []
if ‘organic_results’ in results:
for result in results[‘organic_results’]:
cleaned_results.append({
‘title’: result.get(‘title’, ”),
‘link’: result.get(‘link’, ”),
‘snippet’: result.get(‘snippet’, ”),
‘position’: result.get(‘position’, 0)
})
return cleaned_results

def extract_news_results(self, results):
“””Extract news articles with timestamps”””
news_results = []
if ‘news_results’ in results:
for article in results[‘news_results’]:
news_results.append({
‘title’: article.get(‘title’, ”),
‘link’: article.get(‘link’, ”),
‘snippet’: article.get(‘snippet’, ”),
‘date’: article.get(‘date’, ”),
‘source’: article.get(‘source’, ”)
})
return news_results

def extract_image_results(self, results):
“””Extract image results with metadata”””
image_results = []
if ‘images_results’ in results:
for img in results[‘images_results’]:
image_results.append({
‘title’: img.get(‘title’, ”),
‘original’: img.get(‘original’, ”),
‘thumbnail’: img.get(‘thumbnail’, ”),
‘source’: img.get(‘source’, ”)
})
return image_results

def analyze_with_gemini(self, search_results, analysis_prompt):
“””Use Gemini Flash to analyze search results”””
results_text = json.dumps(search_results, indent=2)

full_prompt = f”””
{analysis_prompt}

Search Results Data:
{results_text}

Please provide a comprehensive analysis based on the search results.
“””

try:
response = self.gemini_model.generate_content(full_prompt)
return response.text
except Exception as e:
return f”Gemini analysis failed: {str(e)}”

def search_marktechpost_tutorials(self, topic=””, num_results=10):
“””Search specifically for trending tutorials from Marktechpost”””
queries = [
f”site:marktechpost.com {topic} tutorial guide how-to 2024 2025″,
f”site:marktechpost.com trending {topic} tutorial”,
f”site:marktechpost.com top {topic} books frameworks”
]

all_results = []
for query in queries:
params = {
“engine”: “google”,
“q”: query,
“api_key”: self.serpapi_key,
“num”: num_results // len(queries),
“hl”: “en”,
“gl”: “us”
}

search = GoogleSearch(params)
results = search.get_dict()
extracted = self.extract_search_results(results)
all_results.extend(extracted)

unique_results = []
seen_links = set()
for result in all_results:
if result[‘link’] not in seen_links:
unique_results.append(result)
seen_links.add(result[‘link’])

return unique_results[:num_results]

def get_trending_marktechpost_content(self, categories=None):
“””Get trending content from Marktechpost across different categories”””
if categories is None:
categories = [“AI”, “LLM”, “Machine Learning”, “Python”, “Tutorial”, “Framework”]

trending_content = {}

for category in categories:
print(f” Searching for trending {category} content…”)
results = self.search_marktechpost_tutorials(category, num_results=5)
trending_content[category] = results
print(f” Found {len(results)} {category} tutorials”)

return trending_content

def smart_research(self, topic, research_depth=”medium”, focus_marktechpost=True):
“””Intelligent research combining multiple search types with Marktechpost focus”””
print(f” Starting smart research on: {topic}”)

if focus_marktechpost:
marktechpost_results = self.search_marktechpost_tutorials(topic, num_results=8)
print(f” Found {len(marktechpost_results)} Marktechpost tutorials”)

web_results = self.search_google(f”{topic} tutorial guide”, num_results=3)
print(f” Found {len(web_results)} additional web results”)

all_web_results = marktechpost_results + web_results
else:
all_web_results = self.search_google(f”{topic} overview facts”, num_results=5)
print(f” Found {len(all_web_results)} web results”)

news_results = self.search_news(topic)
print(f” Found {len(news_results)} news articles”)

analysis_prompt = f”””
Analyze the search results about ‘{topic}’ with focus on Marktechpost content and provide:
1. Key tutorials and guides available
2. Trending topics and frameworks
3. Learning resources and books mentioned
4. Recent developments and updates
5. Practical implementation guides
6. Recommended learning path

Focus on actionable insights and learning resources.
“””

all_results = {
“marktechpost_results”: marktechpost_results if focus_marktechpost else [],
“web_results”: all_web_results,
“news_results”: news_results,
“search_topic”: topic,
“timestamp”: datetime.now().isoformat()
}

gemini_analysis = self.analyze_with_gemini(all_results, analysis_prompt)

return {
“topic”: topic,
“marktechpost_tutorials”: marktechpost_results if focus_marktechpost else [],
“web_results”: all_web_results,
“news_results”: news_results,
“ai_analysis”: gemini_analysis,
“total_sources”: len(all_web_results) + len(news_results)
}

This class, AdvancedSerpAPI, encapsulates SerpAPI-based search methods (web, news, and images) and helper functions to clean the resulting JSON data. It also integrates a Gemini-1.5-Flash model, via analyze_with_gemini, to generate an AI-driven summary of any collected search data. Additional utilities include specialized Marktechpost tutorial lookups, a “get trending” routine across categories, and a combined “smart research” workflow that stitches together tutorials, web results, news, and Gemini analysis.

Copy CodeCopiedUse a different Browserdef demo_marktechpost_tutorials():
“””Demo specifically focused on Marktechpost tutorials”””

searcher = AdvancedSerpAPI(SERPAPI_API_KEY, GEMINI_API_KEY)

print(” Marktechpost Trending Tutorials Finder”)
print(“=” * 50)

print(“n Demo 1: Trending Marktechpost Tutorials by Category”)
trending_content = searcher.get_trending_marktechpost_content([
“LangChain”, “ChatGPT”, “Python”, “AI”, “MLOps”
])

for category, tutorials in trending_content.items():
print(f”n Trending {category} Tutorials:”)
for i, tutorial in enumerate(tutorials[:3], 1):
print(f” {i}. {tutorial[‘title’]}”)
print(f” {tutorial[‘link’]}”)
if tutorial[‘snippet’]:
print(f” {tutorial[‘snippet’][:100]}…”)

print(“n Demo 2: Deep Dive – LangChain Tutorials”)
langchain_research = searcher.smart_research(“LangChain”, focus_marktechpost=True)

print(f”n Research Summary:”)
print(f”Topic: {langchain_research[‘topic’]}”)
print(f”Marktechpost Tutorials Found: {len(langchain_research[‘marktechpost_tutorials’])}”)
print(f”Total Sources: {langchain_research[‘total_sources’]}”)

print(f”n AI Analysis Preview:”)
print(langchain_research[‘ai_analysis’][:600] + “…” if len(langchain_research[‘ai_analysis’]) > 600 else langchain_research[‘ai_analysis’])

print(“n Demo 3: Latest AI Trends from Marktechpost”)
ai_trends = searcher.search_marktechpost_tutorials(“AI trends 2024 2025”, num_results=5)

print(“Recent AI trend articles:”)
for i, article in enumerate(ai_trends, 1):
print(f”{i}. {article[‘title’]}”)
print(f” {article[‘link’]}”)

def demo_advanced_serpapi():
“””Comprehensive demo of SerpAPI capabilities”””

searcher = AdvancedSerpAPI(SERPAPI_API_KEY, GEMINI_API_KEY)

print(” Advanced SerpAPI Tutorial with Gemini Flash”)
print(“=” * 50)

print(“n Demo 1: Smart Research on AI Technology”)
research_results = searcher.smart_research(“artificial intelligence 2024 trends”)

print(f”n Research Summary:”)
print(f”Topic: {research_results[‘topic’]}”)
print(f”Total Sources: {research_results[‘total_sources’]}”)

print(f”n AI Analysis Preview:”)
print(research_results[‘ai_analysis’][:500] + “…” if len(research_results[‘ai_analysis’]) > 500 else research_results[‘ai_analysis’])

print(“n Demo 2: Recent News Search”)
tech_news = searcher.search_news(“technology breakthrough”, days_back=7)

print(f”Found {len(tech_news)} recent tech news articles:”)
for i, article in enumerate(tech_news[:3], 1):
print(f”{i}. {article[‘title’][:80]}…”)
print(f” Source: {article[‘source’]} | Date: {article[‘date’]}”)

print(“n Demo 3: Image Search”)
space_images = searcher.search_images(“space exploration 2024″, num_images=5)

print(f”Found {len(space_images)} space-related images:”)
for i, img in enumerate(space_images[:3], 1):
print(f”{i}. {img[‘title’][:60]}…”)
print(f” Source: {img[‘source’]}”)

demo_marktechpost_tutorials() initializes the AdvancedSerpAPI class and prints trending tutorials from Marktechpost for a list of categories (LangChain, ChatGPT, Python, AI, MLOps). It then performs a “deep dive” smart research on “LangChain,” showing counts of tutorials and a preview of Gemini’s AI analysis. Finally, it retrieves and lists the top five recent “AI trends 2024–2025” articles from Marktechpost.

Also, demo_advanced_serpapi() creates an AdvancedSerpAPI instance but focuses on a broader workflow: it runs smart research on “artificial intelligence 2024 trends” and prints the topic summary and AI analysis snippet. It then performs a news search for “technology breakthrough,” lists the first three articles with sources and dates, and concludes by fetching and displaying a handful of “space exploration 2024” image results.

Copy CodeCopiedUse a different Browserif __name__ == “__main__”:
if SERPAPI_API_KEY == “your_serpapi_key_here” or GEMINI_API_KEY == “your_gemini_key_here”:
print(” Please set your API keys before running the demo!”)
print(“1. Get SerpAPI key from: https://serpapi.com”)
print(“2. Get Gemini API key from: https://makersuite.google.com”)
else:
print(” Running Marktechpost-focused demo…”)
demo_marktechpost_tutorials()

print(“n” + “=”*50)
print(” Running general demo…”)
demo_advanced_serpapi()

def compare_search_engines(query, engines=[‘google’, ‘bing’, ‘duckduckgo’]):
“””Compare results across different search engines”””
results = {}

for engine in engines:
params = {
“engine”: engine,
“q”: query,
“api_key”: SERPAPI_API_KEY
}

try:
search = GoogleSearch(params)
results[engine] = search.get_dict()
except Exception as e:
results[engine] = {“error”: str(e)}

return results

def trending_searches(location=”United States”):
“””Get trending searches”””
params = {
“engine”: “google_trends_trending_now”,
“api_key”: SERPAPI_API_KEY,
“geo”: location
}

search = GoogleSearch(params)
return search.get_dict()

print(” Advanced SerpAPI Tutorial with Marktechpost Focus loaded successfully!”)
print(” Remember to set your API keys before running demos”)
print(” New Functions: search_marktechpost_tutorials, get_trending_marktechpost_content”)
print(” Marktechpost-specific features: LangChain, ChatGPT, Python, AI, MLOps tutorials”)

print(“n Quick Start Examples:”)
print(“searcher = AdvancedSerpAPI(SERPAPI_API_KEY, GEMINI_API_KEY)”)
print(“langchain_tutorials = searcher.search_marktechpost_tutorials(‘LangChain’)”)
print(“trending_ai = searcher.get_trending_marktechpost_content([‘AI’, ‘Python’])”)
print(“research = searcher.smart_research(‘ChatGPT’, focus_marktechpost=True)”)

Finally, the section includes a Python “main” guard that first verifies your SerpAPI and Gemini keys, prompting you to obtain them if they’re still placeholders, and otherwise runs the Marktechpost‐focused and general demos in sequence. It also defines two utility functions: compare_search_engines, which queries multiple search engines (Google, Bing, DuckDuckGo) via SerpAPI and returns their raw JSON results or errors, and trending_searches, which fetches today’s trending topics using the Google Trends endpoint. After these definitions, the script prints a brief status message confirming that the tutorial loaded successfully, reminds you to set your API keys, and highlights newly added methods for fetching Marktechpost tutorials and trending content.

In conclusion, by following this tutorial, users will have a reusable, modular Python class that streamlines web research and analysis, from performing keyword-driven searches to automatically summarizing findings using Gemini-powered AI. The combination of SerpAPI’s reliable search endpoints and Gemini’s natural language understanding enables a seamless “research-to-insights” workflow, ideal for content creators, developers, and technical teams who need to stay up-to-date with the latest tutorials and industry trends.

Check out the Notebook here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.
The post A Comprehensive Coding Tutorial for Advanced SerpAPI Integration with Google Gemini-1.5-Flash for Advanced Analytics appeared first on MarkTechPost.