A Coding Implementation to Build a Document Search Agent (DocSearchAge …

In today’s information-rich world, finding relevant documents quickly is crucial. Traditional keyword-based search systems often fall short when dealing with semantic meaning. This tutorial demonstrates how to build a powerful document search engine using:

Hugging Face’s embedding models to convert text into rich vector representations

Chroma DB as our vector database for efficient similarity search

Sentence transformers for high-quality text embeddings

This implementation enables semantic search capabilities – finding documents based on meaning rather than just keyword matching. By the end of this tutorial, you’ll have a working document search engine that can:

Process and embed text documents

Store these embeddings efficiently

Retrieve the most semantically similar documents to any query

Handle a variety of document types and search needs

Please follow the detailed steps mentioned below in sequence to implement DocSearchAgent.

First, we need to install the necessary libraries. 

Copy CodeCopiedUse a different Browser!pip install chromadb sentence-transformers langchain datasets

Let’s start by importing the libraries we’ll use:

Copy CodeCopiedUse a different Browserimport os
import numpy as np
import pandas as pd
from datasets import load_dataset
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import time

For this tutorial, we’ll use a subset of Wikipedia articles from the Hugging Face datasets library. This gives us a diverse set of documents to work with.

Copy CodeCopiedUse a different Browserdataset = load_dataset(“wikipedia”, “20220301.en”, split=”train[:1000]”)
print(f”Loaded {len(dataset)} Wikipedia articles”)

documents = []
for i, article in enumerate(dataset):
doc = {
“id”: f”doc_{i}”,
“title”: article[“title”],
“text”: article[“text”],
“url”: article[“url”]
}
documents.append(doc)

df = pd.DataFrame(documents)
df.head(3)

Now, let’s split our documents into smaller chunks for more granular searching:

Copy CodeCopiedUse a different Browsertext_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)

chunks = []
chunk_ids = []
chunk_sources = []

for i, doc in enumerate(documents):
doc_chunks = text_splitter.split_text(doc[“text”])
chunks.extend(doc_chunks)
chunk_ids.extend([f”chunk_{i}_{j}” for j in range(len(doc_chunks))])
chunk_sources.extend([doc[“title”]] * len(doc_chunks))

print(f”Created {len(chunks)} chunks from {len(documents)} documents”)

We’ll use a pre-trained sentence transformer model from Hugging Face to create our embeddings:

Copy CodeCopiedUse a different Browsermodel_name = “sentence-transformers/all-MiniLM-L6-v2”
embedding_model = SentenceTransformer(model_name)

sample_text = “This is a sample text to test our embedding model.”
sample_embedding = embedding_model.encode(sample_text)
print(f”Embedding dimension: {len(sample_embedding)}”)

Now, let’s set up Chroma DB, a lightweight vector database perfect for our search engine:

Copy CodeCopiedUse a different Browserchroma_client = chromadb.Client()

embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

collection = chroma_client.create_collection(
name=”document_search”,
embedding_function=embedding_function
)

batch_size = 100
for i in range(0, len(chunks), batch_size):
end_idx = min(i + batch_size, len(chunks))

batch_ids = chunk_ids[i:end_idx]
batch_chunks = chunks[i:end_idx]
batch_sources = chunk_sources[i:end_idx]

collection.add(
ids=batch_ids,
documents=batch_chunks,
metadatas=[{“source”: source} for source in batch_sources]
)

print(f”Added batch {i//batch_size + 1}/{(len(chunks)-1)//batch_size + 1} to the collection”)

print(f”Total documents in collection: {collection.count()}”)

Now comes the exciting part – searching through our documents:

Copy CodeCopiedUse a different Browserdef search_documents(query, n_results=5):
“””
Search for documents similar to the query.

Args:
query (str): The search query
n_results (int): Number of results to return

Returns:
dict: Search results
“””
start_time = time.time()

results = collection.query(
query_texts=[query],
n_results=n_results
)

end_time = time.time()
search_time = end_time – start_time

print(f”Search completed in {search_time:.4f} seconds”)
return results

queries = [
“What are the effects of climate change?”,
“History of artificial intelligence”,
“Space exploration missions”
]

for query in queries:
print(f”nQuery: {query}”)
results = search_documents(query)

for i, (doc, metadata) in enumerate(zip(results[‘documents’][0], results[‘metadatas’][0])):
print(f”nResult {i+1} from {metadata[‘source’]}:”)
print(f”{doc[:200]}…”)

Let’s create a simple function to provide a better user experience:

Copy CodeCopiedUse a different Browserdef interactive_search():
“””
Interactive search interface for the document search engine.
“””
while True:
query = input(“nEnter your search query (or ‘quit’ to exit): “)

if query.lower() == ‘quit’:
print(“Exiting search interface…”)
break

n_results = int(input(“How many results would you like? “))

results = search_documents(query, n_results)

print(f”nFound {len(results[‘documents’][0])} results for ‘{query}’:”)

for i, (doc, metadata, distance) in enumerate(zip(
results[‘documents’][0],
results[‘metadatas’][0],
results[‘distances’][0]
)):
relevance = 1 – distance
print(f”n— Result {i+1} —“)
print(f”Source: {metadata[‘source’]}”)
print(f”Relevance: {relevance:.2f}”)
print(f”Excerpt: {doc[:300]}…”)
print(“-” * 50)

interactive_search()

Let’s add the ability to filter our search results by metadata:

Copy CodeCopiedUse a different Browserdef filtered_search(query, filter_source=None, n_results=5):
“””
Search with optional filtering by source.

Args:
query (str): The search query
filter_source (str): Optional source to filter by
n_results (int): Number of results to return

Returns:
dict: Search results
“””
where_clause = {“source”: filter_source} if filter_source else None

results = collection.query(
query_texts=[query],
n_results=n_results,
where=where_clause
)

return results

unique_sources = list(set(chunk_sources))
print(f”Available sources for filtering: {len(unique_sources)}”)
print(unique_sources[:5])

if len(unique_sources) > 0:
filter_source = unique_sources[0]
query = “main concepts and principles”

print(f”nFiltered search for ‘{query}’ in source ‘{filter_source}’:”)
results = filtered_search(query, filter_source=filter_source)

for i, doc in enumerate(results[‘documents’][0]):
print(f”nResult {i+1}:”)
print(f”{doc[:200]}…”)

In conclusion, we demonstrate how to build a semantic document search engine using Hugging Face embedding models and ChromaDB. The system retrieves documents based on meaning rather than just keywords by transforming text into vector representations. The implementation processes Wikipedia articles chunks them for granularity, embeds them using sentence transformers, and stores them in a vector database for efficient retrieval. The final product features interactive searching, metadata filtering, and relevance ranking.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

The post A Coding Implementation to Build a Document Search Agent (DocSearchAgent) with Hugging Face, ChromaDB, and Langchain appeared first on MarkTechPost.

Streamline AWS resource troubleshooting with Amazon Bedrock Agents and …

As AWS environments grow in complexity, troubleshooting issues with resources can become a daunting task. Manually investigating and resolving problems can be time-consuming and error-prone, especially when dealing with intricate systems. Fortunately, AWS provides a powerful tool called AWS Support Automation Workflows, which is a collection of curated AWS Systems Manager self-service automation runbooks. These runbooks are created by AWS Support Engineering with best practices learned from solving customer issues. They enable AWS customers to troubleshoot, diagnose, and remediate common issues with their AWS resources.
Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.
In this post, we explore how to use the power of Amazon Bedrock Agents and AWS Support Automation Workflows to create an intelligent agent capable of troubleshooting issues with AWS resources.
Solution overview
Although the solution is versatile and can be adapted to use a variety of AWS Support Automation Workflows, we focus on a specific example: troubleshooting an Amazon Elastic Kubernetes Service (Amazon EKS) worker node that failed to join a cluster. The following diagram provides a high-level overview of troubleshooting agents with Amazon Bedrock.

Our solution is built around the following key components that work together to provide a seamless and efficient troubleshooting experience:

Amazon Bedrock Agents – Amazon Bedrock Agents acts as the intelligent interface between users and AWS Support Automation Workflows. It processes natural language queries to understand the issue context and manages conversation flow to gather required information. The agent uses Anthropic’s Claude 3.5 Sonnet model for advanced reasoning and response generation, enabling natural interactions throughout the troubleshooting process.
Amazon Bedrock agent action groups – These action groups define the structured API operations that the Amazon Bedrock agent can invoke. Using OpenAPI specifications, they define the interface between the agent and AWS Lambda functions, specifying the available operations, required parameters, and expected responses. Each action group contains the API schema that tells the agent how to properly format requests and interpret responses when interacting with Lambda functions.
Lambda Function – The Lambda function acts as the integration layer between the Amazon Bedrock agent and AWS Support Automation Workflows. It validates input parameters from the agent and initiates the appropriate SAW runbook execution. It monitors the automation progress while processing the technical output into a structured format. When the workflow is complete, it returns formatted results back to the agent for user presentation.
IAM role – The AWS Identity and Access Management (IAM) role provides the Lambda function with the necessary permissions to execute AWS Support Automation Workflows and interact with required AWS services. This role follows the principle of least privilege to maintain security best practices.
AWS Support Automation Workflows – These pre-built diagnostic runbooks are developed by AWS Support Engineering. The workflows execute comprehensive system checks based on AWS best practices in a standardized, repeatable manner. They cover a wide range of AWS services and common issues, encapsulating AWS Support’s extensive troubleshooting expertise.

The following steps outline the workflow of our solution:

Users start by describing their AWS resource issue in natural language through the Amazon Bedrock chat console. For example, “Why isn’t my EKS worker node joining the cluster?”
The Amazon Bedrock agent analyzes the user’s question and matches it to the appropriate action defined in its OpenAPI schema. If essential information is missing, such as a cluster name or instance ID, the agent engages in a natural conversation to gather the required parameters. This makes sure that necessary data is collected before proceeding with the troubleshooting workflow.
The Lambda function receives the validated request and triggers the corresponding AWS Support Automation Workflow. These SAW runbooks contain comprehensive diagnostic checks developed by AWS Support Engineering to identify common issues and their root causes. The checks run automatically without requiring user intervention.
The SAW runbook systematically executes its diagnostic checks and compiles the findings. These results, including identified issues and configuration problems, are structured in JSON format and returned to the Lambda function.
The Amazon Bedrock agent processes the diagnostic results using chain of thought (CoT) reasoning, based on the ReAct (synergizing reasoning and acting) technique. This enables the agent to analyze the technical findings, identify root causes, generate clear explanations, and provide step-by-step remediation guidance.

During the reasoning phase of the agent, the user is able to view the reasoning steps.
Troubleshooting examples
Let’s take a closer look at a common issue we mentioned earlier and how our agent can assist in troubleshooting it.
EKS worker node failed to join EKS cluster
When an EKS worker node fails to join an EKS cluster, our Amazon Bedrock agent can be invoked with the relevant information: cluster name and worker node ID. The agent will execute the corresponding AWS Support Automation Workflow, which will perform checks like verifying the worker node’s IAM role permissions and verifying the necessary network connectivity.
The automation workflow will run all the checks. Then Amazon Bedrock agent will ingest the troubleshooting, explain the root cause of the issue to the user, and suggest remediation steps based on the AWSSupport-TroubleshootEKSWorkerNode output, such as updating the worker node’s IAM role or resolving network configuration issues, enabling them to take the necessary actions to resolve the problem.
OpenAPI example
When you create an action group in Amazon Bedrock, you must define the parameters that the agent needs to invoke from the user. You can also define API operations that the agent can invoke using these parameters. To define the API operations, we will create an OpenAPI schema in JSON:
“Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post”: {
“properties”: {
“cluster_name”: {
“type”: “string”,
“title”: “Cluster Name”,
“description”: “The name of the EKS cluster”
},
“worker_id”: {
“type”: “string”,
“title”: “Worker Id”,
“description”: “The ID of the worker node”
}
},
“type”: “object”,
“required”: [
“cluster_name”,
“worker_id”
],
“title”: “Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post”
}

The schema consists of the following components:

Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post – This is the name of the schema, which corresponds to the request body for the troubleshoot-eks-worker_node POST endpoint.
Properties – This section defines the properties (fields) of the schema:

“cluster_name” – This property represents the name of the EKS cluster. It is a string type and has a title and description.
“worker_id” – This property represents the ID of the worker node. It is also a string type and has a title and description.

Type – This property specifies that the schema is an “object” type, meaning it is a collection of key-value pairs.
Required – This property lists the required fields for the schema, which in this case are “cluster_name” and “worker _id”. These fields must be provided in the request body.
Title – This property provides a human-readable title for the schema, which can be used for documentation purposes.

The OpenAPI schema defines the structure of the request body. To learn more, see Define OpenAPI schemas for your agent’s action groups in Amazon Bedrock and OpenAPI specification.
Lambda function code
Now let’s explore the Lambda function code:
@app.post(“/troubleshoot-eks-worker-node”)
@tracer.capture_method
def troubleshoot_eks_worker_node(
cluster_name: Annotated[str, Body(description=”The name of the EKS cluster”)],
worker_id: Annotated[str, Body(description=”The ID of the worker node”)]
) -> dict:
“””
Troubleshoot EKS worker node that failed to join the cluster.

Args:
cluster_name (str): The name of the EKS cluster.
worker_id (str): The ID of the worker node.

Returns:
dict: The output of the Automation execution.
“””
return execute_automation(
automation_name=’AWSSupport-TroubleshootEKSWorkerNode’,
parameters={
‘ClusterName’: [cluster_name],
‘WorkerID’: [worker_id]
},
execution_mode=’TroubleshootWorkerNode’
)

The code consists of the following components

app.post(“/troubleshoot-eks-worker-node”, description=”Troubleshoot EKS worker node failed to join the cluster”) – This is a decorator that sets up a route for a POST request to the /troubleshoot-eks-worker-node endpoint. The description parameter provides a brief explanation of what this endpoint does.
@tracer.capture_method – This is another decorator that is likely used for tracing or monitoring purposes, possibly as part of an application performance monitoring (APM) tool. It captures information about the execution of the function, such as the duration, errors, and other metrics.
cluster_name: str = Body(description=”The name of the EKS cluster”), – This parameter specifies that the cluster_name is a string type and is expected to be passed in the request body. The Body decorator is used to indicate that this parameter should be extracted from the request body. The description parameter provides a brief explanation of what this parameter represents.
worker_id: str = Body(description=”The ID of the worker node”) – This parameter specifies that the worker_id is a string type and is expected to be passed in the request body.
 -> Annotated[dict, Body(description=”The output of the Automation execution”)] – This is the return type of the function, which is a dictionary. The Annotated type is used to provide additional metadata about the return value, specifically that it should be included in the response body. The description parameter provides a brief explanation of what the return value represents.

To link a new SAW runbook in the Lambda function, you can follow the same template.
Prerequisites
Make sure you have the following prerequisites:

An AWS account
Access for Anthropic’s Claude 3.5 Sonnet model enabled in Amazon Bedrock
Your credentials configured in the AWS Command Line Interface (AWS CLI)
Javascript installed
The AWS Cloud Development Kit (AWS CDK) 143.0

Deploy the solution
Complete the following steps to deploy the solution:

Clone the GitHub repository and go to the root of your downloaded repository folder:

$ git clone https://github.com/aws-samples/sample-bedrock-agent-for-troubleshooting-aws-resources.git
$ cd bedrock-agent-for-troubleshooting-aws-resources

Install local dependencies:

$ npm install

Sign in to your AWS account using the AWS CLI by configuring your credential file (replace <PROFILE_NAME> with the profile name of your deployment AWS account):

$ export AWS_PROFILE=PROFILE_NAME

Bootstrap the AWS CDK environment (this is a one-time activity and is not needed if your AWS account is already bootstrapped):

$ cdk bootstrap

Run the script to replace the placeholders for your AWS account and AWS Region in the config files:

$ cdk deploy –all
Test the agent
Navigate to the Amazon Bedrock Agents console in your Region and find your deployed agent. You will find the agent ID in the cdk deploy command output.
You can now interact with the agent and test troubleshooting a worker node not joining an EKS cluster. The following are some example questions:

I want to troubleshoot why my Amazon EKS worker node is not joining the cluster. Can you help me?
Why this instance <instance_ID> is not able to join the EKS cluster <Cluster_Name>?

The following screenshot shows the console view of the agent.

The agent understood the question and mapped it with the right action group. It also spotted that the parameters needed are missing in the user prompt. It came back with a follow-up question to require the Amazon Elastic Compute Cloud (Amazon EC2) instance ID and EKS cluster name.

We can see the agent’s thought process in the trace step 1. The agent assesses the next step as ready to call the right Lambda function and right API path.

With the results coming back from the runbook, the agent now reviews the troubleshooting outcome. It goes through the information and will start writing the solution where it provides the instructions for the user to follow.
In the answer provided, the agent was able to spot all the issues and transform that into solution steps. We can also see the agent mentioning the right information like IAM policy and the required tag.
Clean up
When implementing Amazon Bedrock Agents, there are no additional charges for resource construction. However, costs are incurred for embedding model and text model invocations on Amazon Bedrock, with charges based on the pricing of each FM used. In this use case, you will also incur costs for Lambda invocations.
To avoid incurring future charges, delete the created resources by the AWS CDK. From the root of your repository folder, run the following command:
$ npm run cdk destroy –all
Conclusion
Amazon Bedrock Agents and AWS Support Automation Workflows are powerful tools that, when combined, can revolutionize AWS resource troubleshooting. In this post, we explored a serverless application built with the AWS CDK that demonstrates how these technologies can be integrated to create an intelligent troubleshooting agent. By defining action groups within the Amazon Bedrock agent and associating them with specific scenarios and automation workflows, we’ve developed a highly efficient process for diagnosing and resolving issues such as Amazon EKS worker node failures.
Our solution showcases the potential for automating complex troubleshooting tasks, saving time and streamlining operations. Powered by Anthropic’s Claude 3.5 Sonnet, the agent demonstrates improved understanding and responding in languages other than English, such as French, Japanese, and Spanish, making it accessible to global teams while maintaining its technical accuracy and effectiveness. The intelligent agent quickly identifies root causes and provides actionable insights, while automatically executing relevant AWS Support Automation Workflows. This approach not only minimizes downtime, but also scales effectively to accommodate various AWS services and use cases, making it a versatile foundation for organizations looking to enhance their AWS infrastructure management.
Explore the AWS Support Automation Workflow for additional use cases and consider using this solution as a starting point for building more comprehensive troubleshooting agents tailored to your organization’s needs. To learn more about using agents to orchestrate workflows, see Automate tasks in your application using conversational agents. For details about using guardrails to safeguard your generative AI applications, refer to Stop harmful content in models using Amazon Bedrock Guardrails.
Happy coding!
Acknowledgements
The authors thank all the reviewers for their valuable feedback.

About the Authors
Wael Dimassi is a Technical Account Manager at AWS, building on his 7-year background as a Machine Learning specialist. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them.
Marwen Benzarti is a Senior Cloud Support Engineer at AWS Support where he specializes in Infrastructure as Code. With over 4 years at AWS and 2 years of previous experience as a DevOps engineer, Marwen works closely with customers to implement AWS best practices and troubleshoot complex technical challenges. Outside of work, he enjoys playing both competitive multiplayer and immersive story-driven video games.

Create generative AI agents that interact with your companies’ syste …

Today we are announcing that general availability of Amazon Bedrock in Amazon SageMaker Unified Studio.
Companies of all sizes face mounting pressure to operate efficiently as they manage growing volumes of data, systems, and customer interactions. Manual processes and fragmented information sources can create bottlenecks and slow decision-making, limiting teams from focusing on higher-value work. Generative AI agents offer a powerful solution by automatically interfacing with company systems, executing tasks, and delivering instant insights, helping organizations scale operations without scaling complexity.
Amazon Bedrock in SageMaker Unified Studio addresses these challenges by providing a unified service for building AI-driven solutions that centralize customer data and enable natural language interactions. It integrates with existing applications and includes key Amazon Bedrock features like foundation models (FMs), prompts, knowledge bases, agents, flows, evaluation, and guardrails. Users can access these AI capabilities through their organization’s single sign-on (SSO), collaborate with team members, and refine AI applications without needing AWS Management Console access.
Generative AI-powered agents for automated workflows
Amazon Bedrock in SageMaker Unified Studio allows you to create and deploy generative AI agents that integrate with organizational applications, databases, and third-party systems, enabling natural language interactions across the entire technology stack. The chat agent bridges complex information systems and user-friendly communication. By using Amazon Bedrock functions and Amazon Bedrock Knowledge Bases, the agent can connect with data sources like JIRA APIs for real-time project status tracking, retrieve customer information, update project tasks, and manage preferences.
Sales and marketing teams can quickly access customer information and their meeting preferences, and project managers can efficiently manage JIRA tasks and timelines. This streamlined process enhances productivity and customer interactions across the organization.
The following diagram illustrates the generative AI agent solution workflow.

Solution overview
Amazon Bedrock provides a governed collaborative environment to build and share generative AI applications within SageMaker Unified Studio. Let’s look at an example solution for implementing a customer management agent:

An agentic chat can be built with Amazon Bedrock chat applications, and integrated with functions that can be quickly built with other AWS services such as AWS Lambda and Amazon API Gateway.
SageMaker Unified Studio, using Amazon DataZone, provides a comprehensive data management solution through its integrated services. Organization administrators can control member access to Amazon Bedrock models and features, maintaining secure identity management and granular access control.

Before we dive deep into the deployment of the AI agent, let’s walk through the key steps of the architecture, as shown in the following diagram.

The workflow is as follows:

The user logs into SageMaker Unified Studio using their organization’s SSO from AWS IAM Identity Center. Then the user interacts with the chat application using natural language.
The Amazon Bedrock chat application uses a function to retrieve JIRA status and customer information from the database through the endpoint using API Gateway.
The chat application authenticates with API Gateway to securely access the endpoint with the random API key from AWS Secrets Manager, and triggers the Lambda function based on the user’s request.
The Lambda function performs the actions by calling the JIRA API or database with the required parameters provided from the agent. The agent has the capability to:

Provide a brief customer overview.
List recent customer interactions.
Retrieve the meeting preferences for a customer.
Retrieve open JIRA tickets for a project.
Update the due date for a JIRA ticket.

Prerequisites
You need the following prerequisites to follow along with this solution implementation:

An AWS account
User access to Amazon Bedrock in SageMaker Unified Studio
Model access to Amazon Nova Pro on Amazon Bedrock in a supported AWS Region
A JIRA application, JIRA URL, and a JIRA API token to your account

We assume you are familiar with fundamental serverless constructs on AWS, such as API Gateway, Lambda functions, and IAM Identity Center. We don’t focus on defining these services in this post, but we do use them to show use cases for the new Amazon Bedrock features within SageMaker Unified Studio.
Deploy the solution
Complete the following deployment steps:

Download the code from GitHub.
Get the value of JIRA_API_KEY_ARN, JIRA_URL, and JIRA_USER_NAME for the Lambda function.
Use the following AWS CloudFormation template, and refer to Create a stack from the CloudFormation console to launch the stack in your preferred AWS Region.
After the stack is deployed, note down the API Gateway URL value from the CloudFormation Outputs tab (ApiInvokeURL).
On the Secrets Manager console, find the secrets for JIRA_API_KEY_ARN, JIRA_URL, and JIRA_USER_NAME.
Choose Retrieve secret and copy the variables from Step 2 to the secret plaintext string.
Sign in to SageMaker Unified Studio using your organization’s SSO.

Create a new project
Complete the following steps to create a new project:

On the SageMaker Unified Studio landing page, create a new project.
Give the project a name (for example, crm-agent).
Choose Generative AI application development profile and continue.
Use the default settings and continue.
Review and choose Create project to confirm.

Build the chat agent application
Complete the following steps to build the chat agent application:

Under the New section located to the right of the crm-agent project landing page, choose Chat agent.

It has a list of configurations for your agent application.

Under the model section, choose a desired FM supported by Amazon Bedrock. For this crm-agent, we choose Amazon Nova Pro.
In the system prompt section, add the following prompt. Optionally, you could add examples of user input and model responses to improve it.

You are a customer relationship management agent tasked with helping a sales person plan their work with customers. You are provided with an API endpoint. This endpoint can provide information like company overview, company interaction history (meeting times and notes), company meeting preferences (meeting type, day of week, and time of day). You can also query Jira tasks and update their timeline. After receiving a response, clean it up into a readable format. If the output is a numbered list, format it as such with newline characters and numbers.

In the Functions section, choose Create a new function.
Give the function a name, such as crm_agent_calling.
For Function schema, use the OpenAPI definition from the GitHub repo.

For Authentication method, choose API Keys (Max. 2 Keys)and enter the following details:

For Key sent in, choose Header.
For Key name, enter x-api-key.
For Key value, enter the Secrets Manager api Key

In the API servers section, input the endpoint URL.
Choose Create to finish the function creation.
In the Functions section of the chat agent application, choose the function you created and choose Save to finish the application creation.

Example interactions
In this section, we explore two example interactions.
Use case 1: CRM analyst can retrieve customer details stored in the database with natural language.
For this use case, we ask the following questions in the chat application:

Give me a brief overview of customer C-jkl101112.
List the last 2 recent interactions for customer C-def456.
What communication method does customer C-mno131415 prefer?
Recommend optimal time and contact channel to reach out to C-ghi789 based on their preferences and our last interaction.

The response from the chat application is shown in the following screenshot. The agent successfully retrieves the customer’s information from the database. It understands the user’s question and queries the database to find corresponding answers.

Use case 2: Project managers can list and update the JIRA ticket.
In this use case, we ask the following questions:

What are the open JIRA Tasks for project id CRM?
Please update JIRA Task CRM-3 to 1 weeks out.

The response from the chat application is shown in the following screenshot. Similar to the previous use case, the agent accesses the JIRA board and fetches the JIRA project information. It provides a list of open JIRA tasks and updates the timeline of the task following the user’s request.

Clean up
To avoid incurring additional costs, complete the following steps:

Delete the CloudFormation stack.
Delete the function component in Amazon Bedrock.
Delete the chat agent application in Amazon Bedrock.
Delete the domains in SageMaker Unified Studio.

Cost
Amazon Bedrock in SageMaker Unified Studio doesn’t incur separate charges, but you will be charged for the individual AWS services and resources utilized within the service. You only pay for the Amazon Bedrock resources you use, without minimum fees or upfront commitments.
If you need further assistance with pricing calculations or have questions about optimizing costs for your specific use case, please reach out to AWS Support or consult with your account manager.
Conclusion
In this post, we demonstrated how to use Amazon Bedrock in SageMaker Unified Studio to build a generative AI application to integrate with an existing endpoint and database.
The generative AI features of Amazon Bedrock transform how organizations build and deploy AI solutions by enabling rapid agent prototyping and deployment. Teams can swiftly create, test, and launch chat agent applications, accelerating the implementation of AI solutions that automate complex tasks and enhance decision-making capabilities. The solution’s scalability and flexibility allow organizations to seamlessly integrate advanced AI capabilities into existing applications, databases, and third-party systems.
Through a unified chat interface, agents can handle project management, data retrieval, and workflow automation—significantly reducing manual effort while enhancing user experience. By making advanced AI capabilities more accessible and user-friendly, Amazon Bedrock in SageMaker Unified Studio empowers organizations to achieve new levels of productivity and customer satisfaction in today’s competitive landscape.
Try out Amazon Bedrock in SageMaker Unified Studio for your own use case, and share your questions in the comments.

About the Authors
Jady Liu is a Senior AI/ML Solutions Architect on the AWS GenAI Labs team based in Los Angeles, CA. With over a decade of experience in the technology sector, she has worked across diverse technologies and held multiple roles. Passionate about generative AI, she collaborates with major clients across industries to achieve their business goals by developing scalable, resilient, and cost-effective generative AI solutions on AWS. Outside of work, she enjoys traveling to explore wineries and distilleries.
Justin Ossai is a GenAI Labs Specialist Solutions Architect based in Dallas, TX. He is a highly passionate IT professional with over 15 years of technology experience. He has designed and implemented solutions with on-premises and cloud-based infrastructure for small and enterprise companies.

Asure’s approach to enhancing their call center experience using gen …

Asure, a company of over 600 employees, is a leading provider of cloud-based workforce management solutions designed to help small and midsized businesses streamline payroll and human resources (HR) operations and ensure compliance. Their offerings include a comprehensive suite of human capital management (HCM) solutions for payroll and tax, HR compliance services, time tracking, 401(k) plans, and more.
Asure anticipated that generative AI could aid contact center leaders to understand their team’s support performance, identify gaps and pain points in their products, and recognize the most effective strategies for training customer support representatives using call transcripts. The Asure team was manually analyzing thousands of call transcripts to uncover themes and trends, a process that lacked scalability. The overarching goal of this engagement was to improve upon this manual approach. Failing to adopt a more automated approach could have potentially led to decreased customer satisfaction scores and, consequently, a loss in future revenue. Therefore, it was valuable to provide Asure a post-call analytics pipeline capable of providing beneficial insights, thereby enhancing the overall customer support experience and driving business growth.
Asure recognized the potential of generative AI to further enhance the user experience and better understand the needs of the customer and wanted to find a partner to help realize it.
Pat Goepel, chairman and CEO of Asure, shares,

“In collaboration with the AWS Generative AI Innovation Center, we are utilizing Amazon Bedrock, Amazon Comprehend, and Amazon Q in QuickSight to understand trends in our own customer interactions, prioritize items for product development, and detect issues sooner so that we can be even more proactive in our support for our customers. Our partnership with AWS and our commitment to be early adopters of innovative technologies like Amazon Bedrock underscore our dedication to making advanced HCM technology accessible for businesses of any size.”
“We are thrilled to partner with AWS on this groundbreaking generative AI project. The robust AWS infrastructure and advanced AI capabilities provide the perfect foundation for us to innovate and push the boundaries of what’s possible. This collaboration will enable us to deliver cutting-edge solutions that not only meet but exceed our customers’ expectations. Together, we are poised to transform the landscape of AI-driven technology and create unprecedented value for our clients.”
—Yasmine Rodriguez, CTO of Asure.
“As we embarked on our journey at Asure to integrate generative AI into our solutions, finding the right partner was crucial. Being able to partner with the Gen AI Innovation Center at AWS brings not only technical expertise with AI but the experience of developing solutions at scale. This collaboration confirms that our AI solutions are not just innovative but also resilient. Together, we believe that we can harness the power of AI to drive efficiency, enhance customer experiences, and stay ahead in a rapidly evolving market.”
—John Canada, VP of Engineering at Asure.

In this post, we explore why Asure used the Amazon Web Services (AWS) post-call analytics (PCA) pipeline that generated insights across call centers at scale with the advanced capabilities of generative AI-powered services such as Amazon Bedrock and Amazon Q in QuickSight. Asure chose this approach because it provided in-depth consumer analytics, categorized call transcripts around common themes, and empowered contact center leaders to use natural language to answer queries. This ultimately allowed Asure to provide its customers with improvements in product and customer experiences.
Solution Overview
At a high level, the solution consists of first converting audio into transcripts using Amazon Transcribe and generating and evaluating summary fields for each transcript using Amazon Bedrock. In addition, Q&A can be done at a single call level using Amazon Bedrock or for many calls using Amazon Q in QuickSight. In the rest of this section, we describe these components and the services used in greater detail.
We added upon the existing PCA solution with the following services:

Amazon Bedrock
Amazon Q in QuickSight

Customer service and call center operations are highly dynamic, with evolving customer expectations, market trends, and technological advancements reshaping the industry at a rapid pace. Staying ahead in this competitive landscape demands agile, scalable, and intelligent solutions that can adapt to changing demands.
In this context, Amazon Bedrock emerges as an exceptional choice for developing a generative AI-powered solution to analyze customer service call transcripts. This fully managed service provides access to cutting-edge foundation models (FMs) from leading AI providers, enabling the seamless integration of state-of-the-art language models tailored for text analysis tasks. Amazon Bedrock offers fine-tuning capabilities that allow you to customize these pre-trained models using proprietary call transcript data, facilitating high accuracy and relevance without the need for extensive machine learning (ML) expertise. Moreover, Amazon Bedrock offers integration with other AWS services like Amazon SageMaker, which streamlines the deployment process, and its scalable architecture makes sure the solution can adapt to increasing call volumes effortlessly.
With robust security measures, data privacy safeguards, and a cost-effective pay-as-you-go model, Amazon Bedrock offers a secure, flexible, and cost-efficient service to harness generative AI’s potential in enhancing customer service analytics, ultimately leading to improved customer experiences and operational efficiencies.
Furthermore, by integrating a knowledge base containing organizational data, policies, and domain-specific information, the generative AI models can deliver more contextual, accurate, and relevant insights from the call transcripts. This knowledge base allows the models to understand and respond based on the company’s unique terminology, products, and processes, enabling deeper analysis and more actionable intelligence from customer interactions.
In this use case, Amazon Bedrock is used for both generation of summary fields for sample call transcripts and evaluation of these summary fields against a ground truth dataset. Its value comes from its simple integration into existing pipelines and various evaluation frameworks. Amazon Bedrock also allows you to choose various models for different use cases, making it an obvious choice for the solution due to its flexibility. Using Amazon Bedrock allows for iteration of the solution using knowledge bases for simple storage and access of call transcripts as well as guardrails for building responsible AI applications.
Amazon Bedrock
Amazon Bedrock is a fully managed service that makes FMs available through an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case. With the Amazon Bedrock serverless experience, you can get started quickly, privately customize FMs with your own data, and quickly integrate and deploy them into your applications using AWS tools without having to manage the infrastructure.
Amazon Q in Quicksight
Amazon Q in QuickSight is a generative AI assistant that accelerates decision-making and enhances business productivity with generative business intelligence (BI) capabilities.
The original PCA solution includes the following services:

AWS Lambda
Amazon Simple Storage Service (Amazon S3)
Amazon CloudFront
Amazon Athena
Amazon Comprehend
Amazon Transcribe
Amazon Cognito

The solution consisted of the following components:

Call metadata generation – After the file ingestion step when transcripts are generated for each call transcript using Amazon Transcribe, Anthropic’s Claude Haiku FM in Amazon Bedrock is used to generate call-related metadata. This includes a summary, the category, the root cause, and other high-level fields generated from a call transcript. This is orchestrated using AWS Step Functions.
Individual call Q&A – For questions requiring a specific call, such as, “How did the customer react in call ID X,” Anthropic’s Claude Haiku is used to power a Q&A assistant located in a CloudFront application. This is powered by the web app portion of the architecture diagram (provided in the next section).
Aggregate call Q&A – To answer questions requiring multiple calls, such as “What are the most common issues detected,” Amazon Q on QuickSight is used to enhance the Agent Assist interface. This step is shown by business analysts interacting with QuickSight in the storage and visualization step through natural language.

To learn more about the architectural components of the PCA solution, including file ingestion, insight extraction, storage and visualization, and web application components, refer to Post call analytics for your contact center with Amazon language AI services.
Architecture
The following diagram illustrates the solution architecture. The evaluation framework, call metadata generation, and Amazon Q in QuickSight were new components introduced from the original PCA solution.

Ragas and a human-in-the-loop UI (as described in the customer blogpost with Tealium) were used to evaluate the metadata generation and individual call Q&A portions. Ragas is an open source evaluation framework that helps evaluate FM-generated text.
The high-level takeaways from this work are the following:

Anthropic’s Claude 3 Haiku successfully took in a call transcript and determined its summary, root cause, if the issue was resolved, and, if it was a callback, next steps by the customer and agent (generative AI-powered fields). When using Anthropic’s Claude 3 Haiku as opposed to Anthropic’s Claude Instant, there was a reduction in latency. With chain-of-thought reasoning, there was an increase in overall quality (includes how factual, understandable, and relevant responses are on a 1–5 scale, described in more detail later in this post) as measured by subject matter experts (SMEs). With the use of Amazon Bedrock, various models can be chosen based on different use cases, illustrating its flexibility in this application.
Amazon Q in QuickSight proved to be a powerful analytical tool in understanding and generating relevant insights from data through intuitive chart and table visualizations. It can perform simple calculations whenever necessary while also facilitating deep dives into issues and exploring data from multiple perspectives, demonstrating great value in insight generation.
The human-in-the-loop UI plus Ragas metrics proved effective to evaluate outputs of FMs used throughout the pipeline. Particularly, answer correctness, answer relevance, faithfulness, and summarization metrics (alignment and coverage score) were used to evaluate the call metadata generation and individual call Q&A components using Amazon Bedrock. Its flexibility in various FMs allowed the testing of many types of models to generate evaluation metrics, including Anthropic’s Claude Sonnet 3.5 and Anthropic’s Claude Haiku 3.

Call metadata generation
The call metadata generation pipeline consisted of converting an audio file to a call transcript in a JSON format using Amazon Transcribe and then generating key information for each transcript using Amazon Bedrock and Amazon Comprehend. The following diagram shows a subset of the preceding architecture diagram that demonstrates this.

The original PCA post linked previously shows how Amazon Transcribe and Amazon Comprehend are used in the metadata generation pipeline.
The call transcript input that was outputted from the Amazon Transcribe step of the Step Functions workflow followed the format in the following code example:

{
call_id: <call id>,
agent_id: <agent_id>
customer_id: <customer_id>
transcript: “””
   Agent: <Agent message>.
   Customer: <Customer message>
   Agent: <Agent message>.
   Customer: <Customer message>
   Agent: <Agent message>.
   Customer: <Customer message>
   ………..
    “””
}

Metadata was generated using Amazon Bedrock. Specifically, it extracted the summary, root cause, topic, and next steps, and answered key questions such as if the call was a callback and if the issue was ultimately resolved.
Prompts were stored in Amazon DynamoDB, allowing Asure to quickly modify prompts or add new generative AI-powered fields based on future enhancements. The following screenshot shows how prompts can be modified through DynamoDB.

Individual call Q&A
The chat assistant powered by Anthropic’s Claude Haiku was used to answer natural language queries on a single transcript. This assistant, the call metadata values generated from the previous section, and sentiments generated from Amazon Comprehend were displayed in an application hosted by CloudFront.
The user of the final chat assistant can modify the prompt in DynamoDB. The following screenshot shows the general prompt for an individual call Q&A.

The UI hosted by CloudFront allows an agent or supervisor to analyze a specific call to extract additional details. The following screenshot shows the insights Asure gleaned for a sample customer service call.

The following screenshot shows the chat assistant, which exists in the same webpage.

Evaluation Framework
This section outlines components of the evaluation framework used. It ultimately allows Asure to highlight important metrics for their use case and provides visibility into the generative AI application’s strengths and weaknesses. This was done using automated quantitative metrics provided by Ragas, DeepEval, and traditional ML metrics as well as human-in-the-loop evaluation done by SMEs.
Quantitative Metrics
The results of the generated call metadata and individual call Q&A were evaluated using quantitative metrics provided by Ragas: answer correctness, answer relevance, and faithfulness; and DeepEval: alignment and coverage, both powered by FMs from Amazon Bedrock. Its simple integration with external libraries allowed Amazon Bedrock to be configured with existing libraries. In addition, traditional ML metrics were used for “Yes/No” answers. The following are the metrics used for different components of the solution:

Call metadata generation – This included the following:

Summary – Alignment and coverage (find a description of these metrics in the DeepEval repository) and answer correctness
Issue resolved, callback – F1-score and accuracy
Topic, next steps, root cause – Answer correctness, answer relevance, and faithfulness

Individual call Q&A – Answer correctness, answer relevance, and faithfulness
Human in the loop – Both individual call Q&A and call metadata generation used human-in-the-loop metrics

For a description of answer correctness, answer relevance, and faithfulness, refer to the customer blogpost with Tealium.
The use of Amazon Bedrock in the evaluation framework allowed for a flexibility of different models based on different use cases. For example, Anthropic’s Claude Sonnet 3.5 was used to generate DeepEval metrics, whereas Anthropic’s Claude 3 Haiku (with its low latency) was ideal for Ragas.
Human in the Loop
The human-in-the-loop UI is described in the Human-in-the-Loop section of the customer blogpost with Tealium. To use it to evaluate this solution, some changes had to be made:

There is a choice for the user to analyze one of the generated metadata fields for a call (such as a summary) or a specific Q&A pair.
The user can bring in two model outputs for comparison. This can include outputs from the same FMs but using different prompts, outputs from different FMs but using the same prompt, and outputs from different FMs and using different prompts.
Additional checks for fluency, coherence, creativity, toxicity, relevance, completeness, and overall quality were added, where the user adds in a measure of this metric based on the model output from a range of 0–4.

The following screenshots show the UI.

The human-in-the-loop system establishes a mechanism between domain expertise and Amazon Bedrock outputs. This in turn will lead to improved generative AI applications and ultimately to high customer trust of such systems.
To demo the human-in-the-loop UI, follow the instructions in the GitHub repo.
Natural Language Q&A using Amazon Q in Quicksight
QuickSight, integrated with Amazon Q, enabled Asure to use natural language queries for comprehensive customer analytics. By interpreting queries on sentiments, call volumes, issue resolutions, and agent performance, the service delivered data-driven visualizations. This empowered Asure to quickly identify pain points, optimize operations, and deliver exceptional customer experiences through a streamlined, scalable analytics solution tailored for call center operations.
Integrate Amazon Q in QuickSight with the PCA solution
The Amazon Q in QuickSight integration was done by following three high-level steps:

Create a dataset on QuickSight.
Create a topic on QuickSight from the dataset.
Query using natural language.

Create a dataset on QuickSight
We used Athena as the data source, which queries data from Amazon S3. QuickSight can be configured through multiple data sources (for more information, refer to Supported data sources). For this use case, we used the data generated from the PCA pipeline as the data source for further analytics and natural language queries in Amazon Q in QuickSight. The PCA pipeline stores data in Amazon S3, which can be queried in Athena, an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL.

On the QuickSight console, choose Datasets in the navigation pane.
Choose Create new.
Choose Athena as the data source and input the particular catalog, database, and table that Amazon Q in QuickSight will reference.

Confirm the dataset was created successfully and proceed to the next step.

Create a topic on Amazon Quicksight from the dataset created
Users can use topics in QuickSight, powered by Amazon Q integration, to perform natural language queries on their data. This feature allows for intuitive data exploration and analysis by posing questions in plain language, alleviating the need for complex SQL queries or specialized technical skills. Before setting up a topic, make sure that the users have Pro level access. To set up a topic, follow these steps:

On the QuickSight console, choose Topics in the navigation pane.
Choose New topic.
Enter a name for the topic and choose the data source created.
Choose the created topic and then choose Open Q&A to start querying in natural language

Query using natural language
We performed intuitive natural language queries to gain actionable insights into customer analytics. This capability allows users to analyze sentiments, call volumes, issue resolutions, and agent performance through conversational queries, enabling data-driven decision-making, operational optimization, and enhanced customer experiences within a scalable, call center-tailored analytics solution. Examples of the simple natural language queries “Which customer had positive sentiments and a complex query?” and “What are the most common issues and which agents dealt with them?” are shown in the following screenshots.

These capabilities are helpful when business leaders want to dive deep on a particular issue, empowering them to make informed decisions on various issues.
Success metrics
The primary success metric gained from this solution is boosting employee productivity, primarily by quickly understanding customer interactions from calls to uncover themes and trends while also identifying gaps and pain points in their products. Before the engagement, analysts were taking 14 days to manually go through each call transcript to retrieve insights. After engagement, Asure observed how Amazon Bedrock and Amazon Q in QuickSight could reduce this time to minutes, even seconds, to obtain both insights queried directly from all stored call transcripts and visualizations that can be used for report generation.
In the pipeline, Anthropic’s Claude 3 Haiku was used to obtain initial call metadata fields (such as summary, root cause, next steps, and sentiments) that was stored in Athena. This allowed each call transcript to be queried using natural language from Amazon Q in QuickSight, letting business analysts answer high-level questions about issues, themes, and customer and agent insights in seconds.
Pat Goepel, chairman and CEO of Asure, shares,

“In collaboration with the AWS Generative AI Innovation Center, we have improved upon a post-call analytics solution to help us identify and prioritize features that will be the most impactful for our customers. We are utilizing Amazon Bedrock, Amazon Comprehend, and Amazon Q in QuickSight to understand trends in our own customer interactions, prioritize items for product development, and detect issues sooner so that we can be even more proactive in our support for our customers. Our partnership with AWS and our commitment to be early adopters of innovative technologies like Amazon Bedrock underscore our dedication to making advanced HCM technology accessible for businesses of any size.”

Takeaways
We had the following takeaways:

Enabling chain-of-thought reasoning and specific assistant prompts for each prompt in the call metadata generation component and calling it using Anthropic’s Claude 3 Haiku improved metadata generation for each transcript. Primarily, the flexibility of Amazon Bedrock in the use of various FMs allowed full experimentation of many types of models with minimal changes. Using Amazon Bedrock can allow for the use of various models depending on the use case, making it the obvious choice for this application due to its flexibility.
Ragas metrics, particularly faithfulness, answer correctness, and answer relevance, were used to evaluate call metadata generation and individual Q&A. However, summarization required different metrics, alignment, and coverage, which didn’t require ground truth summaries. Therefore, DeepEval was used to calculate summarization metrics. Overall, the ease of integrating Amazon Bedrock allowed it to power the calculation of quantitative metrics with minimal changes to the evaluation libraries. This also allowed the use of different types of models for different evaluation libraries.
The human-in-the-loop approach can be used by SMEs to further evaluate Amazon Bedrock outputs. There is an opportunity to improve upon an Amazon Bedrock FM based on this feedback, but this was not worked on in this engagement.
The post-call analytics workflow, with the use of Amazon Bedrock, can be iterated upon in the future using features such as Amazon Bedrock Knowledge Bases to perform Q&A over a specific number of call transcripts as well as Amazon Bedrock Guardrails to detect harmful and hallucinated responses while also creating more responsible AI applications.
Amazon Q in QuickSight was able to answer natural language questions on customer analytics, root cause, and agent analytics, but some questions required reframing to get meaningful responses.
Data fields within Amazon Q in QuickSight needed to be defined properly and synonyms needed to be added to make Amazon Q more robust with natural language queries.

Security best practices
We recommend the following security guidelines for building secure applications on AWS:

Building secure machine learning environments with Amazon SageMaker
Control root access to a SageMaker notebook instance
Security in Amazon S3
Data protection in Amazon Cognito

Conclusion
In this post, we showcased how Asure used the PCA solution powered by Amazon Bedrock and Amazon Q in QuickSight to generate consumer and agent insights both at individual and aggregate levels. Specific insights included those centered around a common theme or issue. With these services, Asure was able to improve employee productivity to generate these insights in minutes instead of weeks.
This is one of the many ways builders can deliver great solutions using Amazon Bedrock and Amazon Q in QuickSight. To learn more, refer to Amazon Bedrock and Amazon Q in QuickSight.

About the Authors
Suren Gunturu is a Data Scientist working in the Generative AI Innovation Center, where he works with various AWS customers to solve high-value business problems. He specializes in building ML pipelines using large language models, primarily through Amazon Bedrock and other AWS Cloud services.
Avinash Yadav is a Deep Learning Architect at the Generative AI Innovation Center, where he designs and implements cutting-edge GenAI solutions for diverse enterprise needs. He specializes in building ML pipelines using large language models, with expertise in cloud architecture, Infrastructure as Code (IaC), and automation. His focus lies in creating scalable, end-to-end applications that leverage the power of deep learning and cloud technologies.
John Canada is the VP of Engineering at Asure Software, where he leverages his experience in building innovative, reliable, and performant solutions and his passion for AI/ML to lead a talented team dedicated to using Machine Learning to enhance the capabilities of Asure’s software and meet the evolving needs of businesses.
Yasmine Rodriguez Wakim is the Chief Technology Officer at Asure Software. She is an innovative Software Architect & Product Leader with deep expertise in creating payroll, tax, and workforce software development. As a results-driven tech strategist, she builds and leads technology vision to deliver efficient, reliable, and customer-centric software that optimizes business operations through automation.
Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

From innovation to impact: How AWS and NVIDIA enable real-world genera …

As we gather for NVIDIA GTC, organizations of all sizes are at a pivotal moment in their AI journey. The question is no longer whether to adopt generative AI, but how to move from promising pilots to production-ready systems that deliver real business value. The organizations that figure this out first will have a significant competitive advantage—and we’re already seeing compelling examples of what’s possible.
Consider Hippocratic AI’s work to develop AI-powered clinical assistants to support healthcare teams as doctors, nurses, and other clinicians face unprecedented levels of burnout. During a recent hurricane in Florida, their system called 100,000 patients in a day to check on medications and provide preventative healthcare guidance–the kind of coordinated outreach that would be nearly impossible to achieve manually. They aren’t just building another chatbot; they are reimagining healthcare delivery at scale.
Production-ready AI like this requires more than just cutting-edge models or powerful GPUs. In my decade working with customers’ data journeys, I’ve seen that an organization’s most valuable asset is its domain-specific data and expertise. And now leading our data and AI go-to-market, I hear customers consistently emphasize what they need to transform their domain advantage into AI success: infrastructure and services they can trust—with performance, cost-efficiency, security, and flexibility—all delivered at scale. When the stakes are high, success requires not just cutting-edge technology, but the ability to operationalize it at scale—a challenge that AWS has consistently solved for customers. As the world’s most comprehensive and broadly adopted cloud, our partnership with NVIDIA’s pioneering accelerated computing platform for generative AI amplifies this capability. It’s inspiring to see how, together, we’re enabling customers across industries to confidently move AI into production.
In this post, I will share some of these customers’ remarkable journeys, offering practical insights for any organization looking to harness the power of generative AI.
Transforming content creation with generative AI

Content creation represents one of the most visible and immediate applications of generative AI today. Adobe, a pioneer that has shaped creative workflows for over four decades, has moved with remarkable speed to integrate generative AI across its flagship products, helping millions of creators work in entirely new ways.
Adobe’s approach to generative AI infrastructure exemplifies what their VP of Generative AI, Alexandru Costin, calls an “AI superhighway”—a sophisticated technical foundation that enables rapid iteration of AI models and seamless integration into their creative applications. The success of their Firefly family of generative AI models, integrated across flagship products like Photoshop, demonstrates the power of this approach. For their AI training and inference workloads, Adobe uses NVIDIA GPU-accelerated Amazon Elastic Compute Cloud (Amazon EC2) P5en (NVIDIA H200 GPUs), P5 (NVIDIA H100 GPUs), P4de (NVIDIA A100 GPUs), and G5 (NVIDIA A10G GPUs) instances. They also use NVIDIA software such as NVIDIA TensorRT and NVIDIA Triton Inference Server for faster, scalable inference. Adobe needed maximum flexibility to build their AI infrastructure, and AWS provided the complete stack of services needed—from Amazon FSx for Lustre for high-performance storage, to Amazon Elastic Kubernetes Service (Amazon EKS) for container orchestration, to Elastic Fabric Adapter (EFA) for high-throughput networking—to create a production environment that could reliably serve millions of creative professionals.
Key takeaway
If you’re building and managing your own AI pipelines, Adobe’s success highlights a critical insight: although GPU-accelerated compute often gets the spotlight in AI infrastructure discussions, what’s equally important is the NVIDIA software stack along with the foundation of orchestration, storage, and networking services that enable production-ready AI. Their results speak for themselves—Adobe achieved a 20-fold scale-up in model training while maintaining the enterprise-grade performance and reliability their customers expect.
Pioneering new AI applications from the ground up

Throughout my career, I’ve been particularly energized by startups that take on audacious challenges—those that aren’t just building incremental improvements but are fundamentally reimagining how things work. Perplexity exemplifies this spirit. They’ve taken on a technology most of us now take for granted: search. It’s the kind of ambitious mission that excites me, not just because of its bold vision, but because of the incredible technical challenges it presents. When you’re processing 340 million queries monthly and serving over 1,500 organizations, transforming search isn’t just about having great ideas—it’s about building robust, scalable systems that can deliver consistent performance in production.
Perplexity’s innovative approach earned them membership in both AWS Activate and NVIDIA Inception—flagship programs designed to accelerate startup innovation and success. These programs provided them with the resources, technical guidance, and support needed to build at scale. They were one of the early adopters of Amazon SageMaker HyperPod, and continue to use its distributed training capabilities to accelerate model training time by up to 40%. They use a highly optimized inference stack built with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server to serve both their search application and pplx-api, their public API service that gives developers access to their proprietary models. The results speak for themselves—their inference stack achieves up to 3.1 times lower latency compared to other platforms. Both their training and inference workloads run on NVIDIA GPU-accelerated EC2 P5 instances, delivering the performance and reliability needed to operate at scale. To give their users even more flexibility, Perplexity complements their own models with services such as Amazon Bedrock, and provides access to additional state-of-the-art models in their API. Amazon Bedrock offers ease of use and reliability, which are crucial for their team—as they note, it allows them to effectively maintain the reliability and latency their product demands.
What I find particularly compelling about Perplexity’s journey is their commitment to technical excellence, exemplified by their work optimizing GPU memory transfer with EFA networking. The team achieved 97.1% of the theoretical maximum bandwidth of 3200 Gbps and open sourced their innovations, enabling other organizations to benefit from their learnings.
For those interested in the technical details, I encourage you to read their fascinating post Journey to 3200 Gbps: High-Performance GPU Memory Transfer on AWS Sagemaker Hyperpod.
Key takeaway
For organizations with complex AI workloads and specific performance requirements, Perplexity’s approach offers a valuable lesson. Sometimes, the path to production-ready AI isn’t about choosing between self-hosted infrastructure and managed services—it’s about strategically combining both. This hybrid strategy can deliver both exceptional performance (evidenced by Perplexity’s 3.1 times lower latency) and the flexibility to evolve.
Transforming enterprise workflows with AI
Enterprise workflows represent the backbone of business operations—and they’re a crucial proving ground for AI’s ability to deliver immediate business value. ServiceNow, which terms itself the AI platform for business transformation, is rapidly integrating AI to reimagine core business processes at scale.
ServiceNow’s innovative AI solutions showcase their vision for enterprise-specific AI optimization. As Srinivas Sunkara, ServiceNow’s Vice President, explains, their approach focuses on deep AI integration with technology workflows, core business processes, and CRM systems—areas where traditional large language models (LLMs) often lack domain-specific knowledge. To train generative AI models at enterprise scale, ServiceNow uses NVIDIA DGX Cloud on AWS. Their architecture combines high-performance FSx for Lustre storage with NVIDIA GPU clusters for training, and NVIDIA Triton Inference Server handles production deployment. This robust technology platform allows ServiceNow to focus on domain-specific AI development and customer value rather than infrastructure management.
Key takeaway
ServiceNow offers an important lesson about enterprise AI adoption: while foundation models (FMs) provide powerful general capabilities, the greatest business value often comes from optimizing models for specific enterprise use cases and workflows. In many cases, it’s precisely this deliberate specialization that transforms AI from an interesting technology into a true business accelerator.
Scaling AI across enterprise applications
Cisco’s Webex team’s journey with generative AI exemplifies how large organizations can methodically transform their applications while maintaining enterprise standards for reliability and efficiency. With a comprehensive suite of telecommunications applications serving customers globally, they needed an approach that would allow them to incorporate LLMs across their portfolio—from AI assistants to speech recognition—without compromising performance or increasing operational complexity.
The Webex team’s key insight was to separate their models from their applications. Previously, they had embedded AI models into the container images for applications running on Amazon EKS, but as their models grew in sophistication and size, this approach became increasingly inefficient. By migrating their LLMs to Amazon SageMaker AI and using NVIDIA Triton Inference Server, they created a clean architectural break between their relatively lean applications and the underlying models, which require more substantial compute resources. This separation allows applications and models to scale independently, significantly reducing development cycle time and increasing resource utilization. The team deployed dozens of models on SageMaker AI endpoints, using Triton Inference Server’s model concurrency capabilities to scale globally across AWS data centers.
The results validate Cisco’s methodical approach to AI transformation. By separating applications from models, their development teams can now fix bugs, perform tests, and add features to applications much faster, without having to manage large models in their workstation memory. The architecture also enables significant cost optimization—applications remain available during off-peak hours for reliability, and model endpoints can scale down when not needed, all without impacting application performance. Looking ahead, the team is evaluating Amazon Bedrock to further improve their price-performance, demonstrating how thoughtful architecture decisions create a foundation for continuous optimization.
Key takeaway
For enterprises with large application portfolios looking to integrate AI at scale, Cisco’s methodical approach offers an important lesson: separating LLMs from applications creates a cleaner architectural boundary that improves both development velocity and cost optimization. By treating models and applications as independent components, Cisco significantly improved development cycle time while reducing costs through more efficient resource utilization.
Building mission-critical AI for healthcare

Earlier, we highlighted how Hippocratic AI reached 100,000 patients during a crisis. Behind this achievement lies a story of rigorous engineering for safety and reliability—essential in healthcare where stakes are extraordinarily high.
Hippocratic AI’s approach to this challenge is both innovative and rigorous. They’ve developed what they call a “constellation architecture”—a sophisticated system of over 20 specialized models working in concert, each focused on specific safety aspects like prescription adherence, lab analysis, and over-the-counter medication guidance. This distributed approach to safety means they have to train multiple models, requiring management of significant computational resources. That’s why they use SageMaker HyperPod for their training infrastructure, using Amazon FSx and Amazon Simple Storage Service (Amazon S3) for high-speed storage access to NVIDIA GPUs, while Grafana and Prometheus provide the comprehensive monitoring needed to provide optimal GPU utilization. They build upon NVIDIA’s low-latency inference stack, and are enhancing conversational AI capabilities using NVIDIA Riva models for speech recognition and text-to-speech translation, and are also using NVIDIA NIM microservices to deploy these models. Given the sensitive nature of healthcare data and HIPAA compliance requirements, they’ve implemented a sophisticated multi-account, multi-cluster strategy on AWS—running production inference workloads with patient data on completely separate accounts and clusters from their development and training environments. This careful attention to both security and performance allows them to handle thousands of patient interactions while maintaining precise control over clinical safety and accuracy.
The impact of Hippocratic AI’s work extends far beyond technical achievements. Their AI-powered clinical assistants address critical healthcare workforce burnout by handling burdensome administrative tasks—from pre-operative preparation to post-discharge follow-ups. For example, during weather emergencies, their system can rapidly assess heat risks and coordinate transport for vulnerable patients—the kind of comprehensive care that would be too burdensome and resource-intensive to coordinate manually at scale.
Key takeaway
For organizations building AI solutions for complex, regulated, and high-stakes environments, Hippocratic AI’s constellation architecture reinforces what we’ve consistently emphasized: there’s rarely a one-size-fits-all model for every use case. Just as Amazon Bedrock offers a choice of models to meet diverse needs, Hippocratic AI’s approach of combining over 20 specialized models—each focused on specific safety aspects—demonstrates how a thoughtfully designed ensemble can achieve both precision and scale.
Conclusion
As the technology partners enabling these and countless other customer innovations, AWS and NVIDIA’s long-standing collaboration continues to evolve to meet the demands of the generative AI era. Our partnership, which began over 14 years ago with the world’s first GPU cloud instance, has grown to offer the industry’s widest range of NVIDIA accelerated computing solutions and software services for optimizing AI deployments. Through initiatives like Project Ceiba—one of the world’s fastest AI supercomputers hosted exclusively on AWS using NVIDIA DGX Cloud for NVIDIA’s own research and development use—we continue to push the boundaries of what’s possible.
As all the examples we’ve covered illustrate, it isn’t just about the technology we build together—it’s how organizations of all sizes are using these capabilities to transform their industries and create new possibilities. These stories ultimately reveal something more fundamental: when we make powerful AI capabilities accessible and reliable, people find remarkable ways to use them to solve meaningful problems. That’s the true promise of our partnership with NVIDIA—enabling innovators to create positive change at scale. I’m excited to continue inventing and partnering with NVIDIA and can’t wait to see what our mutual customers are going to do next.
Resources
Check out the following resources to learn more about our partnership with NVIDIA and generative AI on AWS:

Learn about the AWS and NVIDIA partnership
Explore generative AI on AWS
Cost-effectively access NVIDIA GPUs across several new AWS Regions with Amazon EC2 Capacity Blocks for ML
Get started with Amazon SageMaker HyperPod for generative AI model development
Build and scale generative AI applications with Amazon Bedrock

About the Author
Rahul Pathak is Vice President Data and AI GTM at AWS, where he leads the global go-to-market and specialist teams who are helping customers create differentiated value with AWS’s AI and capabilities such as Amazon Bedrock, Amazon Q, Amazon SageMaker, and Amazon EC2 and Data Services such as Amaqzon S3, AWS Glue and Amazon Redshift. Rahul believes that generative AI will transform virtually every single customer experience and that data is a key differentiator for customers as they build AI applications. Prior to his current role, he was Vice President, Relational Database Engines where he led Amazon Aurora, Redshift, and DSQL . During his 13+ years at AWS, Rahul has been focused on launching, building, and growing managed database and analytics services, all aimed at making it easy for customers to get value from their data. Rahul has over twenty years of experience in technology and has co-founded two companies, one focused on analytics and the other on IP-geolocation. He holds a degree in Computer Science from MIT and an Executive MBA from the University of Washington.

Amazon Q Business now available in Europe (Ireland) AWS Region

Today, we are excited to announce that Amazon Q Business—a fully managed generative-AI powered assistant that you can configure to answer questions, provide summaries and generate content based on your enterprise data—is now generally available in the Europe (Ireland) AWS Region.
Since its launch, Amazon Q Business has been helping customers find information, gain insight, and take action at work. The general availability of Amazon Q Business in the Europe (Ireland) Region will support customers across Ireland and the EU to transform how their employees work and access information, while maintaining data security and privacy requirements.
AWS customers and partners innovate using Amazon Q Business in Europe
Organizations across the EU are using Amazon Q Business for a wide variety of use cases, including answering questions about company data, summarizing documents, and providing business insights.
Katya Dunets, the AWS Lead Sales Engineer for Adastra noted,

Adastra stands at the forefront of technological innovation, specializing in artificial intelligence, data, cloud, digital, and governance services. Our team was facing the daunting challenge of sifting through hundreds of documents on SharePoint, searching for content and information critical for market research and RFP generation. This process was not only time-consuming but also impeded our agility and responsiveness. Recognizing the need for a transformative solution, we turned to Amazon Q Business for its prowess in answering queries, summarizing documents, generating content, and executing tasks, coupled with its direct SharePoint integration. Amazon Q Business became the catalyst for unprecedented efficiency within Adastra, dramatically streamlining document retrieval, enhancing cross-team collaboration through shared insights from past projects, and accelerating our RFP development process by 70%. Amazon Q Business has not only facilitated a smoother exchange of knowledge within our teams but has also empowered us to maintain our competitive edge by focusing on innovation rather than manual tasks. Adastra’s journey with Amazon Q exemplifies our commitment to harnessing cutting-edge technology to better serve both our clients and their customers.

AllCloud is a cloud solutions provider specializing in cloud stack, infrastructure, platform, and Software-as-a-Service. Their CTO, Peter Nebel stated,

“AllCloud faces the common challenge of information sprawl. Critical knowledge for sales and delivery teams is scattered across various tools—Salesforce for customer and marketing data, Google Drive for documents, Bamboo for HR and internal information, and Confluence for internal wikis. This fragmented approach wastes valuable time as employees hunt and peck for the information they need, hindering productivity and potentially impacting client satisfaction. Amazon Q Business provides AllCloud a solution to increase productivity by streamlining information access. By leveraging Amazon Q’s natural language search capabilities, AllCloud can empower its personnel with a central hub to find answers to their questions across all their existing information sources. This drives efficiency and accuracy by eliminating the need for time-consuming searches across multiple platforms and ensures all teams have access to the most up-to-date information. Amazon Q will significantly accelerate productivity, across all lines of business, allowing AllCloud’s teams to focus on delivering exceptional service to their clients.”

Lars Ritter, Senior Manager at Woodmark Consulting noted,

“Amazon Bedrock and Amazon Q Business have been game-changers for Woodmark. Employees struggled with time-consuming searches across various siloed systems, leading to reduced productivity and slower operations. To solve for the inefficient retrieval of corporate knowledge from unstructured data sources we turned to Amazon Bedrock and Amazon Q Business for help. With this innovative solution, Woodmark has been able to revolutionize data accessibility, empowering our teams to effortlessly retrieve insights using simple natural language queries, and to make informed decisions without relying on specialized data teams, which was not feasible before. These solutions have dramatically increased efficiency, fostered a data-driven culture, and positioned us for scalable growth, driving our organization toward unparalleled success.”

Scott Kumono, Product Manager for Kinectus at Siemens Healthineers adds,

“Amazon Q Business has enhanced the delivery of service and clinical support for our ultrasound customers. Previously, finding specific information meant sifting through a 1,000-page manual or waiting for customer support to respond. Now, customers have instant access to answers and specifications right at their fingertips, using Kinectus Remote Service. With Amazon Q Business we were able to significantly reduce manual work and wait times to find the right information, allowing our customers to focus on what really matters – patient care.”

Till Gloger, Head of Digital Production Platform Region Americas at Volkswagen Group of America states,

“Volkswagen innovates not only on its products, but also on how to boost employee productivity and increase production throughput. Volkswagen is testing the use of Amazon Q to streamline employee workflows by potentially integrating it with existing processes. This integration has the possibility to help employees save time during the assembly process, reducing some processes from minutes to seconds, ultimately leading to more throughput.”

Pricing
With Amazon Q Business, enterprise customers pay for user subscriptions and index capacity. For more details, see Amazon Q Business pricing.
Get started with Amazon Q Business today
To get started with Amazon Q Business, users first need to configure an application environment and create a knowledge base using over 40 data source connectors that index documents (e.g text, pdf, images, tables). Organizations then set up user authentication through AWS IAM Identity Center or other SAML-based identity providers like Okta, Ping Identity, and Microsoft Entra ID. After configuring access permissions, applications users can navigate to their organization’s Amazon Q Business web interface using their credentials to begin interacting with Q Business and the data they have access to. Q Business enables natural language interactions where users can ask questions and receive answers based on their indexed documents, uploaded content, and world knowledge – this may include getting details, generating content or insights. Users can access Amazon Q Business through multiple channels including web applications, Slack, Microsoft Teams, Microsoft 365 for Word and Outlook, or through browser extensions for gen-AI assistance directly where they work. Additionally, customers can securely share their data with verified independent software vendors (ISVs) like Asana, Miro, PagerDuty, and Zoom using the data accessors feature, which maintains security and compliance while respecting user-level permissions.
Learn more about how to get started with Amazon Q Business here. Read about other Amazon Q Business customers’ success stories here. Certain Amazon Q Business features already available in US East (N. Virginia) and US West (Oregon) including Q Apps, Q Actions, and Audio/Video file support will become available in Europe (Ireland) soon.

About the Authors
Jose Navarro is an AI/ML Specialist Solutions Architect at AWS, based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production.
Morgan Dutton is a Senior Technical Program Manager at AWS, Amazon Q Business based in Seattle.
Eva Pagneux is a Principal Product Manager at AWS, Amazon Q Business, based in San Francisco.
Wesleigh Roeca is a Senior Worldwide Gen AI/ML Specialist at AWS, Amazon Q Business, based in Santa Monica.

Intelligent healthcare assistants: Empowering stakeholders with person …

Large language models (LLMs) have revolutionized the field of natural language processing, enabling machines to understand and generate human-like text with remarkable accuracy. However, despite their impressive language capabilities, LLMs are inherently limited by the data they were trained on. Their knowledge is static and confined to the information they were trained on, which becomes problematic when dealing with dynamic and constantly evolving domains like healthcare.
The healthcare industry is a complex, ever-changing landscape with a vast and rapidly growing body of knowledge. Medical research, clinical practices, and treatment guidelines are constantly being updated, rendering even the most advanced LLMs quickly outdated. Additionally, patient data, including electronic health records (EHRs), diagnostic reports, and medical histories, are highly personalized and unique to each individual. Relying solely on an LLM’s pre-trained knowledge is insufficient for providing accurate and personalized healthcare recommendations.
Furthermore, healthcare decisions often require integrating information from multiple sources, such as medical literature, clinical databases, and patient records. LLMs lack the ability to seamlessly access and synthesize data from these diverse and distributed sources. This limits their potential to provide comprehensive and well-informed insights for healthcare applications.
Overcoming these challenges is crucial for using the full potential of LLMs in the healthcare domain. Patients, healthcare providers, and researchers require intelligent agents that can provide up-to-date, personalized, and context-aware support, drawing from the latest medical knowledge and individual patient data.
Enter LLM function calling, a powerful capability that addresses these challenges by allowing LLMs to interact with external functions or APIs, enabling them to access and use additional data sources or computational capabilities beyond their pre-trained knowledge. By combining the language understanding and generation abilities of LLMs with external data sources and services, LLM function calling opens up a world of possibilities for intelligent healthcare agents.
In this blog post, we will explore how Mistral LLM on Amazon Bedrock can address these challenges and enable the development of intelligent healthcare agents with LLM function calling capabilities, while maintaining robust data security and privacy through Amazon Bedrock Guardrails.
Healthcare agents equipped with LLM function calling can serve as intelligent assistants for various stakeholders, including patients, healthcare providers, and researchers. They can assist patients by answering medical questions, interpreting test results, and providing personalized health advice based on their medical history and current conditions. For healthcare providers, these agents can help with tasks such as summarizing patient records, suggesting potential diagnoses or treatment plans, and staying up to date with the latest medical research. Additionally, researchers can use LLM function calling to analyze vast amounts of scientific literature, identify patterns and insights, and accelerate discoveries in areas such as drug development or disease prevention.
Benefits of LLM function calling
LLM function calling offers several advantages for enterprise applications, including enhanced decision-making, improved efficiency, personalized experiences, and scalability. By combining the language understanding capabilities of LLMs with external data sources and computational resources, enterprises can make more informed and data-driven decisions, automate and streamline various tasks, provide tailored recommendations and experiences for individual users or customers, and handle large volumes of data and process multiple requests concurrently.
Potential use cases for LLM function calling in the healthcare domain include patient triage, medical question answering, and personalized treatment recommendations. LLM-powered agents can assist in triaging patients by analyzing their symptoms, medical history, and risk factors, and providing initial assessments or recommendations for seeking appropriate care. Patients and healthcare providers can receive accurate and up-to-date answers to medical questions by using LLMs’ ability to understand natural language queries and access relevant medical knowledge from various data sources. Additionally, by integrating with electronic health records (EHRs) and clinical decision support systems, LLM function calling can provide personalized treatment recommendations tailored to individual patients’ medical histories, conditions, and preferences.
Amazon Bedrock supports a variety of foundation models. In this post, we will be exploring how to perform function calling using Mistral from Amazon Bedrock. Mistral supports function calling, which allows agents to invoke external functions or APIs from within a conversation flow. This capability enables agents to retrieve data, perform calculations, or use external services to enhance their conversational abilities. Function calling in Mistral is achieved through the use of specific function call blocks that define the external function to be invoked and handle the response or output.
Solution overview
LLM function calling typically involves integrating an LLM model with an external API or function that provides access to additional data sources or computational capabilities. The LLM model acts as an interface, processing natural language inputs and generating responses based on its pre-trained knowledge and the information obtained from the external functions or APIs. The architecture typically consists of the LLM model, a function or API integration layer, and external data sources and services.
Healthcare agents can integrate LLM models and call external functions or APIs through a series of steps: natural language input processing, self-correction, chain of thought, function or API calling through an integration layer, data integration and processing, and persona adoption. The agent receives natural language input, processes it through the LLM model, calls relevant external functions or APIs if additional data or computations are required, combines the LLM model’s output with the external data or results, and provides a comprehensive response to the user.

High Level Architecture- Healthcare assistant

The architecture for the Healthcare Agent is shown in the preceding figure and is as follows:

Consumers interact with the system through Amazon API Gateway.
AWS Lambda orchestrator, along with tool configuration and prompts, handles orchestration and invokes the Mistral model on Amazon Bedrock.
Agent function calling allows agents to invoke Lambda functions to retrieve data, perform computations, or use external services.
Functions such as insurance, claims, and pre-filled Lambda functions handle specific tasks.
Data is stored in a conversation history, and a member database (MemberDB) is used to store member information and the knowledge base has static documents used by the agent.
AWS CloudTrail, AWS Identity and Access Management (IAM), and Amazon CloudWatch handle data security.
AWS Glue, Amazon SageMaker, and Amazon Simple Storage Service (Amazon S3) facilitate data processing.

A sample code using function calling through the Mistral LLM can be found at mistral-on-aws.
Security and privacy considerations
Data privacy and security are of utmost importance in the healthcare sector because of the sensitive nature of personal health information (PHI) and the potential consequences of data breaches or unauthorized access. Compliance with regulations such as HIPAA and GDPR is crucial for healthcare organizations handling patient data. To maintain robust data protection and regulatory compliance, healthcare organizations can use Amazon Bedrock Guardrails, a comprehensive set of security and privacy controls provided by Amazon Web Services (AWS).
Amazon Bedrock Guardrails offers a multi-layered approach to data security, including encryption at rest and in transit, access controls, audit logging, ground truth validation and incident response mechanisms. It also provides advanced security features such as data residency controls, which allow organizations to specify the geographic regions where their data can be stored and processed, maintaining compliance with local data privacy laws.
When using LLM function calling in the healthcare domain, it’s essential to implement robust security measures and follow best practices for handling sensitive patient information. Amazon Bedrock Guardrails can play a crucial role in this regard by helping to provide a secure foundation for deploying and operating healthcare applications and services that use LLM capabilities.
Some key security measures enabled by Amazon Bedrock Guardrails are:

Data encryption: Patient data processed by LLM functions can be encrypted at rest and in transit, making sure that sensitive information remains secure even in the event of unauthorized access or data breaches.
Access controls: Amazon Bedrock Guardrails enables granular access controls, allowing healthcare organizations to define and enforce strict permissions for who can access, modify, or process patient data through LLM functions.
Secure data storage: Patient data can be stored in secure, encrypted storage services such as Amazon S3 or Amazon Elastic File System (Amazon EFS), making sure that sensitive information remains protected even when at rest.
Anonymization and pseudonymization: Healthcare organizations can use Amazon Bedrock Guardrails to implement data anonymization and pseudonymization techniques, making sure that patient data used for training or testing LLM models doesn’t contain personally identifiable information (PII).
Audit logging and monitoring: Comprehensive audit logging and monitoring capabilities provided by Amazon Bedrock Guardrails enable healthcare organizations to track and monitor all access and usage of patient data by LLM functions, enabling timely detection and response to potential security incidents.
Regular security audits and assessments: Amazon Bedrock Guardrails facilitates regular security audits and assessments, making sure that the healthcare organization’s data protection measures remain up-to-date and effective in the face of evolving security threats and regulatory requirements.

By using Amazon Bedrock Guardrails, healthcare organizations can confidently deploy LLM function calling in their applications and services, maintaining robust data security, privacy protection, and regulatory compliance while enabling the transformative benefits of AI-powered healthcare assistants.
Case studies and real-world examples
3M Health Information Systems is collaborating with AWS to accelerate AI innovation in clinical documentation by using AWS machine learning (ML) services, compute power, and LLM capabilities. This collaboration aims to enhance 3M’s natural language processing (NLP) and ambient clinical voice technologies, enabling intelligent healthcare agents to capture and document patient encounters more efficiently and accurately. These agents, powered by LLMs, can understand and process natural language inputs from healthcare providers, such as spoken notes or queries, and use LLM function calling to access and integrate relevant medical data from EHRs, knowledge bases, and other data sources. By combining 3M’s domain expertise with AWS ML and LLM capabilities, the companies can improve clinical documentation workflows, reduce administrative burdens for healthcare providers, and ultimately enhance patient care through more accurate and comprehensive documentation.
GE Healthcare developed Edison, a secure intelligence solution running on AWS, to ingest and analyze data from medical devices and hospital information systems. This solution uses AWS analytics, ML, and Internet of Things (IoT) services to generate insights and analytics that can be delivered through intelligent healthcare agents powered by LLMs. These agents, equipped with LLM function calling capabilities, can seamlessly access and integrate the insights and analytics generated by Edison, enabling them to assist healthcare providers in improving operational efficiency, enhancing patient outcomes, and supporting the development of new smart medical devices. By using LLM function calling to retrieve and process relevant data from Edison, the agents can provide healthcare providers with data-driven recommendations and personalized support, ultimately enabling better patient care and more effective healthcare delivery.
Future trends and developments
Future advancements in LLM function calling for healthcare might include more advanced natural language processing capabilities, such as improved context understanding, multi-turn conversational abilities, and better handling of ambiguity and nuances in medical language. Additionally, the integration of LLM models with other AI technologies, such as computer vision and speech recognition, could enable multimodal interactions and analysis of various medical data formats.
Emerging technologies such as multimodal models, which can process and generate text, images, and other data formats simultaneously, could enhance LLM function calling in healthcare by enabling more comprehensive analysis and visualization of medical data. Personalized language models, trained on individual patient data, could provide even more tailored and accurate responses. Federated learning techniques, which allow model training on decentralized data while preserving privacy, could address data-sharing challenges in healthcare.
These advancements and emerging technologies could shape the future of healthcare agents by making them more intelligent, adaptive, and personalized. Agents could seamlessly integrate multimodal data, such as medical images and lab reports, into their analysis and recommendations. They could also continuously learn and adapt to individual patients’ preferences and health conditions, providing truly personalized care. Additionally, federated learning could enable collaborative model development while maintaining data privacy, fostering innovation and knowledge sharing across healthcare organizations.
Conclusion
LLM function calling has the potential to revolutionize the healthcare industry by enabling intelligent agents that can understand natural language, access and integrate various data sources, and provide personalized recommendations and insights. By combining the language understanding capabilities of LLMs with external data sources and computational resources, healthcare organizations can enhance decision-making, improve operational efficiency, and deliver superior patient experiences. However, addressing data privacy and security concerns is crucial for the successful adoption of this technology in the healthcare domain.
As the healthcare industry continues to embrace digital transformation, we encourage readers to explore and experiment with LLM function calling in their respective domains. By using this technology, healthcare organizations can unlock new possibilities for improving patient care, advancing medical research, and streamlining operations. With a focus on innovation, collaboration, and responsible implementation, the healthcare industry can harness the power of LLM function calling to create a more efficient, personalized, and data-driven future. AWS can help organizations use LLM function calling and build intelligent healthcare assistants through its AI/ML services, including Amazon Bedrock, Amazon Lex, and Lambda, while maintaining robust security and compliance using Amazon Bedrock Guardrails. To learn more, see AWS for Healthcare & Life Sciences.

About the Authors
Laks Sundararajan is a seasoned Enterprise Architect helping companies reset, transform and modernize their IT, digital, cloud, data and insight strategies. A proven leader with significant expertise around Generative AI, Digital, Cloud and Data/Analytics Transformation, Laks is a Sr. Solutions Architect with Healthcare and Life Sciences (HCLS).
Subha Venugopal is a Senior Solutions Architect at AWS with over 15 years of experience in the technology and healthcare sectors. Specializing in digital transformation, platform modernization, and AI/ML, she leads AWS Healthcare and Life Sciences initiatives. Subha is dedicated to enabling equitable healthcare access and is passionate about mentoring the next generation of professionals.

Cohere Released Command A: A 111B Parameter AI Model with 256K Context …

LLMs are widely used for conversational AI, content generation, and enterprise automation. However, balancing performance with computational efficiency is a key challenge in this field. Many state-of-the-art models require extensive hardware resources, making them impractical for smaller enterprises. The demand for cost-effective AI solutions has led researchers to develop models that deliver high performance with lower computational requirements.

Training and deploying AI models present hurdles for researchers and businesses. Large-scale models require substantial computational power, making them costly to maintain. Also, AI models must handle multilingual tasks, ensure high instruction-following accuracy, and support enterprise applications such as data analysis, automation, and coding. Current market solutions, while effective, often demand infrastructure beyond the reach of many enterprises. The challenge is to optimize AI models for processing efficiency without compromising accuracy or functionality.

Several AI models currently dominate the market, including GPT-4o and DeepSeek-V3. These models excel in natural language processing and generation but require high-end hardware, sometimes needing up to 32 GPUs to operate effectively. While they provide advanced capabilities in text generation, multilingual support, and coding, their hardware dependencies limit accessibility. Some models also struggle with enterprise-level instruction-following accuracy and tool integration. Businesses need AI solutions that maintain competitive performance while minimizing infrastructure and deployment costs. This demand has driven efforts to optimize language models to function with minimal hardware requirements.

Researchers from Cohere introduced Command A, a high-performance AI model, designed specifically for enterprise applications requiring maximum efficiency. Unlike conventional models that require large computational resources, Command A operates on just two GPUs while maintaining competitive performance. The model comprises 111 billion parameters and supports a context length of 256K, making it suitable for enterprise applications that involve long-form document processing. Its ability to efficiently handle business-critical agentic and multilingual tasks sets it apart from its predecessors. The model has been optimized to provide high-quality text generation while reducing operational costs, making it a cost-effective alternative for businesses aiming to leverage AI for various applications.

The underlying technology of Command A is structured around an optimized transformer architecture, which includes three layers of sliding window attention, each with a window size of 4096 tokens. This mechanism enhances local context modeling, allowing the model to retain important details across extended text inputs. A fourth layer incorporates global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence. The model’s supervised fine-tuning and preference training further refine its ability to align responses with human expectations regarding accuracy, safety, and helpfulness. Also, Command A supports 23 languages, making it one of the most versatile AI models for businesses with global operations. Its chat capabilities are preconfigured for interactive behavior, enabling seamless conversational AI applications.

Image Source

Performance evaluations indicate that Command A competes favorably with leading AI models such as GPT-4o and DeepSeek-V3 across various enterprise-focused benchmarks. The model achieves a token generation rate of 156 tokens per second, 1.75 times higher than GPT-4o and 2.4 times higher than DeepSeek-V3, making it one of the most efficient models available. Regarding cost efficiency, private deployments of Command A are up to 50% cheaper than API-based alternatives, significantly reducing the financial burden on businesses. Command A also excels in instruction-following tasks, SQL-based queries, and retrieval-augmented generation (RAG) applications. It has demonstrated high accuracy in real-world enterprise data evaluations, outperforming its competitors in multilingual business use cases.

In a direct comparison of enterprise task performance, human evaluation results show that Command A consistently outperforms its competitors in fluency, faithfulness, and response utility. The model’s enterprise-ready capabilities include robust retrieval-augmented generation with verifiable citations, advanced agentic tool use, and high-level security measures to protect sensitive business data. Its multilingual capabilities extend beyond simple translation, demonstrating superior proficiency in responding accurately in region-specific dialects. For instance, evaluations of Arabic dialects, including Egyptian, Saudi, Syrian, and Moroccan Arabic, revealed that Command A delivered more precise and contextually appropriate responses than leading AI models. These results emphasize its strong applicability in global enterprise environments where language diversity is crucial.

Image Source

Several key takeaways from the research include:

Command A operates on just two GPUs, significantly reducing computational costs while maintaining high performance.

With 111 billion parameters, the model is optimized for enterprise-scale applications that require extensive text processing.

The model supports a 256K context length, enabling it to process longer enterprise documents more effectively than competing models.

Command A is trained on 23 languages, ensuring high accuracy and contextual relevance for global businesses.

It achieves 156 tokens per second, 1.75x higher than GPT-4o and 2.4x higher than DeepSeek-V3.

The model consistently outperforms competitors in real-world enterprise evaluations, excelling in SQL, agentic, and tool-based tasks.

Advanced RAG capabilities with verifiable citations make it highly suitable for enterprise information retrieval applications.

Private deployments of Command A can be up to 50% cheaper than API-based models.

The model includes enterprise-grade security features, ensuring safe handling of sensitive business data.

Demonstrates high proficiency in regional dialects, making it ideal for businesses operating in linguistically diverse regions.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Cohere Released Command A: A 111B Parameter AI Model with 256K Context Length, 23-Language Support, and 50% Cost Reduction for Enterprises appeared first on MarkTechPost.

Dynamic Tanh DyT: A Simplified Alternative to Normalization in Transfo …

Normalization layers have become fundamental components of modern neural networks, significantly improving optimization by stabilizing gradient flow, reducing sensitivity to weight initialization, and smoothing the loss landscape. Since the introduction of batch normalization in 2015, various normalization techniques have been developed for different architectures, with layer normalization (LN) becoming particularly dominant in Transformer models. Their widespread use is largely attributed to their ability to accelerate convergence and enhance model performance, especially as networks grow deeper and more complex. Despite ongoing architectural innovations that replace other core components like attention or convolution layers, normalization layers remain integral to most designs, underscoring their perceived necessity in deep learning.

While normalization layers have proven beneficial, researchers have also explored methods to train deep networks without them. Studies have proposed alternative weight initialization strategies, weight normalization techniques, and adaptive gradient clipping to maintain stability in models like ResNets. In Transformers, recent efforts have examined modifications that reduce reliance on normalization, such as restructuring Transformer blocks or gradually removing LN layers through fine-tuning. These approaches demonstrate that, while normalization layers offer optimization advantages, they are not strictly indispensable, and alternative training techniques can achieve stable convergence with comparable performance.

Researchers from FAIR, Meta, NYU, MIT, and Princeton propose Dynamic Tanh (DyT) as a simple yet effective alternative to normalization layers in Transformers. DyT operates as an element-wise function, DyT(x) = tanh(alpha x), where (alpha) is a learnable parameter that scales activations while limiting extreme values. Unlike layer normalization, DyT eliminates the need for activation statistics, simplifying computations. Empirical evaluations show that replacing normalization layers with DyT maintains or improves performance across various tasks without extensive hyperparameter tuning. Additionally, DyT enhances training and inference efficiency, challenging the assumption that normalization is essential for modern deep networks.

Researchers analyzed normalization layers in Transformers using models like ViT-B, wav2vec 2.0, and DiT-XL. They found that LN often exhibits a tanh-like, S-shaped input-output mapping, primarily linear for most values but squashing extreme activations. Inspired by this, they propose Dynamic Tanh (DyT) as a replacement for LN. Defined as DyT(x) = gamma *tanh(alpha x) + beta), where alpha, gamma, and beta are learnable parameters, DyT preserves LN’s effects without computing activation statistics. Empirical results show DyT integrates seamlessly into existing architectures, maintaining stability and reducing the need for hyperparameter tuning.

To evaluate DyT’s effectiveness, experiments were conducted across various architectures and tasks by replacing LN or RMSNorm with DyT while keeping hyperparameters unchanged. In supervised vision tasks, DyT slightly outperformed LN in ImageNet-1K classification. For self-supervised learning, diffusion models, language models, speech processing, and DNA sequence modeling, DyT achieved performance comparable to existing normalization methods. Efficiency tests on LLaMA-7B showed DyT reduced computation time. Ablation studies highlighted the importance of the tanh function and learnable parameter α, which correlated with activation standard deviation, acting as an implicit normalization mechanism. DyT demonstrated competitive performance with improved efficiency.

In conclusion, the study shows that modern neural networks, particularly Transformers, can be trained effectively without normalization layers. The proposed DyT replaces traditional normalization using a learnable scaling factor alpha and an S-shaped tanh function to regulate activation values. Despite its simplicity, DyT replicates normalization behavior and achieves comparable or superior performance across various tasks, including recognition, generation, and self-supervised learning. The results challenge the assumption that normalization layers are essential, offering new insights into their function. DyT provides a lightweight alternative that simplifies training while maintaining or improving performance, often without requiring hyperparameter adjustments.

Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Dynamic Tanh DyT: A Simplified Alternative to Normalization in Transformers appeared first on MarkTechPost.

A Code Implementation to Build an AI-Powered PDF Interaction System in …

In this tutorial, we demonstrate how to build an AI-powered PDF interaction system in Google Colab using Gemini Flash 1.5, PyMuPDF, and the Google Generative AI API. By leveraging these tools, we can seamlessly upload a PDF, extract its text, and interactively ask questions, receiving intelligent responses from Google’s latest Gemini Flash 1.5 model.

Copy CodeCopiedUse a different Browser!pip install -q -U google-generativeai PyMuPDF python-dotenv

First we install the necessary dependencies for building an AI-powered PDF Q&A system in Google Colab. google-generativeai provides access to Gemini Flash 1.5, enabling natural language interactions, while PyMuPDF (also known as Fitz) allows efficient text extraction from PDFs. Also, python-dotenv helps manage environment variables, such as API keys, securely within the notebook.

Copy CodeCopiedUse a different Browserfrom google.colab import files
uploaded = files.upload()

We upload files from your local device to Google Colab. When executed, it opens a file selection dialog, allowing you to choose a file (e.g., a PDF) to upload. The uploaded file is stored in a dictionary-like object (uploaded), where keys represent file names and values contain the file’s binary data. This step is essential for directly processing documents, datasets, or model weights in a Colab environment.

Copy CodeCopiedUse a different Browserimport fitz

def extract_pdf_text(pdf_path):
doc = fitz.open(pdf_path)
full_text = “”
for page in doc:
full_text += page.get_text()
return full_text

pdf_file_path = ‘/content/Paper.pdf’
document_text = extract_pdf_text(pdf_path=pdf_file_path)
print(“Document text extracted!”)
print(document_text[:1000])

We use PyMuPDF (fitz) to extract text from a PDF file in Google Colab. The function extract_pdf_text(pdf_path) reads the PDF, iterates through its pages, and retrieves the text content. The extracted text is then stored in document_text, with the first 1000 characters printed to preview the content. This step is crucial for enabling text-based analysis and AI-driven question answering from PDFs.

Copy CodeCopiedUse a different Browserimport os
os.environ[“GOOGLE_API_KEY”] = ‘Use your own API key here’

We set the Google API key as an environment variable in Google Colab. The API key is required to authenticate requests to Google Generative AI, allowing access to Gemini Flash 1.5 for AI-powered text processing. Replacing ‘Use your own API key here’ with a valid key ensures that the model can generate responses securely within the notebook.

Copy CodeCopiedUse a different Browserimport google.generativeai as genai

genai.configure(api_key=os.environ[“GOOGLE_API_KEY”])

model_name = “models/gemini-1.5-flash-001″

def query_gemini_flash(question, context):
model = genai.GenerativeModel(model_name=model_name)
prompt = f”””
Context: {context[:20000]}

Question: {question}

Answer:
“””
response = model.generate_content(prompt)
return response.text

pdf_text = extract_pdf_text(“/content/Paper.pdf”)

question = “Summarize the key findings of this document.”
answer = query_gemini_flash(question, pdf_text)
print(“Gemini Flash Answer:”)
print(answer)

Finally, we configure and query Gemini Flash 1.5 using a PDF document for AI-powered text generation. It initializes the genai library with the API key and loads the Gemini Flash 1.5 model (gemini-1.5-flash-001). The query_gemini_flash() function takes a question and extracted PDF text as input, formulates a structured prompt, and retrieves an AI-generated response. This setup enables automated document summarization and intelligent Q&A from PDFs.

In conclusion, following this tutorial, we have successfully built an interactive PDF-based interaction system in Google Colab using Gemini Flash 1.5, PyMuPDF, and the Google Generative AI API. This solution enables users to extract information from PDFs and interactively query them easily. The combination of Google’s cutting-edge AI models and Colab’s cloud-based environment provides a powerful and accessible way to process large documents without requiring heavy computational resources.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

The post A Code Implementation to Build an AI-Powered PDF Interaction System in Google Colab Using Gemini Flash 1.5, PyMuPDF, and Google Generative AI API appeared first on MarkTechPost.

Meet PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for …

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities across various domains, propelling their evolution into multi-modal agents for human assistance. GUI automation agents for PCs face particularly daunting challenges compared to smartphone counterparts. PC environments present significantly more complex interactive elements with dense, diverse icons and widgets often lacking textual labels, leading to perception difficulties. Even advanced models like Claude-3.5 achieve only 24.0% accuracy in GUI grounding tasks. Also, PC productivity tasks involve intricate workflows spanning multiple applications with lengthy operation sequences and inter-subtask dependencies, causing dramatic performance declines where GPT-4o’s success rate drops from 41.8% at subtask level to just 8% for complete instructions.

Previous approaches have developed frameworks to address PC task complexity with varying strategies. UFO implements a dual-agent architecture separating application selection from specific control interactions. Meanwhile, AgentS augments planning capabilities by combining online search with local memory. However, these methods demonstrate significant limitations in fine-grained perception and operation of on-screen text—a critical requirement for productivity scenarios like document editing. In addition, they generally fail to address the complex dependencies between subtasks, resulting in poor performance when handling realistic intra- and inter-app workflows that characterize everyday PC usage.

Researchers from MAIS, Institute of Automation, Chinese Academy of Sciences, China, School of Artificial Intelligence, University of Chinese Academy of Sciences, Alibaba Group, Beijing Jiaotong University, and School of Information Science and Technology, ShanghaiTech University introduce PC-Agent framework to address complex PC scenarios through three innovative designs. First, the Active Perception Module enhances fine-grained interaction by extracting locations and meanings of interactive elements via accessibility trees, while using MLLM-driven intention understanding and OCR for precise text localization. Second, Hierarchical Multi-agent Collaboration implements a three-level decision process (Instruction-Subtask-Action) where a Manager Agent decomposes instructions into parameterized subtasks and manages dependencies, a Progress Agent tracks operation history, and a Decision Agent executes steps with perception and progress information. Third, Reflection-based Dynamic Decision-making introduces a Reflection Agent that assesses execution correctness and provides feedback, enabling top-down task decomposition with bottom-up precision feedback across all four collaborating agents.

PC-Agent’s architecture addresses GUI interaction through a formalized approach where an agent ρ processes user instructions I, observations O, and history H to determine actions A. The Active Perception Module enhances element recognition using pywinauto to extract accessibility trees for interactive elements while employing MLLM-driven intention understanding with OCR for precise text localization. For complex workflows, PC-Agent implements Hierarchical Multi-agent Collaboration across three levels: the Manager Agent decomposes instructions into parameterized subtasks and manages dependencies; the Progress Agent tracks operation progress within subtasks; and the Decision Agent executes step-by-step actions based on environmental perception and progress information. This hierarchical division effectively reduces decision-making complexity by breaking complex tasks into manageable components with clear interdependencies.

Experimental results demonstrate PC-Agent’s superior performance compared to both single and multi-agent alternatives. Single MLLM-based agents (GPT-4o, Gemini-2.0, Claude3.5, Qwen2.5-VL) consistently fail on complex instructions, with even the best performer achieving only 12% success rate, confirming that single-agent approaches struggle with lengthy operational sequences and complex dependencies. Multi-agent frameworks like UFO and AgentS show modest improvements but remain limited by perception deficiencies and dependency management issues. They struggle with fine-grained operations such as text editing in Word or proper data entry in Excel, and often fail to utilize information from previous subtasks. In contrast, PC-Agent significantly outperforms all previous methods, surpassing UFO by 44% and AgentS by 32% in success rate through its Active Perception Module and hierarchical multi-agent collaboration.

This study introduces PC-Agent framework, a significant advancement in handling complex PC-based tasks through three key innovations. The Active Perception Module provides refined perception and operation capabilities, enabling precise interaction with GUI elements and text. The hierarchical multi-agent collaboration architecture effectively decomposes decision-making across instruction, subtask, and action levels, while reflection-based dynamic decision-making allows for real-time error detection and correction. Validation through the newly created PC-Eval benchmark with realistic, complex instructions confirms PC-Agent’s superior performance compared to previous methods, demonstrating its effectiveness in navigating the intricate workflows and interactive environments characteristic of PC productivity scenarios.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Meet PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC appeared first on MarkTechPost.

Researchers from the University of Cambridge and Monash University Int …

Reasoning capabilities have become essential for LLMs, but analyzing these complex processes poses a significant challenge. While LLMs can generate detailed text reasoning output, the lack of process visualization creates barriers to understanding, evaluating, and improving. This limitation manifests in three critical ways: increased cognitive load for users attempting to parse complex reasoning paths; difficulty detecting logical fallacies, circular reasoning, and missing steps that remain obscured in lengthy text outputs; and restrictions on downstream applications due to the absence of standardized visualization frameworks. So, there is a need for unified visualization solutions that can effectively illustrate diverse reasoning methodologies across the growing ecosystem of LLM providers and models.

Existing methods like sequential reasoning show step-by-step problem decomposition and have evolved through several variants. Tree-based approaches like Tree-of-Thoughts enable state-based branching for parallel path exploration, while Beam Search reasoning evaluates solution paths based on scoring mechanisms. Further, current visualization approaches fall into two categories: model behavior analysis and reasoning process illustration. Tools like BertViz and Transformers Interpret provide detailed visualizations of attention mechanisms but are limited to low-level model behaviors. Frameworks such as LangGraph offer basic flow visualization without supporting diverse reasoning methodologies, while general-purpose tools like Graphviz and Mermaid lack specific adaptations for LLM reasoning analysis.

Researchers from the University of Cambridge and Monash University have proposed ReasonGraph, a web-based platform for visualizing and analyzing LLM reasoning processes. It supports sequential and tree-based reasoning methods while seamlessly integrating with major LLM providers and over fifty state-of-the-art models. ReasonGraph incorporates an intuitive UI with meta reasoning method selection, configurable visualization parameters, and a modular framework that facilitates efficient extension. By providing a unified visualization framework, ReasonGraph effectively reduces cognitive load in analyzing complex reasoning paths, improves error detection in logical processes, and enables more effective development of LLM-based applications.

ReasonGraph utilizes a modular framework that provides extensible reasoning visualization through the clear separation of components. The front-end tier handles visualization logic and user participation handling, implementing an asynchronous event handling module where user interactions with method selection and parameter configuration trigger corresponding state updates. The backend framework is organized around three core modules implemented in Flask: a Configuration Manager for state updates, an API Factory for LLM integration, and a Reasoning Methods module for reasoning approach encapsulation. Framework modularity exists at both API and reasoning method levels, with the API Factory providing a unified interface for multiple LLM providers through the BaseAPI class.

The evaluation of ReasonGraph shows the platform’s robustness in three key aspects. In parsing reliability, the rule-based XML parsing approach achieves nearly 100% accuracy in extracting and visualizing reasoning paths from properly formatted LLM outputs. For processing efficiency, the Mermaid-based visualization generation time is negligible compared to the LLM’s reasoning time, maintaining consistent performance across all six reasoning methods implemented in the platform. Regarding platform usability, preliminary feedback from open-source platform users shows that approximately 90% of users successfully used the platform without assistance, though these metrics continue to evolve as the user base expands and the platform undergoes regular updates.

In this paper, researchers introduced ReasonGraph, a web-based platform that enables visualization and analysis of LLM reasoning processes across six mainstream methods and over 50 models. It achieves high usability across diverse applications in academia, education, and development through its modular framework and real-time visualization capabilities. Future work includes (a) using the open-source community to integrate additional reasoning methods and expand model API support, (b) developing the platform based on community feedback and user suggestions, (c) exploring downstream applications such as reasoning evaluation, educational tutorials, etc, and (d) implementing editable nodes in the visualization flowcharts to enable direct modification of reasoning processes.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Researchers from the University of Cambridge and Monash University Introduce ReasonGraph: A Web-based Platform to Visualize and Analyze LLM Reasoning Processes appeared first on MarkTechPost.

Meet Attentive Reasoning Queries (ARQs): A Structured Approach to Enha …

Large Language Models (LLMs) have become crucial in customer support, automated content creation, and data retrieval. However, their effectiveness is often hindered by their inability to follow detailed instructions during multiple interactions consistently. This issue is particularly critical in high-stakes environments, such as financial services and customer support systems, where strict adherence to guidelines is essential. LLMs frequently struggle with instruction recall, leading to deviations from intended behaviors. Also, they generate misleading or incorrect information, commonly called hallucination, making their deployment challenging in scenarios requiring precise, context-aware decision-making.

Maintaining reasoning consistency in complex scenarios remains a challenge for LLMs. While they generate coherent responses to simple queries, their performance declines in multi-turn conversations influenced by past interactions. One key issue is alignment drift, where models gradually move away from original instructions, causing misinterpretation of guidelines and incorrect recommendations. Context forgetfulness is another concern, where models prioritize recent information over earlier details, often disregarding critical constraints. These factors contribute to errors that undermine the reliability of LLM-driven systems. Despite strategies like Chain-of-Thought (CoT) and verification-based prompting, existing methods do not provide enough structure to guide models reliably through complex tasks.

Various prompting techniques have been developed to improve instruction adherence. CoT prompting encourages step-by-step reasoning to enhance logical accuracy, while Chain-of-Verification requires explicit self-checking of outputs. Although these methods improve upon direct response generation, they lack mechanisms to reinforce domain-specific constraints and systematically prevent common failures. AI frameworks like LangChain add structural elements for tool integration and workflow automation but treat LLM reasoning as a black box, limiting their ability to enforce strict guidelines. The lack of mechanisms to prevent hallucination and instruction drift highlights the need for a more structured approach.

Researchers at Emcie Co Ltd. developed Attentive Reasoning Queries (ARQs) to address these shortcomings. This novel approach introduces a structured reasoning blueprint designed to guide LLMs systematically through predefined queries. Unlike free-form reasoning methods, ARQs implement a structured JSON schema that directs the model’s attention to specific decision points at critical moments. This design enables ARQs to enhance guideline adherence while minimizing failures caused by misinterpretation or loss of contextual details. To evaluate its effectiveness, the approach was tested within Parlant, a framework used for building customer-facing AI applications. Initial findings demonstrated that ARQs significantly improved instruction-following capabilities while mitigating hallucination-related errors.

The ARQ framework consists of multiple stages that collectively enhance reasoning performance. The first step involves issuing targeted, structured queries that remind the model of key constraints before response generation. These queries reinforce critical instructions, ensuring the model does not deviate from predefined guidelines. Next, the model processes a series of step-by-step queries to reinforce task-specific reasoning. In some implementations, an additional verification step follows, where the model checks its response against predefined correctness criteria before finalizing the output. This structured approach contrasts sharply with CoT prompting by incorporating explicit mechanisms to ensure consistency at every stage of the reasoning process.

On performance evaluation within the Parlant framework, in a controlled test environment comprising 87 distinct conversational scenarios, ARQs achieved a 90.2% success rate, outperforming both CoT reasoning (86.1%) and direct response generation (81.5%). The ARQ methodology excelled in addressing two critical failure modes: guideline re-application and hallucination prevention. Specifically, in cases where the model needed to reapply earlier instructions, ARQs ensured a 92.19% success rate, significantly higher than CoT (87.81%) and direct response generation (85.31%). Also, ARQs reduced the occurrence of factual inaccuracies, with models trained on ARQs exhibiting a 23% lower hallucination rate than those relying on standard CoT techniques. These results underscore the importance of structured reasoning approaches in improving LLM reliability.

Several Key takeaways from the research include:

ARQs improved instruction adherence, achieving a 90.2% success rate across 87 test cases, surpassing Chain-of-Thought (86.1%) and direct response generation (81.5%).

ARQs significantly reduced hallucination errors by 23% compared to CoT, making them particularly useful for business-critical AI applications requiring factual consistency.

In guideline re-application scenarios, ARQs outperformed CoT by 4.38%, achieving a success rate of 92.19% compared to CoT’s 87.81%.

The structured nature of ARQs allowed for more efficient reasoning in classification tasks, reducing token usage by 29% compared to CoT.

The verification mechanism in ARQs was key to preventing alignment drift. It ensured that models focused on predefined constraints even in extended conversations.

Future research aims to optimize ARQ efficiency further by refining query design and exploring its application in diverse AI-driven decision-making systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Meet Attentive Reasoning Queries (ARQs): A Structured Approach to Enhancing Large Language Model Instruction Adherence, Decision-Making Accuracy, and Hallucination Prevention in AI-Driven Conversational Systems appeared first on MarkTechPost.

Allen Institute for AI (AI2) Releases OLMo 32B: A Fully Open Model to …

The rapid evolution of artificial intelligence (AI) has ushered in a new era of large language models (LLMs) capable of understanding and generating human-like text. However, the proprietary nature of many of these models poses challenges for accessibility, collaboration, and transparency within the research community. Additionally, the substantial computational resources required to train such models often limit participation to well-funded organizations, thereby hindering broader innovation.​

Addressing these concerns, the Allen Institute for AI (AI2) has introduced OLMo 2 32B, the latest and most advanced model in the OLMo 2 series. This model distinguishes itself as the first fully open model to surpass GPT-3.5 Turbo and GPT-4o mini across a suite of widely recognized, multi-skill academic benchmarks. By making all data, code, weights, and training details freely available, AI2 promotes a culture of openness and collaboration, enabling researchers worldwide to build upon this work.

OLMo 2 32B’s architecture comprises 32 billion parameters, reflecting a significant scaling from its predecessors. The training process was meticulously structured in two primary phases: pretraining and mid-training. During pretraining, the model was exposed to approximately 3.9 trillion tokens from diverse sources, including DCLM, Dolma, Starcoder, and Proof Pile II, ensuring a comprehensive understanding of language patterns. The mid-training phase utilized the Dolmino dataset, which consists of 843 billion tokens curated for quality, encompassing educational, mathematical, and academic content. This phased approach ensured that OLMo 2 32B developed a robust and nuanced grasp of language.

A notable aspect of OLMo 2 32B is its training efficiency. The model achieved performance levels comparable to leading open-weight models while utilizing only a fraction of the computational resources. Specifically, it required approximately one-third of the training compute compared to models like Qwen 2.5 32B, highlighting AI2’s commitment to resource-efficient AI development. ​

In benchmark evaluations, OLMo 2 32B demonstrated impressive results. It matched or exceeded the performance of models such as GPT-3.5 Turbo, GPT-4o mini, Qwen 2.5 32B, and Mistral 24B. Furthermore, it approached the performance levels of larger models like Qwen 2.5 72B and Llama 3.1 and 3.3 70B. These assessments spanned various tasks, including Massive Multitask Language Understanding (MMLU), mathematics problem-solving (MATH), and instruction-following evaluations (IFEval), underscoring the model’s versatility and competence across diverse linguistic challenges. ​

The release of OLMo 2 32B signifies a pivotal advancement in the pursuit of open and accessible AI. By providing a fully open model that not only competes with but also surpasses certain proprietary models, AI2 exemplifies how thoughtful scaling and efficient training methodologies can lead to significant breakthroughs. This openness fosters a more inclusive and collaborative environment, empowering researchers and developers globally to engage with and contribute to the evolving landscape of artificial intelligence.

Check out the Technical Details, HF Project and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Allen Institute for AI (AI2) Releases OLMo 32B: A Fully Open Model to Beat GPT 3.5 and GPT-4o mini on a Suite of Multi-Skill Benchmarks appeared first on MarkTechPost.

This AI Paper Introduces BD3-LMs: A Hybrid Approach Combining Autoregr …

Traditional language models rely on autoregressive approaches, which generate text sequentially, ensuring high-quality outputs at the expense of slow inference speeds. In contrast, diffusion models, initially developed for image and video generation, have gained attention in text generation due to their potential for parallelized generation and improved controllability. However, existing diffusion models struggle with fixed-length constraints and inefficiencies in likelihood modeling, limiting their effectiveness in generating flexible-length text.

A major challenge in language modeling is balancing efficiency and quality. Autoregressive models capture long-range dependencies effectively but suffer from slow token-by-token generation. Diffusion models, while promising, require multiple inference steps and typically generate fixed-length outputs. This limitation prevents them from being practical for real-world applications where variable-length sequences are necessary. The research addresses this issue by proposing a method that combines the strengths of both autoregressive and diffusion models, ensuring efficient and high-quality text generation without compromising flexibility.

Current methods primarily involve autoregressive models, which generate text one token at a time based on previously generated tokens. While these models achieve high fluency and coherence, they are inherently slow due to their sequential processing nature. Diffusion-based approaches have been explored as an alternative, offering parallel generation. However, existing diffusion models generate fixed-length sequences and lack efficient means of extending beyond predefined contexts. Despite their inefficiencies, the lack of scalability in diffusion models has led to continued reliance on autoregressive methods.

Cornell Tech and Stanford University researchers introduced **Block Discrete Denoising Diffusion Language Models (BD3-LMs)** to overcome these limitations. This new class of models interpolates between autoregressive and diffusion models by employing a structured approach that supports variable-length generation while maintaining inference efficiency. BD3-LMs use key-value caching and parallel token sampling to reduce computational overhead. The model is designed with specialized training algorithms that minimize gradient variance through customized noise schedules, optimizing performance across diverse language modeling benchmarks.

BD3-LMs operate by structuring text generation into blocks rather than individual tokens. Unlike traditional autoregressive models, which predict the next token sequentially, BD3-LMs generate a block of tokens simultaneously, significantly improving efficiency. A diffusion-based denoising process within each block ensures high-quality text generation while preserving coherence. The model architecture integrates transformers with a block-causal attention mechanism, allowing each block to condition on previously generated blocks. This approach enhances both contextual relevance and fluency. The training process includes a vectorized implementation that enables parallel computations, reducing training time and resource consumption. Researchers introduced data-driven noise schedules that stabilize training and improve gradient estimation to address the high variance issue in diffusion models.

Performance evaluations of BD3-LMs demonstrate substantial improvements over existing discrete diffusion models. The model achieves state-of-the-art perplexity scores among diffusion-based language models while enabling the generation of arbitrary-length sequences. In experiments conducted on language modeling benchmarks, BD3-LMs reduce perplexity by up to 13% compared to previous diffusion models. On the LM1B dataset, BD3-LMs achieved a perplexity of 28.23 when using a block size of four, outperforming previous models such as MDLM, which had a perplexity of 31.78. On OpenWebText, BD3-LMs attained a perplexity of 20.73, significantly better than other discrete diffusion models. Further, BD3-LMs generated sequences up to 10 times longer than those produced by traditional diffusion methods, demonstrating superior scalability. The proposed model also reduced the number of function evaluations required for inference, achieving improved sample efficiency and generation speed.

The introduction of BD3-LMs presents a significant advancement in language modeling by integrating autoregressive and diffusion-based methodologies. By addressing key challenges related to inference efficiency, likelihood estimation, and sequence flexibility, this research offers a practical and scalable solution for text generation. BD3-LMs improve training stability and computational efficiency, providing a framework that can be extended to future language modeling developments. The results highlight the effectiveness of BD3-LMs in bridging the gap between autoregressive and diffusion-based approaches, offering an optimized balance between quality and speed in text generation.

Check out the Paper, Project and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post This AI Paper Introduces BD3-LMs: A Hybrid Approach Combining Autoregressive and Diffusion Models for Scalable and Efficient Text Generation appeared first on MarkTechPost.