August 2025 - Page 6 of 10

Scalable intelligent document processing using Amazon Bedrock Data Aut …

Posted on August 15, 2025 by i-genie

Intelligent document processing (IDP) is a technology to automate the extraction, analysis, and interpretation of critical information from a wide range of documents. By using advanced machine learning (ML) and natural language processing algorithms, IDP solutions can efficiently extract and process structured data from unstructured text, streamlining document-centric workflows.
When enhanced with generative AI capabilities, IDP enables organizations to transform document workflows through advanced understanding, structured data extraction, and automated classification. Generative AI-powered IDP solutions can better handle the variety of documents that traditional ML models might not have seen before. This technology combination is impactful across multiple industries, including child support services, insurance, healthcare, financial services, and the public sector. Traditional manual processing creates bottlenecks and increases error risk, but by implementing these advanced solutions, organizations can dramatically enhance their document workflow efficiency and information retrieval capabilities. AI-enhanced IDP solutions improve service delivery while reducing administrative burden across diverse document processing scenarios.
This approach to document processing provides scalable, efficient, and high-value document processing that leads to improved productivity, reduced costs, and enhanced decision-making. Enterprises that embrace the power of IDP augmented with generative AI can benefit from increased efficiency, enhanced customer experiences, and accelerated growth.
In the blog post Scalable intelligent document processing using Amazon Bedrock, we demonstrated how to build a scalable IDP pipeline using Anthropic foundation models on Amazon Bedrock. Although that approach delivered robust performance, the introduction of Amazon Bedrock Data Automation brings a new level of efficiency and flexibility to IDP solutions. This post explores how Amazon Bedrock Data Automation enhances document processing capabilities and streamlines the automation journey.
Benefits of Amazon Bedrock Data Automation
Amazon Bedrock Data Automation introduces several features that significantly improve the scalability and accuracy of IDP solutions:

Confidence scores and bounding box data – Amazon Bedrock Data Automation provides confidence scores and bounding box data, enhancing data explainability and transparency. With these features, you can assess the reliability of extracted information, resulting in more informed decision-making. For instance, low confidence scores can signal the need for additional human review or verification of specific data fields.
Blueprints for rapid development – Amazon Bedrock Data Automation provides pre-built blueprints that simplify the creation of document processing pipelines, helping you develop and deploy solutions quickly. Amazon Bedrock Data Automation provides flexible output configurations to meet diverse document processing requirements. For simple extraction use cases (OCR and layout) or for a linearized output of the text in documents, you can use standard output. For customized output, you can start from scratch to design a unique extraction schema, or use preconfigured blueprints from our catalog as a starting point. You can customize your blueprint based on your specific document types and business requirements for more targeted and accurate information retrieval.
Automatic classification support – Amazon Bedrock Data Automation splits and matches documents to appropriate blueprints, resulting in precise document categorization. This intelligent routing alleviates the need for manual document sorting, drastically reducing human intervention and accelerating processing time.
Normalization – Amazon Bedrock Data Automation addresses a common IDP challenge through its comprehensive normalization framework, which handles both key normalization (mapping various field labels to standardized names) and value normalization (converting extracted data into consistent formats, units, and data types). This normalization approach helps reduce data processing complexities, so organizations can automatically transform raw document extractions into standardized data that integrates more smoothly with their existing systems and workflows.
Transformation – The Amazon Bedrock Data Automation transformation feature converts complex document fields into structured, business-ready data by automatically splitting combined information (such as addresses or names) into discrete, meaningful components. This capability simplifies how organizations handle varied document formats, helping teams define custom data types and field relationships that match their existing database schemas and business applications.
Validation – Amazon Bedrock Data Automation enhances document processing accuracy by using automated validation rules for extracted data, supporting numeric ranges, date formats, string patterns, and cross-field checks. This validation framework helps organizations automatically identify data quality issues, trigger human reviews when needed, and make sure extracted information meets specific business rules and compliance requirements before entering downstream systems.

Solution overview
The following diagram shows a fully serverless architecture that uses Amazon Bedrock Data Automation along with AWS Step Functions and Amazon Augmented AI (Amazon A2I) to provide cost-effective scaling for document processing workloads of different sizes.

The Step Functions workflow processes multiple document types including multipage PDFs and images using Amazon Bedrock Data Automation. It uses various Amazon Bedrock Data Automation blueprints (both standard and custom) within a single project to enable processing of diverse document types such as immunization documents, conveyance tax certificates, child support services enrollment forms, and driver licenses.
The workflow processes a file (PDF, JPG, PNG, TIFF, DOC, DOCX) containing a single document or multiple documents through the following steps:

For multi-page documents, splits along logical document boundaries
Matches each document to the appropriate blueprint
Applies the blueprint’s specific extraction instructions to retrieve information from each document
Perform normalization, Transformation and validation on extracted data according to the instruction specified in blueprint

The Step Functions Map state is used to process each document. If a document meets the confidence threshold, the output is sent to an Amazon Simple Storage Service (Amazon S3) bucket. If any extracted data falls below the confidence threshold, the document is sent to Amazon A2I for human review. Reviewers use the Amazon A2I UI with bounding box highlighting for selected fields to verify the extraction results. When the human review is complete, the callback task token is used to resume the state machine and human-reviewed output is sent to an S3 bucket.
To deploy this solution in an AWS account, follow the steps provided in the accompanying GitHub repository.
In the following sections, we review the specific Amazon Bedrock Data Automation features deployed using this solution, using the example of a child support enrollment form.
Automated Classification
In our implementation, we define the document class name for each custom blueprint created, as illustrated in the following screenshot. When processing multiple document types, such as driver’s licenses and child support enrollment forms, the system automatically applies the appropriate blueprint based on content analysis, making sure the correct extraction logic is used for each document type.

Data Normalization
We use data normalization to make sure downstream systems receive uniformly formatted data. We use both explicit extractions (for clearly stated information visible in the document) and implicit extractions (for information that needs transformation). For example, as shown in the following screenshot, dates of birth are standardized to YYYY-MM-DD format.

Similarly, format of Social Security Numbers is changed to XXX-XX-XXXX.
Data Transformation
For the child support enrollment application, we’ve implemented custom data transformations to align extracted data with specific requirements. One example is our custom data type for addresses, which breaks down single-line addresses into structured fields (Street, City, State, ZipCode). These structured fields are reused across different address fields in the enrollment form (employer address, home address, other parent address), resulting in consistent formatting and straightforward integration with existing systems.

Data Validation
Our implementation includes validation rules for maintaining data accuracy and compliance. For our example use case, we’ve implemented two validations: 1. verify the presence of the enrollee’s signature and 2. verify that the signed date isn’t in the future.

The following screenshot shows the result of the above validation rules applied to the document.

Human-in-the-loop validation
The following screenshot illustrates the extraction process, which includes a confidence score and is integrated with a human-in-the-loop process. It also shows normalization applied to the date of birth format.

Conclusion
Amazon Bedrock Data Automation significantly advances IDP by introducing confidence scoring, bounding box data, automatic classification, and rapid development through blueprints. In this post, we demonstrated how to take advantage of its advanced capabilities for data normalization, transformation, and validation. By upgrading to Amazon Bedrock Data Automation, organizations can significantly reduce development time, improve data quality, and create more robust, scalable IDP solutions that integrate with human review processes.
Follow the AWS Machine Learning Blog to keep up to date with new capabilities and use cases for Amazon Bedrock.

About the authors
Abdul Navaz is a Senior Solutions Architect in the Amazon Web Services (AWS) Health and Human Services team, based in Dallas, Texas. With over 10 years of experience at AWS, he focuses on modernization solutions for child support and child welfare agencies using AWS services. Prior to his role as a Solutions Architect, Navaz worked as a Senior Cloud Support Engineer, specializing in networking solutions.
Venkata Kampana is a senior solutions architect in the Amazon Web Services (AWS) Health and Human Services team and is based in Sacramento, Calif. In this role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.
Sanjeev Pulapaka is principal solutions architect and AI lead for public sector. Sanjeev is a published author with several blogs and a book on generative AI. He is also a well-known speaker at several events including re:Invent and Summit. Sanjeev has an undergraduate degree in engineering from the Indian Institute of Technology and an MBA from the University of Notre Dame.

Whiteboard to cloud in minutes using Amazon Q, Amazon Bedrock Data Aut …

Posted on August 15, 2025 by i-genie

Upgrading legacy systems has become increasingly important to stay competitive in today’s market as outdated infrastructure can cost organizations time, money, and market position. However, modernization efforts face challenges like time-consuming architecture reviews, complex migrations, and fragmented systems. These delays not only impact engineering teams but have broader impacts including lost market opportunities, reduced competitiveness, and higher operational costs. With Amazon Q Developer, Amazon Bedrock Data Automation (Bedrock Data Automation) and Anthropic’s Model Context Protocol (MCP), developers can now go from whiteboard sketches and team discussions to fully deployed, secure, and scalable cloud architectures in a matter of minutes, not months.
We’re excited to share the Amazon Bedrock Data Automation Model Context Protocol (MCP) server, for seamless integration between Amazon Q and your enterprise data. With this new capability, developers can use the features of Amazon Q while maintaining secure access to their organization’s data through standardized MCP interactions. In this post, you will learn how to use the Amazon Bedrock Data Automation MCP server to securely integrate with AWS Services, use Bedrock Data Automation operations as callable MCP tools, and build a conversational development experience with Amazon Q.
The problem: Five systems, lack of agility
Engineers looked at a whiteboard, eyeing a complex web of arrows, legacy system names, and integration points that had long stopped making sense. The diagram represented multiple disconnected systems held together by brittle scripts, fragile batch jobs, and a patchwork of manual workarounds as shown in the following illustration.

The meeting audio was synthesized using Amazon Polly to bring the conversation to life for this post.
“We need to stop patching and start transforming,” Alex said, pointing at the tangled mess. The team nodded, weary from another outage that left the finance team reconciling thousands of transactions by hand. Feature development had slowed to a crawl, infrastructure costs were unpredictable, and any change risked breaking something downstream. Migration felt inevitable but overwhelming. The question wasn’t whether to modernize – it was how to begin without burning months in planning and coordination. That’s when they turned to the new pattern.
The breakthrough
Just a few months ago, building a working prototype from a whiteboard session like this would have taken months, if not longer. The engineers would have started by manually transcribing the meeting, converting rough ideas into action items, cleaning up architecture diagrams, aligning teams across operations and security, and drafting infrastructure templates by hand. Every step would have required coordination, and each change made would have invited risk to the system. Even a proof-of-concept would have demanded hours of YAML, command line interface (CLI) commands, policy definitions, and trial-and-error troubleshooting. Now the engineers need to only ask, and what used to take months happens in minutes.

With Amazon Q CLI, the team initiates a conversation. Behind the scenes, Amazon Q CLI invokes the MCP server and extracts information from multimodal content using Bedrock Data Automation. The meeting recording and the draft architecture diagram are also analyzed using Bedrock Data Automation. Amazon Q uses the extracted content from Bedrock Data Automation to generate the AWS CloudFormation template. It even deploys it to the AWS Cloud when asked. There is no manual translation, no brittle scripting, and no dependency mapping across systems. The result is a fully deployable, secure AWS architecture generated and provisioned in minutes. What once required cross-functional coordination and prolonged development cycles now starts and completes with a chat.
Understanding the Model Context Protocol
The Model Context Protocol (MCP) is an open standard developed by Anthropic to facilitate secure, two-way connections between AI models and multiple data sources, including content repositories, business tools, and development environments. By standardizing these interactions, MCP enables AI systems to access the data they need to provide more relevant and accurate responses.
MCP operates on a client-server architecture, where developers can either expose their data through MCP servers or build AI applications (MCP clients) that connect to these servers. This setup allows for a more streamlined and scalable integration process, replacing the need for custom connectors for each data source.
Enhancing Amazon Q with Amazon Bedrock Data Automation and MCP server
Bedrock Data Automation complements MCP by providing a robust suite of tools that automate the extraction, transformation, and loading (ETL) of enterprise data into AI workflows at scale and with minimal manual intervention. With Bedrock Data Automation, customers can:

Extract unstructured data from diverse sources such as document, image, audio, and video files.
Transform and validate data using schema-driven extraction using Blueprints, confidence scoring, and responsible AI practice to maintain accuracy, completeness, and consistency.
Load ready-to-use data into AI models for real-time, context-aware reasoning across business.

This deep integration makes sure that AI models are not just connected to data, they are grounded in clean, validated, and context-rich information. As a result, intelligent agents deliver more accurate, relevant, and reliable outputs that drive faster decisions and richer insights across the enterprise. Amazon Q Developer is a generative AI-powered conversational assistant from AWS designed to help software developers and IT professionals build, operate, and transform software with greater speed, security, and efficiency. It acts as an intelligent coding companion and productivity tool, integrated with the AWS environment and available in popular code editors, the AWS Management Console, and collaboration tools such as Microsoft Teams and Slack. As described in the following figure, the Bedrock Data Automation MCP server works in the following way:

The User sends a “Request action” to the MCP Host.
The MCP Host processes the request with an LLM.
The MCP Host then requests a tool execution to the MCP Client.
The MCP Client makes a tool call request to the MCP Server.
The MCP Server makes an API request to the Bedrock Data Automation.
Bedrock Data Automation sends back an API response to the MCP Server.
The MCP Server returns the tool result to the MCP Client.
The MCP Client sends the result back to the MCP Host.
The MCP Host again processes with LLM.
The MCP Host sends a final response to the User.

Step-by-step guide
If this is your first time using AWS MCP servers, visit the Installation and Setup guide in the AWS Labs GitHub repository for installation instructions. After installation, add the following MCP server configuration to your local setup:
Prerequisites

You need an AWS account and an AWS Identity and Access Management (IAM) role and user with permissions to create and manage the necessary resources and components for this application. If you don’t have an AWS account, see How do I create and activate a new Amazon Web Services account?
Bedrock Data Automation MCP requires an Amazon Simple Storage Service (Amazon S3) bucket to function properly. If you need to create a new S3 bucket, follow the AWS security best practices for bucket configuration.
NodeJS and NPM
Follow these instructions to set up Amazon Q

Set up MCP
Install Amazon Q for command line and add the conﬁguration to ~/.aws/amazonq/mcp.json. If you’re already an Amazon Q CLI user, add only the configuration.

{
  “mcpServers”: {
   “bedrock-data-automation-mcp-server”: {
   “command”: “uvx”,
   “args”: [
   “awslabs.aws-bedrock-data-automation-mcp-server@latest”
   ],
   “env”: {
   “AWS_PROFILE”: “your-aws-profile”,
   “AWS_REGION”: “your-aws-region”,
   “AWS_BUCKET_NAME”: “amzn-s3-demo-bucket”
   }
   }
  }
}

To confirm the setup was successful, open a terminal and enter q chat to enter into a chat session with Amazon Q.
Need to know what tools are at your disposal? Enter:”Tell me the tools I have access to”
If MCP has been properly configured, as shown in the following screenshot, you will have, aws_bedrock_data_automation suffixed by getprojects, getprojectdetails, and analyzeasset as its three tools. This will help you quickly verify access and make sure that the necessary components are properly set up.

Now, you can ask Amazon Q to use Bedrock Data Automation as a tool and extract the transcript from the meeting stored in the .mp3 file and refer to the updated architecture diagram, as shown in the following screenshot.

can you extract the meeting recording from <your-location> and refer to the updated architecture diagram from <your-location> using Bedrock Data Automation

You can seamlessly continue a natural language conversation with Amazon Q to generate an AWS CloudFormation template, write prototype code, or even implement monitoring solutions. The potential applications are virtually endless.
Clean up
When you’re done working with the Amazon Bedrock Data Automation MCP server, follow the given steps to perform cleanup:

Empty and delete the S3 buckets used for Bedrock Data Automation.

aws s3 rm s3://amzn-s3-demo-bucket –recursive
aws s3 rb s3://amzn-s3-demo-bucket

Remove the configuration added to ~/.aws/amazonq/mcp.json for bedrock-data-automation-mcp-server.

Conclusion
With MCP and Bedrock Data Automation, Amazon Q Developer can turn messy ideas into working cloud architectures in record time. No whiteboards are left behind.
Are you ready to build smarter, faster, and more context-aware applications? Explore Amazon Q Developer and see how MCP and Amazon Bedrock Data Automation can help your team turn ideas into reality faster than ever before.

About the authors
Wrick Talukdar is a Tech Lead and Senior Generative AI Specialist at Amazon Web Services, driving innovation through multimodal AI, generative models, computer vision, and natural language processing. He is also the author of the bestselling book “Building Agentic AI Systems”. He is a keynote speaker and often presents his innovations and solutions at leading global forums, including AWS re:Invent, ICCE, Global Consumer Technology conference, and major industry events such as CERAWeek and ADIPEC. In his free time, he enjoys writing and birding photography.
Ayush Goyal is a Senior Software Engineer at Amazon Bedrock, where he focuses on designing and scaling AI-powered distributed systems. He’s also passionate about contributing to open-source projects. When he’s not writing code, Ayush enjoys speed cubing, exploring global cuisines, and discovering new parks—both in the real world and through open-world games.
Himanshu Sah is an Associate Delivery Consultant in AWS Professional Services, specialising in Application Development and Generative AI solutions. Based in India, he helps customers architect and implement cutting-edge applications leveraging AWS services and generative AI capabilities. Working closely with cross-functional teams, he focuses on delivering best-practice implementations while ensuring optimal performance and cost-effectiveness. Outside of work, he is passionate about exploring new technologies and contributing to the tech community.

Bringing agentic Retrieval Augmented Generation to Amazon Q Business

Posted on August 15, 2025 by i-genie

Amazon Q Business is a generative AI-powered enterprise assistant that helps organizations unlock value from their data. By connecting to enterprise data sources, employees can use Amazon Q Business to quickly find answers, generate content, and automate tasks—from accessing HR policies to streamlining IT support workflows, all while respecting existing permissions and providing clear citations. At the heart of systems like Amazon Q Business lies Retrieval Augmented Generation (RAG), which enables AI models to ground their responses in an organization’s enterprise data.
The evolution of RAG
Traditional RAG implementations typically follow a straightforward approach: retrieve relevant documents or passages based on a user query, then generate a response using these documents or passages as context for the large language model (LLM) to respond. While this methodology works well for basic, factual queries, enterprise environments present uniquely complex challenges that expose the limitations of this single-shot retrieval approach.
Consider an employee asking about the differences between two benefits packages or requesting a comparison of project outcomes across multiple quarters. These queries require synthesizing information from various sources, understanding company-specific context, and often need multiple retrieval steps to gather comprehensive information around each aspect of the query.
Traditional RAG systems struggle with such complexity, often providing incomplete answers or failing to adapt their retrieval strategy when initial results are insufficient. When processing these more involved queries, users are left waiting without visibility into the system’s progress, leading to an opaque experience.
Bringing agency to Amazon Q Business
Bringing agency to Amazon Q Business is a new paradigm to handle sophisticated enterprise queries through intelligent, agent-based retrieval strategies. By introducing AI agents that dynamically plan and execute sophisticated retrieval strategies with a suite of data navigation tools, Agentic RAG represents a significant evolution in how AI assistants interact with enterprise data, delivering more accurate and comprehensive responses while maintaining the speed users expect.
With Agentic RAG in Amazon Q Business you have several new capabilities, including query decomposition and transparent events, agentic retrieval tool use, improved conversational capabilities, and agentic response optimization. Let’s dive deeper into what each of these mean.
Query decomposition and transparent response events
Traditional RAG systems often face significant challenges when processing complex enterprise queries, particularly those involving multiple steps, composite elements, or comparative analysis. With this release of Agentic RAG in Amazon Q Business, we aim to solve this problem through sophisticated query decomposition techniques, where AI agents intelligently break down complex questions into discrete, manageable components.
When an employee asks Please compare the vacation policies of Washington and California?, the question is decomposed into two queries on Washington and California policies. The first decomposed query being washington state vacation policies and the second query being california state vacation policies.
Because Agentic RAG presumes a series of parallel steps to explore the data source and collect thorough information for more accurate query resolution, we are now providing real-time visibility into its processing steps that will be displayed on the screen as data is being retrieved to generate the response. After the response is generated, the steps will be collapsed with the response streamed. In the following image, we see how the decomposed queries are displayed and the relevant data retrieved for response generation.

This allows users to see meaningful updates to the system’s operations, including query decomposition patterns, document retrieval paths, and response generation workflows. This granular visibility into the system’s decision-making process enhances user confidence and provides valuable insights into the sophisticated mechanisms driving accurate response generation.
This agentic solution facilitates comprehensive data collection and enables more accurate, nuanced responses. The result is enhanced responses that maintain both granular precision and holistic understanding of complex, multi-faceted business questions, while relying on the LLM to synthesize the information retrieved. As shown in the following image, the information fetched individually for California and Washington vacation policies were synthesized by the LLM and presented in a rich markdown format.

Agentic tool use
The designed RAG agents can intelligently deploy various data exploration tools and retrieval methods in optimal strategies by thinking about the retrieval plan while maintaining context over multiple turns of the conversations. These retrieval tools include tools built within Amazon Q Business such as tabular search, allowing intelligent retrieval of data through either code generation or tabular linearization across small and large tables embedded in documents (such as DOCX, PPTX, PDF, and so on) or stored in CSV or XLSX files. Another retrieval tool includes long context retrieval, which determines when the full context of a document is required for retrieval. An example of long context retrieval: if a user asks a query such as Summarize the 10K of Company X, the agent could identify the query’s intent as a summarization query that requires document-level context and, as a result, deploy the long context retrieval tool that fetches the complete document—the 10K of Company X—as part of the context for the LLM to generate a response (as shown in the following figure). This intelligent tool selection and deployment represents a significant advancement over traditional RAG systems, which often rely on fragmented passage retrieval that can compromise the coherence and completeness of complex document analysis for question answering.

Improved conversational capabilities
Agentic RAG introduces multi-turn query capabilities that elevate the conversational capabilities of Amazon Q Business into dynamic, context-aware dialogues. The agent maintains conversational context across interactions by storing short-term memory, enabling natural follow-up questions without requiring users to restate previous context. Additionally, when the agent encounters multiple possible answers based on your enterprise data, it asks clarifying questions to disambiguate the query to better understand what you’re looking for to provide more accurate responses. For instance, Q refers to any of the many implementations of Amazon Q. The system handles semantic ambiguity gracefully by recognizing multiple potential interpretations of what Q could be and asks for clarifications in its responses to verify accuracy and relevance. This sophisticated approach to dialogue management makes complex tasks like policy interpretation or technical troubleshooting more efficient, because the system can progressively refine its understanding through targeted clarification and follow-up exchanges.
In the following image, the user asks tell me about Q with the system providing a high-level overview of the various implementations and asking a follow-up question to disambiguate the user’s search intent.

Upon successful disambiguation, the system persists both the conversation state and previously retrieved contextual data in-memory, enabling the generation of precisely targeted responses that align with the user’s clarified intent thus being more accurate, relevant, and complete.

Agentic response optimization
Agentic RAG introduces dynamic response optimization where AI agents actively evaluate and refine their responses. Unlike traditional systems that provide answers even when the context is insufficient, these agents continuously assess response quality and iteratively plan out new actions to improve information completeness. They can recognize when initial retrievals miss crucial information and autonomously initiate additional searches or alternative retrieval strategies. This means when discussing complex topics like compliance policies, the system captures all relevant updates, exceptions, and interdependencies while maintaining context across multiple turns of the conversation. The following diagram shows how Agentic RAG handles the conversation history across multiple turns of the conversation. The agent plans and reasons across the retrieval tool use and response generation process. Based on the initial retrieval, while taking into account the conversation state and history, the agent re-plans the process as needed to generate the most complete and accurate response for the user’s query.

Using the Agentic RAG feature
Getting started with Agentic RAG’s advanced capabilities in Amazon Q Business is straightforward and can immediately improve how your organization interacts with your enterprise data. To begin, in the Amazon Q Business web interface, you can switch on the Advanced Search toggle to enable Agentic RAG, as shown in the following image.

After advanced search is enabled, users can experience richer and more complete responses from Amazon Q Business. Agentic RAG particularly shines when handling complex business scenarios based on your enterprise data—imagine asking about cross-AWS Region performance comparisons, investigating policy implications across departments, or analyzing historical trends in project deliveries. The system excels at breaking down these complex queries into manageable search tasks while maintaining context throughout the conversation.
For the best experience, users should feel confident in asking detailed, multi-part questions. Unlike traditional search systems, Agentic RAG handles nuanced queries like
How have our metrics changed across the southeast and northeast regions in 2024?
The system will work through such questions methodically, showing its progress as it analyzes and breaks the query down into composite parts to fetch sufficient context and generate a complete and accurate response.
Conclusion
Agentic RAG represents a significant leap forward for Amazon Q Business, transforming how organizations use their enterprise data while maintaining the robust security and compliance that they expect with AWS services. Through its sophisticated query processing and contextual understanding, the system enables deeper, more nuanced interactions with enterprise data—from comparative and multi-step queries to interactive multi-turn chat experiences. All of this occurs within a secure framework that respects existing permissions and access controls, making sure that users receive only authorized information while maintaining the rich, contextual responses needed for meaningful insights.
By combining advanced retrieval capabilities with intelligent, conversation-aware interactions, Agentic RAG allows organizations to unlock the full potential of their data while maintaining the highest standards of data governance. The result is an improved chat experience and a more capable query answering engine that maximizes the value of your data assets.
Try out Amazon Q Business for your organization with your data and share your thoughts in the comments.

About the authors
Sanjit Misra is a technical product leader at Amazon Web Services, driving innovation on Amazon Q Business, Amazon’s generative AI product. He leads product development for core Agentic AI features that enhance accuracy and retrieval — including Agentic RAG, conversational disambiguation, tabular search, and long-context retrieval. With over 15 years of experience across product and engineering roles in data, analytics, and AI/ML, Sanjit combines deep technical expertise with a track record of delivering business outcomes. He is based in New York City.
Venky Nagapudi is a Senior Manager of Product Management for Amazon Q Business. His focus areas include RAG features, accuracy evaluation and enhancement, user identity management and user subscriptions.
Yi-An Lai is a Senior Applied Scientist with the Amazon Q Business team at Amazon Web Services in Seattle, WA. His expertise spans agentic information retrieval, conversational AI systems, LLM tool orchestration, and advanced natural language processing. With over a decade of experience in ML/AI, he has been enthusiastic about developing sophisticated AI solutions that bridge state-of-the-art research and practical enterprise applications.
Yumo Xu is an Applied Scientist at AWS, where he focuses on building helpful and responsible AI systems for enterprises. His primary research interests are centered on the foundational challenges of machine reasoning and agentic AI. Prior to AWS, Yumo received his PhD in Natural Language Processing from the University of Edinburgh.
Danilo Neves Ribeiro is an Applied Scientist on the Q Business team based in Santa Clara, CA. He is currently working on designing innovative solutions for information retrieval, reasoning, language model agents, and conversational experience for enterprise use cases within AWS. He holds a Ph.D. in Computer Science from Northwestern University (2023) and has over three years of experience working as an AI/ML scientist.
Kapil Badesara is a Senior Machine Learning Engineer on AWS Q Business, focusing on optimizing RAG systems for accuracy and efficiency. Kapil is based out of Seattle and has more than 10 years of building large scale AI/ML services.
Sunil Singh is an Engineering Manager on the Amazon Q Business team, where he leads the development of next-generation agentic AI solutions designed to enhance Retrieval-Augmented Generation (RAG) systems for greater accuracy and efficiency. Sunil is based out of Seattle and has more than 10 years of experience in architecting secure, scalable AI/ML services for enterprise-grade applications.

Top 10 AI Agent and Agentic AI News Blogs (2025 Update)

Posted on August 14, 2025 by i-genie

In the rapidly evolving field of agentic AI and AI Agents, staying informed is essential. Here’s a comprehensive, up-to-date list of the Top 10 AI Agent and Agentic AI News Blogs (2025 Update)—from industry leaders to academic voices—offering insights, tutorials, and reviews focused on AI agents and Agentic AI in 2025.

1. OpenAI Blog

The official blog of OpenAI, creators of landmark models like ChatGPT, serves as a primary source for updates, research breakthroughs, and discussions on AI ethics and developments. Follow this blog for firsthand insight into the future of agentic AI systems.

2. Marktechpost.com

A leading California-based news site, Marktechpost is known for covering the latest in machine learning, AI agents, and deep learning. The publication excels in quick updates, accessible explanations, and careful reporting on agentic workflows, making it a key resource for both newcomers and experts.

3. Google AI Blog

Google’s AI blog documents the tech giant’s advances in artificial intelligence and machine learning. The blog discusses applications of agentic AI across search, cloud, and consumer products, and regularly presents deep-dives into new research.

4. AIM

The website provides real-time updates on artificial intelligence breakthroughs, tech company news, and innovations from around the world. It offers the latest information on AI products, AI Agent and AI Agent related company launches, industry investments, and research developments.

5. Towards Data Science

Hosted on Medium, this community-driven blog covers emerging trends in machine learning and data science. Contributors share perspectives, project walkthroughs, and tips on agentic AI topics, making it a rich source of up-to-date industry knowledge.

6. The Hugging Face Blog

A top resource for NLP and LLM enthusiasts, Hugging Face’s blog explores everything from training large language models to deploying agents. The blog includes tutorials, model launches, and tips for integrating advanced agentic tools into real-world applications.

7. Venturebeat

Venturebeat Offers in-depth coverage of AI trends and developments, including machine learning, robotics, and virtual reality. Venturebeat also has a section dedicated to AI Agents and Agentic AI.

8. Agent.ai Blog

Agent.ai is a specialized educational blog devoted to agentic AI. It provides readers with foundational concepts, best development practices, and use cases that demonstrate the real-world impact of autonomous agents.

9. n8n Blog

Offering reviews and discussions centered on AI workflow building, n8n’s blog uncovers the role and potential of agentic AI across various applications. Its guides enable professionals to evaluate and leverage AI agents in automated workflows.

10. AI Agents SubReddit

A go-to source for ranking and comparing AI agent platforms, this subreddit addresses multi-agent orchestration, performance comparisons, and practical implementation strategies for agentic workflows.

These blogs provide invaluable resources for tech leaders, engineers, researchers, and anyone interested in the future of agentic AI. While some cover AI broadly, each includes dedicated coverage of agentic AI—often within specific sections or articles. For the latest on workflows, industry trends, and deployment guidance, search within these sites for “AI agents” or explore relevant categories to stay at the cutting edge.
The post Top 10 AI Agent and Agentic AI News Blogs (2025 Update) appeared first on MarkTechPost.

An Implementation Guide to Build a Modular Conversational AI Agent wit …

Posted on August 14, 2025 by i-genie

In this tutorial, we explore how we can build a fully functional conversational AI agent from scratch using the Pipecat framework. We walk through setting up a Pipeline that links together custom FrameProcessor classes, one for handling user input and generating responses with a HuggingFace model, and another for formatting and displaying the conversation flow. We also implement a ConversationInputGenerator to simulate dialogue, and use the PipelineRunner and PipelineTask to execute the data flow asynchronously. This structure showcases how Pipecat handles frame-based processing, enabling modular integration of components like language models, display logic, and future add-ons such as speech modules. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install -q pipecat-ai transformers torch accelerate numpy

import asyncio
import logging
from typing import AsyncGenerator
import numpy as np

print(” Checking available Pipecat frames…”)

try:
from pipecat.frames.frames import (
Frame,
TextFrame,
)
print(” Basic frames imported successfully”)
except ImportError as e:
print(f” Import error: {e}”)
from pipecat.frames.frames import Frame, TextFrame

from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor

from transformers import pipeline as hf_pipeline
import torch

We begin by installing the required libraries, including Pipecat, Transformers, and PyTorch, and then set up our imports. We bring in Pipecat’s core components, such as Pipeline, PipelineRunner, and FrameProcessor, along with HuggingFace’s pipeline API for text generation. This prepares our environment to build and run the conversational AI agent seamlessly. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SimpleChatProcessor(FrameProcessor):
“””Simple conversational AI processor using HuggingFace”””
def __init__(self):
super().__init__()
print(” Loading HuggingFace text generation model…”)
self.chatbot = hf_pipeline(
“text-generation”,
model=”microsoft/DialoGPT-small”,
pad_token_id=50256,
do_sample=True,
temperature=0.8,
max_length=100
)
self.conversation_history = “”
print(” Chat model loaded successfully!”)

async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, TextFrame):
user_text = getattr(frame, “text”, “”).strip()
if user_text and not user_text.startswith(“AI:”):
print(f” USER: {user_text}”)
try:
if self.conversation_history:
input_text = f”{self.conversation_history} User: {user_text} Bot:”
else:
input_text = f”User: {user_text} Bot:”

response = self.chatbot(
input_text,
max_new_tokens=50,
num_return_sequences=1,
temperature=0.7,
do_sample=True,
pad_token_id=self.chatbot.tokenizer.eos_token_id
)

generated_text = response[0][“generated_text”]
if “Bot:” in generated_text:
ai_response = generated_text.split(“Bot:”)[-1].strip()
ai_response = ai_response.split(“User:”)[0].strip()
if not ai_response:
ai_response = “That’s interesting! Tell me more.”
else:
ai_response = “I’d love to hear more about that!”

self.conversation_history = f”{input_text} {ai_response}”
await self.push_frame(TextFrame(text=f”AI: {ai_response}”), direction)
except Exception as e:
print(f” Chat error: {e}”)
await self.push_frame(
TextFrame(text=”AI: I’m having trouble processing that. Could you try rephrasing?”),
direction
)
else:
await self.push_frame(frame, direction)

We implement SimpleChatProcessor, which loads the HuggingFace DialoGPT-small model for text generation and maintains conversation history for context. As each TextFrame arrives, we process the user’s input, generate a model response, clean it up, and push it forward in the Pipecat pipeline for display. This design ensures our AI agent can hold coherent, multi-turn conversations in real time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass TextDisplayProcessor(FrameProcessor):
“””Displays text frames in a conversational format”””
def __init__(self):
super().__init__()
self.conversation_count = 0

async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, TextFrame):
text = getattr(frame, “text”, “”)
if text.startswith(“AI:”):
print(f” {text}”)
self.conversation_count += 1
print(f” Exchange {self.conversation_count} completen”)
await self.push_frame(frame, direction)

class ConversationInputGenerator:
“””Generates demo conversation inputs”””
def __init__(self):
self.demo_conversations = [
“Hello! How are you doing today?”,
“What’s your favorite thing to talk about?”,
“Can you tell me something interesting about AI?”,
“What makes conversation enjoyable for you?”,
“Thanks for the great chat!”
]

async def generate_conversation(self) -> AsyncGenerator[TextFrame, None]:
print(” Starting conversation simulation…n”)
for i, user_input in enumerate(self.demo_conversations):
yield TextFrame(text=user_input)
if i < len(self.demo_conversations) – 1:
await asyncio.sleep(2)

We create TextDisplayProcessor to neatly format and display AI responses, tracking the number of exchanges in the conversation. Alongside it, ConversationInputGenerator simulates a sequence of user messages as TextFrame objects, adding short pauses between them to mimic a natural back-and-forth flow during the demo. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass SimpleAIAgent:
“””Simple conversational AI agent using Pipecat”””
def __init__(self):
self.chat_processor = SimpleChatProcessor()
self.display_processor = TextDisplayProcessor()
self.input_generator = ConversationInputGenerator()

def create_pipeline(self) -> Pipeline:
return Pipeline([self.chat_processor, self.display_processor])

async def run_demo(self):
print(” Simple Pipecat AI Agent Demo”)
print(” Conversational AI with HuggingFace”)
print(“=” * 50)

pipeline = self.create_pipeline()
runner = PipelineRunner()
task = PipelineTask(pipeline)

async def produce_frames():
async for frame in self.input_generator.generate_conversation():
await task.queue_frame(frame)
await task.stop_when_done()

try:
print(” Running conversation demo…n”)
await asyncio.gather(
runner.run(task),
produce_frames(),
)
except Exception as e:
print(f” Demo error: {e}”)
logging.error(f”Pipeline error: {e}”)

print(” Demo completed successfully!”)

In SimpleAIAgent, we tie everything together by combining the chat processor, display processor, and input generator into a single Pipecat Pipeline. The run_demo method launches the PipelineRunner to process frames asynchronously while the input generator feeds simulated user messages. This orchestrated setup allows the agent to process inputs, generate responses, and display them in real time, completing the end-to-end conversational flow. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserasync def main():
logging.basicConfig(level=logging.INFO)
print(” Pipecat AI Agent Tutorial”)
print(” Google Colab Compatible”)
print(” Free HuggingFace Models”)
print(” Simple & Working Implementation”)
print(“=” * 60)
try:
agent = SimpleAIAgent()
await agent.run_demo()
print(“n Tutorial Complete!”)
print(“n What You Just Saw:”)
print(“✓ Pipecat pipeline architecture in action”)
print(“✓ Custom FrameProcessor implementations”)
print(“✓ HuggingFace conversational AI integration”)
print(“✓ Real-time text processing pipeline”)
print(“✓ Modular, extensible design”)
print(“n Next Steps:”)
print(“• Add real speech-to-text input”)
print(“• Integrate text-to-speech output”)
print(“• Connect to better language models”)
print(“• Add memory and context management”)
print(“• Deploy as a web service”)
except Exception as e:
print(f” Tutorial failed: {e}”)
import traceback
traceback.print_exc()

try:
import google.colab
print(” Google Colab detected – Ready to run!”)
ENV = “colab”
except ImportError:
print(” Local environment detected”)
ENV = “local”

print(“n” + “=”*60)
print(” READY TO RUN!”)
print(“Execute this cell to start the AI conversation demo”)
print(“=”*60)

print(“n Starting the AI Agent Demo…”)

await main()

We define the main function to initialize logging, set up the SimpleAIAgent, and run the demo while printing helpful progress and summary messages. We also detect whether the code is running in Google Colab or locally, display environment details, and then call await main() to start the full conversational AI pipeline execution.

In conclusion, we have a working conversational AI agent where user inputs (or simulated text frames) are passed through a processing pipeline, the HuggingFace DialoGPT model generates responses, and the results are displayed in a structured conversational format. The implementation demonstrates how Pipecat’s architecture supports asynchronous processing, stateful conversation handling, and clean separation of concerns between different processing stages. With this foundation, we can now integrate more advanced features, such as real-time speech-to-text, text-to-speech synthesis, context persistence, or richer model backends, while retaining a modular and extensible code structure.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Sponsor us

The post An Implementation Guide to Build a Modular Conversational AI Agent with Pipecat and HuggingFace appeared first on MarkTechPost.

Why Docker Matters for Artificial Intelligence AI Stack: Reproducibili …

Posted on August 14, 2025 by i-genie

Artificial intelligence and machine learning workflows are notoriously complex, involving fast-changing code, heterogeneous dependencies, and the need for rigorously repeatable results. By approaching the problem from basic principles—what does AI actually need to be reliable, collaborative, and scalable—we find that container technologies like Docker are not a convenience, but a necessity for modern ML practitioners. This article unpacks the core reasons why Docker has become foundational for reproducible machine learning: reproducibility, portability, and environment parity.

Reproducibility: Science You Can Trust

Reproducibility is the backbone of credible AI development. Without it, scientific claims or production ML models cannot be verified, audited, or reliably transferred between environments.

Precise Environment Definition: Docker ensures that all code, libraries, system tools, and environment variables are specified explicitly in a Dockerfile. This enables you to recreate the exact same environment on any machine, sidestepping the classic “works on my machine” problem that has plagued researchers for decades.

Version Control for Environments: Not only code but also dependencies and runtime configurations can be version-controlled alongside your project. This allows teams—or future you—to rerun experiments perfectly, validating results and debugging issues with confidence.

Easy Collaboration: By sharing your Docker image or Dockerfile, colleagues can instantly replicate your ML setup. This eliminates setup discrepancies, streamlining collaboration and peer review.

Consistency Across Research and Production: The very container that worked for your academic experiment or benchmark can be promoted to production with zero changes, ensuring scientific rigor translates directly to operational reliability.

Recommended Article: NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

Portability: Building Once, Running Everywhere

AI/ML projects today span local laptops, on-prem clusters, commercial clouds, and even edge devices. Docker abstracts away the underlying hardware and OS, reducing environmental friction:

Independence from Host System: Containers encapsulate the application and all dependencies, so your ML model runs identically regardless of whether the host is Ubuntu, Windows, or MacOS.

Cloud & On-Premises Flexibility: The same container can be deployed on AWS, GCP, Azure, or any local machine that supports Docker. This makes migrations (cloud to cloud, notebook to server) trivial and risk-free.

Scaling Made Simple: As data grows, containers can be replicated to scale horizontally across dozens or thousands of nodes, without any dependency headaches or manual configuration.

Future-Proofing: Docker’s architecture supports emerging deployment patterns, such as serverless AI and edge inference, ensuring ML teams can keep pace with innovation without refactoring legacy stacks.

Environment Parity: The End of “It Works Here, Not There”

Environment parity means your code behaves the same way during development, testing, and production. Docker nails this guarantee:

Isolation and Modularity: Each ML project lives in its own container, eliminating conflicts from incompatible dependencies or system-level resource contention. This is especially vital in data science, where different projects often need different versions of Python, CUDA, or ML libraries.

Rapid Experimentation: Multiple containers can run side-by-side, supporting high-throughput ML experimentation and parallel research, with no risk of cross-contamination.

Easy Debugging: When bugs emerge in production, parity makes it trivial to spin up the same container locally and reproduce the issue instantly, dramatically reducing MTTR (mean time to resolution).

Seamless CI/CD Integration: Parity enables fully automated workflows—from code commit, through automated testing, to deployment—without nasty surprises due to mismatched environments.

A Modular AI Stack for the Future

Modern machine learning workflows often break down into distinct phases: data ingestion, feature engineering, training, evaluation, model serving, and observability. Each of these can be managed as a separate, containerized component. Orchestration tools like Docker Compose and Kubernetes then let teams build reliable AI pipelines that are easy to manage and scale.

This modularity not only aids development and debugging but sets the stage for adopting best practices in MLOps: model versioning, automated monitoring, and continuous delivery—all built upon the trust that comes from reproducibility and environment parity.

Why Containers Are Essential for AI

Starting from core requirements (reproducibility, portability, environment parity), it is clear that Docker and containers tackle the “hard problems” of ML infrastructure head-on:

They make reproducibility effortless instead of painful.

They empower portability in an increasingly multi-cloud and hybrid world.

They deliver environment parity, putting an end to cryptic bugs and slow collaboration.

Whether you’re a solo researcher, part of a startup, or working in a Fortune 500 enterprise, using Docker for AI projects is no longer optional—it’s foundational to doing modern, credible, and high-impact machine learning.

Star us on GitHub

Sponsor us

The post Why Docker Matters for Artificial Intelligence AI Stack: Reproducibility, Portability, and Environment Parity appeared first on MarkTechPost.

Securely launch and scale your agents and tools on Amazon Bedrock Agen …

Posted on August 14, 2025 by i-genie

Organizations are increasingly excited about the potential of AI agents, but many find themselves stuck in what we call “proof of concept purgatory”—where promising agent prototypes struggle to make the leap to production deployment. In our conversations with customers, we’ve heard consistent challenges that block the path from experimentation to enterprise-grade deployment:
“Our developers want to use different frameworks and models for different use cases—forcing standardization slows innovation.”
“The stochastic nature of agents makes security more complex than traditional applications—we need stronger isolation between user sessions.”
“We struggle with identity and access control for agents that need to act on behalf of users or access sensitive systems.”
“Our agents need to handle various input types—text, images, documents—often with large payloads that exceed typical serverless compute limits.”
“We can’t predict the compute resources each agent will need, and costs can spiral when overprovisioning for peak demand.”
“Managing infrastructure for agents that may be a mix of short and long-running requires specialized expertise that diverts our focus from building actual agent functionality.”
Amazon Bedrock AgentCore Runtime addresses these challenges with a secure, serverless hosting environment specifically designed for AI agents and tools. Whereas traditional application hosting systems weren’t built for the unique characteristics of agent workloads—variable execution times, stateful interactions, and complex security requirements—AgentCore Runtime was purpose-built for these needs.
The service alleviates the infrastructure complexity that has kept promising agent prototypes from reaching production. It handles the undifferentiated heavy lifting of container orchestration, session management, scalability, and security isolation, helping developers focus on creating intelligent experiences rather than managing infrastructure. In this post, we discuss how to accomplish the following:

Use different agent frameworks and different models
Deploy, scale, and stream agent responses in four lines of code
Secure agent execution with session isolation and embedded identity
Use state persistence for stateful agents along with Amazon Bedrock AgentCore Memory
Process different modalities with large payloads
Operate asynchronous multi-hour agents
Pay only for used resources

Use different agent frameworks and models
One advantage of AgentCore Runtime is its framework-agnostic and model-agnostic approach to agent deployment. Whether your team has invested in LangGraph for complex reasoning workflows, adopted CrewAI for multi-agent collaboration, or built custom agents using Strands, AgentCore Runtime can use your existing code base without requiring architectural changes or any framework migrations. Refer to these samples on Github for examples.
With AgentCore Runtime, you can integrate different large language models (LLMs) from your preferred provider, such as Amazon Bedrock managed models, Anthropic’s Claude, OpenAI’s API, or Google’s Gemini. This makes sure your agent implementations remain portable and adaptable as the LLM landscape continues to evolve while helping you pick the right model for your use case to optimize for performance, cost, or other business requirements. This gives you and your team the flexibility to choose your favorite or most useful framework or model using a unified deployment pattern.
Let’s examine how AgentCore Runtime supports two different frameworks and model providers:

LangGraph agent using Anthropic’s Claude Sonnet on Amazon Bedrock
Strands agent using GPT 4o mini through the OpenAI API

For the full code examples, refer to langgraph_agent_web_search.py and strands_openai_identity.py on GitHub.
Both examples above show how you can use AgentCore SDK, regardless of the underlying framework or model choice. After you have modified your code as shown in these examples, you can deploy your agent with or without the AgentCore Runtime starter toolkit, discussed in the next section.
Note that there are minimal additions, specific to AgentCore SDK, to the example code above. Let us dive deeper into this in the next section.
Deploy, scale, and stream agent responses with four lines of code
Let’s examine the two examples above. In both examples, we only add four new lines of code:

Import – from bedrock_agentcore.runtime import BedrockAgentCoreApp
Initialize – app = BedrockAgentCoreApp()
Decorate – @app.entrypoint
Run – app.run()

Once you have made these changes, the most straightforward way to get started with agentcore is to use the AgentCore Starter toolkit. We suggest using uv to create and manage local development environments and package requirements in python. To get started, install the starter toolkit as follows:

uv pip install bedrock-agentcore-starter-toolkit

Run the appropriate commands to configure, launch, and invoke to deploy and use your agent. The following video provides a quick walkthrough.

For your chat style applications, AgentCore Runtime supports streaming out of the box. For example, in Strands, locate the following synchronous code:

result = agent(user_message)

Change the preceding code to the following and deploy:

agent_stream = agent.stream_async(user_message)
async for event in agent_stream:
yield event #you can process/filter these events before yielding

For more examples on streaming agents, refer to the following GitHub repo. The following is an example streamlit application streaming back responses from an AgentCore Runtime agent.

Secure agent execution with session isolation and embedded identity
AgentCore Runtime fundamentally changes how we think about serverless compute for agentic applications by introducing persistent execution environments that can maintain an agent’s state across multiple invocations. Rather than the typical serverless model where functions spin up, execute, and immediately terminate, AgentCore Runtime provisions dedicated microVMs that can persist for up to 8 hours. This enables sophisticated multi-step agentic workflows where each subsequent call builds upon the accumulated context and state from previous interactions within the same session. The practical implication of this is that you can now implement complex, stateful logic patterns that would previously require external state management solutions or cumbersome workarounds to maintain context between function executions. This doesn’t obviate the need for external state management (see the following section on using AgentCore Runtime with AgentCore Memory), but is a common need for maintaining local state and files temporarily, within a session context.
Understanding the session lifecycle
The session lifecycle operates through three distinct states that govern resource allocation and availability (see diagram below for a high level view of this session lifecycle). When you first invoke a runtime with a unique session identifier, AgentCore provisions a dedicated execution environment and transitions it to an Active state during request processing or when background tasks are running.
The system automatically tracks synchronous invocation activity, while background processes can signal their status through HealthyBusy responses to health check pings from the service (see the later section on asynchronous workloads). Sessions transition to Idle when not processing requests but remain provisioned and ready for immediate use, reducing cold start penalties for subsequent invocations.

Finally, sessions reach a Terminated state when they currently exceed a 15-minute inactivity threshold, hit the 8-hour maximum duration limit, or fail health checks. Understanding these state transitions is crucial for designing resilient workflows that gracefully handle session boundaries and resource cleanup. For more details on session lifecycle-related quotas, refer to AgentCore Runtime Service Quotas.
The ephemeral nature of AgentCore sessions means that runtime state exists solely within the boundaries of the active session lifecycle. The data your agent accumulates during execution—such as conversation context, user preference mappings, intermediate computational results, or transient workflow state—remains accessible only while the session persists and is completely purged when the session terminates.
For persistent data requirements that extend beyond individual session boundaries, AgentCore Memory provides the architectural solution for durable state management. This purpose-built service is specifically engineered for agent workloads and offers both short-term and long-term memory abstractions that can maintain user conversation histories, learned behavioral patterns, and critical insights across session boundaries. See documentation here for more information on getting started with AgentCore Memory.
True session isolation
Session isolation in AI agent workloads addresses fundamental security and operational challenges that don’t exist in traditional application architectures. Unlike stateless functions that process individual requests independently, AI agents maintain complex contextual state throughout extended reasoning processes, handle privileged operations with sensitive credentials and files, and exhibit non-deterministic behavior patterns. This creates unique risks where one user’s agent could potentially access another’s data—session-specific information could be used across multiple sessions, credentials could leak between sessions, or unpredictable agent behavior could compromise system boundaries. Traditional containerization or process isolation isn’t sufficient because agents need persistent state management while maintaining absolute separation between users.
Let’s explore a case study: In May 2025, Asana deployed a new MCP server to power agentic AI features (integrations with ChatGPT, Anthropic’s Claude, Microsoft Copilot) across its enterprise software as a service (SaaS) offering. Due to a logic flaw in MCP’s tenant isolation and relying solely on user but not agent identity, requests from one organization’s user could inadvertently retrieve cached results containing another organization’s data. This cross-tenant contamination wasn’t triggered by a targeted exploit but was an intrinsic security fault in handling context and cache separation across agentic AI-driven sessions.
The exposure silently persisted for 34 days, impacting roughly 1,000 organizations, including major enterprises. After it was discovered, Asana halted the service, remediated the bug, notified affected customers, and released a fix.
AgentCore Runtime solves these challenges through complete microVM isolation that goes beyond simple resource separation. Each session receives its own dedicated virtual machine with isolated compute, memory, and file system resources, making sure agent state, tool operations, and credential access remain completely compartmentalized. When a session ends, the entire microVM is terminated and memory sanitized, minimizing the risk of data persistence or cross-contamination. This architecture provides the deterministic security boundaries that enterprise deployments require, even when dealing with the inherently probabilistic and non-deterministic nature of AI agents, while still enabling the stateful, personalized experiences that make agents valuable. Although other offerings might provide sandboxed kernels, with the ability to manage your own session state, persistence, and isolation, this should not be treated a strict security boundary. AgentCore Runtime provides consistent, deterministic isolation boundaries regardless of agent execution patterns, delivering the predictable security properties required for enterprise deployments. The following diagram shows how two separate sessions run in isolated microVM kernels.

AgentCore Runtime embedded identity
Traditional agent deployments often struggle with identity and access management, particularly when agents need to act on behalf of users or access external services securely. The challenge becomes even more complex in multi-tenant environments—for example, where you need to make sure Agent A accessing Google Drive on behalf of User 1 can never accidentally retrieve data belonging to User 2.
AgentCore Runtime addresses these challenges through its embedded identity system that seamlessly integrates authentication and authorization into the agent execution environment. First, each runtime is associated with a unique workload identity (you can treat this as a unique agent identity). The service supports two primary authentication mechanisms for agents using this unique agent identity: IAM SigV4 Authentication for agents operating within AWS security boundaries, and OAuth based (JWT Bearer Token Authentication) integration with existing enterprise identity providers like Amazon Cognito, Okta, or Microsoft Entra ID.
When deploying an agent with AWS Identity and Access Management (IAM) authentication, users don’t have to incorporate other Amazon Bedrock AgentCore Identity specific settings or setup—simply configure with IAM authorization, launch, and invoke with the right user credentials.
When using JWT authentication, you configure the authorizer during the CreateAgentRuntime operation, specifying your identity provider (IdP)-specific discovery URL and allowed clients. Your existing agent code requires no modification—you simply add the authorizer configuration to your runtime deployment. When a calling entity or user invokes your agent, they pass their IdP-specific access token as a bearer token in the Authorization header. AgentCore Runtime uses AgentCore Identity to automatically validate this token against your configured authorizer and rejects unauthorized requests. The following diagram shows the flow of information between AgentCore runtime, your IdP, AgentCore Identity, other AgentCore services, other AWS services (in orange), and other external APIs or resources (in purple).

Behind the scenes, AgentCore Runtime automatically exchanges validated user tokens for workload access tokens (through the bedrock-agentcore:GetWorkloadAccessTokenForJWT API). This provides secure outbound access to external services through the AgentCore credential provider system, where tokens are cached using the combination of agent workload identity and user ID as the binding key. This cryptographic binding makes sure, for example, User 1’s Google token can never be accessed when processing requests for User 2, regardless of application logic errors. Note that in the preceding diagram, connecting to AWS resources can be achieved simply by editing the AgentCore Runtime execution role, but connections to Amazon Bedrock AgentCore Gateway or to another runtime will require reauthorization with a new access token.
The most straightforward way to configure your agent with OAuth-based inbound access is to use the AgentCore starter toolkit:

With the AWS Command Line Interface (AWS CLI), follow the prompts to interactively enter your OAuth discovery URL and allowed Client IDs (comma-separated).

With Python, use the following code:

bedrock_agentcore_starter_toolkit  Runtime
boto3.session  Session
boto_session  Session()
region  boto_sessionregion_name
region

discovery_url ‘<your-cognito-user-pool-discovery-url>’

client_id ‘<your-cognito-app-client-id>’

agentcore_runtime Runtime()
response agentcore_runtimeconfigure(
entrypoint”strands_openai.py”,
auto_create_execution_role,
auto_create_ecr,
requirements_file”requirements.txt”,
regionregion,
agent_nameagent_name,
authorizer_configuration{
“customJWTAuthorizer”: {
“discoveryUrl”: discovery_url,
“allowedClients”: [client_id]
}
}
)

For outbound access (for example, if your agent uses OpenAI APIs), first set up your keys using the API or the Amazon Bedrock console, as shown in the following screenshot.

Then access your keys from within your AgentCore Runtime agent code:

from bedrock_agentcore.identity.auth import requires_api_key

@requires_api_key(
   provider_name=”openai-apikey-provider” # replace with your own credential provider name
)
async def need_api_key(*, api_key: str):
   print(f’received api key for async func: {api_key}’)
   os.environ[“OPENAI_API_KEY”] = api_key

For more information on AgentCore Identity, refer to Authenticate and authorize with Inbound Auth and Outbound Auth and Hosting AI Agents on AgentCore Runtime.
Use AgentCore Runtime state persistence with AgentCore Memory
AgentCore Runtime provides ephemeral, session-specific state management that maintains context during active conversations but doesn’t persist beyond the session lifecycle. Each user session preserves conversational state, objects in memory, and local temporary files within isolated execution environments. For short-lived agents, you can use the state persistence offered by AgentCore Runtime without needing to save this information externally. However, at the end of the session lifecycle, the ephemeral state is permanently destroyed, making this approach suitable only for interactions that don’t require knowledge retention across separate conversations.
AgentCore Memory addresses this challenge by providing persistent storage that survives beyond individual sessions. Short-term memory captures raw interactions as events using create_event, storing the complete conversation history that can be retrieved with get_last_k_turns even if the runtime session restarts. Long-term memory uses configurable strategies to extract and consolidate key insights from these raw interactions, such as user preferences, important facts, or conversation summaries. Through retrieve_memories, agents can access this persistent knowledge across completely different sessions, enabling personalized experiences. The following diagram shows how AgentCore Runtime can use specific APIs to interact with Short-term and Long-term memory in AgentCore Memory.

This basic architecture, of using a runtime to host your agents, and a combination of short- and long-term memory has become commonplace in most agentic AI applications today. Invocations to AgentCore Runtime with the same session ID lets you access the agent state (for example, in a conversational flow) as though it were running locally, without the overhead of external storage operations, and AgentCore Memory selectively captures and structures the valuable information worth preserving beyond the session lifecycle. This hybrid approach means agents can maintain fast, contextual responses during active sessions while building cumulative intelligence over time. The automatic asynchronous processing of long-term memories according to each strategy in AgentCore Memory makes sure insights are extracted and consolidated without impacting real-time performance, creating a seamless experience where agents become progressively more helpful while maintaining responsive interactions. This architecture avoids the traditional trade-off between conversation speed and long-term learning, enabling agents that are both immediately useful and continuously improving.
Process different modalities with large payloads
Most AI agent systems struggle with large file processing due to strict payload size limits, typically capping requests at just a few megabytes. This forces developers to implement complex file chunking, multiple API calls, or external storage solutions that add latency and complexity. AgentCore Runtime removes these constraints by supporting payloads up to 100 MB in size, enabling agents to process substantial datasets, high-resolution images, audio, and comprehensive document collections in a single invocation.
Consider a financial audit scenario where you need to verify quarterly sales performance by comparing detailed transaction data against a dashboard screenshot from your analytics system. Traditional approaches would require using external storage such as Amazon Simple Storage Service (Amazon S3) or Google Drive to download the Excel file and image into the container running the agent logic. With AgentCore Runtime, you can send both the comprehensive sales data and the dashboard image in a single payload from the client:

large_payload = {
“prompt”: “Compare the Q4 sales data with the dashboard metrics and identify any discrepancies”,
“sales_data”: base64.b64encode(excel_sales_data).decode(‘utf-8’),
“dashboard_image”: base64.b64encode(dashboard_screenshot).decode(‘utf-8’)
}

The agent’s entrypoint function can be modified to process both data sources simultaneously, enabling this cross-validation analysis:

@app.entrypoint
def audit_analyzer(payload, context):
   inputs = [
   {“text”: payload.get(“prompt”, “Analyze the sales data and dashboard”)},
   {“document”: {“format”: “xlsx”, “name”: “sales_data”,
   “source”: {“bytes”: base64.b64decode(payload[“sales_data”])}}},
   {“image”: {“format”: “png”,
   “source”: {“bytes”: base64.b64decode(payload[“dashboard_image”])}}}
   ]

   response = agent(inputs)
   return response.message[‘content’][0][‘text’]

To test out an example of using large payloads, refer to the following GitHub repo.
Operate asynchronous multi-hour agents
As AI agents evolve to tackle increasingly complex tasks—from processing large datasets to generating comprehensive reports—they often require multi-step processing that can take significant time to complete. However, most agent implementations are synchronous (with response streaming) that block until completion. While synchronous, streaming agents are a common way to expose agentic chat applications to users, users cannot interact with the agent when a task or tool is still running, view the status of, or cancel background operations, or start more concurrent tasks while others have still not completed.
Building asynchronous agents forces developers to implement complex distributed task management systems with state persistence, job queues, worker coordination, failure recovery, and cross-invocation state management while also navigating serverless system limitations like execution timeouts (tens of minutes), payload size restrictions, and cold start penalties for long-running compute operations—a significant heavy lift that diverts focus from core functionality.
AgentCore Runtime alleviates this complexity through stateful execution sessions that maintain context across invocations, so developers can build upon previous work incrementally without implementing complex task management logic. The AgentCore SDK provides ready-to-use constructs for tracking asynchronous tasks and seamlessly managing compute lifecycles, and AgentCore Runtime supports execution times up to 8 hours and request/response payload sizes of 100 MB, making it suitable for most asynchronous agent tasks.
Getting started with asynchronous agents
You can get started with just a couple of code changes:

pip install bedrock-agentcore

To build interactive agents that perform asynchronous tasks, simply call add_async_task when starting a task and complete_async_task when finished. The SDK automatically handles task tracking and manages compute lifecycle for you.

# Start tracking a task
task_id = app.add_async_task(“data_processing”)

# Do your work…
# (your business logic here)

# Mark task as complete
app.complete_async_task(task_id)

These two method calls transform your synchronous agent into a fully asynchronous, interactive system. Refer to this sample for more details.
The following example shows the difference between a synchronous agent that streams back responses to the user immediately vs. a more complex multi-agent scenario where longer running, asynchronous background shopping agents use Amazon Bedrock AgentCore Browser to automate a shopping experience on amazon.com on behalf of the user.

Pay only for Used Resources
Amazon Bedrock AgentCore Runtime introduces a consumption-based pricing model that fundamentally changes how you pay for AI agent infrastructure. Unlike traditional compute models that charge for allocated resources regardless of utilization, AgentCore Runtime bills you only for what you actually use however long you use it; said differently, you don’t have to pre-allocate resources like CPU or GB Memory, and you don’t pay for CPU resources during I/O wait periods. This distinction is particularly valuable for AI agents, which typically spend significant time waiting for LLM responses or external API calls to complete. Here is a typical Agent event loop, where we only expect the purple boxes to be processed within Runtime:

The LLM call (light blue) and tool call (green) boxes take time, but are run outside the context of AgentCore Runtime; users only pay for processing that happens in Runtime itself (purple boxes). Let’s look at some real-world examples to understand the impact:
Customer support agent example
Consider a customer support agent that handles 10,000 user inquiries per day. Each interaction involves initial query processing, knowledge retrieval from Retrieval Augmented Generation (RAG) systems, LLM reasoning for response formulation, API calls to order systems, and final response generation. In a typical session lasting 60 seconds, the agent could actively use CPU for only 18 seconds (30%) while spending the remaining 42 seconds (70%) waiting for LLM responses or API calls to complete. Memory usage can fluctuate between 1.5 GB to 2.5 GB depending on the complexity of the customer query and the amount of context needed. With traditional compute models, you would pay for the full 60 seconds of CPU time and peak memory allocation. With AgentCore Runtime, you only pay for the 18 seconds of active CPU processing and the actual memory consumed moment-by-moment:

CPU cost: 18 seconds × 1 vCPU × ($0.0895/3600) = $0.0004475
Memory cost: 60 seconds × 2GB average × ($0.00945/3600) = $0.000315
Total per session: $0.0007625

For 10,000 daily sessions, this represents a 70% reduction in CPU costs compared to traditional models that would charge for the full 60 seconds.
Data analysis agent example
The savings become even more dramatic for data processing agents that handle complex workflows. A financial analysis agent processing quarterly reports might run for three hours but have highly variable resource needs. During data loading and initial parsing, it might use minimal resources (0.5 vCPU, 2 GB memory). When performing complex calculations or running statistical models, it might spike to 2 vCPU and 8 GB memory for just 15 minutes of the total runtime, while spending the remaining time waiting for batch operations or model inferences at much lower resource utilization. By charging only for actual resource consumption while maintaining your session state during I/O waits, AgentCore Runtime aligns costs directly with value creation, making sophisticated agent deployments economically viable at scale.
Conclusion
In this post, we explored how AgentCore Runtime simplifies the deployment and management of AI agents. The service addresses critical challenges that have traditionally blocked agent adoption at scale, offering framework-agnostic deployment, true session isolation, embedded identity management, and support for large payloads and long-running, asynchronous agents, all with a consumption based model where you pay only for the resources you use.
With just four lines of code, developers can securely launch and scale their agents while using AgentCore Memory for persistent state management across sessions. For hands-on examples on AgentCore Runtime covering simple tutorials to complex use cases, and demonstrating integrations with various frameworks such as LangGraph, Strands, CrewAI, MCP, ADK, Autogen, LlamaIndex, and OpenAI Agents, refer to the following examples on GitHub:

Amazon Bedrock AgentCore Runtime
Amazon Bedrock AgentCore Samples: Agentic Frameworks
Hosting MCP server on AgentCore Runtime
Amazon Bedrock AgentCore Starter Toolkit
Runtime QuickStart guide

About the authors
Shreyas Subramanian is a Principal Data Scientist and helps customers by using Generative AI and deep learning to solve their business challenges using AWS services like Amazon Bedrock and AgentCore. Dr. Subramanian contributes to cutting-edge research in deep learning, Agentic AI, foundation models and optimization techniques with several books, papers and patents to his name. In his current role at Amazon, Dr. Subramanian works with various science leaders and research teams within and outside Amazon, helping to guide customers to best leverage state-of-the-art algorithms and techniques to solve business critical problems. Outside AWS, Dr. Subramanian is a expert reviewer for AI papers and funding via organizations like Neurips, ICML, ICLR, NASA and NSF.
Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and explores the wilderness with his family.
Vivek Bhadauria is a Principal Engineer at Amazon Bedrock with almost a decade of experience in building AI/ML services. He now focuses on building generative AI services such as Amazon Bedrock Agents and Amazon Bedrock Guardrails. In his free time, he enjoys biking and hiking.

PwC and AWS Build Responsible AI with Automated Reasoning on Amazon Be …

Posted on August 14, 2025 by i-genie

This is a guest post co-written with Scott Likens, Ambuj Gupta, Adam Hood, Chantal Hudson, Priyanka Mukhopadhyay, Deniz Konak Ozturk, and Kevin Paul from PwC
Organizations are deploying generative AI solutions while balancing accuracy, security, and compliance. In this globally competitive environment, scale matters less, speed matters more, and innovation matters most of all, according to recent PwC 2025 business insights on AI agents. To maintain a competitive advantage, organizations must support rapid deployment and verifiable trust in AI outputs. Particularly within regulated industries, mathematical verification of results can transform the speed of innovation from a potential risk into a competitive advantage.
This post presents how AWS and PwC are developing new reasoning checks that combine deep industry expertise with Automated Reasoning checks in Amazon Bedrock Guardrails to support innovation. Automated Reasoning is a branch of AI focused on algorithmic search for mathematical proofs. Automated Reasoning checks in Amazon Bedrock Guardrails, which encode knowledge into formal logic to validate if large language models (LLM) outputs are possible, are generally available as of August 6, 2025.
This new guardrail policy maintains accuracy within defined parameters, unlike traditional probabilistic reasoning methods. The system evaluates AI-generated content against rules derived from policy documents, including company guidelines and operational standards. Automated Reasoning checks produce findings that provide insights into whether the AI-generated content aligns with the rules extracted from the policy, highlights ambiguity that exists in the content, and provides suggestions on how to remove assumptions.
“In a field where breakthroughs are happening at incredible speed, reasoning is one of the most important technical advances to help our joint customers succeed in generative AI,” says Matt Wood, Global CTIO at PwC, at AWS Re:Invent 2024.
Industry-transforming use cases using Amazon Bedrock Automated Reasoning checks
The strategic alliance combining PwC’s proven, deep expertise and the innovative technology from AWS is set to transform how businesses approach AI-driven innovation. The following diagram illustrates PWC’s Automated Reasoning implementation. We initially focus on highly regulated industries such as pharmaceuticals, financial services, and energy.

In the following sections, we present three groundbreaking use cases developed by PwC teams.
EU AI Act compliance for financial services risk management
The European Union (EU) AI Act requires organizations to classify and verify all AI applications according to specific risk levels and governance requirements. PwC has developed a practical approach to address this challenge using Automated Reasoning checks in Amazon Bedrock Guardrails, which transforms EU AI Act compliance from a manual burden into a systematic, verifiable process. Given a description of an AI application’s use case, the solution converts risk classification criteria into defined guardrails, enabling organizations to consistently assess and monitor AI applications while supporting expert human judgment through automated compliance verification with auditable artifacts.The key benefits of using Automated Reasoning checks include:

Automated classification of AI use cases into risk categories
Verifiable logic trails for AI-generated classifications
Enhanced speed in identifying the required governance controls

The following diagram illustrates the workflow for this use case.

Pharmaceutical content review
PwC’s Regulated Content Orchestrator (RCO) is a globally scalable, multi-agent capability—powered by a core rules engine customized to company, region, product, and indication for use—that automates medical, legal, regulatory, and brand compliance. The RCO team was an early incubating collaborator of Amazon Bedrock Automated Reasoning checks, implementing it as a secondary validation layer in the marketing content generation process. This enhanced defense strengthened existing content controls, resulting in accelerated content creation and review processes while enhancing compliance standards.Key benefits of Automated Reasoning Checks in Amazon Bedrock Guardrails include:

Applies automated, mathematically based safeguards for verifying RCO’s analysis
Enables transparent QA with traceable, audit-ready reasoning
Safeguards against potentially unsupported or hallucinated outputs

The following diagram illustrates the workflow for this use case.

Utility outage management for real-time decision support
Utility outage management applies Automated Reasoning checks in Amazon Bedrock Guardrails to enhance response times and operational efficiency of utility companies. The solution can generate standardized protocols from regulatory guidelines, creates procedures based on NERC and FERC requirements, and verifies AI-produced outage classifications. Through an integrated cloud-based architecture, this solution applies severity-based verification workflows to dispatch decisions—normal outages (3-hour target) assign tickets to available crews, medium severity (6-hour target) triggers expedited dispatch, and critical incidents (12-hour target) activate emergency procedures with proactive messaging.
The key benefits of using Automated Reasoning checks include:

Effective and enhanced responses to customers
Real-time operational insights with verified regulatory alignment
Accelerated decision-making with mathematical certainty

The following diagram illustrates the workflow for this use case.

Looking ahead
As the adoption of AI continues to evolve, particularly with agentic AI, the AWS and PwC alliance is focused on the following:

Expanding Automated Reasoning checks integrated solutions across more industries
Developing industry-specific agentic AI solutions with built-in compliance verification
Enhancing explainability features to provide greater transparency

Conclusion
The integration of Automated Reasoning checks in Amazon Bedrock Guardrails with PwC’s deep industry expertise offers a powerful avenue to help deploy AI-based solutions. As an important component of responsible AI, Automated Reasoning checks provides safeguards that help improve the trustworthiness of AI applications. With the expectation of mathematical certainty and verifiable trust in AI outputs, organizations can now innovate without compromising on accuracy, security, or compliance. To learn more about how Automated Reasoning checks works, refer to Minimize AI hallucinations and deliver up to 99% verification accuracy with Automated Reasoning checks: Now available and Improve accuracy by adding Automated Reasoning checks in Amazon Bedrock Guardrails.
Explore how Automated Reasoning checks in Amazon Bedrock can improve the trustworthiness of your generative AI applications. To learn more about using this capability or to discuss custom solutions for your specific needs, contact your AWS account team or an AWS Solutions Architect. Contact the PwC team to learn how you can use the combined power of AWS and PwC to drive innovation in your industry.

About the authors
Nafi Diallo is a Senior Automated Reasoning Architect at Amazon Web Services, where she advances innovations in AI safety and Automated Reasoning systems for generative AI applications. Her expertise is in formal verification methods, AI guardrails implementation, and helping global customers build trustworthy and compliant AI solutions at scale. She holds a PhD in Computer Science with research in automated program repair and formal verification, and an MS in Financial Mathematics from WPI.
Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.
Bharathi Srinivasan is a Generative AI Data Scientist at the AWS Worldwide Specialist Organization. She works on developing solutions for Responsible AI, focusing on algorithmic fairness, veracity of large language models, and explainability. Bharathi guides internal teams and AWS customers on their responsible AI journey. She has presented her work at various learning conferences.
Dan Spillane, a Principal at Amazon Web Services (AWS), leads global strategic initiatives in the Consulting Center of Excellence (CCOE). He works with customers and partners to solve their critical business challenges using innovative technologies. Dan specializes in generative AI and responsible AI, including automated reasoning. He applies these tools to deliver measurable business value at scale. As a lifelong learner, Dan actively studies global cultures and business mechanisms, which enhances his ability to mentor others and drive cross-cultural initiatives.
Aartika Sardana Chandras is a Senior Product Marketing Manager for AWS Generative AI solutions, with a focus on Amazon Bedrock. She brings over 15 years of experience in product marketing, and is dedicated to empowering customers to navigate the complexities of the AI lifecycle. Aartika is passionate about helping customers leverage powerful AI technologies in an ethical and impactful manner.
Rama Lankalapalli is a Senior Partner Solutions Architect (PSA) at AWS where he leads a global team of PSAs supporting PwC, a major global systems integrator. Working closely with PwC’s global practice he champions enterprise cloud adoption by leveraging the breadth and depth of AWS services across migrations, modernization, security, AI/ML, and analytics. Rama architects scalable solutions that help organizations accelerate their digital transformation while delivering measurable business outcomes. His leadership combines deep technical expertise with strategic insight to drive customer success through innovative, industry-specific cloud solutions.
Scott Likens serves as the Chief AI Engineer over Global and US teams at PwC and leads the AI Engineering and Emerging Technology R&D teams domestically, driving the firm’s strategy around AI, Blockchain, VR, Quantum Computing, and other disruptors. With over 30 years of emerging technology expertise, he has helped clients transform customer experience, digital strategy, and operations for various industries.
Ambuj Gupta is a Director in PwC’s AI and Digital Contacts & Service practice, based in Chicago. With over 15 years of experience, Ambuj brings deep expertise in Artificial Intelligence, Agentic and Generative AI, Digital Contact Solutions, and Cloud Innovation across a broad spectrum of platforms and industries. He is recognized for driving strategic transformation through Cloud Native AI Automation and emerging technologies—including GenAI-powered agents, Intelligent Agent Assists, and Customer Data Platforms—to enhance channel performance and employee effectiveness.
Adam Hood is a Partner and AWS Data and AI Leader at PwC US. As a strategic and results-oriented technology leader, Adam specializes in driving enterprise-wide transformation and unlocking business value through the strategic application of digital systems, data, and GenAI/AI/ML including building agentic workflows. With a track record of success in industry and consulting, he has guided organizations through complex digital, finance, and ERP modernizations, from initial strategy and business case development to seamless execution and global rollout.
Chantal Hudson is a Manager in PwC UK’s AI and Modelling team. She has been with PwC for just over five years, starting her career in the South African firm. Chantal works primarily with large banks on credit risk modelling, and is particularly interested in the application of AI applying AI to advance modelling practices.
Priyanka Mukhopadhyay is a Manager in PwC’s Cloud and Digital Engineering practice. She is an AWS Certified Solution Architect – Associate with over 13 years of experience in Data Engineering. Over the past decade, she has honed her expertise in AWS services and has more than 12 years of experience in developing and delivering robust projects following Agile Methodologies.
Deniz Konak Ozturk is a Senior Manager within PwC’s AI &Modelling team. She has around 15 years of experience in AI/Gen AI and traditional model development, implementation and validation across UK and EU/non-EU territories and compliance assessment with EU regulations as well as IFRS9 audits. Over the past 6 years, her focus has been primarily on AI/Gen AI, highlighted by her involvement in AI Validation framework development, implementation of this framework into different clients, product management for an automated ML platform, and leading research and product ownership in an R&D initiative on Alternative Data Usage for ML based Risk Models targeting the financially underserved segment.
Kevin Paul is a Director within the AI Engineering group at PwC. He specializes in Applied AI, and has extensive of experience across the AI lifecycle, building and maintaining solutions across industries.

How Amazon scaled Rufus by building multi-node inference using AWS Tra …

Posted on August 14, 2025 by i-genie

At Amazon, our team builds Rufus, a generative AI-powered shopping assistant that serves millions of customers at immense scale. However, deploying Rufus at scale introduces significant challenges that must be carefully navigated. Rufus is powered by a custom-built large language model (LLM). As the model’s complexity increased, we prioritized developing scalable multi-node inference capabilities that maintain high-quality interactions while delivering low latency and cost-efficiency.
In this post, we share how we developed a multi-node inference solution using Amazon Trainium and vLLM, an open source library designed for efficient and high-throughput serving of LLMs. We also discuss how we built a management layer on top of Amazon Elastic Container Service (Amazon ECS) to host models across multiple nodes, facilitating robust, reliable, and scalable deployments.
Challenges with multi-node inference
As our Rufus model grew bigger in size, we needed multiple accelerator instances because no single chip or instance had enough memory for the entire model. We first needed to engineer our model to be split across multiple accelerators. Techniques such as tensor parallelism can be used to accomplish this, which can also impact various metrics such as time to first token. At larger scale, the accelerators on a node might not be enough and require you to use multiple hosts or nodes. At that point, you must also address managing your nodes as well as how your model is sharded across them (and their respective accelerators). We needed to address two major areas:

Model performance – Maximize compute and memory resources utilization across multiple nodes to serve models at high throughput, without sacrificing low latency. This includes designing effective parallelism strategies and model weight-sharding approaches to partition computation and memory footprint both within the same node and across multiple nodes, and an efficient batching mechanism that maximizes hardware resource utilization under dynamic request patterns.
Multi-node inference infrastructure – Design a containerized, multi-node inference abstraction that represents a single model running across multiple nodes. This abstraction and underlying infrastructure needs to support fast inter-node communication, maintain consistency across distributed components, and allow for deployment and scaling as a single, deployable unit. In addition, it must support continuous integration to allow rapid iteration and safe, reliable rollouts in production environments.

Solution overview
Taking these requirements into account, we built multi-node inference solution designed to overcome the scalability, performance, and reliability challenges inherent in serving LLMs at production scale using tens of thousands of TRN1 instances.
To create a multi-node inference infrastructure, we implemented a leader/follower multi-node inference architecture in vLLM. In this configuration, the leader node uses vLLM for request scheduling, batching, and orchestration, and follower nodes execute distributed model computations. Both leader and follower nodes share the same NeuronWorker implementation in vLLM, providing a consistent model execution path through seamless integration with the AWS Neuron SDK.
To address how we split the model across multiple instances and accelerators, we used hybrid parallelism strategies supported in the Neuron SDK. Hybrid parallelism strategies such as tensor parallelism and data parallelism are selectively applied to maximize cross-node compute and memory bandwidth utilization, significantly improving overall throughput.
Being aware of how the nodes are connected is also important to avoid latency penalties. We took advantage of network topology-aware node placement. Optimized placement facilitates low-latency, high-bandwidth cross-node communication using Elastic Fabric Adapter (EFA), minimizing communication overhead and improving collective operation efficiency.
Lastly, to manage models across multiple nodes, we built a multi-node inference unit abstraction layer on Amazon ECS. This abstraction layer supports deploying and scaling multiple nodes as a single, cohesive unit, providing robust and reliable large-scale production deployments.
By combining a leader/follower orchestration model, hybrid parallelism strategies, and a multi-node inference unit abstraction layer built on top of Amazon ECS, this architecture deploys a single model replica to run seamlessly across multiple nodes, supporting large production deployments.In the following sections, we discuss the architecture and key components of the solution in more detail.
Inference engine design
We built an architecture on Amazon ECS using Trn1 instances that supports scaling inference beyond a single node to fully use distributed hardware resources, while maintaining seamless integration with NVIDIA Triton Inference Server, vLLM, and the Neuron SDK.
Although the following diagram illustrates a two-node configuration (leader and follower) for simplicity, the architecture is designed to be extended to support additional follower nodes as needed.

In this architecture, the leader node runs the Triton Inference Server and vLLM engine, serving as the primary orchestration unit for inference. By integrating with vLLM, we can use continuous batching—a technique used in LLM inference to improve throughput and accelerator utilization by dynamically scheduling and processing inference requests at the token level. The vLLM scheduler handles batching based on the global batch size. It operates in a single-node context and is not aware of multi-node model execution. After the requests are scheduled, they’re handed off to the NeuronWorker component in vLLM, which handles broadcasting model inputs and executing the model through integration with the Neuron SDK.
The follower node operates as an independent process and acts as a wrapper around the vLLM NeuronWorker component. It continuously listens to model inputs broadcasted from the leader node and executes the model using the Neuron runtime in parallel with the leader node.
For nodes to communicate with each other with the proper information, two mechanisms are required:

Cross-node model inputs broadcasting on CPU – Model inputs are broadcasted from the leader node to follower nodes using the torch.distributed communication library with the Gloo backend. A distributed process group is initialized during NeuronWorker initialization on both the leader and follower nodes. This broadcast occurs on CPU over standard TCP connections, allowing follower nodes to receive the full set of model inputs required for model execution.
Cross-node collectives communication on Trainium chips – During model execution, cross-node collectives (such as all gather or all reduce) are managed by the Neuron Distributed Inference (NxDI) library, which uses EFA to deliver high-bandwidth, low-latency inter-node communication.

Model parallelism strategies
We adopted hybrid model parallelism strategies through integration with the Neuron SDK to maximize cross-node memory bandwidth utilization (MBU) and model FLOPs utilization (MFU), while also reducing memory pressure on each individual node. For example, during the context encoding (prefill) phase, we use context parallelism by splitting inputs along the sequence dimension, facilitating parallel computation of attention layers across nodes. In the decoding phase, we adopt data parallelism by partitioning the input along the batch dimension, so each node can serve a subset of batch requests independently.
Multi-node inference infrastructure
We also designed a distributed LLM inference abstraction: the multi-node inference unit, as illustrated in the following diagram. This abstraction serves as a unit of deployment for inference service, supporting consistent and reliable rolling deployments on a cell-by-cell basis across the production fleet. This is important so you only have a minimal number of nodes offline during upgrades without impacting your entire service. Both the leader and follower nodes described earlier are fully containerized, so each node can be independently managed and updated while maintaining a consistent execution environment across the entire fleet. This consistency is critical for reliability, because the leader and follower nodes must run with identical software stacks—including Neuron SDKs, Neuron drivers, EFA software, and other runtime dependencies—to achieve correct and reliable multi-node inference execution. The inference containers are deployed on Amazon ECS.

A crucial aspect of achieving high-performance distributed LLM inference is minimizing the latency of cross-node collective operations, which rely on Remote Direct Memory Access (RDMA). To enable this, optimized node placement is essential: the deployment management system must compose a cell by pairing nodes based on their physical location and proximity. With this optimized placement, cross-node operations can utilize the high-bandwidth, low-latency EFA network available to instances. The deployment management system gathers this information using the Amazon EC2 DescribeInstanceTopology API to pair nodes based on their underlying network topology.
To maintain high availability for customers (making sure Rufus is always online and ready to answer a question), we developed a proxy layer positioned between the system’s ingress or load-balancing layer and the multi-node inference unit. This proxy layer is responsible for continuously probing and reporting the health of all worker nodes. Rapidly detecting unhealthy nodes in a distributed inference environment is critical for maintaining availability because it makes sure the system can immediately route traffic away from unhealthy nodes and trigger automated recovery processes to restore service stability.
The proxy also monitors real-time load on each multi-node inference unit and reports it to the ingress layer, supporting fine-grained, system-wide load visibility. This helps the load balancer make optimized routing decisions that maximize per-cell performance and overall system efficiency.
Conclusion
As Rufus continues to evolve and become more capable, we must continue to build systems to host our model. Using this multi-node inference solution, we successfully launched a much larger model across over tens of thousands of AWS Trainium chips to Rufus customers, supporting Prime Day traffic. This increased model capacity has enabled new shopping experiences and significantly improved user engagement. This achievement marks a major milestone in pushing the boundaries of large-scale AI infrastructure for Amazon, delivering a highly available, high-throughput, multi-node LLM inference solution at industry scale.
AWS Trainium in combination with solutions such as NVIDIA Triton and vLLM can help you enable large inference workloads at scale with great cost performance. We encourage you to try these solutions to host large models for your workloads.

About the authors
James Park is a ML Specialist Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.
Faqin Zhong is a Software Engineer at Amazon Stores Foundational AI, working on LLM inference infrastructure and optimizations. Passionate about generative AI technology, Faqin collaborates with leading teams to drive innovation, making LLMs more accessible and impactful, and ultimately enhancing customer experiences across diverse applications. Outside of work, she enjoys cardio exercise and baking with her son.
Charlie Taylor is a Senior Software Engineer within Amazon Stores Foundational AI, focusing on developing distributed systems for high performance LLM inference. He builds inference systems and infrastructure to help larger, more capable models respond to customers faster. Outside of work, he enjoys reading and surfing.
Yang Zhou is a Software Engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost-efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.
Nicolas Trown is a Principal Engineer in Amazon Stores Foundational AI. His recent focus is lending his systems expertise across Rufus to aid the Rufus Inference team and efficient utilization across the Rufus experience. Outside of work, he enjoys spending time with his wife and taking day trips to the nearby coast, Napa, and Sonoma areas.
Michael Frankovich is a Principal Software Engineer at Amazon Core Search, where he supports the ongoing development of their cellular deployment management system used to host Rufus, among other search applications. Outside of work, he enjoys playing board games and raising chickens.
Adam (Hongshen) Zhao is a Software Development Manager at Amazon Stores Foundational AI. In his current role, Adam is leading the Rufus Inference team to build generative AI inference optimization solutions and inference system at scale for fast inference at low cost. Outside of work, he enjoys traveling with his wife and creating art.
Bing Yin is a Director of Science at Amazon Stores Foundational AI. He leads the effort to build LLMs that are specialized for shopping use cases and optimized for inference at Amazon scale. Outside of work, he enjoys running marathon races.
Parthasarathy Govindarajen is Director of Software Development at Amazon Stores Foundational AI. He leads teams that develop advanced infrastructure for large language models, focusing on both training and inference at scale. Outside of work, he spends his time playing cricket and exploring new places with his family.

NVIDIA AI Releases ProRLv2: Advancing Reasoning in Language Models wit …

Posted on August 13, 2025 by i-genie

Table of contentsWhat Is ProRLv2?Key Innovations in ProRLv2How ProRLv2 Expands LLM ReasoningWhy It MattersUsing Nemotron-Research-Reasoning-Qwen-1.5B-v2Conclusion

What Is ProRLv2?

ProRLv2 is the latest version of NVIDIA’s Prolonged Reinforcement Learning (ProRL), designed specifically to push the boundaries of reasoning in large language models (LLMs). By scaling reinforcement learning (RL) steps from 2,000 up to 3,000, ProRLv2 systematically tests how extended RL can unlock new solution spaces, creativity, and high-level reasoning that were previously inaccessible—even with smaller models like the 1.5B-parameter Nemotron-Research-Reasoning-Qwen-1.5B-v2.

Key Innovations in ProRLv2

ProRLv2 incorporates several innovations to overcome common RL limitations in LLM training:

REINFORCE++- Baseline: A robust RL algorithm that enables long-horizon optimization over thousands of steps, handling the instability typical in RL for LLMs.

KL Divergence Regularization & Reference Policy Reset: Periodically refreshes the reference model with the current best checkpoint, allowing stable progress and continued exploration by preventing the RL objective from dominating too early.

Decoupled Clipping & Dynamic Sampling (DAPO): Encourages diverse solution discovery by boosting unlikely tokens and focusing learning signals on prompts of intermediate difficulty.

Scheduled Length Penalty: Cyclically applied, helping maintain diversity and prevent entropy collapse as training lengthens.

Scaling Training Steps: ProRLv2 moves the RL training horizon from 2,000 to 3,000 steps, directly testing how much longer RL can expand reasoning abilities.

Recommended Article: NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

How ProRLv2 Expands LLM Reasoning

Nemotron-Research-Reasoning-Qwen-1.5B-v2, trained with ProRLv2 for 3,000 RL steps, sets a new standard for open-weight 1.5B models on reasoning tasks, including math, code, science, and logic puzzles:

Performance surpasses previous versions and competitors like DeepSeek-R1-1.5B.

Sustained gains with more RL steps: Longer training leads to continual improvements, especially on tasks where base models perform poorly, demonstrating genuine expansion in reasoning boundaries.

Generalization: Not only does ProRLv2 boost pass@1 accuracy, but it also enables novel reasoning and solution strategies on tasks not seen during training.

Benchmarks: Gains include average pass@1 improvements of 14.7% in math, 13.9% in coding, 54.8% in logic puzzles, 25.1% in STEM reasoning, and 18.1% in instruction-following tasks, with further improvements in v2 on unseen and harder benchmarks.

Why It Matters

The major finding of ProRLv2 is that continued RL training, with careful exploration and regularization, reliably expands what LLMs can learn and generalize. Rather than hitting an early plateau or overfitting, prolonged RL allows smaller models to rival much larger ones in reasoning—demonstrating that scaling RL itself is as important as model or dataset size.

Using Nemotron-Research-Reasoning-Qwen-1.5B-v2

The latest checkpoint is available for testing on Hugging Face. Loading the model:

Copy CodeCopiedUse a different Browserfrom transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(“nvidia/Nemotron-Research-Reasoning-Qwen-1.5B”)
model = AutoModelForCausalLM.from_pretrained(“nvidia/Nemotron-Research-Reasoning-Qwen-1.5B”)

Conclusion

ProRLv2 redefines the limits of reasoning in language models by showing that RL scaling laws matter as much as size or data. Through advanced regularization and smart training schedules, it enables deep, creative, and generalizable reasoning even in compact architectures. The future lies in how far RL can push—not just how big models can get.

Check out the Unofficial Blog and Model on Hugging Face here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Sponsor us

The post NVIDIA AI Releases ProRLv2: Advancing Reasoning in Language Models with Extended Reinforcement Learning RL appeared first on MarkTechPost.

Meet LEANN: The Tiniest Vector Database that Democratizes Personal AI …

Posted on August 13, 2025 by i-genie

Embedding-based search outperforms traditional keyword-based methods across various domains by capturing semantic similarity using dense vector representations and approximate nearest neighbor (ANN) search. However, the ANN data structure brings excessive storage overhead, often 1.5 to 7 times the size of the original raw data. This overhead is manageable in large-scale web applications but becomes impractical for personal devices or large datasets. Reducing storage to under 5% of the original data size is critical for edge deployment, but existing solutions fall short. Techniques like product quantization (PQ) can reduce storage, but either lead to a decrease in accuracy or need increased search latency.

Recommended Article: NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

Vector search methods depend on IVF and proximity graphs. Graph-based approaches like HNSW, NSG, and Vamana are considered state-of-the-art due to their balance of accuracy and efficiency. Efforts to reduce graph size, such as learned neighbor selection, face limitations due to high training costs and dependency on labeled data. For resource-constrained environments, DiskANN and Starling store data on disk, while FusionANNS optimizes hardware usage. Methods like AiSAQ and EdgeRAG attempt to minimize memory usage but still suffer from high storage overhead or performance degradation at scale. Embedding compression techniques like PQ and RabitQ provides quantization with theoretical error bounds, but struggles to maintain accuracy under tight budgets.

Researchers from UC Berkeley, CUHK, Amazon Web Services, and UC Davis have developed LEANN, a storage-efficient ANN search index optimized for resource-limited personal devices. It integrates a compact graph-based structure with an on-the-fly recomputation strategy, enabling fast and accurate retrieval while minimizing storage overhead. LEANN achieves up to 50 times smaller storage than standard indexes by reducing the index size to under 5% of the original raw data. It maintains 90% top-3 recall in under 2 seconds on real-world question-answering benchmarks. To reduce latency, LEANN utilizes a two-level traversal algorithm and dynamic batching that combines embedding computations across search hops, enhancing GPU utilization.

LEANN’s architecture combines core methods such as graph-based recomputation, main techniques, and system workflow. Built on the HNSW framework, it observes that each query needs embeddings for only a limited subset of nodes, prompting on-demand computation instead of pre-storing all embeddings. To address earlier challenges, LEANN introduces two techniques: (a) a two-level graph traversal with dynamic batching to lower recomputation latency, and (b) a high degree of preserving graph pruning method to reduce metadata storage. In the system workflow, LEANN begins by computing embeddings for all dataset items and then constructs a vector index using an off-the-shelf graph-based indexing approach.

In terms of storage and latency, LEANN outperforms EdgeRAG, an IVF-based recomputation method, achieving latency reductions ranging from 21.17 to 200.60 times across various datasets and hardware platforms. This advantage is from LEANN’s polylogarithmic recomputation complexity, which scales more efficiently than EdgeRAG’s √𝑁 growth. In terms of accuracy for downstream RAG tasks, LEANN achieves higher performance across most datasets, except GPQA, where a distributional mismatch limits its effectiveness. Similarly, on HotpotQA, the single-hop retrieval setup limits accuracy gains, as the dataset demands multi-hop reasoning. Despite these limitations, LEANN shows strong performance across diverse benchmarks.

In this paper, researchers introduced LEANN, a storage-efficient neural retrieval system that combines graph-based recomputation with innovative optimizations. By integrating a two-level search algorithm and dynamic batching, it eliminates the need to store full embeddings, achieving significant reductions in storage overhead while maintaining high accuracy. Despite its strengths, LEANN faces limitations, such as high peak storage usage during index construction, which could be addressed through pre-clustering or other techniques. Future work may focus on reducing latency and enhancing responsiveness, opening the path for broader adoption in resource-constrained environments.

Check out the Paper and GitHub Page here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Sponsor us

The post Meet LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index appeared first on MarkTechPost.

Zhipu AI Releases GLM-4.5V: Versatile Multimodal Reasoning with Scalab …

Posted on August 13, 2025 by i-genie

Zhipu AI has officially released and open-sourced GLM-4.5V, a next-generation vision-language model (VLM) that significantly advances the state of open multimodal AI. Based on Zhipu’s 106-billion parameter GLM-4.5-Air architecture—with 12 billion active parameters via a Mixture-of-Experts (MoE) design—GLM-4.5V delivers strong real-world performance and unmatched versatility across visual and textual content.

Key Features and Design Innovations

1. Comprehensive Visual Reasoning

Image Reasoning: GLM-4.5V achieves advanced scene understanding, multi-image analysis, and spatial recognition. It can interpret detailed relationships in complex scenes (such as distinguishing product defects, analyzing geographical clues, or inferring context from multiple images simultaneously).

Video Understanding: It processes long videos, performing automatic segmentation and recognizing nuanced events thanks to a 3D convolutional vision encoder. This enables applications like storyboarding, sports analytics, surveillance review, and lecture summarization.

Spatial Reasoning: Integrated 3D Rotational Positional Encoding (3D-RoPE) gives the model a robust perception of three-dimensional spatial relationships, crucial for interpreting visual scenes and grounding visual elements.

2. Advanced GUI and Agent Tasks

Screen Reading & Icon Recognition: The model excels at reading desktop/app interfaces, localizing buttons and icons, and assisting with automation—essential for RPA (robotic process automation) and accessibility tools.

Desktop Operation Assistance: Through detailed visual understanding, GLM-4.5V can plan and describe GUI operations, assisting users in navigating software or performing complex workflows.

3. Complex Chart and Document Parsing

Chart Understanding: GLM-4.5V can analyze charts, infographics, and scientific diagrams within PDFs or PowerPoint files, extracting summarized conclusions and structured data even from dense, long documents.

Long Document Interpretation: With support for up to 64,000 tokens of multimodal context, it can parse and summarize extended, image-rich documents (such as research papers, contracts, or compliance reports), making it ideal for business intelligence and knowledge extraction.

4. Grounding and Visual Localization

Precise Grounding: The model can accurately localize and describe visual elements—such as objects, bounding boxes, or specific UI elements—using world knowledge and semantic context, not just pixel-level cues. This enables detailed analysis for quality control, AR applications, and image annotation workflows.

Architectural Highlights

Hybrid Vision-Language Pipeline: The system integrates a powerful visual encoder, MLP adapter, and a language decoder, allowing seamless fusion of visual and textual information. Static images, videos, GUIs, charts, and documents are all treated as first-class inputs.

Mixture-of-Experts (MoE) Efficiency: While housing 106B total parameters, the MoE design activates only 12B per inference, ensuring high throughput and affordable deployment without sacrificing accuracy.

3D Convolution for Video & Images: Video inputs are processed using temporal downsampling and 3D convolution, enabling the analysis of high-resolution videos and native aspect ratios, while maintaining efficiency.

Adaptive Context Length: Supports up to 64K tokens, allowing robust handling of multi-image prompts, concatenated documents, and lengthy dialogues in one pass.

Innovative Pretraining and RL: The training regime combines massive multimodal pretraining, supervised fine-tuning, and Reinforcement Learning with Curriculum Sampling (RLCS) for long-chain reasoning mastery and real-world task robustness.

“Thinking Mode” for Tunable Reasoning Depth

A prominent feature is the “Thinking Mode” toggle:

Thinking Mode ON: Prioritizes deep, step-by-step reasoning, suitable for complex tasks (e.g., logical deduction, multi-step chart or document analysis).

Thinking Mode OFF: Delivers faster, direct answers for routine lookups or simple Q&A. The user can control the model’s reasoning depth at inference, balancing speed against interpretability and rigor.

Benchmark Performance and Real-World Impact

State-of-the-Art Results: GLM-4.5V achieves SOTA across 41–42 public multimodal benchmarks, including MMBench, AI2D, MMStar, MathVista, and more, outperforming both open and some premium proprietary models in categories like STEM QA, chart understanding, GUI operation, and video comprehension.

Practical Deployments: Businesses and researchers report transformative results in defect detection, automated report analysis, digital assistant creation, and accessibility technology with GLM-4.5V.

Democratizing Multimodal AI: Open-sourced under the MIT license, the model equalizes access to cutting-edge multimodal reasoning that was previously gated by exclusive proprietary APIs.

Example Use Cases

FeatureExample UseDescriptionImage ReasoningDefect detection, content moderationScene understanding, multiple-image summarizationVideo AnalysisSurveillance, content creationLong video segmentation, event recognitionGUI TasksAccessibility, automation, QAScreen/UI reading, icon location, operation suggestionChart ParsingFinance, research reportsVisual analytics, data extraction from complex chartsDocument ParsingLaw, insurance, scienceAnalyze & summarize long illustrated documentsGroundingAR, retail, roboticsTarget object localization, spatial referencing

Summary

GLM-4.5V by Zhipu AI is a flagship open-source vision-language model setting new performance and usability standards for multimodal reasoning. With its powerful architecture, context length, real-time “thinking mode”, and broad capability spectrum, GLM-4.5V is redefining what’s possible for enterprises, researchers, and developers working at the intersection of vision and language.

Check out the Paper, Model on Hugging Face and GitHub Page here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Star us on GitHub

Join our ML Subreddit

Sponsor us

The post Zhipu AI Releases GLM-4.5V: Versatile Multimodal Reasoning with Scalable Reinforcement Learning appeared first on MarkTechPost.

Train and deploy AI models at trillion-parameter scale with Amazon Sag …

Posted on August 13, 2025 by i-genie

Imagine harnessing the power of 72 cutting-edge NVIDIA Blackwell GPUs in a single system for the next wave of AI innovation, unlocking 360 petaflops of dense 8-bit floating point (FP8) compute and 1.4 exaflops of sparse 4-bit floating point (FP4) compute. Today, that’s exactly what Amazon SageMaker HyperPod delivers with the launch of support for P6e-GB200 UltraServers. Accelerated by NVIDIA GB200 NVL72, P6e-GB200 UltraServers provide industry-leading GPU performance, network throughput, and memory for developing and deploying trillion-parameter AI models at scale. By seamlessly integrating these UltraServers with the distributed training environment of SageMaker HyperPod, organizations can rapidly scale model development, reduce downtime, and simplify the transition from training to large-scale deployment. With the automated, resilient, and highly scalable machine learning infrastructure of SageMaker HyperPod, organizations can seamlessly distribute massive AI workloads across thousands of accelerators and manage model development end-to-end with unprecedented efficiency. Using SageMaker HyperPod with P6e-GB200 UltraServers marks a pivotal shift towards faster, more resilient, and cost-effective training and deployment for state-of-the-art generative AI models.
In this post, we review the technical specifications of P6e-GB200 UltraServers, discuss their performance benefits, and highlight key use cases. We then walk though how to purchase UltraServer capacity through flexible training plans and get started using UltraServers with SageMaker HyperPod.
Inside the UltraServer
P6e-GB200 UltraServers are accelerated by NVIDIA GB200 NVL72, connecting 36 NVIDIA Grace CPUs and 72 Blackwell GPUs in the same NVIDIA NVLink domain. Each ml.p6e-gb200.36xlarge compute node within an UltraServer includes two NVIDIA GB200 Grace Blackwell Superchips, each connecting two high-performance NVIDIA Blackwell GPUs and an Arm-based NVIDIA Grace CPU with the NVIDIA NVLink chip-to-chip (C2C) interconnect. SageMaker HyperPod is launching P6e-GB200 UltraServers in two sizes. The ml.u-p6e-gb200x36 UltraServer includes a rack of 9 compute nodes fully connected with NVSwitch (NVS), providing a total of 36 Blackwell GPUs in the same NVLink domain, and the ml.u-p6e-gb200x72 UltraServer includes a rack-pair of 18 compute nodes with a total of 72 Blackwell GPUs in the same NVLink domain. The following diagram illustrates this configuration.

Performance benefits of UltraServers
In this section, we discuss some of the performance benefits of UltraServers.
GPU and compute power
With P6e-GB200 UltraServers, you can access up to 72 NVIDIA Blackwell GPUs within a single NVLink domain, with a total of 360 petaflops of FP8 compute (without sparsity), 1.4 exaflops of FP4 compute (with sparsity) and 13.4 TB of high-bandwidth memory (HBM3e). EachGrace Blackwell Superchip pairs two Blackwell GPUs with one Grace CPU through the NVLink-C2C interconnect, delivering 10 petaflops of dense FP8 compute, 40 petaflops of sparse FP4 compute, up to 372 GB HBM3e, and 850GB of cache-coherent fast memory per module. This co-location boosts bandwidth between GPU and CPU by an order of magnitude compared to previous-generation instances. Each NVIDIA Blackwell GPU features a second-generation Transformer Engine and supports the latest AI precision microscaling (MX) data formats such as MXFP6 and MXFP4, as well as NVIDIA NVFP4. When combined with frameworks like NVIDIA Dynamo, NVIDA TensorRT-LLM and NVIDIA NeMo, these Transformer Engines significantly accelerate inference and training for large language models (LLMs) and Mixture-of-Experts (MoE) models, supporting higher efficiency and performance for modern AI workloads.
High-performance networking
P6e-GB200 UltraServers deliver up to 130 TBps of low-latency NVLink bandwidth between GPUs for efficient large-scale AI workload communication. At double the bandwidth of its predecessor, the fifth-generation NVIDIA NVLink provides up to 1.8 TBps of bidirectional, direct GPU-to-GPU interconnect, greatly enhancing intra-server communication. Each compute node within an UltraServer can be configured with up to 17 physical network interface cards (NICs), each supporting up to 400 Gbps of bandwidth. P6e-GB200 UltraServers provide up to 28.8 Tbps of total Elastic Fabric Adapter (EFA) v4 networking, using the Scalable Reliable Datagram (SRD) protocol to intelligently route network traffic across multiple paths, providing smooth operation even during congestion or hardware failures. For more information, refer to EFA configuration for a P6e-GB200 instances.
Storage and data throughput
P6e-GB200 UltraServers support up to 405 TB of local NVMe SSD storage, ideal for large-scale datasets and fast checkpointing during AI model training. For high-performance shared storage, Amazon FSx for Lustre file systems can be accessed over EFA with GPUDirect Storage (GDS), providing direct data transfer between the file system and the GPU memory with TBps of throughput and millions of input/output operations per second (IOPS) for demanding AI training and inference.
Topology-aware scheduling
Amazon Elastic Compute Cloud (Amazon EC2) provides topology information that describes the physical and network relationships between instances in your cluster. For UltraServer compute nodes, Amazon EC2 exposes which instances belong to the same UltraServer, so you’re training and inference algorithms can understand NVLink connectivity patterns. This topology information helps optimize distributed training by allowing frameworks like the NVIDIA Collective Communications Library (NCCL) to make intelligent decisions about communication patterns and data placement. For more information, see How Amazon EC2 instance topology works.
With Amazon Elastic Kubernetes Service (Amazon EKS) orchestration, SageMaker HyperPod automatically labels UltraServer compute nodes with their respective AWS Region, Availability Zone, Network Node Layers (1–4), and UltraServer ID. These topology labels can be used with node affinities, and pod topology spread constraints to assign Pods to cluster nodes for optimal performance.
With Slurm orchestration, SageMaker HyperPod automatically enables the topology plugin and creates a topology.conf file with the respective BlockName, Nodes, and BlockSizes to match your UltraServer capacity. This way, you can group and segment your compute nodes to optimize job performance.
Use cases for UltraServers
P6e-GB200 UltraServers can efficiently train models with over a trillion parameters due to their unified NVLink domain, ultrafast memory, and high cross-node bandwidth, making them ideal for state-of-the-art AI development. The substantial interconnect bandwidth makes sure even extremely large models can be partitioned and trained in a highly parallel and efficient manner without the performance setbacks seen in disjointed multi-node systems. This results in faster iteration cycles and higher-quality AI models, helping organizations push the boundaries of state-of-the-art AI research and innovation.
For real-time trillion-parameter model inference, P6e-GB200 UltraServers enable 30 times faster inference on frontier trillion-parameter LLMs compared to prior platforms, achieving real-time performance for complex models used in generative AI, natural language understanding, and conversational agents. When paired with NVIDIA Dynamo, P6e-GB200 UltraServers deliver significant performance gains, especially for long context lengths. NVIDIA Dynamo disaggregates the compute-heavy prefill phase and the memory-heavy decode phase onto different GPUs, supporting independent optimization and resource allocation within the large 72-GPU NVLink domain. This enables more efficient management of large context windows and high-concurrency applications.
P6e-GB200 UltraServers offer substantial benefits to startup, research, and enterprise customers with multiple teams that need to run diverse distributed training and inference workloads on shared infrastructure. When used in conjunction with SageMaker HyperPod task governance, UltraServers provide exceptional scalability and resource pooling, so different teams can launch simultaneous jobs without bottlenecks. Enterprises can maximize infrastructure utilization, reduce overall costs, and accelerate project timelines, all while supporting the complex needs of teams developing and serving advanced AI models, including massive LLMs for high-concurrency real-time inference, across a single, resilient platform.
Flexible training plans for UltraServer capacity
SageMaker AI currently offers P6e-GB200 UltraServer capacity through flexible training plans in the Dallas AWS Local Zone (us-east-1-dfw-2a). UltraServers can be used for both SageMaker HyperPod and SageMaker training jobs.
To get started, navigate to the SageMaker AI training plans console, which includes a new UltraServer compute type, from which you can select your UltraServer type: ml.u-p6e-gb200x36 (containing 9 ml.p6e-gb200.36xlarge compute nodes) or ml.u-p6e-gb200x72 (containing 18 ml.p6e-gb200.36xlarge compute nodes).

After finding the training plan that fits your needs, it is recommended that you configure at least one spare ml.p6e-gb200.36xlarge compute node to make sure faulty instances can be quickly replaced with minimal disruption.

Create an UltraServer cluster with SageMaker HyperPod
After purchasing an UltraServer training plan, you can add the capacity to an ml.p6e-gb200.36xlarge type instance group within your SageMaker HyperPod cluster and specify the quantity of instances that you want to provision up to the amount available within the training plan. For example, if you purchased a training plan for one ml.u-p6e-gb200x36 UltraServer, you could provision up to 9 compute nodes, whereas if you purchased a training plan for one ml.u-p6e-gb200x72 UltraServer, you could provision up to 18 compute nodes.

By default, SageMaker will optimize the placement of instance group nodes within the same UltraServer so that GPUs across nodes are interconnected within the same NVLink domain to achieve the best data transfer performance for your jobs. For example, if you purchase two ml.u-p6e-gb200x72 UltraServers with 17 compute nodes available each (assuming you configured two spares), then create an instance group with 24 nodes, the first 17 compute nodes will be placed on UltraServer A, and the other 7 compute nodes will be placed on UltraServer B.
Conclusion
P6e-GB200 UltraServers help organizations train, fine-tune, and serve the world’s most ambitious AI models at scale. By combining extraordinary GPU resources, ultrafast networking, and industry-leading memory with the automation and scalability of SageMaker HyperPod, enterprises can accelerate the different stages of the AI lifecycle, from experimentation and distributed training through seamless inference and deployment. This powerful solution breaks new ground in performance and flexibility and reduces operational complexity and costs, so that innovators can unlock new possibilities and lead the next era of AI advancement.

About the authors
Nathan Arnold is a Senior AI/ML Specialist Solutions Architect at AWS based out of Austin Texas. He helps AWS customers—from small startups to large enterprises—train and deploy foundation models efficiently on AWS. When he’s not working with customers, he enjoys hiking, trail running, and playing with his dogs.

How Indegene’s AI-powered social intelligence for life sciences turn …

Posted on August 13, 2025 by i-genie

This post is co-written with Rudra Kannemadugu and Shravan K S from Indegene Limited.
In today’s digital-first world, healthcare conversations are increasingly happening online. Yet the life sciences industry has struggled to keep pace with this shift, facing challenges in effectively analyzing and deriving insights from complex medical discussions on a scale. This post will explore how Indegene is using services like Amazon Bedrock, Amazon SageMaker, and purpose-built AWS solutions for healthcare and life sciences to help pharmaceutical companies extract valuable, actionable intelligence from digital healthcare conversations.
Indegene Limited is a digital-first, life sciences commercialization company. It helps pharmaceutical, emerging biotech, and medical device companies develop products, get them to customers, and grow their impact through the healthcare lifecycle in a more effective, efficient, and modern way. Trusted by global leaders in the pharma and biotech space, Indegene brings together healthcare domain expertise, fit-for-purpose technology, and an agile operating model to provide a diverse range of solutions. They aim to deliver a personalized, scalable and omnichannel experience for patients and physicians.
Life sciences companies face unprecedented challenges in effectively understanding and engaging with healthcare professionals (HCPs) and patients. Indegene’s Digital-Savvy HCP Report reveals that 52% of HCPs now prefer receiving medical and promotional content from pharmaceutical companies through social media (such as LinkedIn, Twitter, YouTube, or Facebook). This number is up significantly from 41% in 2020. Despite this shift, pharma companies are struggling to deliver high-quality experiences. A study by DT Consulting (an Indegene company) shows the industry currently holds a Customer Experience Quality (CXQ) score of 58. Although this rating is considered good, it merely meets basic expectations and falls short of the excellence benchmark, defined by a CXQ score of 76–100.
This post explores how Indegene’s Social Intelligence Solution uses advanced AI to help life sciences companies extract valuable insights from digital healthcare conversations. Built on AWS technology, the solution addresses the growing preference of HCPs for digital channels while overcoming the challenges of analyzing complex medical discussions on a scale.
Digital transformation challenges in life sciences
Consider the scenario in the following figure: A patient shares their healthcare journey on social media, including details about their medical condition, treatment protocol, healthcare provider, medication usage patterns, treatment efficacy, and experienced side effects. When such patient narratives are collected at scale and processed through analytical models, they provide valuable strategic insights for pharmaceutical companies.

This has created an urgent need for sophisticated, healthcare-focused solutions that can automatically capture, analyze, and transform these digital conversations into actionable business intelligence.Social intelligence in healthcare can help companies achieve the following:

Monitor brand sentiment and reputation – Track relevant conversations in real time (forward listening) and historically (backward listening) to monitor sentiment and trends around specific drugs or brands.
Gauge launch reactions and adjust strategies – Monitor product launches to assess public reaction, identify leading indicators, and detect trends for brand switching, adverse events, or off-label drug use.
Identify and monitor key decision-makers – Enhance outreach by identifying and engaging with key influencers, particularly HCPs, by analyzing their posts and interactions across social media channels.
Gain competitive intelligence – Track brand sentiment and patient behavior patterns to identify emerging trends, gauge competitor performance, and adapt business strategies proactively.

Key challenges in healthcare social listening
Life sciences organizations recognize that customer-centricity becomes more attainable when decision-making is informed by data. Consequently, they are increasingly embracing strategies that use data to enhance customer experience and drive business outcomes. However, they face significant challenges:

Obsolete engagement methods – Traditional in-person interactions are becoming less effective as medical conversations migrate to digital channels.
Complex healthcare terminology – Standard social listening tools can’t adequately process healthcare-specific language, regulatory considerations, and authentic HCP identification.
Real-time insight requirements – Critical information about treatment preferences and product feedback emerges rapidly, outpacing manual analysis methods.

Solution overview
With over 25 years of industrial experience, Indegene has built and continues to evolve their specialized Social Intelligence Solution on AWS, adapting to emerging healthcare and life sciences (HCLS) needs and use cases. This solution aims to transform how life sciences companies understand and engage with their stakeholders by combining machine learning (ML), natural language processing (MLP), and generative AI capabilities. Key differentiators of the solution include:

Broad social media integration – Provides automated data collection with comprehensive coverage across social media channels.
Healthcare-focused analytics – Delivers deep insights into pharmaceutical-specific attributes, including stakeholder segmentation, safety, and efficacy.
Targeted HCP identification – Uniquely detects and categorizes social media profiles of healthcare professionals for precision targeting.
Comprehensive insight capabilities – Provides granular analysis of conversations with sentiment analysis for nuanced understanding.

The following diagram illustrates an end-to-end life sciences system that integrates multiple functional layers. Starting from the bottom, it flows from data acquisition through data management layers, up to AI/ML core processing and customer-facing applications (such as HCP and DOL identification, and conference listening). The right side showcases supporting techno-functional services, including security, DevOps, and enterprise interfaces.

The system employs a modular, extensible architecture that transforms unstructured social data into actionable healthcare insights while maintaining regulatory compliance. This layered design allows for continuous evolution, helping pharmaceutical companies implement diverse use cases beyond initial applications.
Architecture layers
The architecture consists of the following layers:

Data acquisition layer – This foundation layer features specialized components for social media connectivity across channels like LinkedIn, Twitter, and YouTube, alongside sophisticated web scraping frameworks with rate limiting and randomization capabilities. A standout feature is the taxonomy-based query generator that uses healthcare terminology databases to create contextually relevant searches across medical conversations.
Data management layer – This layer provides robust data lake functionality with comprehensive governance features, including personally identifiable information (PII) detection, retention policies, and lineage tracking to help maintain regulatory compliance. This layer’s metadata repository and schema registry make sure complex healthcare data remains organized and discoverable, and extraction workers and data cleansers maintain data quality essential for reliable analytics. For more information, see Building and Scaling Robust and Effective Enterprise Data Governance in Life Sciences.
Core AI/ML service layer – This layer represents the system’s intelligence center, offering healthcare-specific capabilities like medical entity recognition, credential verification for healthcare professionals, and specialized sentiment analysis tuned for medical contexts. The system’s context-aware analyzer and confidence scoring mechanisms make sure insights reflect the nuanced nature of healthcare discussions, and the HCP-KOL-DOL identifier provides critical stakeholder classification capabilities unavailable in generic social listening tools. For more information, see Five must-have AI capabilities to lead the commercial race in life sciences.
Customer-facing analytics layer – This layer delivers actionable insights through specialized modules, including anomaly detection, predictive trend modeling, and adverse event detection, with medical side effect lexicons. Particularly valuable are the comparative analysis tools and share-of-voice calculators that provide competitive intelligence specific to the pharmaceutical industry. These components work together to power purpose-built applications like HCP identification, conference listening, brand reputation analysis, and patient sentiment tracking—all designed to help pharmaceutical companies navigate the increasingly digital healthcare conversation landscape with precision and compliance alignment.

A layered system-based modular approach offers the following benefits for healthcare use cases:

Reusability – The dynamic nature of healthcare-digital engagement demands flexible customer-facing solutions (top-layer). A modular approach provides reusable components that adapt to changing business use cases without requiring core infrastructure rebuilds. This approach delivers controlled implementation costs, consistent scalability and reliability, and minimal time-to-market.
Extensibility and separation of concerns – The solution separates four fundamental building blocks: data acquisition mechanisms, compliance-aligned data lifecycle management, healthcare-optimized AI/ML services, and domain-specific analytics. Given the accelerating pace of innovation in each area (from new social channels to advanced language models), these components must evolve independently with meticulously defined interfaces between them. This separation helps specialized teams update individual components without disrupting overall system performance or compliance requirements.
Standardization – Enterprise-wide consistency forms the backbone of a reliable healthcare analytics solution. Authentication, authorization, integration with enterprise systems like ERP and CRM, observability mechanisms, and security controls must follow standardized patterns across the entire social listening channels. When dealing with HCP identification and medical conversations, these standardized guardrails become not just technical best practices but essential regulatory and compliance requirements.
Domain adaptation – What fundamentally distinguishes our approach from generic social media channels is our deep domain-specific implementation tailored for life sciences. Whereas lower layers like data acquisition and management follow industry standards, our upper layers deliver specialized capabilities engineered specifically for healthcare contexts. Identifying healthcare professionals in social conversations with high precision, enabling taxonomy-based querying across complex medical hierarchies, and contextualizing medical terminology within appropriate clinical frameworks are capabilities with transformative utility in life sciences applications. This domain specialization creates unique value that generic solutions simply cannot match, providing Indegene with a distinctive competitive advantage in helping pharmaceutical companies bridge the digital engagement gap revealed in our research.

Implementation on AWS
Indegene’s Social Intelligence Solution’s layered architecture can be efficiently implemented using AWS’s comprehensive suite of services, providing scalability, security, and specialized capabilities for life sciences analytics.
Data acquisition layer
The data acquisition layer orchestrates diverse data collection mechanisms to gather insights from multiple social and professional channels while facilitating compliance-aligned and efficient ingestion:

Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Kinesis – Provide the backbone for real-time data ingestion from social media channels, handling high-throughput event streams from Twitter, LinkedIn, and other sources with built-in fault tolerance and robust message retention.
AWS Lambda – Powers the event-driven collection system, triggering data capture based on scheduled polling or webhook events.
Amazon AppFlow – Simplifies integration with social media APIs through no-code connectors to social media channels like LinkedIn and Twitter.
AWS Glue crawlers – Systematically extract data from web sources using headless browser capabilities, with rate limiting and randomization to facilitate ethical data collection.
Amazon Neptune – Stores and traverses complex medical terminology relationships needed for taxonomy-based query generation.

Data management layer
The data management layer demands robust storage, cataloging, and governance solutions:

Amazon Simple Storage Service (Amazon S3) – Serves as the cost-optimized data lake foundation, with intelligent tiering to automatically move less-accessed historical social data to lower-cost storage classes.
AWS Lake Formation – Provides fine-grained access controls and governance for the data lake, which is critical for managing sensitive healthcare information.
AWS Glue Data Catalog – Maintains the metadata repository and schema registry, making social media data discoverable and queryable.
Amazon EMR – Powers the extract, transform, and load (ETL) pipeline for large-scale data transformation, particularly useful for processing historical social media archives.
Amazon Comprehend Medical – Assists with PII detection in the data governance framework, identifying and helping protect sensitive healthcare information that might appear in social conversations.

Core AI/ML service layer
This critical layer uses AWS’s advanced AI capabilities to transform raw social data into healthcare-specific insights:

Amazon Bedrock and Amazon SageMaker AI – Form the centerpiece of the ML implementation with foundation models (FMs) fine-tuned for healthcare terminology.
Amazon ElastiCache for Redis – Implements high-performance Retrieval Augmented Generation (RAG) caching, dramatically improving response times for common healthcare queries and reducing computational costs.

Amazon Bedrock serves as the cornerstone of the solution’s AI capabilities, offering several advantages for life sciences applications. It is a fully managed service that offers a choice of industry-leading large language models (LLMs) to build generative AI applications.
Amazon Bedrock minimizes the substantial infrastructure management burden typically associated with deploying LLMs, helping life sciences companies focus on insights rather than complex ML operations. Amazon Bedrock FMs can be specialized for healthcare terminology through domain adaptation, enabling accurate interpretation of complex medical discussions.
The RAG capabilities of Amazon Bedrock Knowledge Bases are particularly valuable for incorporating medical ontologies and taxonomies, making sure AI responses reflect current medical understanding and regulatory contexts.
Amazon Bedrock Custom Model Import helps pharmaceutical companies use their proprietary domain-specific models and intellectual property, which is critical for companies with established investments in specialized healthcare AI.
For pharmaceutical companies monitoring product launches or adverse events, Amazon Bedrock Prompt Management allows for consistent, validated queries across different monitoring scenarios. Operational efficiency is significantly enhanced through Amazon Bedrock prompt caching mechanisms, which reduce redundant processing of similar queries and substantially lower costs—particularly valuable when analyzing recurring patterns in healthcare conversations. Amazon Bedrock Intelligent Prompt Routing enables intelligent distribution of tasks across multiple state-of-the-art LLMs, helping teams seamlessly compare and select the optimal model for each specific use case, such as Anthropic’s Claude for nuanced sentiment analysis, Meta Llama for rapid classification, or proprietary models for specialized pharmaceutical applications.
The Amazon Bedrock comprehensive responsible AI framework is particularly crucial in healthcare applications. The built-in evaluation tools enable systematic assessment of model outputs for fairness, bias, and accuracy in medical contexts, which is essential when analyzing diverse patient populations. Amazon Bedrock transparency features provide detailed model cards and lineage tracking, helping pharmaceutical companies document and justify AI-driven decisions to regulatory authorities. The human-in-the-loop workflows facilitate expert review of critical healthcare insights before they influence business decisions, and comprehensive audit logging creates the documentation trail necessary for compliance in regulated industries.
Amazon Bedrock Guardrails is especially valuable in the life sciences context, where guardrails can be configured with domain-specific constraints to help prevent the extraction or exposure of protected health information. These guardrails can be tailored to automatically block requests for individual patient information, personal details of healthcare professionals, or other sensitive data categories specific to pharmaceutical compliance requirements. This capability makes sure that even as the solution analyzes millions of healthcare conversations, it can maintain strict adherence to HIPAA, GDPR, and industry-specific privacy standards. The ability to implement these comprehensive guardrails makes sure the AI outputs comply with pharmaceutical marketing regulations and patient privacy requirements.
Amazon Bedrock Agents can automate routine monitoring tasks while escalating potential adverse events or off-label discussions for human review.
By implementing fine-tuning pipelines through Amazon Bedrock, the solution continuously improves its understanding of emerging medical terminology and evolving social media language patterns, making sure the insights remain relevant as digital healthcare conversations evolve.
Customer-facing analytics and insights service layer
The solution’s analytics capabilities transform processed data into actionable business intelligence:

Reporting – Delivers interactive dashboards and visualizations of brand sentiment, competitor analysis, and trend detection with healthcare-specific visualizations and metrics.
Amazon Managed Service for Apache Flink – Enables real-time trend detection and anomaly identification in streaming social media data, which is particularly valuable for monitoring adverse event signals.
AWS Step Functions – Orchestrates complex analytics workflows like adverse event detection that require multiple processing steps and human review.
Amazon Athena – Provides SQL queries against the processed social media data lake, helping business users explore patterns without complex data engineering.
Amazon Lex – Powers natural language interfaces, helping users query and interact with social media insights through conversational AI.

Supporting techno-functional services
The solution’s enterprise integration and operational capabilities use AWS’s comprehensive management tools:

AWS Control Tower and AWS Organizations – Implement guardrails and compliance controls essential for life sciences applications.
Amazon CloudWatch and AWS X-Ray – Provide comprehensive observability across the solution, with specialized monitoring for healthcare-specific metrics and compliance indicators.
AWS AppSync – Builds the intuitive user experience layer with real-time data synchronization.
AWS AppFlow and Amazon API Gateway – Enable enterprise interface integration with CRM and ERP systems.
Amazon Cognito – Delivers secure user authentication and authorization, with role-based access controls appropriate for different stakeholder groups within pharmaceutical organizations.

This AWS-powered implementation delivers the benefits we have discussed—reusability, extensibility, standardization, and domain adaptation—while providing the security, compliance-alignment, and performance capabilities essential for life sciences applications.
Example use case
Let’s explore the implementation of a domain-specific, taxonomy-based query generation system for social media data analysis. A typical implementation comprises the following components:

Medical terminology database – This repository stores standardized medical terminology from SNOMED CT, MeSH, and RxNorm. It returns detailed information about queried terms, including synonyms, parent categories, and codes. For example, querying “diabetes” returns alternatives like “diabetes mellitus” and “DM” with classification data. The database maintains specialty-specific collections for fields such as oncology and cardiology, enabling precise medical language processing.
Synonym expansion engine – This engine expands medical terms into sets of clinically equivalent expressions. For a term like “insulin pump,” it retrieves medical synonyms such as “CSII” from the terminology database, supplements these with general language alternatives, and handles abbreviations. The resulting synonym list makes sure queries capture content regardless of terminology variations.
Context-aware query builder – This component transforms medical term lists into optimized search queries. It uses the Synonym Expansion Engine for each term, formats synonym groups with Boolean operators, and applies system-specific syntax. When targeting healthcare professional content, it adds credential filters. The output balances comprehensiveness with system constraints to maximize relevant result retrieval.
Query effectiveness analyzer – This analyzer evaluates query performance and provides improvement recommendations. It calculates metrics including result relevance, topic diversity, and healthcare professional content ratio. Using NLP to identify medical entities, it suggests specific improvements such as broadening queries with few results or adding professional filters when needed.
Taxonomy-based query generator – This orchestrator manages the entire workflow as the main client interface. It coordinates with other components to construct optimized queries and packages results with expansion metadata. It also evaluates search results to provide performance metrics and improvement suggestions, delivering sophisticated search capabilities through a simplified interface.

The following sequence diagram illustrates a typical use case for a taxonomy-driven query lookup.
A typical user journey narrative includes the following phases:

Step 1: User intent – The user enters two simple terms, “diabetes” and “insulin pump,” into the search interface. They specify they want to search on Twitter and only see content from healthcare professionals. This basic information is passed on to our query enhancement system, which begins the process of creating a more comprehensive search.
Step 2: Expand first term – The system looks up “diabetes” in its medical terminology database (SNOMED CT). It identifies several related terms and technical variations, including “diabetes mellitus” (the formal medical term), “DM” (common medical abbreviation), “T1DM” (Type 1 diabetes mellitus), and “T2DM” (Type 2 diabetes mellitus). The system incorporates these terms to make sure the search captures the full spectrum of diabetes-related discussions.
Step 3: Expand second term – The system then consults its database (MeSH terminology) for “insulin pump” and discovers related clinical terms such as “insulin infusion pump” (formal medical device name), “continuous subcutaneous insulin infusion” (clinical procedure name), and “CSII” (common medical abbreviation). These variations are integrated into the search query to capture the different terminology healthcare professionals might use when discussing this treatment approach.
Step 4: Build enhanced query – The system intelligently combines the expanded terms into organized groups: Group 1 (diabetes OR diabetes mellitus OR DM OR T1DM OR T2DM) and Group 2 (insulin pump OR insulin infusion pump OR continuous subcutaneous insulin infusion OR CSII). It connects these groups with AND operators to make sure results contain references to both diabetes and insulin pump technologies, creating a focused yet comprehensive query structure.
Step 5: Add professional filters – Because the user specifically wants content from healthcare professionals, the system adds specialized filters, including professional title indicators (doctor OR physician OR MD OR clinician OR nurse OR NP OR pharmacist OR PharmD OR healthcare professional OR HCP OR medical) and Twitter’s verification filter (filter: verified). These filters work together to prioritize content from qualified medical experts while filtering out public discussions.
Step 6: Execution and analysis – The system executes this enhanced query on Twitter and analyzes the returned results to evaluate their relevance and professional source quality. It provides the user with performance metrics such as: “Found 5 results with 80% from verified healthcare professionals.” The query effectiveness analyzer module then offers intelligent suggestions for further refinement, such as incorporating age-specific terms (pediatric, adult, elderly) to better target specific patient populations.

The following diagram illustrates how the taxonomy-based query generation flow can be implemented on AWS using Amazon Bedrock Agents.

Results and next steps
Indegene’s Social Intelligence Solution demonstrates measurable impact across various dimensions:

Time-to-insight – Reduction in insight generation time
Operational cost savings – Reduced analytics outsourcing and FTE costs
Business outcomes – Measured by the percentage of insights used in downstream decision-making

Looking ahead, the solution is evolving to deliver even more comprehensive capabilities:

Omnichannel intelligence integration – Unify insights across multiple channels, including social media, CRM systems, representative interactions, email campaigns, and HCP prescription behavior, creating a true 360-degree view of stakeholder sentiment and behavior.
Conference listening capabilities – Use advanced audio/video analysis to extract valuable insights from podcasts, webinars, and live medical conference sessions—formats that have previously been difficult to analyze at scale.
Conversational insight assistant powered by generative AI – Help users interact with the system through natural language queries and receive real-time, narrative-style summaries of social insights.

Conclusion
This post explored how advancements in generative AI have sparked a change in how pharmaceutical teams access and use social intelligence, transforming insights into instantly accessible and actionable resources across the organization. In future posts, we will explore specific use cases, such as conference listening, Key Opinion Leader (KOL) identification, and Digital Opinion Leader (DOL) identification.To learn more, refer to the following resources:

Guidance for Social Media Insights on AWS
Generative AI in Healthcare & Life Sciences
AWS Skill Builder course: Building Generative AI Applications Using Amazon Bedrock
GitHub repo: Sample Healthcare and Life Sciences Agents on AWS

About the authors
Rudra Kannemadugu is a Senior Director–Data and Advanced Analytics at Indegene with 22+ years of experience, leading digital transformation across pharma, healthcare, and retail. He specializes in drug launches, sales force operations, and building enterprise data ecosystems. A strategic leader in GenAI adoption, he drives commercial analytics, predictive modeling, and marketing automation. Rudra is a proven people leader in spearheading AI transformation initiatives and talent development, and is also skilled in cross-functional collaboration and global stakeholder management to accelerate drug commercialization.
Shravan K S is a Senior Manager–Data Analytics at Indegene and an experienced GenAI Architect with 17+ years in analytics, data platforms, and system integration across life sciences and healthcare. He has led the delivery of secure, scalable solutions in Generative AI, data engineering, platform modernization, and emerging Agentic AI systems. Skilled in driving transformation through SAFe Agile, he advances innovation via cloud-native architectures and AI-driven data operations. He holds advanced certifications from AWS, Snowflake, and Dataiku, and combines cutting-edge technologies with real-world impact in pharma and healthcare analytics.
Bhagyashree Chandak is a Solutions Architect in the APAC region. She works with customers to design and build innovative solutions in the AWS Cloud, bridging the gap between complex business requirements and technical solutions across various domains. As an AI/ML enthusiast, Bhagyashree has expertise in both traditional ML and advanced GenAI techniques.
Punyabrota Dasgupta is a Principal Solutions Architect at AWS. His area of expertise includes machine learning applications for media and entertainment business. Beyond work, he loves tinkering and restoration of antique electronic appliances.

Unlocking enhanced legal document review with Lexbe and Amazon Bedrock

Posted on August 13, 2025 by i-genie

This post is co-authored with Karsten Weber and Rosary Wang from Lexbe.
Legal professionals are frequently tasked with sifting through vast volumes of documents to identify critical evidence for litigation. This process can be time-consuming, prone to human error, and expensive—especially when tight deadlines loom. Lexbe, a leader in legal document review software, confronted these challenges head-on by using Amazon Bedrock. By integrating the advanced AI and machine learning services offered by Amazon, Lexbe streamlined its document review process, boosting both efficiency and accuracy. In this blog post, we explore how Lexbe used Amazon Bedrock and other AWS services to overcome business challenges and deliver a scalable, high-performance solution for legal document analysis.
Business challenges and why they matter
Legal professionals routinely face the daunting task of managing and analyzing massive sets of case documents, which can range anywhere from 100,000 to over a million. Rapidly identifying relevant information within these large datasets is often critical to building a strong case—or preventing a costly oversight. Lexbe addresses this challenge by using Amazon Bedrock in their custom application: Lexbe Pilot
Lexbe Pilot is an AI-powered Q&A assistant integrated into the Lexbe eDiscovery platform. It enables legal teams to instantly query and extract insights from the full body of documents in an entire case using generative AI—eliminating the need for time-consuming manual research and analysis. Using Amazon Bedrock Knowledge Bases, users can query an entire dataset and retrieve grounded, contextually relevant results. This approach goes far beyond traditional keyword searches by helping legal teams identify critical or smoking gun documents that could otherwise remain hidden. As legal cases grow, keyword searches that previously returned a handful of documents might now produce hundreds or even thousands. Lexbe Pilot distills these large result sets into concise, meaningful answers—giving legal teams the insights they need to make informed decisions.
Failing to address these challenges can lead to missed evidence, possibly resulting in unfavorable outcomes. With Amazon Bedrock and its associated services, Lexbe provides a scalable, high-performance solution that empowers legal professionals to navigate the growing landscape of electronic discovery efficiently and accurately.
Solution overview: Amazon Bedrock as the foundation
Lexbe transformed its document review process by integrating Amazon Bedrock, a powerful suite of AI and machine learning (ML) services. With deep integration into the AWS ecosystem, Amazon Bedrock delivers the performance and scalability necessary to meet the rigorous demands of Lexbe’s clients in the legal industry.
Key AWS services used:

Amazon Bedrock. A fully managed service offering high-performing foundation models (FMs) for large-scale language tasks. By using these models, Lexbe can rapidly analyze vast amounts of legal documents with exceptional accuracy.
Amazon Bedrock Knowledge Bases. Provides fully managed support for an end-to-end Retrieval-Augmented Generation (RAG) workflow, enabling Lexbe to ingest documents, perform semantic searches, and retrieve contextually relevant information.
Amazon OpenSearch. Indexes all the document text and corresponding metadata. Both Vector and Text modes are used. This allows Lexbe to quickly locate specific documents or key information across large datasets by vector or by keyword.
AWS Fargate. Orchestrates the analysis and processing of large-scale workloads in a serverless container environment, allowing Lexbe to scale horizontally without the need to manage underlying server infrastructure.

Amazon Bedrock Knowledge Bases architecture and workflow
The integration of Amazon Bedrock within Lexbe’s platform is shown in the following architecture diagram. The architecture is designed to handle both large-scale ingestion and retrieval of legal documents.

User access: A user accesses the frontend application through a web browser.
Request routing: The request is routed through Amazon CloudFront, which connects to the backend through an Application Load Balancer.
Backend processing: Backend services running on Fargate handle the request and interact with the system components.
Document handling: Legal documents are stored in an Amazon Simple Storage Service (Amazon S3) bucket, and Apache Tika extracts text from these documents. The extracted text is stored as individual text files in a separate S3 bucket. This bucket is used as the source repository for Amazon Bedrock.
Embedding creation: The extracted text is processed using Titan Text v2 to generate embeddings. Lexbe experimented with multiple embedding models—including Amazon Titan and Cohere—and tested configurations with varying token sizes (for example, 512 compared to 1024 tokens).
Embedding sorage: The generated embeddings are stored in a vector database for fast retrieval.
Query execution: Amazon Bedrock Knowledge Bases retrieves relevant data from the vector database for a given query.
LLM integration: The Amazon Bedrock Sonnet 3.5 large language model (LLM) processes the retrieved data to generate a coherent and accurate response.
Response delivery: The final response is returned to the user using the frontend application through CloudFront.

Amazon and Lexbe collaboration
Over an eight-month period, Lexbe worked hand-in-hand with the Amazon Bedrock Knowledge Bases team to enhance the performance and accuracy of its Pilot feature. This collaboration included weekly strategy meetings between senior teams from both organizations, enabling rapid iterations. From the outset, Lexbe established clear acceptance criteria focused on achieving specific recall rates. These metrics served as a benchmark for when the feature was ready for production. As illustrated in the following figure, the system’s performance underwent five significant milestones, each marking a leap toward production. We focused on Recall Rate because identifying the right documents is critical to getting the correct response. Unlike some use cases for Retrieval Augmented Generation (RAG) where the user has a specific question that can often be answered by a few documents, we are looking to generate finding-of-facts reports that require a large number of source documents. For this reason, we focused on Recall Rate to help ensure that Amazon Bedrock Knowledge Bases was not leaving out important information.
First iteration: January 2024. The initial system only had a 5% Recall Rate showing that much work was needed to get to production.
Second iteration: April 2024. New features were added to Amazon Bedrock Knowledge Bases leading to a noticeable boost in accuracy. We were now at 36% Recall Rate.
Third iteration: June 2024. Parameter adjustment, particularly around token size, led to another jump in performance. This brought Recall Rate to 60%.
Fourth iteration: August 2024. A Recall Rate of 66% was achieved using Titan Embed text-v2 models.
Fifth iteration: December 2024. Introduction of Reranker technology proved invaluable and enabled up to 90% Recall Rate.

The final outcome is impressive

Broad, human-style reporting. In an industrial-accident matter, Pilot was asked to conduct a full findings-of-fact analysis. It produced a polished, five-page report complete with clear section headings and hyperlinks back to every source document regardless of whether those documents were in English, Spanish, or any other language.
Deep, automated inference. In a case of tens of thousands of documents, we asked, “Who is Bob’s son?” There was no explicit reference to his children anywhere. Yet Pilot zeroed in on an email that began “Dear Fam,” closed with “Love, Mom/Linda,” and included the children’s first and last names in the metadata. By connecting those dots, it accurately identified Bob’s son and cited the exact email where the inference was made.

Traditional techniques in eDiscovery are unable to do either of the above. With Pilot, legal teams can:

Generate actionable reports that attorneys can swiftly iterate for deeper analysis.
Streamline eDiscovery by surfacing critical connections that go far beyond simple text matches.
Unlock strategic insights in moments, even from multilingual data.

Whether you need a comprehensive, human-readable report or laser-focused intelligence on the relationships lurking in your data, Lexbe Pilot, powered by Amazon Bedrock Knowledge Bases, delivers the precise information you need—fast.
Benefits of integrating Amazon Bedrock and AWS services
By integrating Amazon Bedrock with other AWS services, Lexbe gained several strategic advantages in their document review process:
Scalability. Using Amazon Elastic Container Service (Amazon ECS) and AWS Fargate, Lexbe can dynamically scale its processing infrastructure.
Cost efficiency. Processing in Amazon ECS Linux Spot Market provides a significant cost advantage.
Security. The robust security framework of AWS, including encryption and role-based access controls, safeguards sensitive legal documents. This is critical for Lexbe’s clients, who must adhere to strict confidentiality requirements.
Conclusion: A scalable, accurate, and cost-effective solution
Through its integration of Amazon Bedrock, Lexbe has transformed its document review platform into a highly efficient, scalable, and accurate solution. By combining Amazon Bedrock, Amazon OpenSearch, and AWS Fargate, they achieved marked improvements in both retrieval accuracy and processing speed—all while keeping costs under control. Lexbe’s success illustrates the power of AWS AI/ML services to tackle complex, real-world challenges. By harnessing the flexible, scalable, and cost-effective offerings of AWS, Lexbe is well-equipped to meet the evolving needs of the legal industry—both today and in the future. If your organization is facing complex challenges that could benefit from AI/ML-powered solutions, take the next step with AWS. Start by working closely with your AWS Solutions Architect to design a tailored strategy that aligns with your unique needs. Engage with the AWS product team to explore cutting-edge services to make sure that your solution is scalable, secure, and future-ready. Together, we can help you innovate faster, reduce costs, and deliver transformative outcomes.

About the authors
Wei Chen is a Senior Solutions Architect at Amazon Web Services, based in Austin, Texas. With over 20 years of experience, he specializes in helping customers design and implement solutions for complex technical challenges. In his role at AWS, Wei partners with organizations to modernize their applications and fully leverage cloud capabilities to meet strategic business goals. His area of expertise is AI/ML and AWS Security services.
Gopikrishnan Anilkumar is a Principal Technical Product Manager in Amazon. He has over 10 years of product management experience across a variety of domains and is passionate about AI/ML.
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.
Karsten Weber is the CTO and Co-founder of Lexbe, an eDiscovery provider, since January 2006. Based in Austin, Texas, Lexbe offers Lexbe Online, a cloud-based application for eDiscovery, litigation, and legal document processing, production, review, and case management. Under Karsten’s leadership, Lexbe has developed a robust platform and comprehensive eDiscovery services that assist law firms and organizations with efficiently managing large ESI data sets for legal review and discovery production. Karsten’s expertise in technology and innovation has been pivotal in driving Lexbe’s success over the past 19 years.
Rosary Wang is a Sr. Software Engineer at Lexbe, an eDiscovery software and services provider based in Austin, Texas.