DoWhile loops now supported in Amazon Bedrock Flows

Today, we are excited to announce support for DoWhile loops in Amazon Bedrock Flows. With this powerful new capability, you can create iterative, condition-based workflows directly within your Amazon Bedrock flows, using Prompt nodes, AWS Lambda functions, Amazon Bedrock Agents, Amazon Bedrock Flows inline code, Amazon Bedrock Knowledge Bases, Amazon Simple Storage Service (Amazon S3), and other Amazon Bedrock nodes within the loop structure. This feature avoids the need for complex workarounds, enabling sophisticated iteration patterns that use the full range of Amazon Bedrock Flows components. Tasks like content refinement, recursive analysis, and multi-step processing can now seamlessly integrate AI model calls, custom code execution, and knowledge retrieval in repeated cycles. By providing loop support with diverse node types, this feature simplifies generative AI application development and accelerates enterprise adoption of complex, adaptive AI solutions.
Organizations using Amazon Bedrock Flows can now use DoWhile loops to design and deploy workflows for building more scalable and efficient generative AI applications fully within the Amazon Bedrock environment while achieving the following:

Iterative processing – Execute repeated operations until specific conditions are met, enabling dynamic content refinement and recursive improvements
Conditional logic – Implement sophisticated decision-making within flows based on AI outputs and business rules
Complex use cases – Manage multi-step generative AI workflows that require repeated execution and refinement
Builder-friendly – Create and manage loops through both the Amazon Bedrock API and AWS Management Console in the traces
Observability – Employ seamless tracking of loop iterations, conditions, and execution paths

In this post, we discuss the benefits of this new feature, and show how to use DoWhile loops in Amazon Bedrock Flows.
Benefits of DoWhile loops in Amazon Bedrock Flows
Using DoWhile loops in Amazon Bedrock Flows offers the following benefits:

Simplified flow control – Create sophisticated iterative workflows without complex orchestration or external services
Flexible processing – Enable dynamic, condition-based execution paths that can adapt based on AI outputs and business rules
Enhanced development experience – Help users build complex iterative workflows through an intuitive interface, without requiring external workflow management

Solution overview
In the following sections, we show how to create a simple Amazon Bedrock flow using Do-while loops with Lambda functions. Our example showcases a practical application where we construct a flow that generates a blog post on a given topic in an iterative manner until certain acceptance criteria are fulfilled. The flow demonstrates the power of combining different types of Amazon Bedrock Flows nodes within a loop structure, where Prompt nodes generate and fine-tune the blog post, Inline Code nodes allow writing custom Python code to analyze the outputs, and S3 Storage nodes enable storing each version of the blog post during the process for reference. The DoWhile loop continues to execute until the quality of the blog post meets the condition set in the loop controller. This example illustrates how different flow nodes can work together within a loop to progressively transform data until desired conditions are met, providing a foundation for understanding more complex iterative workflows with various node combinations.
Prerequisites
Before implementing the new capabilities, make sure you have the following:

An AWS account
Other Amazon Bedrock services in place:

Create and test your base prompts for customer service interactions in Amazon Bedrock Prompt Management
Create guardrails with relevant rules using Amazon Bedrock Guardrails

Resources in auxiliary AWS services needed for your workflow, such as Lambda, Amazon DynamoDB, and Amazon S3
Required AWS Identity and Access Management (IAM) permissions:

Access to Amazon Bedrock Flows
Appropriate access to large language models (LLMs) in Amazon Bedrock

After these components are in place, you can proceed with using Amazon Bedrock Flows with DoWhile loop capabilities in your generative AI use case.
Create your flow using DoWhile Loop nodes
Complete the following steps to create your flow:

On the Amazon Bedrock console, choose Flows under Builder tools in the navigation pane.
Create a new flow, for example, dowhile-loop-demo. For detailed instructions on creating a flow, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Add a DoWhile loop node.
Add additional nodes according to the solution workflow (discussed in the next section).

Amazon Bedrock provides different node types to build your prompt flow. For this example, we use a DoWhile Loop node for calling different types of nodes for a generative AI-powered application, which creates a blog post on a given topic and checks the quality in every loop. There is one DoWhile Loop node in the flow. This new node type is on the Nodes tab in the left pane, as shown in the following screenshot.

DoWhile loop workflow
A DoWhile loop consists of two parts: the loop and the loop controller. The loop controller validates the logic for the loop and decides whether to continue or exit the loop. In this example, it is executing Prompt, Inline Code, S3 Storage nodes each time the loop is executed.
Let’s go through this flow step-by-step, as illustrated in the preceding screenshot:

A user asks to write a blog post on a specific topic (for example, using the following prompt: {“topic”: “AWS Lambda”, “Audience”: “Chief Technology Officer”, “word_count”:”500}). This prompt is sent to the Prompt node (Content_Generator).
The Prompt node (Content_Generator) writes a blog post based on the prompt using one of the Amazon Bedrock provided LLMs (such as Amazon Nova or Anthropic’s Claude) and is sent to the Loop Input node. This is the entry point to the DoWhile Loop node.
Three steps happen in tandem:

The Loop Input node forwards the blog post content to another Prompt node (Blog_Analysis_Rating) for rating the post based on criteria mentioned as part of the prompt. The output of this Prompt node is JSON code like the following example. The output of a Prompt node is always of type String. You can modify the prompt to get different types of output according to your needs. However, you can also ask the LLM to output a single rating number.

{
“overall_rating”: 8.5,
“category_ratings”: {
“clarity_and_readability”: 9,
“value_to_target_audience”: 8,
“engagement_level”: 8,
“technical_accuracy”: 9
}

The blog post is sent to the flow output during every iteration. This is the final version whenever the loop condition is not met (exiting the loop) or the end of maximum loop iterations.
At the same time, the output of the previous Prompt node (Content_Generator) is forwarded to another Prompt node (Blog_Refinement) by the Loop Input node. This node recreates or modifies the blog post based on the feedback from the analysis.

The output of the Prompt node (Blog_Analysis_Rating) is fed into the Inline Code node to extract the necessary rating and return that as a number or other information required for checking the condition inside the loop controller as input variables (for example, a rating).

def __func(variable):
return float(variable[“overall_rating”])
__func(variable)

Python code inside the Inline Code must be treated as untrusted, and appropriate parsing, validation, and data handling should be implemented.

The output of the Inline Code node is fed into the loop condition inside the loop controller to validate against the condition we set up inside the continue loop. In this example, we are checking for a rating less than or equal to 9 for the generated blog post. You can check up to five conditions. Additionally, a maximum loop iterations parameter makes sure that loop doesn’t continue infinitely.
The step consists of two parts:

A Prompt node (Blog_Refinement) forwards the newly generated blog post to loopinput inside the loop controller.
The loop controller stores the version of the post in Amazon S3 for future reference and comparing the different versions generated.

This path will execute if one of the conditions is met inside the continue loop and maximum loop iterations. If this continues, then the new modified blog post from earlier is forwarded to the input field in the Loop Input node as LoopInput and the loop continues.
The final output is produced after the DoWhile loop condition is met or maximum number of iterations are completed. The output will be final version of the blog post.

You can see the output as shown in the following screenshot. The system also provides access to node execution traces, offering detailed insights into each processing step, real-time performance metrics, and highlighting issues that may have occurred during the flow’s execution. Traces can be enabled using an API and sent to an Amazon CloudWatch log. In the API, set the enableTrace field to true in an InvokeFlow request. Each flowOutputEvent in the response is returned alongside a flowTraceEvent.

You have now successfully created and executed an Amazon Bedrock flow using DoWhile Loop nodes. You can also use Amazon Bedrock APIs to programmatically execute this flow. For additional details on how to configure flows, see Amazon Bedrock Flows is now generally available with enhanced safety and traceability.
Considerations
When working with DoWhile Loop nodes in Amazon Bedrock Flows, the following are the important things to note:

DoWhile Loop nodes don’t support nested loops (loops within loops)
Each loop controller can evaluate up to five input conditions for its exit criteria
A maximum iteration limit must be specified to help prevent infinite loops and enable controlled execution

Conclusion
The integration of DoWhile loops in Amazon Bedrock Flows marks a significant advancement in iterative workflow capabilities, enabling sophisticated loop-based processing that can incorporate Prompt nodes, Inline Code nodes, S3 Storage nodes, Lambda functions, agents, DoWhile Loop nodes, and Knowledge Base nodes. This enhancement responds directly to enterprise customers’ needs for handling complex, repetitive tasks within their AI workflows, helping developers create adaptive, condition-based solutions without requiring external orchestration tools. By providing support for iterative processing patterns, DoWhile loops help organizations build more sophisticated AI applications that can refine outputs, perform recursive operations, and implement complex business logic directly within the Amazon Bedrock environment. This powerful addition to Amazon Bedrock Flows democratizes the development of advanced AI workflows, making iterative AI processing more accessible and manageable across organizations.
DoWhile loops in Amazon Bedrock Flows are now available in all the AWS Regions where Amazon Bedrock Flows is supported, except for the AWS Gov Cloud (US) Region. To get started, open the Amazon Bedrock console or Amazon Bedrock APIs to begin building flows with Amazon Bedrock Flows. To learn more, refer to Create your first flow in Amazon Bedrock and Track each step in your flow by viewing its trace in Amazon Bedrock.
We’re excited to see the innovative applications you will build with these new capabilities. As always, we welcome your feedback through AWS re:Post for Amazon Bedrock or your usual AWS contacts. Join the generative AI builder community at community.aws to share your experiences and learn from others.

About the authors
Shubhankar Sumar is a Senior Solutions Architect at AWS, where he specializes in architecting generative AI-powered solutions for enterprise software and SaaS companies across the UK. With a strong background in software engineering, Shubhankar excels at designing secure, scalable, and cost-effective multi-tenant systems on the cloud. His expertise lies in seamlessly integrating cutting-edge generative AI capabilities into existing SaaS applications, helping customers stay at the forefront of technological innovation.
Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.
Eric Li is a Software Development Engineer II at AWS, where he builds core capabilities for Amazon Bedrock and SageMaker to support generative AI applications at scale. His work focuses on designing secure, observable, and cost-efficient systems that help developers and enterprises adopt generative AI with confidence. He is passionate about advancing developer experiences for building with large language models, making it easier to integrate AI into production-ready cloud applications.

How PropHero built an intelligent property investment advisor with con …

This post was written with Lucas Dahan, Dil Dolkun, and Mathew Ng from PropHero.
PropHero is a leading property wealth management service that democratizes access to intelligent property investment advice through big data, AI, and machine learning (ML). For the Spanish and Australian consumer base, PropHero needed an AI-powered advisory system that could engage customers in accurate property investment discussions. The goal was to provide personalized investment insights and to guide and assist users at every stage of their investment journey: from understanding the process, gaining visibility into timelines, securely uploading documents, to tracking progress in real time.
PropHero collaborated with the AWS Generative AI Innovation Center to implement an intelligent property investment advisor using AWS generative AI services with continuous evaluation. The solution helps users engage in natural language conversations about property investment strategies and receive personalized recommendations based on PropHero’s comprehensive market knowledge.
In this post, we explore how we built a multi-agent conversational AI system using Amazon Bedrock that delivers knowledge-grounded property investment advice. We explore the agent architecture, model selection strategy, and comprehensive continuous evaluation system that facilitates quality conversations while facilitating rapid iteration and improvement.
The challenge: Making property investment knowledge more accessible
The area of property investment presents numerous challenges for both novice and experienced investors. Information asymmetry creates barriers where comprehensive market data remains expensive or inaccessible. Traditional investment processes are manual, time-consuming, and require extensive market knowledge to navigate effectively. For the Spanish and Australian consumers specifically, we needed to build a solution that could provide accurate, contextually relevant property investment advice in Spanish while handling complex, multi-turn conversations about investment strategies. The system needed to maintain high accuracy while delivering responses at scale, continuously learning and improving from customer interactions. Most importantly, it needed to assist users across every phase of their journey, from initial onboarding through to final settlement, ensuring comprehensive support throughout the entire investment process.
Solution overview
We built a complete end-to-end solution using AWS generative AI services, architected around a multi-agent AI advisor with integrated continuous evaluation. The system provides seamless data flow from ingestion through intelligent advisory conversations with real-time quality monitoring. The following diagram illustrates this architecture.

The solution architecture consists of four virtual layers, each serving specific functions in the overall system design.
Data foundation layer
The data foundation provides the storage and retrieval infrastructure for system components:

Amazon DynamoDB – Fast storage for conversation history, evaluation metrics, and user interaction data
Amazon Relational Database (Amazon RDS) for PostgreSQL – A PostgreSQL database storing LangFuse observability data, including large language model (LLM) traces and latency metrics
Amazon Simple Storage Service (Amazon S3) – A central data lake storing Spanish FAQ documents, property investment guides, and conversation datasets

Multi-agent AI layer
The AI processing layer encompasses the core intelligence components that power the conversational experience:

Amazon Bedrock – Foundation models (FMs) such as LLMs and rerankers powering specialized agents
Amazon Bedrock Knowledge Bases – Semantic search engine with semantic chunking for FAQ-style content
LangGraph – Orchestration of multi-agent workflows and conversation state management
AWS Lambda – Serverless functions executing multi-agent logic and retrival of user information for richer context

Continuous evaluation layer
The evaluation infrastructure facilitates continuous quality monitoring and improvement through these components:

Amazon CloudWatch – Real-time monitoring of quality metrics with automated alerting and threshold management
Amazon EventBridge – Real-time event triggers for conversation completion and quality assessment
AWS Lambda – Automated evaluation functions measuring context relevance, response groundedness, and goal accuracy
Amazon QuickSight – Interactive dashboards and analytics for monitoring the respective metrics

Application and integration layer
The integration layer provides secure interfaces for external communication:

Amazon API Gateway – Secure API endpoints for conversational interface and evaluation webhooks

Multi-agent AI advisor architecture
The intelligent advisor uses a multi-agent system orchestrated through LangGraph, which sits in a single Lambda function, where each agent is optimized for specific tasks. The following diagram shows the communication flow among the various agents within the Lambda function.

Agent composition and model selection
Our model selection strategy involved extensive testing to match each component’s computational requirements with the most cost-effective Amazon Bedrock model. We evaluated factors including response quality, latency requirements, and cost per token to determine optimal model assignments for each agent type.Each component in the system uses the most appropriate model for its designated function, as outlined in the following table.

Component
Amazon Bedrock Model
Purpose

Router Agent
Anthropic Claude 3.5 Haiku
Query classification and routing

General Agent
Amazon Nova Lite
Common questions and conversation management

Advisor Agent
Amazon Nova Pro
Specialized property investment advice

Settlement agent
Anthropic Claude 3.5 Haiku
Customer support specialising on pre-settlement phase of investment

Response Agent
Amazon Nova Lite
Final response generation and formatting

Embedding
Cohere Embed Multilingual v3
Context retrieval

Retriever
Cohere Rerank 3.5
Context retrieval and ranking

Evaluator
Anthropic Claude 3.5 Haiku
Quality assessment and scoring

End-to-end conversation flow
The conversation processing follows a structured workflow that facilitates accurate responses while maintaining quality standards:

User queries enter through API Gateway and are routed to the router agent.
The router agent determines the appropriate specialized agent based on query analysis.
User information is retrieved at the start for richer context and knowledge-intensive queries trigger the retriever to access the Amazon Bedrock knowledge base.
Specialized agents process queries with retrieved user information and relevant context from the knowledge base.
The response agent formats and generates the final user-facing response with the appropriate tone.
Parallel evaluation processes assess context relevance, response groundedness, and goal accuracy.
Conversation data is stored in DynamoDB for analysis and improvement.

Amazon Bedrock Knowledge Bases configuration
The knowledge base uses several AWS services for optimal performance:

Data source – S3 bucket
Chunking strategy – Semantic chunking optimized for FAQ-style content
Embedding model – Cohere Embed Multilingual v3 for Spanish understanding
Reranker – Cohere Rerank 3.5 for reranking retrieved Spanish content
Vector database – Amazon OpenSearch Serverless for scalable semantic search

Sample conversation experience
The system handles natural Spanish conversations seamlessly. For example:
Usuario: “Hola, ¿qué es PropHero exactamente?”
Asistente: “¡Hola! PropHero es una plataforma que te permite crear, optimizar y crecer tu patrimonio inmobiliario delegando todo el proceso en nuestro equipo de profesionales…”
Usuario: “¿Y cómo funciona en la práctica?”
Asistente: “Es muy sencillo. Comenzamos con una sesión con tu Property Coach para definir tu estrategia de inversión personalizada…”
Integrated continuous evaluation system
The continuous evaluation system is integrated into the architecture, operating as a core component. This approach supports quality monitoring in real-time alongside conversation processing.
Evaluation metrics
We used the following metrics from the Ragas library for evaluation:

Context Relevance (0–1) – Measures the relevance of retrieved context to user queries, evaluating RAG system effectiveness
Response Groundedness (0–1) – Makes sure responses are factually accurate and derived from PropHero’s official information
Agent Goal Accuracy (0–1) – Binary measure of whether responses successfully address user investment goals

Real-time evaluation workflow
The evaluation system operates seamlessly within the conversation architecture:

Amazon DynamoDB Streams triggers – Conversation data written to DynamoDB automatically triggers a Lambda function for evaluation through Amazon DynamoDB Streams
Parallel processing – Lambda functions execute evaluation logic in parallel with response delivery
Multi-dimensional assessment – Each conversation is evaluated across three key dimensions simultaneously
Intelligent scoring with LLM-as-a-judge – Anthropic’s Claude 3.5 Haiku provides consistent evaluation as an LLM judge, offering standardized assessment criteria across conversations.
Monitoring and analytics – CloudWatch captures metrics from the evaluation process, and QuickSight provides dashboards for trend analysis

The following diagram provides an overview of the Lambda function responsible for continuous evaluation.

Implementation insights and best practices
Our development journey involved a 6-week iterative process with PropHero’s technical team. We conducted testing across different model combinations and evaluated chunking strategies using real customer FAQ data. This journey revealed several architectural optimizations that enhanced system performance, achieved significant cost reductions, and improved user experience.
Model selection strategy
Our approach to model selection demonstrates the importance of matching model capabilities to specific tasks. By using Amazon Nova Lite for simpler tasks and Amazon Nova Pro for complex reasoning, the solution achieves optimal cost-performance balance while maintaining high accuracy standards.
Chunking and retrieval optimization
Semantic chunking proved superior to hierarchical and fixed chunking approaches for FAQ-style content. The Cohere Rerank 3.5 model enabled the system to use fewer chunks (10 vs. 20) while maintaining accuracy, reducing latency and cost.
Multilingual capabilities
The system effectively handles Spanish and English queries by using FMs that support Spanish language on Amazon Bedrock.
Business impact
The PropHero AI advisor delivered measurable business value:

Enhanced customer engagement – A 90% goal accuracy rate makes sure customers receive relevant, actionable property investment advice. Over 50% of our users (and over 70% of paid users) are actively using the AI advisor.
Operational efficiency – Automated responses to common questions reduced customer service workload by 30%, freeing staff to focus on complex customer needs.
Scalable growth – The serverless architecture automatically scales to handle increasing customer demand without manual intervention.
Cost optimization – Strategic model selection achieved high performance while reducing AI costs by 60% compared to using premium models throughout.
Consumer base expansion – Successful Spanish language support enabled PropHero’s expansion into the Spanish consumer base with localized expertise.

Conclusion
The PropHero AI advisor demonstrates how AWS generative AI services can be used to create intelligent, context-aware conversational agents that deliver real business value. By combining a modular agent architecture with a robust evaluation system, PropHero has created a solution that enhances customer engagement while providing accurate and relevant responses.The comprehensive evaluation pipeline has been particularly valuable, providing clear metrics for measuring conversation quality and guiding ongoing improvements. This approach makes sure the AI advisor will continue to evolve and improve over time.For more information about building multi-agent AI advisors with continuous evaluation, refer to the following resources:

Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases – With Amazon Bedrock Knowledge Bases, you can implement semantic search with chunking strategies
LangGraph – LangGraph can help you build multi-agent workflows
Ragas – Ragas offers comprehensive LLM evaluation metrics, including context relevance, groundedness, and goal accuracy used in this implementation

To learn more about the Generative AI Innovation Center, get in touch with your account team.

About the authors
Adithya Suresh is a Deep Learning Architect at the AWS Generative AI Innovation Center based in Sydney, where he collaborates directly with enterprise customers to design and scale transformational generative AI solutions for complex business challenges. He uses AWS generative AI services to build bespoke AI systems that drive measurable business value across diverse industries.
Lucas Dahan was the Head of Data & AI at PropHero at the time of writing. He leads the technology team that is transforming property investment through innovative digital solutions.
Dil Dolkun is the Data & AI Engineer at PropHero’s tech team, and has been instrumental in designing data architectures and multi-agent workflows for PropHero’s generative AI property investment Advisor system.
Mathew Ng is a Technical Lead at PropHero, who architected and scaled PropHero’s cloud-native, high-performance software solution from early stage start up to successful Series A funding.
Aaron Su is a Solutions Architect at AWS, with a focus across AI and SaaS startups. He helps early-stage companies architect scalable, secure, and cost-effective cloud solutions.

Accelerate benefits claims processing with Amazon Bedrock Data Automat …

In the benefits administration industry, claims processing is a vital operational pillar that makes sure employees and beneficiaries receive timely benefits, such as health, dental, or disability payments, while controlling costs and adhering to regulations like HIPAA and ERISA. Businesses aim to optimize the workflow—covering claim submission, validation, adjudication, payment, and appeals—to enhance employee satisfaction, strengthen provider relationships, and mitigate financial risks. The process includes specific steps like claim submission (through portals or paper), data validation (verifying eligibility and accuracy), adjudication (assessing coverage against plan rules), payment or denial (including check processing for reimbursements), and appeal handling. Efficient claims processing supports competitive benefits offerings, which is crucial for talent retention and employer branding, but requires balancing speed, accuracy, and cost in a highly regulated environment.
Despite its importance, claims processing faces significant challenges in many organizations. Most notably, the reliance on legacy systems and manual processes results in frustratingly slow resolution times, high error rates, and increased administrative costs. Incomplete or inaccurate claim submissions—such as those with missing diagnosis codes or eligibility mismatches—frequently lead to denials and rework, creating frustration for both employees and healthcare providers. Additionally, fraud, waste, and abuse continue to inflate costs, yet detecting these issues without delaying legitimate claims remains challenging. Complex regulatory requirements demand constant system updates, and poor integration between systems—such as Human Resource Information Systems (HRIS) and other downstream systems—severely limits scalability. These issues drive up operational expenses, erode trust in benefits programs, and overburden customer service teams, particularly during appeals processes or peak claims periods.
Generative AI can help address these challenges. With Amazon Bedrock Data Automation, you can automate generation of useful insights from unstructured multimodal content such as documents, images, audio, and video. Amazon Bedrock Data Automation can be used in benefits claims process to automate document processing by extracting and classifying documents from claims packets, policy applications, and supporting documents with industry-leading accuracy, reducing manual errors and accelerating resolution times. Amazon Bedrock Data Automation natural language processing capabilities interpret unstructured data, such as provider notes, supporting compliance with plan rules and regulations. By automating repetitive tasks and providing insights, Amazon Bedrock Data Automation helps reduce administrative burdens, enhance experiences for both employees and providers, and support compliance in a cost-effective manner. Furthermore, its scalable architecture enables seamless integration with existing systems, improving data flow across HRIS, claims systems, and provider networks, and advanced analytics help detect fraud patterns to optimize cost control.
In this post, we examine the typical benefit claims processing workflow and identify where generative AI-powered automation can deliver the greatest impact.
Benefit claims processing
When an employee or beneficiary pays out of pocket for an expense covered under their health benefits, they submit a claim for reimbursement. This process requires several supporting documents, including doctor’s prescriptions and proof of payment, which might include check images, receipts, or electronic payment confirmations.
The claims processing workflow involves several critical steps:

Document intake and processing – The system receives and categorizes submitted documentation, including:

Medical records and prescriptions
Proof of payment documentation
Supporting forms and eligibility verification

Payment verification processing – For check-based reimbursements, the system must complete the following steps:

Extract information from check images, including the account number and routing number contained in the MICR line
Verify payee and payer names against the information provided during the claim submission process
Confirm payment amounts match the claimed expenses
Flag discrepancies for human review

Adjudication and reimbursement – When verification is complete, the system performs several actions:

Determine eligibility based on plan rules and coverage limits
Calculate appropriate reimbursement amounts
Initiate payment processing through direct deposit or check issuance
Provide notification to the claimant regarding the status of their reimbursement

In this post, we walk through a real-world scenario to make the complexity of this multi-step process clearer. The following example demonstrates how Amazon Bedrock Data Automation can streamline the claims processing workflow, from initial submission to final reimbursement.
Solution overview
Let’s consider a scenario where a benefit plan participant seeks treatment and pays out of pocket for the doctor’s fee using a check. They then buy the medications prescribed by the doctor at the pharmacy store. Later, they log in to their benefit provider’s portal and submit a claim along with the image of the check and payment receipt for the medications.
This solution uses Amazon Bedrock Data Automation to automate the two most critical and time-consuming aspects of this workflow: document intake and payment verification processing. The following diagram illustrates the benefits claims processing architecture.

The end-to-end process works through four integrated stages: ingestion, extraction, validation, and integration.
Ingestion
When a beneficiary uploads supporting documents (check image and pharmacy receipt) through the company’s benefit claims portal, these documents are securely saved in an Amazon Simple Storage Service (Amazon S3) bucket, triggering the automated claims processing pipeline.
Extraction
After documents are ingested, the system immediately begins with intelligent data extraction:

The S3 object upload triggers an AWS Lambda function, which invokes the Amazon Bedrock Data Automation project.
Amazon Bedrock Data Automation uses blueprints for file processing and extraction. Blueprints are artifacts used to configure file processing business logic by specifying a list of field names for data extraction, along with their desired data formats (string, number, or Boolean) and natural language context for data normalization and validation rules. Amazon Bedrock Data Automation provides a catalog of sample blueprints out of the box. You can create a custom blueprint for your unique document types that aren’t predefined in the catalog. This solution uses two blueprints designed for different document types, as shown in the following screenshot:

The catalog blueprint US-Bank-Check for check processing.
The custom blueprint benefit-claims-pharmacy-receipt-blueprint for pharmacy-specific receipts.

US-Bank-Check is a catalog blueprint provided out of the box by Amazon Bedrock Data Automation. The custom blueprint benefit-claims-pharmacy-receipt-blueprint is created using an AWS CloudFormation template to handle pharmacy receipt processing, addressing a specific document type that wasn’t available in the standard blueprint catalog. The benefit administrator wants to look for vendor-specific information such as name, address, and phone details for benefits claims processing. The custom blueprint schema contains natural language explanation of those fields, such as VendorName, VendorAddress, VendorPhone, and additional fields, explaining what the field represents, expected data types, and inference type for each extracted field (explained in Creating Blueprints for Extraction), as shown in the following screenshot.

3. The two blueprints are added to the Amazon Bedrock Data Automation project. An Amazon Bedrock Data Automation project is a grouping of both standard and custom blueprints that you can use to process different types of files (like documents, audio, and images) using specific configuration settings, where you can control what kind of information you want to extract from each file type. When the project is invoked asynchronously, it automatically applies the appropriate blueprint, extracts information such as confidence scores and bounding box details for each field, and saves results in a separate S3 bucket. This intelligent classification alleviates the need for you to write complex document classification logic.
The following screenshot illustrates the document classification by the standard catalog blueprint US-Bank-Check.

The following screenshot shows the document classification by the custom blueprint benefit-claims-pharmacy-receipt-blueprint.

Validation
With the data extracted, the system moves to the validation and decision-making process using the business rules specific to each document type.
The business rules are documented in standard operating procedure documents (AnyCompany Benefit Checks Standard Operating procedure.docx and AnyCompany Benefit Claims Standard Operating procedure.docx) and uploaded to an S3 bucket. Then the system creates a knowledge base for Amazon Bedrock with the S3 bucket as the source, as shown in the following screenshot.

When the extracted Amazon Bedrock Data Automation results are saved to the configured S3 bucket, a Lambda function is triggered automatically. Based on the business rules retrieved from the knowledge base for the specific document type and the extracted Amazon Bedrock Data Automation output, an Amazon Nova Lite large langue model (LLM) makes the automated approve/deny decision for claims.
The following screenshot shows the benefit claim adjudication automated decision for US-Bank-Check.

The following screenshot shows the benefit claim adjudication automated decision for benefit-claims-pharmacy-receipt-blueprint.

Integration
The system seamlessly integrates with existing business processes.
When validation is complete, an event is pushed to Amazon EventBridge, which triggers a Lambda function for downstream integration. In this implementation, we use an Amazon DynamoDB table and Amazon Simple Notification Service (Amazon SNS) email for downstream integration. A DynamoDB table is created as part of the deployment stack, which is used to populate details including document classification, extracted data, and automated decision. An email notification is sent for both check and receipts after the final decision is made by the system. The following screenshot shows an example email for pharmacy receipt approval.

This flexible architecture helps you integrate with your existing applications through internal APIs or events to update claim status or trigger additional workflows when validation fails.
Reducing manual effort through intelligent business rules management
Beyond automating document processing, this solution addresses a common operational challenge: Traditionally, customers must write and maintain code for handling business rules around claims adjudication and processing. Every business rule change requires development effort and code updates, slowing time-to-market and increasing maintenance overhead.
Our approach converts business rules and standard operating procedures (SOPs) into knowledge bases using Amazon Bedrock Knowledge Bases, which you can use for automated decision-making. This approach can dramatically reduce time-to-market when business rules change, because updates can be made through knowledge management rather than code deployment.
In the following sections, we walk you through the steps to deploy the solution to your own AWS account.
Prerequisites
To implement the solution provided in this post, you must have the following:

An AWS account
Access to Amazon Titan Text Embeddings V2 and Amazon Nova Lite foundation models (FMs) enabled in Amazon Bedrock

This solution uses Python 3.13 with Boto3 1.38. or later version, and the AWS Serverless Application Model Command Line Interface (AWS SAM CLI) version 1.138.0. We assume that you have installed these in your local machine already. If not, refer to the following instructions:

Python 3.13 installation
Install the AWS SAM CLI

Set up code in your local machine
To set up the code, clone the GitHub repository. After you have cloned the repository to your local machine, the project folder structure will look like the following code, as mentioned in the README file:

Deploy the solution in your account
The sample code comes with a CloudFormation template that creates necessary resources. To deploy the solution in your account, follow the deployment instructions in the README file.
Clean up
Deploying this solution in your account will incur costs. Follow the cleanup instructions in the README file to avoid charges when you are done.
Conclusion
Benefits administration companies can significantly enhance their operations by automating claims processing using the solution outlined in this post. This strategic approach directly addresses the industry’s core challenges and can deliver several key advantages:

Enhanced processing efficiency through accelerated claims resolution times, reduced manual error rates, and higher straight-through processing rates that minimize the frustrating delays and manual rework plaguing legacy systems
Streamlined document integration and fraud detection capabilities, where adding new supporting documents becomes seamless through new Amazon Bedrock Data Automation blueprints, while AI-powered analytics identify suspicious patterns without delaying legitimate claims, avoiding traditional months-long development cycles and reducing costly fraud, waste, and abuse
Agile business rule management that enables rapid adaptation to changing HIPAA and ERISA requirements and modification of business rules, significantly reducing administrative costs and time-to-market while improving scalability and integration with existing HRIS and claims, ultimately enhancing employee satisfaction, strengthening provider relationships, and supporting competitive benefits offerings that are crucial for talent retention and employer branding

To get started with this solution, refer to the GitHub repo. For more information about Amazon Bedrock Data Automation, refer to Transform unstructured data into meaningful insights using Amazon Bedrock Data Automation and try the Document Processing Using Amazon Bedrock Data Automation workshop.

About the authors
Saurabh Kumar is a Senior Solutions Architect at AWS based out of Raleigh, NC, with expertise in Resilience Engineering, Chaos Engineering, and Generative AI solutions. He advises customers on fault-tolerance strategies and generative AI-driven modernization approaches, helping organizations build robust architectures while leveraging generative AI technologies to drive innovation.
Kiran Lakkireddy is a Principal Solutions Architect at AWS with expertise in Financial Services, Benefits Management and HR Services industries. Kiran provides technology and architecture guidance to customers in their business transformation, with a specialized focus on GenAI security, compliance, and governance. He regularly speaks to customer security leadership on GenAI security, compliance, and governance topics, helping organizations navigate the complex landscape of AI implementation while maintaining robust security standards.
Tamilmanam Sambasivam is a Solutions Architect and AI/ML Specialist at AWS. She helps enterprise customers to solve their business problems by recommending the right AWS solutions. Her strong back ground in Information Technology (24+ years of experience) helps customers to strategize, develop and modernize their business problems in AWS cloud. In the spare time, Tamil like to travel and gardening.

How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and M …

In this tutorial, we explore advanced computer vision techniques using TorchVision’s v2 transforms, modern augmentation strategies, and powerful training enhancements. We walk through the process of building an augmentation pipeline, applying MixUp and CutMix, designing a modern CNN with attention, and implementing a robust training loop. By running everything seamlessly in Google Colab, we position ourselves to understand and apply state-of-the-art practices in deep learning with clarity and efficiency. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip install torch torchvision torchaudio –quiet
!pip install matplotlib pillow numpy –quiet

import torch
import torchvision
from torchvision import transforms as T
from torchvision.transforms import v2
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import requests
from io import BytesIO

print(f”PyTorch version: {torch.__version__}”)
print(f”TorchVision version: {torchvision.__version__}”)

We begin by installing the libraries and importing all the essential modules for our workflow. We set up PyTorch, TorchVision v2 transforms, and supporting tools like NumPy, PIL, and Matplotlib, so we are ready to build and test advanced computer vision pipelines. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedAugmentationPipeline:
def __init__(self, image_size=224, training=True):
self.image_size = image_size
self.training = training
base_transforms = [
v2.ToImage(),
v2.ToDtype(torch.uint8, scale=True),
]
if training:
self.transform = v2.Compose([
*base_transforms,
v2.Resize((image_size + 32, image_size + 32)),
v2.RandomResizedCrop(image_size, scale=(0.8, 1.0), ratio=(0.9, 1.1)),
v2.RandomHorizontalFlip(p=0.5),
v2.RandomRotation(degrees=15),
v2.ColorJitter(brights=0.4, contst=0.4, sation=0.4, hue=0.1),
v2.RandomGrayscale(p=0.1),
v2.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0)),
v2.RandomPerspective(distortion_scale=0.1, p=0.3),
v2.RandomAffine(degrees=10, translate=(0.1, 0.1), scale=(0.9, 1.1)),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
else:
self.transform = v2.Compose([
*base_transforms,
v2.Resize((image_size, image_size)),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
def __call__(self, image):
return self.transform(image)

We define an advanced augmentation pipeline that adapts to both training and validation modes. We apply powerful TorchVision v2 transforms, such as cropping, flipping, color jittering, blurring, perspective, and affine transformations, during training, while keeping validation preprocessing simple with resizing and normalization. This way, we ensure that we enrich the training data for better generalization while maintaining consistent and stable evaluation. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedMixupCutmix:
def __init__(self, mixup_alpha=1.0, cutmix_alpha=1.0, prob=0.5):
self.mixup_alpha = mixup_alpha
self.cutmix_alpha = cutmix_alpha
self.prob = prob
def mixup(self, x, y):
batch_size = x.size(0)
lam = np.random.beta(self.mixup_alpha, self.mixup_alpha) if self.mixup_alpha > 0 else 1
index = torch.randperm(batch_size)
mixed_x = lam * x + (1 – lam) * x[index, :]
y_a, y_b = y, y[index]
return mixed_x, y_a, y_b, lam
def cutmix(self, x, y):
batch_size = x.size(0)
lam = np.random.beta(self.cutmix_alpha, self.cutmix_alpha) if self.cutmix_alpha > 0 else 1
index = torch.randperm(batch_size)
y_a, y_b = y, y[index]
bbx1, bby1, bbx2, bby2 = self._rand_bbox(x.size(), lam)
x[:, :, bbx1:bbx2, bby1:bby2] = x[index, :, bbx1:bbx2, bby1:bby2]
lam = 1 – ((bbx2 – bbx1) * (bby2 – bby1) / (x.size()[-1] * x.size()[-2]))
return x, y_a, y_b, lam
def _rand_bbox(self, size, lam):
W = size[2]
H = size[3]
cut_rat = np.sqrt(1. – lam)
cut_w = int(W * cut_rat)
cut_h = int(H * cut_rat)
cx = np.random.randint(W)
cy = np.random.randint(H)
bbx1 = np.clip(cx – cut_w // 2, 0, W)
bby1 = np.clip(cy – cut_h // 2, 0, H)
bbx2 = np.clip(cx + cut_w // 2, 0, W)
bby2 = np.clip(cy + cut_h // 2, 0, H)
return bbx1, bby1, bbx2, bby2
def __call__(self, x, y):
if np.random.random() > self.prob:
return x, y, y, 1.0
if np.random.random() < 0.5:
return self.mixup(x, y)
else:
return self.cutmix(x, y)

class ModernCNN(nn.Module):
def __init__(self, num_classes=10, dropout=0.3):
super(ModernCNN, self).__init__()
self.conv1 = self._conv_block(3, 64)
self.conv2 = self._conv_block(64, 128, downsample=True)
self.conv3 = self._conv_block(128, 256, downsample=True)
self.conv4 = self._conv_block(256, 512, downsample=True)
self.gap = nn.AdaptiveAvgPool2d(1)
self.attention = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 512),
nn.Sigmoid()
)
self.classifier = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(dropout/2),
nn.Linear(256, num_classes)
)
def _conv_block(self, in_channels, out_channels, downsample=False):
stride = 2 if downsample else 1
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, out_channels, 3, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.gap(x)
x = torch.flatten(x, 1)
attention_weights = self.attention(x)
x = x * attention_weights
return self.classifier(x)

We strengthen our training with a unified MixUp/CutMix module, where we stochastically blend images or patch-swap regions and compute label interpolation with the exact pixel ratio. We pair this with a modern CNN that stacks progressive conv blocks, applies global average pooling, and uses a learned attention gate before a dropout-regularized classifier, so we improve generalization while keeping inference straightforward. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserclass AdvancedTrainer:
def __init__(self, model, device=’cuda’ if torch.cuda.is_available() else ‘cpu’):
self.model = model.to(device)
self.device = device
self.mixup_cutmix = AdvancedMixupCutmix()
self.optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
self.scheduler = optim.lr_scheduler.OneCycleLR(
self.optimizer, max_lr=1e-2, epochs=10, steps_per_epoch=100
)
self.criterion = nn.CrossEntropyLoss()
def mixup_criterion(self, pred, y_a, y_b, lam):
return lam * self.criterion(pred, y_a) + (1 – lam) * self.criterion(pred, y_b)
def train_epoch(self, dataloader):
self.model.train()
total_loss = 0
correct = 0
total = 0
for batch_idx, (data, target) in enumerate(dataloader):
data, target = data.to(self.device), target.to(self.device)
data, target_a, target_b, lam = self.mixup_cutmix(data, target)
self.optimizer.zero_grad()
output = self.model(data)
if lam != 1.0:
loss = self.mixup_criterion(output, target_a, target_b, lam)
else:
loss = self.criterion(output, target)
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
self.optimizer.step()
self.scheduler.step()
total_loss += loss.item()
_, predicted = output.max(1)
total += target.size(0)
if lam != 1.0:
correct += (lam * predicted.eq(target_a).sum().item() +
(1 – lam) * predicted.eq(target_b).sum().item())
else:
correct += predicted.eq(target).sum().item()
return total_loss / len(dataloader), 100. * correct / total

We orchestrate training with AdamW, OneCycleLR, and dynamic MixUp/CutMix so we stabilize optimization and boost generalization. We compute an interpolated loss when mixing, clip gradients for safety, and step the scheduler each batch, so we track loss/accuracy per epoch in a single tight loop. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserdef demo_advanced_techniques():
batch_size = 16
num_classes = 10
sample_data = torch.randn(batch_size, 3, 224, 224)
sample_labels = torch.randint(0, num_classes, (batch_size,))
transform_pipeline = AdvancedAugmentationPipeline(training=True)
model = ModernCNN(num_classes=num_classes)
trainer = AdvancedTrainer(model)
print(” Advanced Deep Learning Tutorial Demo”)
print(“=” * 50)
print(“n1. Advanced Augmentation Pipeline:”)
augmented = transform_pipeline(Image.fromarray((sample_data[0].permute(1,2,0).numpy() * 255).astype(np.uint8)))
print(f” Original shape: {sample_data[0].shape}”)
print(f” Augmented shape: {augmented.shape}”)
print(f” Applied transforms: Resize, Crop, Flip, ColorJitter, Blur, Perspective, etc.”)
print(“n2. MixUp/CutMix Augmentation:”)
mixup_cutmix = AdvancedMixupCutmix()
mixed_data, target_a, target_b, lam = mixup_cutmix(sample_data, sample_labels)
print(f” Mixed batch shape: {mixed_data.shape}”)
print(f” Lambda value: {lam:.3f}”)
print(f” Technique: {‘MixUp’ if lam > 0.7 else ‘CutMix’}”)
print(“n3. Modern CNN Architecture:”)
model.eval()
with torch.no_grad():
output = model(sample_data)
print(f” Input shape: {sample_data.shape}”)
print(f” Output shape: {output.shape}”)
print(f” Features: Residual blocks, Attention, Global Average Pooling”)
print(f” Parameters: {sum(p.numel() for p in model.parameters()):,}”)
print(“n4. Advanced Training Simulation:”)
dummy_loader = [(sample_data, sample_labels)]
loss, acc = trainer.train_epoch(dummy_loader)
print(f” Training loss: {loss:.4f}”)
print(f” Training accuracy: {acc:.2f}%”)
print(f” Learning rate: {trainer.scheduler.get_last_lr()[0]:.6f}”)
print(“n Tutorial completed successfully!”)
print(“This code demonstrates state-of-the-art techniques in deep learning:”)
print(“• Advanced data augmentation with TorchVision v2”)
print(“• MixUp and CutMix for better generalization”)
print(“• Modern CNN architecture with attention”)
print(“• Advanced training loop with OneCycleLR”)
print(“• Gradient clipping and weight decay”)

if __name__ == “__main__”:
demo_advanced_techniques()

We run a compact end-to-end demo where we visualize our augmentation pipeline, apply MixUp/CutMix, and double-check the ModernCNN with a forward pass. We then simulate one training epoch on dummy data to verify loss, accuracy, and learning-rate scheduling, so we confirm the full stack works before scaling to a real dataset.

In conclusion, we have successfully developed and tested a comprehensive workflow that integrates advanced augmentations, innovative CNN design, and modern training strategies. By experimenting with TorchVision v2, MixUp, CutMix, attention mechanisms, and OneCycleLR, we not only strengthen model performance but also deepen our understanding of cutting-edge techniques.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and Modern CNN Training for State-of-the-Art Computer Vision? appeared first on MarkTechPost.

Vision-RAG vs Text-RAG: A Technical Comparison for Enterprise Search

Most RAG failures originate at retrieval, not generation. Text-first pipelines lose layout semantics, table structure, and figure grounding during PDF→text conversion, degrading recall and precision before an LLM ever runs. Vision-RAG—retrieving rendered pages with vision-language embeddings—directly targets this bottleneck and shows material end-to-end gains on visually rich corpora.

Pipelines (and where they fail)

Text-RAG. PDF → (parser/OCR) → text chunks → text embeddings → ANN index → retrieve → LLM. Typical failure modes: OCR noise, multi-column flow breakage, table cell structure loss, and missing figure/chart semantics—documented by table- and doc-VQA benchmarks created to measure exactly these gaps.

Vision-RAG. PDF → page raster(s) → VLM embeddings (often multi-vector with late-interaction scoring) → ANN index → retrieve → VLM/LLM consumes high-fidelity crops or full pages. This preserves layout and figure-text grounding; recent systems (ColPali, VisRAG, VDocRAG) validate the approach.

What current evidence supports

Document-image retrieval works and is simpler. ColPali embeds page images and uses late-interaction matching; on the ViDoRe benchmark it outperforms modern text pipelines while remaining end-to-end trainable.

End-to-end lift is measurable. VisRAG reports 25–39% end-to-end improvement over text-RAG on multimodal documents when both retrieval and generation use a VLM.

Unified image format for real-world docs. VDocRAG shows that keeping documents in a unified image format (tables, charts, PPT/PDF) avoids parser loss and improves generalization; it also introduces OpenDocVQA for evaluation.

Resolution drives reasoning quality. High-resolution support in VLMs (e.g., Qwen2-VL/Qwen2.5-VL) is explicitly tied to SoTA results on DocVQA/MathVista/MTVQA; fidelity matters for ticks, superscripts, stamps, and small fonts.

Costs: vision context is (often) order-of-magnitude heavier—because of tokens

Vision inputs inflate token counts via tiling, not necessarily per-token price. For GPT-4o-class models, total tokens ≈ base + (tile_tokens × tiles), so 1–2 MP pages can be ~10× cost of a small text chunk. Anthropic recommends ~1.15 MP caps (~1.6k tokens) for responsiveness. By contrast, Google Gemini 2.5 Flash-Lite prices text/image/video at the same per-token rate, but large images still consume many more tokens. Engineering implication: adopt selective fidelity (crop > downsample > full page).

Design rules for production Vision-RAG

Align modalities across embeddings. Use encoders trained for textimage alignment (CLIP-family or VLM retrievers) and, in practice, dual-index: cheap text recall for coverage + vision rerank for precision. ColPali’s late-interaction (MaxSim-style) is a strong default for page images.

Feed high-fidelity inputs selectively. Coarse-to-fine: run BM25/DPR, take top-k pages to a vision reranker, then send only ROI crops (tables, charts, stamps) to the generator. This preserves crucial pixels without exploding tokens under tile-based accounting.

Engineer for real documents.• Tables: if you must parse, use table-structure models (e.g., PubTables-1M/TATR); otherwise prefer image-native retrieval.• Charts/diagrams: expect tick- and legend-level cues; resolution must retain these. Evaluate on chart-focused VQA sets.• Whiteboards/rotations/multilingual: page rendering avoids many OCR failure modes; multilingual scripts and rotated scans survive the pipeline.• Provenance: store page hashes and crop coordinates alongside embeddings to reproduce exact visual evidence used in answers.

StandardText-RAGVision-RAGIngest pipelinePDF → parser/OCR → text chunks → text embeddings → ANNPDF → page render(s) → VLM page/crop embeddings (often multi-vector, late interaction) → ANN. ColPali is a canonical implementation.Primary failure modesParser drift, OCR noise, multi-column flow breakage, table structure loss, missing figure/chart semantics. Benchmarks exist because these errors are common.Preserves layout/figures; failures shift to resolution/tiling choices and cross-modal alignment. VDocRAG formalizes “unified image” processing to avoid parsing loss.Retriever representationSingle-vector text embeddings; rerank via lexical or cross-encodersPage-image embeddings with late interaction (MaxSim-style) capture local regions; improves page-level retrieval on ViDoRe.End-to-end gains (vs Text-RAG)Baseline+25–39% E2E on multimodal docs when both retrieval and generation are VLM-based (VisRAG).Where it excelsClean, text-dominant corpora; low latency/costVisually rich/structured docs: tables, charts, stamps, rotated scans, multilingual typography; unified page context helps QA. Resolution sensitivityNot applicable beyond OCR settingsReasoning quality tracks input fidelity (ticks, small fonts). High-res document VLMs (e.g., Qwen2-VL family) emphasize this.Cost model (inputs)Tokens ≈ characters; cheap retrieval contextsImage tokens grow with tiling: e.g., OpenAI base+tiles formula; Anthropic guidance ~1.15 MP ≈ ~1.6k tokens. Even when per-token price is equal (Gemini 2.5 Flash-Lite), high-res pages consume far more tokens. Cross-modal alignment needNot requiredCritical: textimage encoders must share geometry for mixed queries; ColPali/ViDoRe demonstrate effective page-image retrieval aligned to language tasks.Benchmarks to trackDocVQA (doc QA), PubTables-1M (table structure) for parsing-loss diagnostics. ViDoRe (page retrieval), VisRAG (pipeline), VDocRAG (unified-image RAG).Evaluation approachIR metrics plus text QA; may miss figure-text grounding issuesJoint retrieval+gen on visually rich suites (e.g., OpenDocVQA under VDocRAG) to capture crop relevance and layout grounding.Operational patternOne-stage retrieval; cheap to scaleCoarse-to-fine: text recall → vision rerank → ROI crops to generator; keeps token costs bounded while preserving fidelity. (Tiling math/pricing inform budgets.) When to preferContracts/templates, code/wikis, normalized tabular data (CSV/Parquet)Real-world enterprise docs with heavy layout/graphics; compliance workflows needing pixel-exact provenance (page hash + crop coords).Representative systemsDPR/BM25 + cross-encoder rerankColPali (ICLR’25) vision retriever; VisRAG pipeline; VDocRAG unified image framework.

When Text-RAG is still the right default?

Clean, text-dominant corpora (contracts with fixed templates, wikis, code)

Strict latency/cost constraints for short answers

Data already normalized (CSV/Parquet)—skip pixels and query the table store

Evaluation: measure retrieval + generation jointly

Add multimodal RAG benchmarks to your harness—e.g., M²RAG (multi-modal QA, captioning, fact-verification, reranking), REAL-MM-RAG (real-world multi-modal retrieval), and RAG-Check (relevance + correctness metrics for multi-modal context). These catch failure cases (irrelevant crops, figure-text mismatch) that text-only metrics miss.

Summary

Text-RAG remains efficient for clean, text-only data. Vision-RAG is the practical default for enterprise documents with layout, tables, charts, stamps, scans, and multilingual typography. Teams that (1) align modalities, (2) deliver selective high-fidelity visual evidence, and (3) evaluate with multimodal benchmarks consistently get higher retrieval precision and better downstream answers—now backed by ColPali (ICLR 2025), VisRAG’s 25–39% E2E lift, and VDocRAG’s unified image-format results.

References:

https://arxiv.org/abs/2407.01449

https://github.com/illuin-tech/vidore-benchmark

https://huggingface.co/vidore

https://arxiv.org/abs/2410.10594

https://github.com/OpenBMB/VisRAG

https://huggingface.co/openbmb/VisRAG-Ret

https://arxiv.org/abs/2504.09795

https://openaccess.thecvf.com/content/CVPR2025/papers/Tanaka_VDocRAG_Retrieval-Augmented_Generation_over_Visually-Rich_Documents_CVPR_2025_paper.pdf

https://cvpr.thecvf.com/virtual/2025/poster/34926

https://vdocrag.github.io/

https://arxiv.org/abs/2110.00061

https://openaccess.thecvf.com/content/CVPR2022/papers/Smock_PubTables-1M_Towards_Comprehensive_Table_Extraction_From_Unstructured_Documents_CVPR_2022_paper.pdf (CVF Open Access)

https://huggingface.co/datasets/bsmock/pubtables-1m (Hugging Face)

https://arxiv.org/abs/2007.00398

https://www.docvqa.org/datasets

https://qwenlm.github.io/blog/qwen2-vl/

https://arxiv.org/html/2409.12191v1

https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

https://arxiv.org/abs/2203.10244

https://arxiv.org/abs/2504.05506

https://aclanthology.org/2025.findings-acl.978.pdf

https://arxiv.org/pdf/2504.05506

https://openai.com/api/pricing/

https://docs.claude.com/en/docs/build-with-claude/vision

https://docs.claude.com/en/docs/build-with-claude/token-counting

https://ai.google.dev/gemini-api/docs/pricing

https://arxiv.org/abs/2502.17297

https://openreview.net/forum?id=1oCZoWvb8i

https://github.com/NEUIR/M2RAG

https://arxiv.org/abs/2502.12342

https://aclanthology.org/2025.acl-long.1528/

https://aclanthology.org/2025.acl-long.1528.pdf

https://huggingface.co/collections/ibm-research/real-mm-rag-bench-67d2dc0ddf2dfafe66f09d34

https://research.ibm.com/publications/real-mm-rag-a-real-world-multi-modal-retrieval-benchmark

https://arxiv.org/abs/2501.03995

https://platform.openai.com/docs/guides/images-vision

The post Vision-RAG vs Text-RAG: A Technical Comparison for Enterprise Search appeared first on MarkTechPost.

Alibaba’s Qwen3-Max: Production-Ready Thinking Mode, 1T+ Parameters, …

Alibaba has released Qwen3-Max, a trillion-parameter Mixture-of-Experts (MoE) model positioned as its most capable foundation model to date, with an immediate public on-ramp via Qwen Chat and Alibaba Cloud’s Model Studio API. The launch moves Qwen’s 2025 cadence from preview to production and centers on two variants: Qwen3-Max-Instruct for standard reasoning/coding tasks and Qwen3-Max-Thinking for tool-augmented “agentic” workflows.

What’s new at the model level?

Scale & architecture: Qwen3-Max crosses the 1-trillion-parameter mark with an MoE design (sparse activation per token). Alibaba positions the model as its largest and most capable to date; public briefings and coverage consistently describe it as a 1T-parameter class system rather than another mid-scale refresh.

Training/runtime posture: Qwen3-Max uses a sparse Mixture-of-Experts design and was pretrained on ~36T tokens (~2× Qwen2.5). The corpus skews toward multilingual, coding, and STEM/reasoning data. Post-training follows Qwen3’s four-stage recipe: long CoT cold-start → reasoning-focused RL → thinking/non-thinking fusion → general-domain RL. Alibaba confirms >1T parameters for Max; treat token counts/routing as team-reported until a formal Max tech report is published.

Access: Qwen Chat showcases the general-purpose UX, while Model Studio exposes inference and “thinking mode” toggles (notably, incremental_output=true is required for Qwen3 thinking models). Model listings and pricing sit under Model Studio with regioned availability.

Benchmarks: coding, agentic control, math

Coding (SWE-Bench Verified). Qwen3-Max-Instruct is reported at 69.6 on SWE-Bench Verified. That places it above some non-thinking baselines (e.g., DeepSeek V3.1 non-thinking) and slightly below Claude Opus 4 non-thinking in at least one roundup. Treat these as point-in-time numbers; SWE-Bench evaluations move quickly with harness updates.

Agentic tool use (Tau2-Bench). Qwen3-Max posts 74.8 on Tau2-Bench—an agent/tool-calling evaluation—beating named peers in the same report. Tau2 is designed to test decision-making and tool routing, not just text accuracy, so gains here are meaningful for workflow automation.

Math & advanced reasoning (AIME25, etc.). The Qwen3-Max-Thinking track (with tool use and a “heavy” runtime configuration) is described as near-perfect on key math benchmarks (e.g., AIME25) in multiple secondary sources and earlier preview coverage. Until an official technical report drops, treat “100%” claims as vendor-reported or community-replicated, not peer-reviewed.

https://qwen.ai/

https://qwen.ai/

Why two tracks—Instruct vs. Thinking?

Instruct targets conventional chat/coding/reasoning with tight latency, while Thinking enables longer deliberation traces and explicit tool calls (retrieval, code execution, browsing, evaluators), aimed at higher-reliability “agent” use cases. Critically, Alibaba’s API docs formalize the runtime switch: Qwen3 thinking models only operate with streaming incremental output enabled; commercial defaults are false, so callers must explicitly set it. This is a small but consequential contract detail if you’re instrumenting tools or chain-of-thought-like rollouts.

How to reason about the gains (signal vs. noise)?

Coding: A 60–70 SWE-Bench Verified score range typically reflects non-trivial repository-level reasoning and patch synthesis under evaluation harness constraints (e.g., environment setup, flaky tests). If your workloads hinge on repo-scale code changes, these deltas matter more than single-file coding toys.

Agentic: Tau2-Bench emphasizes multi-tool planning and action selection. Improvements here usually translate into fewer brittle hand-crafted policies in production agents, provided your tool APIs and execution sandboxes are robust.

Math/verification: “Near-perfect” math numbers from heavy/thinky modes underscore the value of extended deliberation plus tools (calculators, validators). Portability of those gains to open-ended tasks depends on your evaluator design and guardrails.

Summary

Qwen3-Max is not a teaser—it’s a deployable 1T-parameter MoE with documented thinking-mode semantics and reproducible access paths (Qwen Chat, Model Studio). Treat day-one benchmark wins as directionally strong but continue local evals; the hard, verifiable facts are scale (≈36T tokens, >1T params) and the API contract for tool-augmented runs (incremental_output=true). For teams building coding and agentic systems, this is ready for hands-on trials and internal gating against SWE-/Tau2-style suites.

Check out the Technical details, API and Qwen Chat. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Alibaba’s Qwen3-Max: Production-Ready Thinking Mode, 1T+ Parameters, and Day-One Coding/Agentic Bench Signals appeared first on MarkTechPost.

Coding Implementation to End-to-End Transformer Model Optimization wit …

In this tutorial, we walk through how we use Hugging Face Optimum to optimize Transformer models and make them faster while maintaining accuracy. We begin by setting up DistilBERT on the SST-2 dataset, and then we compare different execution engines, including plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step by step, we get hands-on experience with model export, optimization, quantization, and benchmarking, all inside a Google Colab environment. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser!pip -q install “transformers>=4.49” “optimum[onnxruntime]>=1.20.0” “datasets>=2.20” “evaluate>=0.4” accelerate

from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig

os.environ.setdefault(“OMP_NUM_THREADS”, “1”)
os.environ.setdefault(“MKL_NUM_THREADS”, “1”)

MODEL_ID = “distilbert-base-uncased-finetuned-sst-2-english”
ORT_DIR = Path(“onnx-distilbert”)
Q_DIR = Path(“onnx-distilbert-quant”)
DEVICE = “cuda” if torch.cuda.is_available() else “cpu”
BATCH = 16
MAXLEN = 128
N_WARM = 3
N_ITERS = 8

print(f”Device: {DEVICE} | torch={torch.__version__}”)

We begin by installing the required libraries and setting up our environment for Hugging Face Optimum with ONNX Runtime. We configure paths, batch size, and iteration settings, and we confirm whether we run on CPU or GPU. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserds = load_dataset(“glue”, “sst2″, split=”validation[:20%]”)
texts, labels = ds[“sentence”], ds[“label”]
metric = evaluate.load(“accuracy”)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

def make_batches(texts, max_len=MAXLEN, batch=BATCH):
for i in range(0, len(texts), batch):
yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
max_length=max_len, return_tensors=”pt”)

def run_eval(predict_fn, texts, labels):
preds = []
for toks in make_batches(texts):
preds.extend(predict_fn(toks))
return metric.compute(predictions=preds, references=labels)[“accuracy”]

def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
for _ in range(n_warm):
for toks in make_batches(texts[:BATCH*2]):
predict_fn(toks)
times = []
for _ in range(n_iters):
t0 = time.time()
for toks in make_batches(texts):
predict_fn(toks)
times.append((time.time() – t0) * 1000)
return float(np.mean(times)), float(np.std(times))

We load an SST-2 validation slice and prepare tokenization, an accuracy metric, and batching. We define run_eval to compute accuracy from any predictor and bench to warm up and time end-to-end inference. With these helpers, we fairly compare different engines using identical data and batching. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browsertorch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()

@torch.no_grad()
def pt_predict(toks):
toks = {k: v.to(DEVICE) for k, v in toks.items()}
logits = torch_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()

pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f”[PyTorch eager] {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}”)

compiled_model = torch_model
compile_ok = False
try:
compiled_model = torch.compile(torch_model, mode=”reduce-overhead”, fullgraph=False)
compile_ok = True
except Exception as e:
print(“torch.compile unavailable or failed -> skipping:”, repr(e))

@torch.no_grad()
def ptc_predict(toks):
toks = {k: v.to(DEVICE) for k, v in toks.items()}
logits = compiled_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()

if compile_ok:
ptc_ms, ptc_sd = bench(ptc_predict, texts)
ptc_acc = run_eval(ptc_predict, texts, labels)
print(f”[torch.compile] {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}”)

We load the baseline PyTorch classifier, define a pt_predict helper, and benchmark/score it on SST-2. We then attempt torch.compile for just-in-time graph optimizations and, if successful, run the same benchmarks to compare speed and accuracy under an identical setup. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserprovider = “CUDAExecutionProvider” if DEVICE == “cuda” else “CPUExecutionProvider”
ort_model = ORTModelForSequenceClassification.from_pretrained(
MODEL_ID, export=True, provider=provider, cache_dir=ORT_DIR
)

@torch.no_grad()
def ort_predict(toks):
logits = ort_model(**{k: v.cpu() for k, v in toks.items()}).logits
return logits.argmax(-1).cpu().tolist()

ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f”[ONNX Runtime] {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}”)

Q_DIR.mkdir(parents=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(approach=”dynamic”, per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)

ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, provider=provider)

@torch.no_grad()
def ortq_predict(toks):
logits = ort_quant(**{k: v.cpu() for k, v in toks.items()}).logits
return logits.argmax(-1).cpu().tolist()

oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f”[ORT Quantized] {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}”)

We export the model to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark both to see how latency improves while accuracy stays comparable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browserpt_pipe = pipeline(“sentiment-analysis”, model=torch_model, tokenizer=tokenizer,
device=0 if DEVICE==”cuda” else -1)
ort_pipe = pipeline(“sentiment-analysis”, model=ort_model, tokenizer=tokenizer, device=-1)
samples = [
“What a fantastic movie—performed brilliantly!”,
“This was a complete waste of time.”,
“I’m not sure how I feel about this one.”
]
print(“nSample predictions (PT | ORT):”)
for s in samples:
a = pt_pipe(s)[0][“label”]
b = ort_pipe(s)[0][“label”]
print(f”- {s}n PT={a} | ORT={b}”)

import pandas as pd
rows = [[“PyTorch eager”, pt_ms, pt_sd, pt_acc],
[“ONNX Runtime”, ort_ms, ort_sd, ort_acc],
[“ORT Quantized”, oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, [“torch.compile”, ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=[“Engine”, “Mean ms (↓)”, “Std ms”, “Accuracy”])
display(df)

print(“””
Notes:
– BetterTransformer is deprecated on transformers>=4.49, hence omitted.
– For larger gains on GPU, also try FlashAttention2 models or FP8 with TensorRT-LLM.
– For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; try NUMA pinning.
– For static (calibrated) quantization, use QuantizationConfig(approach=’static’) with a calibration set.
“””)

We sanity-check predictions with quick sentiment pipelines and print PyTorch vs ONNX labels side by side. We then assemble a summary table to compare latency and accuracy across engines, inserting torch.compile results when available. We conclude with practical notes, allowing us to extend the workflow to other backends and quantization modes.

In conclusion, we can clearly see how Optimum helps us bridge the gap between standard PyTorch models and production-ready, optimized deployments. We achieve speedups with ONNX Runtime and quantization while retaining accuracy, and we also explore how torch.compile provides gains directly within PyTorch. This workflow demonstrates a practical approach to balancing performance and efficiency for Transformer models, providing a foundation that can be further extended with advanced backends, such as OpenVINO or TensorRT.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization appeared first on MarkTechPost.

Google AI Introduces the Public Preview of Chrome DevTools MCP: Making …

Google has released a public preview of “Chrome DevTools MCP,” a Model Context Protocol (MCP) server that lets AI coding agents control and inspect a real Chrome instance—recording performance traces, inspecting the DOM and CSS, executing JavaScript, reading console output, and automating user flows. The launch directly targets a well-known limitation in code-generating agents: they usually cannot observe the runtime behavior of the pages they create or modify. By wiring agents into Chrome’s DevTools via MCP, Google is turning static suggestion engines into loop-closed debuggers that run measurements in the browser before proposing fixes.

What exactly is Chrome DevTools MCP?

MCP is an open protocol for connecting LLMs to tools and data. Google’s DevTools MCP acts as a specialized server that exposes Chrome’s debugging surface to MCP-compatible clients. Google’s developer blog positions this as “bringing the power of Chrome DevTools to AI coding assistants,” with concrete workflows like initiating a performance trace (e.g., performance_start_trace) against a target URL, then having the agent analyze the resulting trace to suggest optimizations (for example, diagnosing high Largest Contentful Paint).

Capabilities and tool surface

The official GitHub repository documents a broad tool set. Beyond performance tracing (performance_start_trace, performance_stop_trace, performance_analyze_insight), agents can run navigation primitives (navigate_page, new_page, wait_for), simulate user input (click, fill, drag, hover), and interrogate runtime state (list_console_messages, evaluate_script, list_network_requests, get_network_request). Screenshot and snapshot utilities provide visual and DOM-state capture to support diffs and regressions. The server uses Puppeteer under the hood for reliable automation and waiting semantics, and it speaks to Chrome via the Chrome DevTools Protocol (CDP).

Installation

Setup is intentionally minimal for MCP clients. Google recommends adding a single config stanza that shells out to npx, always tracking the latest server build:

Copy CodeCopiedUse a different Browser{
“mcpServers”: {
“chrome-devtools”: {
“command”: “npx”,
“args”: [“chrome-devtools-mcp@latest”]
}
}
}

This server integrates with multiple agent front ends: Gemini CLI, Claude Code, Cursor, and GitHub Copilot’s MCP support. For VS Code/Copilot, the repo documents a code –add-mcp one-liner; for Claude Code, a claude mcp add command mirrors the same npx target. The package targets Node.js ≥22 and current Chrome.

Example agent workflows

Google’s announcement highlights pragmatic prompts that demonstrate end-to-end loops: verify a proposed fix in a live browser; analyze network failures (e.g., CORS or blocked image requests); simulate user behaviors like form submission to reproduce bugs; inspect layout issues by reading DOM/CSS in context; and run automated performance audits to reduce LCP and other Core Web Vitals. These are all operations agents can now validate with actual measurements rather than heuristics.

https://developer.chrome.com/blog/chrome-devtools-mcp?hl=en

Summary

Chrome DevTools MCP’s public preview is a practical inflection point for agentic frontend tooling: it grounds AI assistants in real browser telemetry—performance traces, DOM/CSS state, network and console data—so recommendations are driven by measurements rather than guesswork. The first-party server, shipped by the Chrome DevTools team, is installable via npx and targets MCP-capable clients, with Chrome/CDP under the hood. Expect shorter diagnose-fix loops for regressions and flaky UI flows, plus tighter validation of performance work.

Check out the Technical details and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Google AI Introduces the Public Preview of Chrome DevTools MCP: Making Your Coding Agent Control and Inspect a Live Chrome Browser appeared first on MarkTechPost.

Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Rea …

Real-time agents, live dubbing, and simultaneous translation die by a thousand milliseconds. Most “streaming” TTS (Text to Speech) stacks still wait for a chunk of text before they emit sound, so the human hears a beat of silence before the voice starts. VoXtream—released by KTH’s Speech, Music and Hearing group—attacks this head-on: it begins speaking after the first word, outputs audio in 80 ms frames, and reports 102 ms first-packet latency (FPL) on a modern GPU (with PyTorch compile).

What exactly is “full-stream” TTS and how is it different from “output streaming”?

Output-streaming systems decode speech in chunks but still require the entire input text upfront; the clock starts late. Full-stream systems consume text as it arrives (word-by-word from an LLM) and emit audio in lockstep. VoXtream implements the latter: it ingests a word stream and generates audio frames continuously, eliminating input-side buffering while maintaining low per-frame compute. The architecture explicitly targets first-word onset rather than only steady-state throughput.

https://arxiv.org/pdf/2509.15969

How does VoXtream start speaking without waiting for future words?

The core trick is a dynamic phoneme look-ahead inside an incremental Phoneme Transformer (PT). PT may peek up to 10 phonemes to stabilize prosody, but it does not wait for that context; generation can start immediately after the first word enters the buffer. This avoids fixed look-ahead windows that add onset delay.

What’s the model stack under the hood?

VoXtream is a single, fully-autoregressive (AR) pipeline with three transformers:

Phoneme Transformer (PT): decoder-only, incremental; dynamic look-ahead ≤ 10 phonemes; phonemization via g2pE at the word level.

Temporal Transformer (TT): AR predictor over Mimi codec semantic tokens plus a duration token that encodes a monotonic phoneme-to-audio alignment (“stay/go” and {1, 2} phonemes per frame). Mimi runs at 12.5 Hz (→ 80 ms frames).

Depth Transformer (DT): AR generator for the remaining Mimi acoustic codebooks, conditioned on TT outputs and a ReDimNet speaker embedding for zero-shot voice prompting. The Mimi decoder reconstructs the waveform frame-by-frame, enabling continuous emission.

Mimi’s streaming codec design and dual-stream tokenization are well documented; VoXtream uses its first codebook as “semantic” context and the rest for high-fidelity reconstruction.

Is it actually fast in practice—or just “fast on paper”?

The repository includes a benchmark script that measures both FPL and real-time factor (RTF). On A100, the research team report 171 ms / 1.00 RTF without compile and 102 ms / 0.17 RTF with compile; on RTX 3090, 205 ms / 1.19 RTF uncompiled and 123 ms / 0.19 RTF compiled.

How does it compare to today’s popular streaming baselines?

The research team evaluates short-form output streaming and full-stream scenarios. On LibriSpeech-long full-stream (where text arrives word-by-word), VoXtream shows lower WER (3.24 %) than CosyVoice2 (6.11 %) and a significant naturalness preference for VoXtream in listener studies (p ≤ 5e-10), while CosyVoice2 scores higher on speaker-similarity—consistent with its flow-matching decoder. In runtime, VoXtream has the lowest FPL among the compared public streaming systems, and with compile it operates >5× faster than real time (RTF ≈ 0.17).

https://arxiv.org/pdf/2509.15969

https://arxiv.org/pdf/2509.15969

Why does this AR design beat diffusion/flow stacks on onset?

Diffusion/flow vocoders typically generate audio in chunks, so even if the text-audio interleaving is clever, the vocoder imposes a floor on first-packet latency. VoXtream keeps every stage AR and frame-synchronous—PT→TT→DT→Mimi decoder—so the first 80 ms packet emerges after one pass through the stack rather than a multi-step sampler. The introduction surveys prior interleaved and chunked approaches and explains how NAR flow-matching decoders used in IST-LM and CosyVoice2 impede low FPL despite strong offline quality.

Did they get here with huge data—or something smaller and cleaner?

VoXtream trains on a ~9k-hour mid-scale corpus: roughly 4.5k h Emilia and 4.5k h HiFiTTS-2 (22 kHz subset). The team diarized to remove multi-speaker clips, filtered transcripts using ASR, and applied NISQA to drop low-quality audio. Everything is resampled to 24 kHz, and the dataset card spells out the preprocessing pipeline and alignment artifacts (Mimi tokens, MFA alignments, duration labels, and speaker templates).

Are the headline quality metrics holding up outside cherry-picked clips?

Table 1 (zero-shot TTS) shows VoXtream is competitive on WER, UTMOS (MOS predictor), and speaker similarity across SEED-TTS test-en and LibriSpeech test-clean; the research team also runs an ablation: adding the CSM Depth Transformer and speaker encoder notably improves similarity without a significant WER penalty relative to a stripped baseline. The subjective study uses a MUSHRA-like protocol and a second-stage preference test tailored to full-stream generation.

source: marktechpost.com

Where does this land in the TTS landscape?

As per the research paper, it positions VoXtream among recent interleaved AR + NAR vocoder approaches and LM-codec stacks. The core contribution isn’t a new codec or a giant model—it’s a latency-focused AR arrangement plus a duration-token alignment that preserves input-side streaming. If you build live agents, the important trade-off is explicit: a small drop in speaker similarity vs. order-of-magnitude lower FPL than chunked NAR vocoders in full-stream conditions.

Check out the PAPER, Model on Hugging, GitHub Page and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Real-Time Use that Begins Speaking from the First Word appeared first on MarkTechPost.

Running deep research AI agents on Amazon Bedrock AgentCore

AI agents are evolving beyond basic single-task helpers into more powerful systems that can plan, critique, and collaborate with other agents to solve complex problems. Deep Agents—a recently introduced framework built on LangGraph—bring these capabilities to life, enabling multi-agent workflows that mirror real-world team dynamics. The challenge, however, is not just building such agents but also running them reliably and securely in production. This is where Amazon Bedrock AgentCore Runtime comes in. By providing a secure, serverless environment purpose-built for AI agents and tools, Runtime makes it possible to deploy Deep Agents at enterprise scale without the heavy lifting of managing infrastructure.
In this post, we demonstrate how to deploy Deep Agents on AgentCore Runtime. As shown in the following figure, AgentCore Runtime scales any agent and provides session isolation by allocating a new microVM for each new session.

What is Amazon Bedrock AgentCore?
Amazon Bedrock AgentCore is both framework-agnostic and model-agnostic, giving you the flexibility to deploy and operate advanced AI agents securely and at scale. Whether you’re building with Strands Agents, CrewAI, LangGraph, LlamaIndex, or another framework—and running them on a large language model (LLM)—AgentCore provides the infrastructure to support them. Its modular services are purpose-built for dynamic agent workloads, with tools to extend agent capabilities and controls required for production use. By alleviating the undifferentiated heavy lifting of building and managing specialized agent infrastructure, AgentCore lets you bring your preferred framework and model and deploy without rewriting code.
Amazon Bedrock AgentCore offers a comprehensive suite of capabilities designed to transform local agent prototypes into production-ready systems. These include persistent memory for maintaining context in and across conversations, access to existing APIs using Model Context Protocol (MCP), seamless integration with corporate authentication systems, specialized tools for web browsing and code execution, and deep observability into agent reasoning processes. In this post, we focus specifically on the AgentCore Runtime component.
Core capabilities of AgentCore Runtime
AgentCore Runtime provides a serverless, secure hosting environment specifically designed for agentic workloads. It packages code into a lightweight container with a simple, consistent interface, making it equally well-suited for running agents, tools, MCP servers, or other workloads that benefit from seamless scaling and integrated identity management.AgentCore Runtime offers extended execution times up to 8 hours for complex reasoning tasks, handles large payloads for multimodal content, and implements consumption-based pricing that charges only during active processing—not while waiting for LLM or tool responses. Each user session runs in complete isolation within dedicated micro virtual machines (microVMs), maintaining security and helping to prevent cross-session contamination between agent interactions. The runtime works with many frameworks (for example: LangGraph, CrewAI, Strands, and so on) and many foundation model providers, while providing built-in corporate authentication, specialized agent observability, and unified access to the broader AgentCore environment through a single SDK.
Real-world example: Deep Agents integration
In this post we’re going to deploy the recently released Deep Agents implementation example on AgentCore Runtime—showing just how little effort it takes to get the latest agent innovations up and running.

The sample implementation in the preceding diagram includes:

A research agent that conducts deep internet searches using the Tavily API
A critique agent that reviews and provides feedback on generated reports
A main orchestrator that manages the workflow and handles file operations

Deep Agents uses LangGraph’s state management to create a multi-agent system with:

Built-in task planning through a write_todos tool that helps agents break down complex requests
Virtual file system where agents can read/write files to maintain context across interactions
Sub-agent architecture allowing specialized agents to be invoked for specific tasks while maintaining context isolation
Recursive reasoning with high recursion limits (more than 1,000) to handle complex, multi-step workflows

This architecture enables Deep Agents to handle research tasks that require multiple rounds of information gathering, synthesis, and refinement.The key integration points in our code showcase how agents work with AgentCore. The beauty is in its simplicity—we only need to add a couple of lines of code to make an agent AgentCore-compatible:

# 1. Import the AgentCore runtime
from bedrock_agentcore.runtime import BedrockAgentCoreApp
app = BedrockAgentCoreApp()

# 2. Decorate your agent function with @app.entrypoint
@app.entrypoint
async def langgraph_bedrock(payload):
# Your existing agent logic remains unchanged
user_input = payload.get(“prompt”)

# Call your agent as before
stream = agent.astream(
{“messages”: [HumanMessage(content=user_input)]},
stream_mode=”values”
)

# Stream responses back
async for chunk in stream:
yield(chunk)

# 3. Add the runtime starter at the bottom
if __name__ == “__main__”:
app.run()

That’s it! The rest of the code—model initialization, API integrations, and agent logic—remains exactly as it was. AgentCore handles the infrastructure while your agent handles the intelligence. This integration pattern works for most Python agent frameworks, making AgentCore truly framework-agnostic.
Deploying to AgentCore Runtime: Step-by-step
Let’s walk through the actual deployment process using the AgentCore Starter ToolKit, which dramatically simplifies the deployment workflow.
Prerequisites
Before you begin, make sure you have:

Python 3.10 or higher
AWS credentials configured
Amazon Bedrock AgentCore SDK installed

Step 1: IAM permissions
There are two different AWS Identity and Access Management (IAM) permissions you need to consider when deploying an agent in an AgentCore Runtime—the role you, as a developer use to create AgentCore resources and the execution role that an agent needs to run in an AgentCore Runtime. While the latter role can now be auto-created by the AgentCore Starter Toolkit (auto_create_execution_role=True), the former must be defined as described in IAM Permissions for AgentCore Runtime.
Step 2: Add a wrapper to your agent
As shown in the preceding Deep Agents example, add the AgentCore imports and decorator to your existing agent code.
Step 3: Deploy using the AgentCore starter toolkit
The starter toolkit provides a three-step deployment process:

from bedrock_agentcore_starter_toolkit import Runtime

# Step 1: Configure
agentcore_runtime = Runtime()
config_response = agentcore_runtime.configure(
entrypoint=”hello.py”, # contains the code we showed earlier in the post
execution_role=role_arn, # or auto-create
auto_create_ecr=True,
requirements_file=”requirements.txt”,
region=”us-west-2″,
agent_name=”deepagents-research”
)

# Step 2: Launch
launch_result = agentcore_runtime.launch()
print(f”Agent deployed! ARN: {launch_result[‘agent_arn’]}”)

# Step 3: Invoke
response = agentcore_runtime.invoke({
“prompt”: “Research the latest developments in quantum computing”
})

Step 4: What happens behind the scenes
When you run the deployment, the starter kit automatically:

Generates an optimized Docker file with Python 3.13-slim base image and OpenTelemetry instrumentation
Builds your container with the dependencies from requirements.txt
Creates an Amazon Elastic Container Registry (Amazon ECR) repository (if auto_create_ecr=True) and pushes your image
Deploys to AgentCore Runtime and monitors the deployment status
Configures networking and observability with Amazon CloudWatch and AWS X-Ray integration

The entire process typically takes 2–3 minutes, after which your agent is ready to handle requests at scale. Each new session is launched in its own fresh AgentCore Runtime microVM, maintaining complete environment isolation.
The starter kit generates a configuration file (.bedrock_agentcore.yaml) that captures your deployment settings, making it straightforward to redeploy or update your agent later.
Invoking your deployed agent
After deployment, you have two options for invoking your agent:
Option 1: Using the start kit (shown in Step 3)

response = agentcore_runtime.invoke({
“prompt”: “Research the latest developments in quantum computing”
})

Option 2: Using boto3 SDK directly

import boto3
import json

agentcore_client = boto3.client(‘bedrock-agentcore’, region_name=’us-west-2′)
response = agentcore_client.invoke_agent_runtime(
agentRuntimeArn=agent_arn,
qualifier=”DEFAULT”,
payload=json.dumps({
“prompt”: “Analyze the impact of AI on healthcare in 2024”
})
)

# Handle streaming response
for event in response[‘completion’]:
if ‘chunk’ in event:
print(event[‘chunk’][‘bytes’].decode(‘utf-8’))

Deep Agents in action
As the code executes in Bedrock AgentCore Runtime, the primary agent orchestrates specialized sub-agents—each with its own purpose, prompt, and tool access—to solve complex tasks more effectively. In this case, the orchestrator prompt (research_instructions) sets the plan:

Write the question to question.txt
Fan out to one or more research-agent calls (each on a single sub-topic) using the internet_search tool
Synthesize findings into final_report.md
Call critique-agent to evaluate gaps and structure
Optionally loop back to more research/edits until quality is met

Here it is in action:

Clean up
When finished, don’t forget to de-allocate provisioned AgentCore Runtime in addition to the container repository that was created during the process:

agentcore_control_client = boto3.client(
‘bedrock-agentcore-control’, region_name=region )
ecr_client = boto3.client(‘ecr’,region_name=region )
runtime_delete_response = agentcore_control_client.delete_agent_runtime( agentRuntimeId=launch_result.agent_id,)
response = ecr_client.delete_repository(
repositoryName=launch_result.ecr_uri.split(‘/’)[1],force=True)

Conclusion
Amazon Bedrock AgentCore represents a paradigm shift in how we deploy AI agents. By abstracting away infrastructure complexity while maintaining framework and model flexibility, AgentCore enables developers to focus on building sophisticated agent logic rather than managing deployment pipelines. Our Deep Agents deployment demonstrates that even complex, multi-agent systems with external API integrations can be deployed with minimal code changes. The combination of enterprise-grade security, built-in observability, and serverless scaling makes AgentCore the best choice for production AI agent deployments. Specifically for deep research agents, AgentCore offers the following unique capabilities that you can explore:

AgentCore Runtime can handle asynchronous processing and long running (up to 8 hours) agents. Asynchronous tasks allow your agent to continue processing after responding to the client and handle long-running operations without blocking responses. Your background research sub-agent could be asynchronously researching for hours.
AgentCore Runtime works with AgentCore Memory, enabling capabilities such as building upon previous findings, remembering research preferences, and maintaining complex investigation context without losing progress between sessions.
You can use AgentCore Gateway to extend your deep research to include proprietary insights from enterprise services and data sources. By exposing these differentiated resources as MCP tools, your agents can quickly take advantage and combine that with publicly available knowledge.

Ready to deploy your agents to production? Here’s how to get started:

Install the AgentCore starter kit: pip install bedrock-agentcore-starter-toolkit
Experiment: Deploy your code by following this step by step guide.

The era of production-ready AI agents is here. With AgentCore, the journey from prototype to production has never been shorter.

About the authors
Vadim Omeltchenko is a Sr. AI/ML Solutions Architect who is passionate about helping AWS customers innovate in the cloud. His prior IT experience was predominantly on the ground.
Eashan Kaushik is a Specialist Solutions Architect AI/ML at Amazon Web Services. He is driven by creating cutting-edge generative AI solutions while prioritizing a customer-centric approach to his work. Before this role, he obtained an MS in Computer Science from NYU Tandon School of Engineering. Outside of work, he enjoys sports, lifting, and running marathons.
Shreyas Subramanian is a Principal data scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.
Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build generative AI solutions. His focus since early 2023 has been leading solution architecture efforts for the launch of Amazon Bedrock, the flagship generative AI offering from AWS for builders. Mark’s work covers a wide range of use cases, with a primary interest in generative AI, agents, and scaling ML across the enterprise. He has helped companies in insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services. Mark holds six AWS Certifications, including the ML Specialty Certification.

Integrate tokenization with Amazon Bedrock Guardrails for secure data …

This post is co-written by Mark Warner, Principal Solutions Architect for Thales, Cyber Security Products.
As generative AI applications make their way into production environments, they integrate with a wider range of business systems that process sensitive customer data. This integration introduces new challenges around protecting personally identifiable information (PII) while maintaining the ability to recover original data when legitimately needed by downstream applications. Consider a financial services company implementing generative AI across different departments. The customer service team needs an AI assistant that can access customer profiles and provide personalized responses that include contact information, for example: “We’ll send your new card to your address at 123 Main Street.” Meanwhile, the fraud analysis team requires the same customer data but must analyze patterns without exposing actual PII, working only with protected representations of sensitive information.
Amazon Bedrock Guardrails helps detect sensitive information, such as PII, in standard format in input prompts or model responses. Sensitive information filters give organizations control over how sensitive data is handled, with options to block requests containing PII or mask the sensitive information with generic placeholders like {NAME} or {EMAIL}. This capability helps organizations comply with data protection regulations while still using the power of large language models (LLMs).
Although masking effectively protects sensitive information, it creates a new challenge: the loss of data reversibility. When guardrails replace sensitive data with generic masks, the original information becomes inaccessible to downstream applications that might need it for legitimate business processes. This limitation can impact workflows where both security and functional data are required.
Tokenization offers a complementary approach to this challenge. Unlike masking, tokenization replaces sensitive data with format-preserving tokens that are mathematically unrelated to the original information but maintain its structure and usability. These tokens can be securely reversed back to their original values when needed by authorized systems, creating a path for secure data flows throughout an organization’s environment.
In this post, we show you how to integrate Amazon Bedrock Guardrails with third-party tokenization services to protect sensitive data while maintaining data reversibility. By combining these technologies, organizations can implement stronger privacy controls while preserving the functionality of their generative AI applications and related systems. The solution described in this post demonstrates how to combine Amazon Bedrock Guardrails with tokenization services from Thales CipherTrust Data Security Platform to create an architecture that protects sensitive data without sacrificing the ability to process that data securely when needed. This approach is particularly valuable for organizations in highly regulated industries that need to balance innovation with compliance requirements.
Amazon Bedrock Guardrails APIs
This section describes the key components and workflow for the integration between Amazon Bedrock Guardrails and a third-party tokenization service.
Amazon Bedrock Guardrails provides two distinct approaches for implementing content safety controls:

Direct integration with model invocation through APIs like InvokeModel and Converse, where guardrails automatically evaluate inputs and outputs as part of the model inference process.
Standalone evaluation through the ApplyGuardrail API, which decouples guardrails assessment from model invocation, allowing evaluation of text against defined policies.

This post uses the ApplyGuardrail API for tokenization integration because it separates content assessment from model invocation, allowing for the insertion of tokenization processing between these steps. This separation creates the necessary space in the workflow to replace guardrail masks with format-preserving tokens before model invocation, or after the model response is handed over to the target application downstream in the process.
The solution extends the typical ApplyGuardrail API implementation by inserting tokenization processing between guardrail evaluation and model invocation, as follows:

The application calls the ApplyGuardrail API to assess the user input for sensitive information.
If no sensitive information is detected (action = “NONE”), the application proceeds to model invocation via the InvokeModel API.
If sensitive information is detected (action = “ANONYMIZED”):

The application captures the detected PII and its positions.
It calls a tokenization service to convert these entities into format-preserving tokens.
It replaces the generic guardrail masks with these tokens.
The application then invokes the foundation model with the tokenized content.

For model responses:

The application applies guardrails to check the output from the model for sensitive information.
It tokenizes detected PII before passing the response to downstream systems.

Solution overview
To illustrate how this workflow delivers value in practice, consider a financial advisory application that helps customers understand their spending patterns and receive personalized financial recommendations. In this example, three distinct application components work together to provide secure, AI-powered financial insights:

Customer gateway service – This trusted frontend orchestrator receives customer queries that often contain sensitive information. For example, a customer might ask: “Hi, this is j.smith@example.com. Based on my last five transactions on acme.com, and my current balance of $2,342.18, should I consider their new credit card offer?”
Financial analysis engine – This AI-powered component analyzes financial patterns and generates recommendations but doesn’t need access to actual customer PII. It works with anonymized or tokenized information.
Response processing service – This trusted service handles the final customer communication, including detokenizing sensitive information before presenting results to the customer.

The following diagram illustrates the workflow for integrating Amazon Bedrock Guardrails with tokenization services in this financial advisory application. AWS Step Functions orchestrates the sequential process of PII detection, tokenization, AI model invocation, and detokenization across the three key components (customer gateway service, financial analysis engine, and response processing service) using AWS Lambda functions.

The workflow operates as follows:

The customer gateway service (for this example, through Amazon API Gateway) receives the user input containing sensitive information.
It calls the ApplyGuardrail API to identify PII or other sensitive information that should be anonymized or blocked.
For detected sensitive elements (such as user names or merchant names), it calls the tokenization service to generate format-preserving tokens.
The input with tokenized values is passed to the financial analysis engine for processing. (For example, “Hi, this is [[TOKEN_123]]. Based on my last five transactions on [[TOKEN_456]] and my current balance of $2,342.18, should I consider their new credit card offer?”)
The financial analysis engine invokes an LLM on Amazon Bedrock to generate financial advice using the tokenized data.
The model response, potentially containing tokenized values, is sent to the response processing service.
This service calls the tokenization service to detokenize the tokens, restoring the original sensitive values.
The final, detokenized response is delivered to the customer.

This architecture maintains data confidentiality throughout the processing flow while preserving the information’s utility. The financial analysis engine works with structurally valid but cryptographically protected data, allowing it to generate meaningful recommendations without exposing sensitive customer information. Meanwhile, the trusted components at the entry and exit points of the workflow can access the actual data when necessary, creating a secure end-to-end solution.
In the following sections, we provide a detailed walkthrough of implementing the integration between Amazon Bedrock Guardrails and tokenization services.
Prerequisites
To implement the solution described in this post, you must have the following components configured in your environment:

An AWS account with Amazon Bedrock enabled in your target AWS Region.
Appropriate AWS Identity and Access Management (IAM) permissions configured following least privilege principles with specific actions enabled: bedrock:CreateGuardrail, bedrock:ApplyGuardrail, and bedrock-runtime:InvokeModel.
For AWS Organizations, verify Amazon Bedrock access is permitted by service control policies.
A Python 3.7+ environment with the boto3 library installed. For information about installing the boto3 library, refer to AWS SDK for Python (Boto3).
AWS credentials configured for programmatic access using the AWS Command Line Interface (AWS CLI). For more details, refer to Configuring settings for the AWS CLI.
This implementation requires a deployed tokenization service accessible through REST API endpoints. Although this walkthrough demonstrates integration with Thales CipherTrust, the pattern adapts to tokenization providers offering protect and unprotect API operations. Make sure network connectivity exists between your application environment and both AWS APIs and your tokenization service endpoints, along with valid authentication credentials for accessing your chosen tokenization service. For information about setting up Thales CipherTrust specifically, refer to How Thales Enables PCI DSS Compliance with a Tokenization Solution on AWS.

Configure Amazon Bedrock Guardrails
Configure Amazon Bedrock Guardrails for PII detection and masking through the Amazon Bedrock console or programmatically using the AWS SDK. Sensitive information filter policies can anonymize or redact information from model requests or responses:

import boto3
def create_bedrock_guardrail():
“””
Create a guardrail in Amazon Bedrock for financial applications with PII protection.
“””
bedrock = boto3.client(‘bedrock’)

response = bedrock.create_guardrail(
name=”FinancialServiceGuardrail”,
description=”Guardrail for financial applications with PII protection”,
sensitiveInformationPolicyConfig={
‘piiEntitiesConfig’: [
{
‘type’: ‘URL’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
},
{
‘type’: ‘EMAIL’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
},
{
‘type’: ‘NAME’,
‘action’: ‘ANONYMIZE’,
‘inputAction’: ‘ANONYMIZE’,
‘outputAction’: ‘ANONYMIZE’,
‘inputEnabled’: True,
‘outputEnabled’: True
}
]
},
blockedInputMessaging=”I can’t provide information with PII data.”,
blockedOutputsMessaging=”I can’t generate content with PII data.”
)

return response

Integrate the tokenization workflow
This section implements the tokenization workflow by first detecting PII entities with the ApplyGuardrail API, then replacing the generic masks with format-preserving tokens from your tokenization service.
Apply guardrails to detect PII entities
Use the ApplyGuardrail API to validate input text from the user and detect PII entities:

import boto3
from botocore.exceptions import ClientError
def invoke_guardrail(user_query):
“””
Apply Amazon Bedrock Guardrails to validate input text and detect PII entities.

Args:
user_query (str): The user’s input text to be checked.

Returns:
dict: The response from the ApplyGuardrail API.

Raises:
ClientError: If there’s an error applying the guardrail.
“””
try:
bedrock_runtime = boto3.client(‘bedrock-runtime’)

response = bedrock_runtime.apply_guardrail(
guardrailIdentifier=’your-guardrail-id’, # Replace with your actual guardrail ID
guardrailVersion=’your-guardrail-version’, # Replace with your actual version
source=”INPUT”,
content=[{“text”: {“text”: user_query}}]
)

return response
except ClientError as e:
print(f”Error applying guardrail: {e}”)
raise

Invoke tokenization service
The response from the ApplyGuadrail API includes the list of PII entities matching the sensitive information policy. Parse those entities and invoke the tokenization service to generate the tokens.
The following example code uses the Thales CipherTrust tokenization service:

import json
import requests
from botocore.exceptions import ClientError
def thales_ciphertrust_tokenizer(guardrail_response):
“””
Process PII entities detected by the guardrail and tokenize them using Thales CipherTrust

Args:
guardrail_response (dict): The response from the ApplyGuardrail API

Returns:
list: List of dictionaries containing original values, types, and tokenized responses

Raises:
ClientError: If there’s an error invoking Thales CipherTrust.
“””
try:
protected_results = []

for assessment in guardrail_response.get(“assessments”, []):
pii_entities = assessment.get(“sensitiveInformationPolicy”, {}).get(“piiEntities”, [])

for entity in pii_entities:
sensitive_value = entity.get(“match”)
entity_type = entity.get(“type”)

if sensitive_value:
# Prepare payload for Thales CipherTrust tokenization service
crdp_payload = {
“protection_policy_name”: “plain-alpha-internal”,
“DATA_KEY”: sensitive_value,
}

url_str = “http://your-ciphertrust-cname:8090/v1/protect” # Replace with your actual CipherTrust URL
headers = {“Content-Type”: “application/json”}

# Invoke the Thales CipherTrust tokenization service
response = requests.post(url_str, headers=headers, data=json.dumps(crdp_payload))
response.raise_for_status()
response_json = response.json()

protected_results.append({
“original_value”: sensitive_value,
“type”: entity_type,
“protection_response”: response_json
})

return protected_results
except requests.RequestException as e:
print(f”Error invoking Thales CipherTrust: {e}”)
raise ClientError(f”Error invoking Thales CipherTrust: {e}”, “TokenizationError”)

Replace guardrail masks with tokens
Next, substitute the generic guardrail masks with the tokens generated by the Thales CipherTrust tokenization service. This enables downstream applications to work with structurally valid data while maintaining security and reversibility.

def process_guardrail_output(protected_results, guardrail_response):
“””
Process guardrail output by replacing placeholders with protected values.

Args:
protected_results (list): List of protected data tokenized by Thales CipherTrust.
guardrail_response (dict): Guardrail response dictionary.

Returns:
list: List of modified output items with placeholders replaced by tokens.

Raises:
ValueError: If input parameters are invalid.
Exception: For any unexpected errors during processing.
“””
try:
# Validate input types
if not isinstance(protected_results, list) or not isinstance(guardrail_response, dict):
raise ValueError(“Invalid input parameters”)

# Extract protection map
protection_map = {res[‘type’].upper(): res[‘protection_response’][‘protected_data’]
for res in protected_results}
# Process outputs
modified_outputs = []
for output_item in guardrail_response.get(‘outputs’, []):
if ‘text’ in output_item:
modified_text = output_item[‘text’]

# Replace all placeholders in one pass
for pii_type, protected_value in protection_map.items():
modified_text = modified_text.replace(f”{{{pii_type}}}”, protected_value)

modified_outputs.append({“text”: modified_text})
return modified_outputs
except (ValueError, KeyError) as e:
print(f”Error processing guardrail output: {e}”)
raise
except Exception as e:
print(f”Unexpected error while processing guardrail output: {e}”)
raise

The result of this process transforms user inputs containing information that match the sensitive information policy applied using Amazon Bedrock Guardrails into unique and reversible tokenized versions.
The following example input contains PII elements:

“Hi, this is john.smith@example.com. Based on my last five transactions on acme.com, and my current balance of $2,342.18, should I consider their new credit card offer?”

The following is an example of the sanitized user input:

“Hi, this is 1001000GC5gDh1.D8eK71@EjaWV.lhC. Based on my last five transactions on 1001000WcFzawG.Jc9Tfc, and my current balance of $2,342.18, should I consider their new credit card offer?”

Downstream application processing
The sanitized input is ready to be used by generative AI applications, including model invocations on Amazon Bedrock. In response to the tokenized input, an LLM invoked by the financial analysis engine would produce a relevant analysis that maintains the secure token format:

“Based on your recent transactions at 1001000WcFzawG.Jc9Tfc and your current account status, I can confirm that the new credit card offer would provide approximately $33 in monthly rewards based on your spending patterns. With annual benefits of around $394 against the $55 annual fee, this card would be beneficial for your profile, 1001000GC5gDh1.D8eK71@EjaWV.lhC.”

When authorized systems need to recover original values, tokens are detokenized. With Thales CipherTrust, this is accomplished using the Detokenize API, which requires the same parameters as in the previous tokenize action. This completes the secure data flow while preserving the ability to recover original information when needed.
Clean up
As you follow the approach described in this post, you will create new AWS resources in your account. To avoid incurring additional charges, delete these resources when you no longer need them.
To clean up your resources, complete the following steps:

Delete the guardrails you created. For instructions, refer to Delete your guardrail.
If you implemented the tokenization workflow using Lambda, API Gateway, or Step Functions as described in this post, remove the resources you created.
This post assumes a tokenization solution is already available in your account. If you deployed a third-party tokenization solution (such as Thales CipherTrust) to test this implementation, refer to that solution’s documentation for instructions to properly decommission these resources and stop incurring charges.

Conclusion
This post demonstrated how to combine Amazon Bedrock Guardrails with tokenization to enhance handling of sensitive information in generative AI workflows. By integrating these technologies, organizations can protect PII during processing while maintaining data utility and reversibility for authorized downstream applications.
The implementation illustrated uses Thales CipherTrust Data Security Platform for tokenization, but the architecture supports many tokenization solutions. To learn more about a serverless approach to building custom tokenization capabilities, refer to Building a serverless tokenization solution to mask sensitive data.
This solution provides a practical framework for builders to use the full potential of generative AI with appropriate safeguards. By combining the content safety mechanisms of Amazon Bedrock Guardrails with the data reversibility of tokenization, you can implement responsible AI workflows that align with your application requirements and organizational policies while preserving the functionality needed for downstream systems.
To learn more about implementing responsible AI practices on AWS, see Transform responsible AI from theory into practice.

About the Authors
Nizar Kheir is a Senior Solutions Architect at AWS with more than 15 years of experience spanning various industry segments. He currently works with public sector customers in France and across EMEA to help them modernize their IT infrastructure and foster innovation by harnessing the power of the AWS Cloud.
Mark Warner is a Principal Solutions Architect for Thales, Cyber Security Products division. He works with companies in various industries such as finance, healthcare, and insurance to improve their security architectures. His focus is assisting organizations with reducing risk, increasing compliance, and streamlining data security operations to reduce the probability of a breach.

Perplexity Launches an AI Email Assistant Agent for Gmail and Outlook, …

Perplexity introduced “Email Assistant,” an AI agent that plugs into Gmail and Outlook to draft replies in your voice, auto-label and prioritize messages, and coordinate meetings end-to-end (availability checks, time suggestions, and calendar invites). The feature is restricted to Perplexity’s Max plan and is live today.

What it does?

Email Assistant adds an agent to any thread (via cc) that handles the back-and-forth typical of scheduling. It reads availability, proposes times, and issues invites, while also surfacing daily priorities and generating reply drafts aligned to the user’s tone. Launch support covers Gmail and Outlook with one-click setup links.

https://www.perplexity.ai/assistant

How it plugs into calendars and mail?

Perplexity has been shipping native connectors for Google and Microsoft stacks; the current changelog notes that Gmail/Gcal/Outlook connections support email search and “create calendar invites directly within Perplexity,” which is what the Email Assistant automates from within a live thread. Practically, users enroll, then send or cc assistant@perplexity.com to delegate scheduling and triage tasks.

https://www.perplexity.ai/assistant

Security posture

Perplexity’s specifies SOC 2 and GDPR compliance and says user data is not used for training. For teams evaluating agents in regulated environments, that implies standard audit controls and data-handling boundaries, but as always, production rollouts should validate data-access scopes and DLP posture in the target tenant.

Competitive context

Email Assistant overlaps with Microsoft Copilot for Outlook and Google Gemini for Gmail (summaries/assists). Perplexity’s differentiator is agentic handling of the entire negotiation loop inside email threads plus cross-account connectors already present in its Comet stack. That makes it a realistic drop-in for users who prefer an external agent rather than suite-native assistants.

Early read for implementers

Integration path: Connect Gmail/Outlook, then cc the agent on threads that need scheduling; use it for triage queries and auto-drafts.

Workflow coverage: Auto-labels for “needs reply” vs. FYI; daily summaries; draft-in-your-style replies; invite creation.

Boundary conditions: Max-only; launch support limited to Gmail/Outlook; verify calendar write permissions and compliance needs per domain.

Summary

Perplexity’s Email Assistant is a concrete agentic workflow for inboxes: cc it, let it negotiate times, send invites, and keep your triage queue lean—currently gated to Max subscribers and Gmail/Outlook environments.

Try it here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership with marktechpost.com, please TALK to us

The post Perplexity Launches an AI Email Assistant Agent for Gmail and Outlook, Aimed at Scheduling, Drafting, and Inbox Triage appeared first on MarkTechPost.

Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, …

Microsoft has released a public preview that enables Azure Logic Apps (Standard) to run as Model Context Protocol (MCP) servers, exposing Logic Apps workflows as agent tools discoverable and callable by MCP-capable clients (e.g., VS Code + Copilot).

What’s actually shipping

Remote MCP server on Logic Apps (Standard): You configure a Standard logic app to host an MCP endpoint (/api/mcp) and surface HTTP Request/Response workflows as tools. Authentication is front-doored by Easy Auth; MCP endpoints default to OAuth 2.0. VS Code (≥1.102) includes GA MCP client support for testing.

API Center registration path (preview): You can also create/register MCP servers in Azure API Center, where selected managed connector actions become tools with cataloging and governance.

https://learn.microsoft.com/en-us/azure/logic-apps/set-up-model-context-protocol-server-standard

Key requirements and transport details

Workflow shape: Tools must be implemented as HTTP Request trigger (“When a HTTP request is received”) plus a Response action.

Auth & access control: By default, MCP uses OAuth 2.0; Easy Auth enforces client/identity/tenant restrictions. During setup, App Service authentication must allow unauthenticated requests (the MCP flow still performs OAuth).

Transports: Streamable HTTP works out of the box. SSE additionally requires VNET integration and host.json settingRuntime.Backend.EdgeWorkflowRuntimeTriggerListener.AllowCrossWorkerCommunication=true.

Enablement switch: MCP APIs are enabled by adding extensions.workflow.McpServerEndpoints.enable=true in host.json.

API Center path: preview limitations that matter

When creating MCP servers via API Center backed by Logic Apps, the current preview imposes the following limits:

Start with an empty Standard logic app resource.

One connector per MCP server.

Built-in service-provider and custom connectors aren’t supported in this path (managed connectors only).

One action per tool.

These constraints materially affect tool granularity and server layout in larger estates.

Why Standard (single-tenant) is the target?

Standard runs on the single-tenant Logic Apps runtime (on Azure Functions), supports multiple workflows per app, and integrates directly with virtual networks and private endpoints—all relevant for exposing private systems safely to agents and for predictable throughput/latency. By contrast, Consumption is multitenant, single-workflow per app, and pay-per-execution.

Tooling semantics and discoverability

Microsoft recommends adding trigger descriptions, parameter schemas/descriptions, and required markers to improve agent tool selection and invocation reliability. These annotations are read by MCP clients and influence calling behavior.

Connectors and enterprise reach

Organizations can front existing workflows and a large catalog of Logic Apps connectors (cloud and on-prem) through MCP, turning them into callable agent tools; Microsoft explicitly cites “more than 1,400 connectors.”

Operations, governance, and testing

Run history plus Application Insights/Log Analytics are available for diagnostics and auditability. VS Code provides quick client validation via MCP: Add Server, including OAuth sign-in and tool enumeration. Registering via API Center brings discovery/governance to MCP servers across teams.

Production notes (preview)

SSE requires both VNET and the cross-worker setting; without these, use streamable HTTP.

Easy Auth must be configured precisely (including the “allow unauthenticated” toggle) or client sign-in flows will fail despite OAuth expectations.

Throttling, idempotency, and schema versioning remain your responsibility when wrapping connectors as tools (not new, but now in the agent path). InfoQ highlights similar operational concerns from early adopters.

Summary

The preview cleanly MCP-enables Logic Apps (Standard): you expose HTTP-based workflows as OAuth-protected tools; you can catalog them in API Center; and you can reach private systems through single-tenant networking. For teams already invested in Logic Apps, this is a low-friction, standards-aligned route to operationalize enterprise agent tooling—just mind the API Center limits, SSE prerequisites, and Easy Auth nuances during rollout.

Check out more details here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership with marktechpost.com, please TALK to us

The post Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, Turning Connectors into Agent Tools appeared first on MarkTechPost.

Alibaba Qwen Team Just Released FP8 Builds of Qwen3-Next-80B-A3B (Inst …

Alibaba’s Qwen team has just released FP8-quantized checkpoints for its new Qwen3-Next-80B-A3B models in two post-training variants—Instruct and Thinking—aimed at high-throughput inference with ultra-long context and MoE efficiency. The FP8 repos mirror the BF16 releases but package “fine-grained FP8” weights (block size 128) and deployment notes for sglang and vLLM nightly builds. Benchmarks in the cards remain those of the original BF16 models; FP8 is provided “for convenience and performance,” not as a separate evaluation run.

What’s in the A3B stack

Qwen3-Next-80B-A3B is a hybrid architecture combining Gated DeltaNet (a linear/conv-style attention surrogate) with Gated Attention, interleaved with an ultra-sparse Mixture-of-Experts (MoE). The 80B total parameter budget activates ~3B params per token via 512 experts (10 routed + 1 shared). The layout is specified as 48 layers arranged into 12 blocks: 3×(Gated DeltaNet → MoE) followed by 1×(Gated Attention → MoE). Native context is 262,144 tokens, validated up to ~1,010,000 tokens using RoPE scaling (YaRN). Hidden size is 2048; attention uses 16 Q heads and 2 KV heads at head dim 256; DeltaNet uses 32 V and 16 QK linear heads at head dim 128.

Qwen team reports the 80B-A3B base model outperforms Qwen3-32B on downstream tasks at ~10% of its training cost and delivers ~10× inference throughput beyond 32K context—driven by low activation in MoE and multi-token prediction (MTP). The Instruct variant is non-reasoning (no <think> tags), whereas the Thinking variant enforces reasoning traces by default and is optimized for complex problems.

FP8 releases: what actually changed

The FP8 model cards state the quantization is “fine-grained fp8” with block size 128. Deployment differs slightly from BF16: both sglang and vLLM require current main/nightly builds, with example commands provided for 256K context and optional MTP. The Thinking FP8 card also recommends a reasoning parser flag (e.g., –reasoning-parser deepseek-r1 in sglang, deepseek_r1 in vLLM). These releases retain Apache-2.0 licensing.

Benchmarks (reported on BF16 weights)

The Instruct FP8 card reproduces Qwen’s BF16 comparison table, putting Qwen3-Next-80B-A3B-Instruct on par with Qwen3-235B-A22B-Instruct-2507 on several knowledge/reasoning/coding benchmarks, and ahead on long-context workloads (up to 256K). The Thinking FP8 card lists AIME’25, HMMT’25, MMLU-Pro/Redux, and LiveCodeBench v6, where Qwen3-Next-80B-A3B-Thinking surpasses earlier Qwen3 Thinking releases (30B A3B-2507, 32B) and claims wins over Gemini-2.5-Flash-Thinking on multiple benchmarks.

Training and post-training signals

The series is trained on ~15T tokens before post-training. Qwen highlights stability additions (zero-centered, weight-decayed layer norm, etc.) and uses GSPO in RL post-training for the Thinking model to handle the hybrid attention + high-sparsity MoE combination. MTP is used to speed inference and improve pretraining signal.

Why FP8 matters?

On modern accelerators, FP8 activations/weights reduce memory bandwidth pressure and resident footprint versus BF16, allowing larger batch sizes or longer sequences at similar latency. Because A3B routes only ~3B parameters per token, the combination of FP8 + MoE sparsity compounds throughput gains in long-context regimes, particularly when paired with speculative decoding via MTP as exposed in the serving flags. That said, quantization interacts with routing and attention variants; real-world acceptance rates for speculative decoding and end-task accuracy can vary with engine and kernel implementations—hence Qwen’s guidance to use current sglang/vLLM and to tune speculative settings.

Summary

Qwen’s FP8 releases make the 80B/3B-active A3B stack practical to serve at 256K context on mainstream engines, preserving the hybrid-MoE design and MTP path for high throughput. The model cards keep benchmarks from BF16, so teams should validate FP8 accuracy and latency on their own stacks, especially with reasoning parsers and speculative settings. Net outcome: lower memory bandwidth and improved concurrency without architectural regressions, positioned for long-context production workloads.

Check out the Qwen3-Next-80B-A3B models in two post-training variants—Instruct and Thinking. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Alibaba Qwen Team Just Released FP8 Builds of Qwen3-Next-80B-A3B (Instruct & Thinking), Bringing 80B/3B-Active Hybrid-MoE to Commodity GPUs appeared first on MarkTechPost.

Rapid ML experimentation for enterprises with Amazon SageMaker AI and …

This post was written with Sarah Ostermeier from Comet.
As enterprise organizations scale their machine learning (ML) initiatives from proof of concept to production, the complexity of managing experiments, tracking model lineage, and managing reproducibility grows exponentially. This is primarily because data scientists and ML engineers constantly explore different combinations of hyperparameters, model architectures, and dataset versions, generating massive amounts of metadata that must be tracked for reproducibility and compliance. As the ML model development scales across multiple teams and regulatory requirements intensify, tracking experiments becomes even more complex. With increasing AI regulations, particularly in the EU, organizations now require detailed audit trails of model training data, performance expectations, and development processes, making experiment tracking a business necessity and not just a best practice.
Amazon SageMaker AI provides the managed infrastructure enterprises need to scale ML workloads, handling compute provisioning, distributed training, and deployment without infrastructure overhead. However, teams still need robust experiment tracking, model comparison, and collaboration capabilities that go beyond basic logging.
Comet is a comprehensive ML experiment management platform that automatically tracks, compares, and optimizes ML experiments across the entire model lifecycle. It provides data scientists and ML engineers with powerful tools for experiment tracking, model monitoring, hyperparameter optimization, and collaborative model development. It also offers Opik, Comet’s open source platform for LLM observability and development.
Comet is available in SageMaker AI as a Partner AI App, as a fully managed experiment management capability, with enterprise-grade security, seamless workflow integration, and a straightforward procurement process through AWS Marketplace.
The combination addresses the needs of an enterprise ML workflow end-to-end, where SageMaker AI handles infrastructure and compute, and Comet provides the experiment management, model registry, and production monitoring capabilities that teams require for regulatory compliance and operational efficiency. In this post, we demonstrate a complete fraud detection workflow using SageMaker AI with Comet, showcasing reproducibility and audit-ready logging needed by enterprises today.
Enterprise-ready Comet on SageMaker AI
Before proceeding to setup instructions, organizations must identify their operating model and based on that, decide how Comet is going to be set up. We recommend implementing Comet using a federated operating model. In this architecture, Comet is centrally managed and hosted in a shared services account, and each data science team maintains fully autonomous environments. Each operating model comes with their own sets of benefits and limitations. For more information, refer to SageMaker Studio Administration Best Practices.
Let’s dive into the setup of Comet in SageMaker AI. Large enterprise generally have the following personas:

Administrators – Responsible for setting up the common infrastructure services and environment for use case teams
Users – ML practitioners from use case teams who use the environments set up by platform team to solve their business problems

In the following sections, we go through each persona’s journey.
Comet works well with both SageMaker AI and Amazon SageMaker. SageMaker AI provides the Amazon SageMaker Studio integrated development environment (IDE), and SageMaker provides the Amazon SageMaker Unified Studio IDE. For this post, we use SageMaker Studio.
Administrator journey
In this scenario, the administrator receives a request from a team working on a fraud detection use case to provision an ML environment with a fully managed training and experimentation setup. The administrator’s journey includes the following steps:

Follow the prerequisites to set up Partner AI Apps. This sets up permissions for administrators, allowing Comet to assume a SageMaker AI execution role on behalf of the users and additional privileges for managing the Comet subscription through AWS Marketplace.
On the SageMaker AI console, under Applications and IDEs in the navigation pane, choose Partner AI Apps, then choose View details for Comet.

The details are shown, including the contract pricing model for Comet and infrastructure tier estimated costs.

Comet provides different subscription options ranging from a 1-month to 36-month contract. With this contract, users can access Comet in SageMaker. Based on the number of users, the admin can review and analyze the appropriate instance size for the Comet dashboard server. Comet supports 5–500 users running more than 100 experiment jobs..

Choose Go to Marketplace to subscribe to be redirected to the Comet listing on AWS Marketplace.
Choose View purchase options.

In the subscription form, provide the required details.

When the subscription is complete, the admin can start configuring Comet.

While deploying Comet, add the project lead of the fraud detection use case team as an admin to manage the admin operations for the Comet dashboard.

It takes a few minutes for the Comet server to be deployed. For more details on this step, refer to Partner AI App provisioning.

Set up a SageMaker AI domain following the steps in Use custom setup for Amazon SageMaker AI. As a best practice, provide a pre-signed domain URL for the use case team member to directly access the Comet UI without logging in to the SageMaker console.
Add the team members to this domain and enable access to Comet while configuring the domain.

Now the SageMaker AI domain is ready for users to log in to and start working on the fraud detection use case.
User journey
Now let’s explore the journey of an ML practitioner from the fraud detection use case. The user completes the following steps:

Log in to the SageMaker AI domain through the pre-signed URL.

You will be redirected to the SageMaker Studio IDE. Your user name and AWS Identity and Access Management (IAM) execution role are preconfigured by the admin.

Create a JupyterLab Space following the JupyterLab user guide.
You can start working on the fraud detection use case by spinning up a Jupyter notebook.

The admin has also set up required access to the data through an Amazon Simple Storage Service (Amazon S3) bucket.

To access Comet APIs, install the comet_ml library and configure the required environment variables as described in Set up the Amazon SageMaker Partner AI Apps SDKs.
To access the Comet UI, choose Partner AI Apps in the SageMaker Studio navigation pane and choose Open for Comet.

Now, let’s walk through the use case implementation.
Solution overview
This use case highlights common enterprise challenges: working with imbalanced datasets (in this example, only 0.17% of transactions are fraudulent), requiring multiple experiment iterations, and maintaining full reproducibility for regulatory compliance. To follow along, refer to the Comet documentation and Quickstart guide for additional setup and API details.
For this use case, we use the Credit Card Fraud Detection dataset. The dataset contains credit card transactions with binary labels representing fraudulent (1) or legitimate (0) transactions. In the following sections, we walk through some of the important sections of the implementation. The entire code of the implementation is available in the GitHub repository.
Prerequisites
As a prerequisite, configure the necessary imports and environment variables for the Comet and SageMaker integration:

# Comet ML for experiment tracking
import comet_ml
from comet_ml import Experiment, API, Artifact
from comet_ml.integration.sagemaker import log_sagemaker_training_job_v1
AWS_PARTNER_APP_AUTH=true
AWS_PARTNER_APP_ARN=<Your_AWS_PARTNER_APP_ARN>
COMET_API_KEY=<Your_Comet_API_Key>
# From Details Page, click Open Comet. In the top #right corner, click on user -> API # Key
# Comet ML configuration
COMET_WORKSPACE = ‘<your-comet-workspace-name>’
COMET_PROJECT_NAME = ‘<your-comet-project-name>’

Prepare the dataset
One of Comet’s key enterprise features is automatic dataset versioning and lineage tracking. This capability provides full auditability of what data was used to train each model, which is critical for regulatory compliance and reproducibility. Start by loading the dataset:

# Create a Comet Artifact to track our raw dataset
dataset_artifact = Artifact(
name=”fraud-dataset”,
artifact_type=”dataset”,
aliases=[“raw”]
)
# Add the raw dataset file to the artifact
dataset_artifact.add_remote(s3_data_path, metadata={
“dataset_stage”: “raw”,
“dataset_split”: “not_split”,
“preprocessing”: “none”
})

Start a Comet experiment
With the dataset artifact created, you can now start tracking the ML workflow. Creating a Comet experiment automatically begins capturing code, installed libraries, system metadata, and other contextual information in the background. You can log the dataset artifact created earlier in the experiment. See the following code:

# Create a new Comet experiment
experiment_1 = comet_ml.Experiment(
project_name=COMET_PROJECT_NAME,
workspace=COMET_WORKSPACE,
)
# Log the dataset artifact to this experiment for lineage tracking
experiment_1.log_artifact(dataset_artifact)

Preprocess the data
The next steps are standard preprocessing steps, including removing duplicates, dropping unneeded columns, splitting into train/validation/test sets, and standardizing features using scikit-learn’s StandardScaler. We wrap the processing code in preprocess.py and run it as a SageMaker Processing job. See the following code:

# Run SageMaker processing job
processor = SKLearnProcessor(
framework_version=’1.0-1′,
role=sagemaker.get_execution_role(),
instance_count=1,
instance_type=’ml.t3.medium’
)
processor.run(
code=’preprocess.py’,
inputs=[ProcessingInput(source=s3_data_path, destination=’/opt/ml/processing/input’)],
outputs=[ProcessingOutput(source=’/opt/ml/processing/output’, destination=f’s3://{bucket_name}/{processed_data_prefix}’)]
)

After you submit the processing job, SageMaker AI launches the compute instances, processes and analyzes the input data, and releases the resources upon completion. The output of the processing job is stored in the S3 bucket specified.
Next, create a new version of the dataset artifact to track the processed data. Comet automatically versions artifacts with the same name, maintaining complete lineage from raw to processed data.

# Create an updated version of the ‘fraud-dataset’ Artifact for the preprocessed data
preprocessed_dataset_artifact = Artifact(
name=”fraud-dataset”,
artifact_type=”dataset”,
aliases=[“preprocessed”],
metadata={
“description”: “Credit card fraud detection dataset”,
“fraud_percentage”: f”{fraud_percentage:.3f}%”,
“dataset_stage”: “preprocessed”,
“preprocessing”: “StandardScaler + train/val/test split”,
}
)
# Add our train, validation, and test dataset files as remote assets
preprocessed_dataset_artifact.add_remote(
uri=f’s3://{bucket_name}/{processed_data_prefix}’,
logical_path=’split_data’
)
# Log the updated dataset to the experiment to track the updates
experiment_1.log_artifact(preprocessed_dataset_artifact)

The Comet and SageMaker AI experiment workflow
Data scientists prefer rapid experimentation; therefore, we organized the workflow into reusable utility functions that can be called multiple times with different hyperparameters while maintaining consistent logging and evaluation across all runs. In this section, we showcase the utility functions along with a brief snippet of the code inside the function:

train() – Spins up a SageMaker model training job using the SageMaker built-in XGBoost algorithm:

# Create SageMaker estimator
estimator = Estimator(
image_uri=xgboost_image,
role=execution_role,
instance_count=1,
instance_type=’ml.m5.large’,
output_path=model_output_path,
sagemaker_session=sagemaker_session_obj,
hyperparameters=hyperparameters_dict,
max_run=1800 # Maximum training time in seconds
)
# Start training
estimator.fit({
‘train’: train_channel,
‘validation’: val_channel
})

log_training_job() – Captures the training metadata and metrics and links the model asset to the experiment for complete traceability:

# Log SageMaker training job to Comet
log_sagemaker_training_job_v1(
estimator=training_estimator,
experiment=api_experiment
)

log_model_to_comet() – Links model artifacts to Comet, captures the training metadata, and links the model asset to the experiment for complete traceability:

experiment.log_remote_model(
model_name=model_name,
uri=model_artifact_path,
metadata=metadata
)

deploy_and_evaluate_model() – Performs model deployment and evaluation, and metric logging:

# Deploy to endpoint
predictor = estimator.deploy(
initial_instance_count=1,
instance_type=”ml.m5.xlarge”)
# Log metrics and visualizations to Comet
experiment.log_metrics(metrics) experiment.log_confusion_matrix(matrix=cm,labels=[‘Normal’, ‘Fraud’])
# Log ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_prob_as_np_array) experiment.log_curve(“roc_curve”, x=fpr, y=tpr)

The complete prediction and evaluation code is available in the GitHub repository.
Run the experiments
Now you can run multiple experiments by calling the utility functions with different configurations and compare experiments to find the most optimal settings for the fraud detection use case.
For the first experiment, we establish a baseline using standard XGBoost hyperparameters:

# Define hyperparameters for first experiment
hyperparameters_v1 = {
‘objective’: ‘binary:logistic’, # Binary classification
‘num_round’: 100, # Number of boosting rounds
‘eval_metric’: ‘auc’, # Evaluation metric
‘learning_rate’: 0.15, # Learning rate
‘booster’: ‘gbtree’ # Booster algorithm
}
# Train the model
estimator_1 = train(
model_output_path=f”s3://{bucket_name}/{model_output_prefix}/1″,
execution_role=role,
sagemaker_session_obj=sagemaker_session,
hyperparameters_dict=hyperparameters_v1,
train_channel_loc=train_channel_location,
val_channel_loc=validation_channel_location
)
# log the training job and model artifact
log_training_job(experiment_key = experiment_1.get_key(), training_estimator=estimator_1)
log_model_to_comet(experiment = experiment_1,
model_name=”fraud-detection-xgb-v1″,
model_artifact_path=estimator_1.model_data,
metadata=metadata)
# Deploy and evaluate
deploy_and_evaluate_model(experiment=experiment_1,
estimator=estimator_1,
X_test_scaled=X_test_scaled,
y_test=y_test
)

While running a Comet experiment from a Jupyter notebook, we need to end the experiment to make sure everything is captured and persisted in the Comet server. See the following code: experiment_1.end()
When the baseline experiment is complete, you can run additional experiments with different hyperparameters. Check out the notebook to see the details of both experiments.
When the second experiment is complete, navigate to the Comet UI to compare these two experiment runs.
View Comet experiments in the UI
To access the UI, you can locate the URL in the SageMaker Studio IDE or by executing the code provided in the notebook: experiment_2.url
The following screenshot shows the Comet experiments UI. The experiment details are for illustration purposes only and do not represent a real-world fraud detection experiment.

This concludes the fraud detection experiment.
Clean up
For the experimentation part, SageMaker processing and training infrastructure is ephemeral in nature and shuts down automatically when the job is complete. However, you must still manually clean up a few resources to avoid unnecessary costs:

Shut down the SageMaker JupyterLab Space after use. For instructions, refer to Idle shutdown.
The Comet subscription renews based on the contract chosen. Cancel the contract when there is no further requirement to renew the Comet subscription.

Advantages of SageMaker and Comet integration
Having demonstrated the technical workflow, let’s examine the broader advantages this integration provides.
Streamlined model development
The Comet and SageMaker combination reduces the manual overhead of running ML experiments. While SageMaker handles infrastructure provisioning and scaling, Comet’s automatic logging captures hyperparameters, metrics, code, installed libraries, and system performance from your training jobs without additional configuration. This helps teams focus on model development rather than experiment bookkeeping.Comet’s visualization capabilities extend beyond basic metric plots. Built-in charts enable rapid experiment comparison, and custom Python panels support domain-specific analysis tools for debugging model behavior, optimizing hyperparameters, or creating specialized visualizations that standard tools can’t provide.
Enterprise collaboration and governance
For enterprise teams, the combination creates a mature platform for scaling ML projects across regulated environments. SageMaker provides consistent, secure ML environments, and Comet enables seamless collaboration with complete artifact and model lineage tracking. This helps avoid costly mistakes that occur when teams can’t recreate previous results.
Complete ML lifecycle integration
Unlike point solutions that only address training or monitoring, Comet paired with SageMaker supports your complete ML lifecycle. Models can be registered in Comet’s model registry with full version tracking and governance. SageMaker handles model deployment, and Comet maintains the lineage and approval workflows for model promotion. Comet’s production monitoring capabilities track model performance and data drift after deployment, creating a closed loop where production insights inform your next round of SageMaker experiments.
Conclusion
In this post, we showed how to use SageMaker and Comet together to spin up fully managed ML environments with reproducibility and experiment tracking capabilities.
To enhance your SageMaker workflows with comprehensive experiment management, deploy Comet directly in your SageMaker environment through the AWS Marketplace, and share your feedback in the comments.
For more information about the services and features discussed in this post, refer to the following resources:

Set up Partner AI Apps
Comet Quickstart
GitHub notebook
Comet Documentation
Opik open source platform for LLM observability

About the authors
Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS, helping large financial institutions adopt and scale generative AI and ML workloads. He is the author of book “Generative AI for financial services.” He carries more than 15 years of experience building enterprise-grade applications on generative AI/ML and related technologies. In his spare time, he plays an unnamed sport with his son that lies somewhere between football and rugby.
Naufal Mir is a Senior GenAI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning workloads to SageMaker. He previously worked at financial services institutes developing and operating systems at scale. Outside of work, he enjoys ultra endurance running and cycling.
Sarah Ostermeier is a Technical Product Marketing Manager at Comet. She specializes in bringing Comet’s GenAI and ML developer products to the engineers who need them through technical content, educational resources, and product messaging. She has previously worked as an ML engineer, data scientist, and customer success manager, helping customers implement and scale AI solutions. Outside of work she enjoys traveling off the beaten path, writing about AI, and reading science fiction.